8 results · AI-generated index
M
mit.edu
research

Large Corpora for Language Understanding

The MIT Natural Language Processing Group focuses on developing large corpora for language understanding, including the OpenWebText corpus and the BookCorpus.

C
commoncrawl.org
tool

Common Crawl: A Large Corpus for Language Understanding

Common Crawl is a non-profit organization that provides a large corpus of web pages for language understanding research, with over 25 terabytes of data.

W
wikipedia.org
article

The Wikipedia Corpus: A Large Corpus for Language Understanding

The Wikipedia Corpus is a large corpus of text from Wikipedia, providing a valuable resource for language understanding research, with over 50 million articles in many languages.

H
huggingface.co
article

Large-Scale Language Modeling with the Pile Corpus

The Pile Corpus is a large-scale corpus of text from the web, books, and other sources, used for training language models, with over 800 GB of text data.

A
arxiv.org
research

Language Understanding with Large Corpora: A Survey

This survey paper reviews the current state of language understanding research using large corpora, including the challenges and opportunities of working with large datasets.

G
google.com
official

Google's Large Corpus for Language Understanding

Google's large corpus for language understanding is used to train its language models, including BERT and its variants, with a massive dataset of text from the web and other sources.

S
stanford.edu
research

The Stanford Natural Language Processing Group: Large Corpora for Language Understanding

The Stanford NLP Group uses large corpora for language understanding research, including the Stanford Question Answering Dataset (SQuAD) and the Stanford Sentiment Treebank.

Y
youtube.com
video

Large Corpora for Language Understanding: A Tutorial

This tutorial video provides an introduction to large corpora for language understanding, including how to work with popular datasets and tools, such as NLTK and spaCy.