large corpus for language understanding

M

mit.edu research

Large Corpora for Language Understanding

The MIT Natural Language Processing Group focuses on developing large corpora for language understanding, including the OpenWebText corpus and the BookCorpus.

C

commoncrawl.org tool

Common Crawl: A Large Corpus for Language Understanding

Common Crawl is a non-profit organization that provides a large corpus of web pages for language understanding research, with over 25 terabytes of data.

W

wikipedia.org article

The Wikipedia Corpus: A Large Corpus for Language Understanding

The Wikipedia Corpus is a large corpus of text from Wikipedia, providing a valuable resource for language understanding research, with over 50 million articles in many languages.

H

huggingface.co article

Large-Scale Language Modeling with the Pile Corpus

The Pile Corpus is a large-scale corpus of text from the web, books, and other sources, used for training language models, with over 800 GB of text data.

A

arxiv.org research

Language Understanding with Large Corpora: A Survey

This survey paper reviews the current state of language understanding research using large corpora, including the challenges and opportunities of working with large datasets.

G

google.com official

Google's Large Corpus for Language Understanding

Google's large corpus for language understanding is used to train its language models, including BERT and its variants, with a massive dataset of text from the web and other sources.

S

stanford.edu research

The Stanford Natural Language Processing Group: Large Corpora for Language Understanding

The Stanford NLP Group uses large corpora for language understanding research, including the Stanford Question Answering Dataset (SQuAD) and the Stanford Sentiment Treebank.

Y

youtube.com video

Large Corpora for Language Understanding: A Tutorial

This tutorial video provides an introduction to large corpora for language understanding, including how to work with popular datasets and tools, such as NLTK and spaCy.