large text corpus for nlp research

N

nltk.org tool

NLTK Data

The Natural Language Toolkit (NLTK) provides access to a wide range of text corpora for NLP research, including the Brown Corpus, the Penn Treebank, and the Corpus of Contemporary American English.

A

arxiv.org research

Large Text Corpus for NLP Research

This article presents a large-scale text corpus for natural language processing research, consisting of over 100 million tokens from various sources, including books, articles, and websites.

C

commoncrawl.org tool

Common Crawl

Common Crawl is a non-profit organization that provides a large corpus of web pages for NLP research, with over 20 terabytes of data available for download.

N

nlp.stanford.edu article

The Stanford Natural Language Processing Group

The Stanford NLP Group provides access to a range of text corpora, including the Stanford Question Answering Dataset (SQuAD) and the Stanford Sentiment Treebank.

H

huggingface.co tool

Hugging Face Datasets

Hugging Face provides a wide range of pre-trained models and datasets for NLP research, including large text corpora such as the WikiText-103 dataset.

L

ldc.upenn.edu official

Linguistic Data Consortium

The Linguistic Data Consortium (LDC) is a leading provider of linguistic resources, including large text corpora, for NLP research.

A

ai.google article

Google's Natural Language Processing Dataset

Google provides a range of datasets for NLP research, including the Google Question Answering Dataset and the Google Sentiment Analysis Dataset.

G

github.io research

The OpenWebText Corpus

The OpenWebText Corpus is a large-scale text corpus for NLP research, consisting of over 38 gigabytes of text data from the web.