8 results · AI-generated index
N
nltk.org
tool

NLTK Data

The Natural Language Toolkit (NLTK) provides access to a wide range of text corpora for NLP research, including the Brown Corpus, the Penn Treebank, and the Corpus of Contemporary American English.

A
arxiv.org
research

Large Text Corpus for NLP Research

This article presents a large-scale text corpus for natural language processing research, consisting of over 100 million tokens from various sources, including books, articles, and websites.

C
commoncrawl.org
tool

Common Crawl

Common Crawl is a non-profit organization that provides a large corpus of web pages for NLP research, with over 20 terabytes of data available for download.

N
nlp.stanford.edu
article

The Stanford Natural Language Processing Group

The Stanford NLP Group provides access to a range of text corpora, including the Stanford Question Answering Dataset (SQuAD) and the Stanford Sentiment Treebank.

H
huggingface.co
tool

Hugging Face Datasets

Hugging Face provides a wide range of pre-trained models and datasets for NLP research, including large text corpora such as the WikiText-103 dataset.

L
ldc.upenn.edu
official

Linguistic Data Consortium

The Linguistic Data Consortium (LDC) is a leading provider of linguistic resources, including large text corpora, for NLP research.

A
ai.google
article

Google's Natural Language Processing Dataset

Google provides a range of datasets for NLP research, including the Google Question Answering Dataset and the Google Sentiment Analysis Dataset.

G
github.io
research

The OpenWebText Corpus

The OpenWebText Corpus is a large-scale text corpus for NLP research, consisting of over 38 gigabytes of text data from the web.