NLTK Data
The Natural Language Toolkit (NLTK) provides access to a wide range of text corpora for NLP research, including the Brown Corpus, the Penn Treebank, and the Corpus of Contemporary American English.
The Natural Language Toolkit (NLTK) provides access to a wide range of text corpora for NLP research, including the Brown Corpus, the Penn Treebank, and the Corpus of Contemporary American English.
This article presents a large-scale text corpus for natural language processing research, consisting of over 100 million tokens from various sources, including books, articles, and websites.
Common Crawl is a non-profit organization that provides a large corpus of web pages for NLP research, with over 20 terabytes of data available for download.
The Stanford NLP Group provides access to a range of text corpora, including the Stanford Question Answering Dataset (SQuAD) and the Stanford Sentiment Treebank.
Hugging Face provides a wide range of pre-trained models and datasets for NLP research, including large text corpora such as the WikiText-103 dataset.
The Linguistic Data Consortium (LDC) is a leading provider of linguistic resources, including large text corpora, for NLP research.
Google provides a range of datasets for NLP research, including the Google Question Answering Dataset and the Google Sentiment Analysis Dataset.
The OpenWebText Corpus is a large-scale text corpus for NLP research, consisting of over 38 gigabytes of text data from the web.