NLTK Data
The Natural Language Toolkit (NLTK) includes a wide range of free text corpora for NLP tasks, including books, articles, and websites.
The Natural Language Toolkit (NLTK) includes a wide range of free text corpora for NLP tasks, including books, articles, and websites.
This paper presents a large-scale text corpus for NLP research, containing over 100 million words from various sources, including books and articles.
Common Crawl is a non-profit organization that provides a large corpus of web pages for NLP research and development, updated regularly.
Google's Dataset Search is a search engine for datasets, including text corpora for NLP, providing access to a wide range of free and open datasets.
The Stanford NLP Group provides a range of free resources, including text corpora, for NLP research and development, such as the Stanford Question Answering Dataset.
Hugging Face Datasets is a platform that provides a wide range of text corpora for NLP tasks, including datasets for language modeling, sentiment analysis, and more.
The Wikipedia Corpus is a large corpus of text from Wikipedia articles, available for free download and use in NLP research and development.
The Linguistic Data Consortium (LDC) is a non-profit organization that provides a wide range of linguistic resources, including text corpora, for NLP research and development.