Natural Language Processing Datasets
Explore a wide range of large text datasets for NLP tasks, including text classification, language modeling, and question answering.
Explore a wide range of large text datasets for NLP tasks, including text classification, language modeling, and question answering.
The Linguistic Data Consortium is an international organization that creates and distributes large text datasets for NLP research, including the Penn Treebank and Switchboard corpora.
Common Crawl is a non-profit organization that provides a large, freely available corpus of web pages for NLP research and development.
The Stanford NLP Group provides access to a variety of large text datasets, including the Stanford Question Answering Dataset and the Stanford Sentiment Treebank.
Google Dataset Search is a search engine for datasets, including large text datasets for NLP tasks such as language modeling and text classification.
The NIH provides a large text dataset for NLP research, including clinical notes and medical literature, to support the development of NLP models for healthcare applications.
Kaggle hosts a variety of NLP competitions, including those focused on large text datasets, such as text classification and language modeling.
The ACL Anthology is a digital archive of papers and proceedings from the Association for Computational Linguistics, including research on large text datasets for NLP tasks.