large text datasets for nlp

H

huggingface.co tool

Natural Language Processing Datasets

Explore a wide range of large text datasets for NLP tasks, including text classification, language modeling, and question answering.

L

ldc.upenn.edu research

Linguistic Data Consortium

The Linguistic Data Consortium is an international organization that creates and distributes large text datasets for NLP research, including the Penn Treebank and Switchboard corpora.

C

commoncrawl.org article

Common Crawl

Common Crawl is a non-profit organization that provides a large, freely available corpus of web pages for NLP research and development.

N

nlp.stanford.edu official

Stanford Natural Language Processing Group

The Stanford NLP Group provides access to a variety of large text datasets, including the Stanford Question Answering Dataset and the Stanford Sentiment Treebank.

D

datasetsearch.research.google.com tool

Google Dataset Search

Google Dataset Search is a search engine for datasets, including large text datasets for NLP tasks such as language modeling and text classification.

W

www.nih.gov official

The National Institutes of Health's (NIH) NLP Dataset

The NIH provides a large text dataset for NLP research, including clinical notes and medical literature, to support the development of NLP models for healthcare applications.

W

www.kaggle.com news

Kaggle NLP Competitions

Kaggle hosts a variety of NLP competitions, including those focused on large text datasets, such as text classification and language modeling.

W

www.aclweb.org research

The ACL Anthology

The ACL Anthology is a digital archive of papers and proceedings from the Association for Computational Linguistics, including research on large text datasets for NLP tasks.