free large text corpus for nlp

N

nltk.org tool

NLTK Data

The Natural Language Toolkit (NLTK) includes a wide range of free text corpora for NLP tasks, including books, articles, and websites.

A

arxiv.org research

Large Text Corpus for NLP Research

This paper presents a large-scale text corpus for NLP research, containing over 100 million words from various sources, including books and articles.

C

commoncrawl.org article

Common Crawl

Common Crawl is a non-profit organization that provides a large corpus of web pages for NLP research and development, updated regularly.

D

datasetsearch.research.google.com tool

Google's Dataset Search

Google's Dataset Search is a search engine for datasets, including text corpora for NLP, providing access to a wide range of free and open datasets.

N

nlp.stanford.edu official

The Stanford Natural Language Processing Group

The Stanford NLP Group provides a range of free resources, including text corpora, for NLP research and development, such as the Stanford Question Answering Dataset.

H

huggingface.co tool

Hugging Face Datasets

Hugging Face Datasets is a platform that provides a wide range of text corpora for NLP tasks, including datasets for language modeling, sentiment analysis, and more.

W

wikipedia.org article

The Wikipedia Corpus

The Wikipedia Corpus is a large corpus of text from Wikipedia articles, available for free download and use in NLP research and development.

L

ldc.upenn.edu official

Linguistic Data Consortium

The Linguistic Data Consortium (LDC) is a non-profit organization that provides a wide range of linguistic resources, including text corpora, for NLP research and development.