big text datasets for language modeling

H

huggingface.io tool

Large Text Datasets for Language Modeling

Discover a wide range of large text datasets for training language models, including the popular WikiText and BookCorpus datasets.

A

arxiv.org research

The Pile: A Large-Scale Dataset for Language Modeling

Research paper introducing The Pile, a large-scale dataset for language modeling, comprising 885 GB of text from 22 diverse sources.

G

github.com tool

Language Modeling Datasets

A collection of popular language modeling datasets, including the Stanford Natural Language Inference Corpus and the IMDB sentiment analysis dataset.

I

ieee.org article

Big Data for Language Modeling: Opportunities and Challenges

Article discussing the opportunities and challenges of using big data for language modeling, including the importance of data quality and diversity.

C

commoncrawl.org official

Common Crawl: A Large Corpus of Web Pages

Non-profit organization providing a large corpus of web pages for language modeling and other NLP tasks, with over 25 terabytes of data.

S

stanford.edu edu

Language Modeling with Large Datasets

Course notes from Stanford University's Natural Language Processing with Deep Learning course, covering language modeling with large datasets.

W

wikimedia.org official

The Wikipedia Corpus: A Large Dataset for Language Modeling

Introduction to the Wikipedia Corpus, a large dataset for language modeling, comprising over 50 million articles in multiple languages.

N

nyu.edu research

Language Model Evaluation with the GLUE Benchmark

Research paper introducing the GLUE benchmark, a collection of datasets for evaluating language models, including question answering and sentiment analysis tasks.