8 results · AI-generated index
H
huggingface.io
tool

Large Text Datasets for Language Modeling

Discover a wide range of large text datasets for training language models, including the popular WikiText and BookCorpus datasets.

A
arxiv.org
research

The Pile: A Large-Scale Dataset for Language Modeling

Research paper introducing The Pile, a large-scale dataset for language modeling, comprising 885 GB of text from 22 diverse sources.

G
github.com
tool

Language Modeling Datasets

A collection of popular language modeling datasets, including the Stanford Natural Language Inference Corpus and the IMDB sentiment analysis dataset.

I
ieee.org
article

Big Data for Language Modeling: Opportunities and Challenges

Article discussing the opportunities and challenges of using big data for language modeling, including the importance of data quality and diversity.

C
commoncrawl.org
official

Common Crawl: A Large Corpus of Web Pages

Non-profit organization providing a large corpus of web pages for language modeling and other NLP tasks, with over 25 terabytes of data.

S
stanford.edu
edu

Language Modeling with Large Datasets

Course notes from Stanford University's Natural Language Processing with Deep Learning course, covering language modeling with large datasets.

W
wikimedia.org
official

The Wikipedia Corpus: A Large Dataset for Language Modeling

Introduction to the Wikipedia Corpus, a large dataset for language modeling, comprising over 50 million articles in multiple languages.

N
nyu.edu
research

Language Model Evaluation with the GLUE Benchmark

Research paper introducing the GLUE benchmark, a collection of datasets for evaluating language models, including question answering and sentiment analysis tasks.