8 results · AI-generated index
A
arxiv.org
research

The Pile: A Large-Scale Corpus for Language Model Training

The Pile is a large-scale corpus for language model training, consisting of 885 GB of text from various sources, including books, articles, and websites.

H
huggingface.io
tool

Hugging Face Datasets: A Hub for Large-Scale Text Corpora

Hugging Face Datasets provides a wide range of large-scale text corpora for language model training, including the popular Wikipedia and BookCorpus datasets.

C
commoncrawl.org
official

Common Crawl: A Non-Profit Organization Providing Large-Scale Web Corpora

Common Crawl is a non-profit organization that provides large-scale web corpora for language model training, with over 24 terabytes of text data available.

N
nlp.stanford.edu
article

The Stanford Natural Language Processing Group: Resources for Language Model Training

The Stanford Natural Language Processing Group provides a range of resources for language model training, including large-scale text corpora and pre-trained models.

B
blog.google
news

Google's Massive Text Corpus for Language Model Training

Google has released a massive text corpus for language model training, consisting of over 45 terabytes of text data from various sources, including books and articles.

G
github.io
tool

The OpenWebText Corpus: A Large-Scale Corpus for Language Model Training

The OpenWebText Corpus is a large-scale corpus for language model training, consisting of over 38 GB of text from the web, with a focus on diversity and representativeness.

M
meta.wikimedia.org
official

Language Model Training with the Wikipedia Corpus

The Wikipedia Corpus is a large-scale text corpus that can be used for language model training, with over 50 million articles in hundreds of languages.

A
aclweb.org
research

Large-Scale Text Corpora for Language Model Training: A Survey

This survey provides an overview of large-scale text corpora for language model training, including the Pile, Common Crawl, and Wikipedia corpora, and discusses their strengths and limitations.