Large Text Datasets for Language Modeling
Discover a wide range of large text datasets for training language models, including the popular WikiText and BookCorpus datasets.
Discover a wide range of large text datasets for training language models, including the popular WikiText and BookCorpus datasets.
Research paper introducing The Pile, a large-scale dataset for language modeling, comprising 885 GB of text from 22 diverse sources.
A collection of popular language modeling datasets, including the Stanford Natural Language Inference Corpus and the IMDB sentiment analysis dataset.
Article discussing the opportunities and challenges of using big data for language modeling, including the importance of data quality and diversity.
Non-profit organization providing a large corpus of web pages for language modeling and other NLP tasks, with over 25 terabytes of data.
Course notes from Stanford University's Natural Language Processing with Deep Learning course, covering language modeling with large datasets.
Introduction to the Wikipedia Corpus, a large dataset for language modeling, comprising over 50 million articles in multiple languages.
Research paper introducing the GLUE benchmark, a collection of datasets for evaluating language models, including question answering and sentiment analysis tasks.