8 results · AI-generated index
H
huggingface.co
tool

Natural Language Processing (NLP) Datasets

The Hugging Face Datasets library provides a wide range of NLP datasets for training and fine-tuning models, including large corpora like Wikipedia and BookCorpus.

W
www.mdpi.com
article

Big Data for NLP: A Review of Corpus Creation and Analysis

This article reviews the creation and analysis of large corpora for NLP, highlighting the importance of big data in training accurate models.

C
commoncrawl.org
official

Common Crawl: A Large Corpus for NLP Research

Common Crawl is a non-profit organization that provides a large corpus of web pages for NLP research, with over 25 terabytes of data available for download.

A
arxiv.org
research

NLP Training with Large Corpora: A Case Study

This research paper presents a case study on training NLP models with large corpora, demonstrating the benefits of using big data in NLP training.

D
datasetsearch.research.google.com
tool

Google's Dataset Search for NLP

Google's Dataset Search is a search engine for datasets, providing access to a wide range of NLP datasets, including large corpora for training and testing.

N
nlp.stanford.edu
edu

The Stanford Natural Language Processing Group

The Stanford NLP Group provides a range of resources for NLP research, including large corpora and pre-trained models, as well as tutorials and guides for NLP training.

T
towardsdatascience.com
article

NLP Training Data: Where to Find Large Corpora

This article provides an overview of where to find large corpora for NLP training, including government datasets, academic resources, and commercial providers.

C
corpus.byu.edu
edu

Corpus of Contemporary American English (COCA)

The Corpus of Contemporary American English is a large corpus of American English texts, with over 525 million words, available for NLP research and training.