Common Corpus: Multilingual Data Set for Training LLMs

Information Industry News

Additional Details

March 20, 2024

  • Common Corpus is the largest public domain dataset released for training LLMs.
  • Common Corpus includes 500 billion words from a wide diversity of cultural heritage initiatives.
  • Common Corpus is multilingual and the largest corpus to date in English, French, Dutch, Spanish, German and Italian.
  • Common Corpus shows it is possible to train fully open LLMs on sources without copyright concerns.

Common Corpus is an international initiative coordinated by Pleias, involving researchers in LLM pretraining, AI ethics and cultural heritage, in association with major organizations committed to an open science approach for AI (HuggingFace, OcciglotEleutherNomic AI). Common Corpus has received the support of Lang:IA, a state start-up supported by the French Ministry of Culture and the Direction du numérique (Agent Public. Pleias is a French start-up specialized in the training of Large Language Models for document processing on fully open and auditable corpus.

Full Text of the Announcement