Hugging Face Releases FineTranslations, a Trillion-Token Multilingual Parallel Text Dataset

Hugging Face Releases FineTranslations, a Trillion-Token Multilingual Parallel Text Dataset - InfoQ - Featured Image

Hugging Face Releases FineTranslations, a Trillion-Token Multilingual Parallel Text Dataset - InfoQ

www.infoq.com - faviconinfoq.com
TLDR

Hugging Face has released FineTranslations, a large-scale multilingual dataset containing over 1 trillion tokens of parallel text across English and 500+ languages. The dataset aims to improve machine translation, especially for lower-resource languages, and can also be used for English-only model pretraining. It was created by translating non-English content from the FineWeb2 corpus into English using Gemma3 27B, with a focus on preserving cultural and contextual information.

5Score: 5

0 Comments