FinerWeb-10BT Refining Web Data with LLM-Based Line-Level Filtering
The Pile An 800GB Dataset of Diverse Text for Language Modeling
CCNet Extracting High Quality Monolingual Datasets from Web Crawl Data
个人笔记 · 一个关于大语言模型(LLM)评估的指南手册
OPENCSG CHINESE CORPUS A SERIES OF HIGHQUALITY CHINESE DATASETS FOR LLM TRAINING
The FineWeb Datasets Decanting the Web for the Finest Text Data at Scale