The Pile An 800GB Dataset of Diverse Text for Language Modeling
CCNet Extracting High Quality Monolingual Datasets from Web Crawl Data
OPENCSG CHINESE CORPUS A SERIES OF HIGHQUALITY CHINESE DATASETS FOR LLM TRAINING
The FineWeb Datasets Decanting the Web for the Finest Text Data at Scale
LLM生成评估指标,协助标注数据完成Reward模型训练