Calibrating LLMs with Preference Optimization on Thought Trees for Generating Rationale in Science Question Scoring
Essential-Web v1.0 24T tokens of organized web data
Ask-Before-Detection - Identifying and Mitigating Conformity Bias in LLM-Powered Error Detector for Math Word Problem Solutions
AnnoLLM - Making Large Language Models to Be Better Crowdsourced Annotators
FinerWeb-10BT Refining Web Data with LLM-Based Line-Level Filtering
The Pile An 800GB Dataset of Diverse Text for Language Modeling
CCNet Extracting High Quality Monolingual Datasets from Web Crawl Data