FinerWeb-10BT: Refining Web Data with LLM-Based Line-Level Filtering

2501.07314FinerWeb-10BT: Refining Web Data with LLM-Based Line-Level Filtering

https://github.com/TurkuNLP/finerweb-10bt

[TOC]

概述

GPT-4o mini 对 FineWeb 中 20,000 份文档样本进行逐行标注，使模型能够为低质量文本行创建描述性标签
标签被归纳为九大类别，并训练 DeBERTa-v3 分类器将过滤规模扩展至 FineWeb 的 100 亿 token 子集
结果表明：使用过滤数据训练的模型在 HellaSwag 基准测试中准确率更高，且能以最多减少 25%的数据量更快达到性能目标

核心问题：

How well can an LLM identify low-quality content missed by heuristic filters?
Does LLM-based quality filtering of training datasets improve model performance?

paper定义高质量数据为：

human-written, continuous English text from the main content of a website, reflecting natural language use across diverse contexts and domains.

网站主体内容中人类撰写的连贯英文文本，能反映跨领域自然语言使用。

典型实例包括访谈核心文本、论坛帖子、新闻文章、博客和食谱。

与之相对，低质量内容则包含导航菜单、版权声明、编程代码和元数据等重复性元素。

过滤分为三个级别：

文档级：基于简单规则整篇剔除文档
- 少于三句话的文档
- 存在过度重复内容的文档
行级：
- 删除含javascript等术语的行、纯数字行或低于长度阈值的行
字符级：
- 移除维基百科常见的引用标记如[1]和[citation needed]

现存的过滤方法具有数据集特异性，相关指标与数据集本身有关

行末标点比例≤0.12的文档（移除10.14% token，相比C4终止标点过滤的30%更高效）

重复行字符比例≥0.1的文档（移除12.47% token）

短行（<30字符）比例≥0.67的文档（移除3.73% token）

Method

数据来源：Fineweb，构建来自 FineWeb 的 100 亿 token（约 1500 万文档）样本，称为 FineWeb-10BT
抽样20,000份文档进行GPT-4o mini 标注 - 为每行生成描述性标签，分为高质量或低质量类别
O1-preview将生成的大量标签归类为更小、更方便管理的集合
训练基于encoder的分类器，scale到Fineweb10BT
使用清洗前后的Fineweb10BT训练GPT-2，在HellaSwag上benchmark

全过程是数据驱动的，不依赖于固定的类别

Experiments

GPT-4o mini 标签标注

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86


    # 系统提示词
    system = "You are an expert text classifier specializing in LLM training data. Your task is to classify each line of text based on its suitability for inclusion in a language model training dataset. High-quality content is clean, meaningful, well-structured, and useful for training language models. Low-quality content includes boilerplate elements (e.g., navigation menus, footers), non-linguistic symbols, formatting tags, placeholders like 'Lorem ipsum', and spammy, irrelevant, or toxic language."

    # 用户提示词
    prompt = f"""
    **Instructions:**

    1. **Line Identification and Separation**:
       - Each line starts with "Line X:" where X is the line number. Treat each "Line X:" as a single unit, regardless of length; do not split lines.
       - Lines are separated by newline characters (`\\n`) and dashes (`------`). If there's no newline character, treat the entire text as a single line.

    2. **Contextual Classification**:
       - Use the context of all lines when classifying each one, as they are sequential and from the same document.
       - For example, a line starting with a hyphen might be part of a list and should be classified as "Clean."

    3. **Assigning Labels**:
       - Assign **exactly one label** to each line.
       - If the line is suitable for inclusion, label it **"Clean"**.
       - If not, assign a specific and descriptive label explaining why it's unsuitable.
       - **Prefer labels from the provided list**. Only create a new label (max three words) if absolutely necessary.
       - **Do not use vague labels** like "Low-Quality," "Bad," "Unsuitable," etc. Labels must be specific and descriptive.

    4. **Focus on Linguistic Content**:
       - Retain valuable and diverse linguistic content suitable for language model pre-training, including natural language patterns, standard advertising copy, commercial language, and promotional content written in natural language.

    5. **Tolerance for Minor Errors and Toxic Language**:
       - Minor grammatical errors, typos, or small mistakes do not disqualify a line from being "Clean." Only exclude lines with pervasive errors that significantly hinder understanding.
       - Mild expletives and controversial opinions do not disqualify a line from being "Clean." Only exclude lines with blatantly hateful, harmful or toxic content.

    6. **Output Format**:
       - Your output must have exactly the same number of lines as the input, matching each line number correctly.
       - Output only the line number followed by the label, separated by a colon.
       - Do not include any additional text or explanations.
       - Do not output dashes between the lines.

    **Guidelines for "Clean" Lines**:

    Assign "Clean" to lines that:

    - Represent natural language suitable for training language models.
    - Include informal internet language, grammatical errors, questions, partial sentences, and common online expressions.
    - Contain standard advertising or commercial language in natural sentences.
    - Have properly formatted titles, headings, and readable content, even with stylistic elements.
    - Include minor in-text elements like email addresses, dates, or URLs within natural sentences.
    - Are general promotional content written in natural language.

    **Guidelines for Non-"Clean" Lines**:

    Lines not classified as "Clean" need a specific and descriptive label. Examples include lines that:

    - Contain blatantly hateful or harmful language. 
    - Are long passages of non-English text (excluding common foreign phrases used in English).
    - Include disclaimers, copyright notices, terms, and conditions.
    - Consist of menu items, login links, buttons, or navigation menus.
    - Contain random characters, garbled text, or excessive symbols.
    - Include programming code, HTML tags, or markup languages (when actual code or markup appears).
    - Present keywords, tags, or similar data without sufficient context.
    - Are irrelevant or spam-like content not suitable for training.
    - Are **excessively** promotional without natural language structure (e.g., a list of product names and prices without sentences).

    **Possible Labels for Non-"Clean" Lines**:

    {non_quality_labels}

    **Example Input:**

    Line 1: Welcome to our website!
    ------
    Line 2: Contact us at [email protected].
    ------
    Line 3: ***** $$$$$
    ------
    Line 4: <div>Content</div>
    ------

    **Example Output:**

    Line 1: Clean  
    Line 2: Clean  
    Line 3: Encoding Errors  
    Line 4: HTML Tags

    **Now, classify the following lines:**

    {input}
    """

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72


**指令：**

1. **行标识与分隔**：
   - 每行以“Line X:”开头，X为行号。将每个“Line X:”视为一个独立单元，无论长度如何；请勿拆分。
   - 行之间用换行符（`\n`）和短横线（`------`）分隔。若无换行符，则将整个文本视为单行。

2. **上下文分类**：
   - 分类时需考虑所有行的上下文，因为它们来自同一文档且顺序相关。
   - 例如，以连字符开头的行可能是列表的一部分，应标记为“Clean”。

3. **标签分配**：
   - 每行**必须分配一个标签**。
   - 若适合纳入训练数据，标记为**“Clean”**。
   - 若不适合，需提供具体描述性标签说明原因。
   - **优先使用提供的标签列表**。仅在必要时创建新标签（最多三个单词）。
   - **禁止使用模糊标签**，如“低质量”、“差”、“不合适”等。标签必须具体明确。

4. **语言内容聚焦**：
   - 保留对语言模型预训练有价值的多样化语言内容，包括自然语言模式、标准广告文案、商业用语和自然语言编写的推广内容。

5. **对轻微错误和毒性内容的容忍**：
   - 轻微语法错误、拼写问题或小错误不影响标记为“Clean”。仅当错误严重影响理解时才排除。
   - 轻度脏话或有争议的观点不影响标记为“Clean”。仅排除明显仇恨、有害或毒性内容。

6. **输出格式**：
   - 输出行数必须与输入完全一致，且行号对应正确。
   - 每行输出格式为“Line X: 标签”，仅包含行号和标签。
   - 禁止额外解释或文本。
   - 行间禁止输出短横线。

**“Clean”行标准**：
符合以下条件的行标记为“Clean”：
- 代表适合训练的自然语言。
- 包含网络用语、语法错误、问题、不完整句子或常见网络表达。
- 含有自然句式中的标准广告或商业用语。
- 格式正确的标题、题头或可读内容（即使包含样式元素）。
- 自然句子中的邮箱、日期或URL等次要元素。
- 以自然语言编写的常规推广内容。

**非“Clean”行标准**：
需提供具体描述性标签，例如：
- 包含明显仇恨或有害内容。
- 大段非英语文本（英语中常用的外语短语除外）。
- 免责声明、版权声明、条款协议。
- 菜单项、登录链接、按钮或导航菜单。
- 随机字符、乱码或过多符号。
- 编程代码、HTML标签或标记语言（实际代码或标签出现时）。
- 缺乏上下文的关键词或标签。
- 与训练无关的垃圾内容。
- **过度推广**且无自然语言结构（如纯产品名和价格列表）。

**非“Clean”行标签示例**：
{non_quality_labels}

**输入示例**：
Line 1: 欢迎访问我们的网站！
------
Line 2: 联系支持邮箱：support@example.com。
------
Line 3: ***** $$$$$
------
Line 4: <div>内容</div>
------

**输出示例**：
Line 1: Clean  
Line 2: Clean  
Line 3: 编码错误  
Line 4: HTML标签

**请对以下行进行分类**：
{input}

一开始并不提供任何的非Clean标签，由模型逐渐生成，优先使用已有的标签，否则进行扩充
未避免顺序带来的影响，每次迭代后随即打乱标签列表
文档最多被分割为多个chunk，每个chunk最多15行，方便结合上下文
单行不能超过200字符，否则按照标点进行切割为新的行
- paper提到：超长行会导致模型的错误输出

50个最常见的标签 - 二维UMAP投影

其中每个圆点的大小对应相应类别的相对出现频率

法律文本出现在左上角，成人及有害内容集中于右上方中部，而参考文献则靠近底部。联系方式（如时间、日期和电话号码）松散分布在左侧，技术类内容（如编程代码）则位于中部。这些分布模式表明，LLM 生成的标签能够有效区分文本行质量，为我们最终构建分类体系提供了可靠依据。

83%的数据被标记为清洁
547个生成的标签，其中部分只出现了一次
- 人工复查，直接标记为清洁

标签分组

对于实现剩下的382个标签，通过O1-preview（推理模型）归类为更简洁、更易管理的宽泛类别

指导该模型创建清晰、明确的分类
每个标签只能属于一个组别

Categories	Lines	%
Clean	283,267	86.24
Formatting, Style & Errors 格式、风格与错误	13,150	4.00
Bibliographical & Citation References 参考文献与引用规范	8,768	2.67
Promotional & Spam Content 促销与垃圾内容	7,339	2.23
Contact & Identification Information 联系与身份识别信息	3,898	1.19
Navigation & Interface Elements 导航与界面元素	3,327	1.01
Technical Specifications & Metadata 技术规范与元数据	3,298	1.00
Legal & Administrative Content 法律与行政内容	2,992	0.91
Offensive or Inappropriate Content 冒犯性或不当内容	2,433	0.74
Total 总计	328,472	100

模型可能会发生错误，例如未能分配全部标签、标签归入多个类别……

人工修正一下即可

Inter-Annotator Agreement 人工标注者一致性（IAA）实验

抽取50篇文档的726行，人工独立分类到九个标签之内

$$ \kappa = \frac{p_o - p_e}{1 - p_e} $$

假设两位标注员（A 和 B）对 100 条文本进行情感分类，标签为 正面（Positive） 或 负面（Negative）。他们的标注结果如下表：

B: Positive B: Negative 总计

A: Positive 50 10 60

A: Negative 20 20 40

总计 70 30 100

$p_o$是两位标注员实际一致的比例，即对角线单元格的和除以总数。

两位标注员在 70 条样本上达成一致（50 条 Positive + 20 条 Negative），因此$p_o = 0.7$

$p_e$ 是假设两位标注员随机标注时预期的一致比例。需分别计算每个类别随机一致的联合概率，再求和。

A 标注 Positive 的概率：$P_{\text{A+}} = \frac{60}{100} = 0.6$

A 标注 Negative 的概率：$P_{\text{A-}} = \frac{40}{100} = 0.4$

B 标注 Positive 的概率：$P_{\text{B+}} = \frac{70}{100} = 0.7$

B 标注 Negative 的概率：$P_{\text{B-}} = \frac{30}{100} = 0.3$

已知以上概率，接下来计算在随机标注的情况下，两人同时一致的概率：

随机都标为 Positive 的概率：$P_{\text{A+}} \times P_{\text{B+}} = 0.6 \times 0.7 = 0.42$

随机都标为 Negative 的概率：$P_{\text{A-}} \times P_{\text{B-}} = 0.4 \times 0.3 = 0.12$

因此：$p_e = 0.42 + 0.12 = 0.54$

解释：
如果两位标注员完全随机标注，预计会有 54% 的样本因巧合而一致。

$$ > \kappa = \frac{p_o - p_e}{1 - p_e} = \frac{0.7 - 0.54}{1 - 0.54} = \frac{0.16}{0.46} \approx 0.348 > $$

κ ≈ 0.35：介于 0.2~0.4 之间，说明两位标注员的一致性为“一般”（仅略高于随机水平）。

对比简单一致率 70%：若直接用 70% 会高估一致性，而 Cohen’s Kappa 通过剔除随机影响，给出了更严格的评估。

$ p_o $：直接观察到的对角线比例。

$ p_e $：基于边际分布的“随机一致”概率，反映巧合带来的虚假一致性。

Kappa 的意义：量化了超越随机水平的一致性，避免高估可靠性。

	B: Positive	B: Negative	总计
A: Positive	50	10	60
A: Negative	20	20	40
总计	70	30	100

κ值范围	一致性强度	解释
κ ≤ 0	比随机还差	一致性低于随机猜测（罕见，可能表示系统性分歧或标注错误）。
0 < κ ≤ 0.2	轻微一致（可忽略）	一致性极低，几乎无实际意义。
0.2 < κ ≤ 0.4	一般一致（弱）	一致性较弱，但高于随机水平（需谨慎对待结果）。
0.4 < κ ≤ 0.6	中等一致	一致性适中，结果有一定可靠性（常见于人工标注任务）。
0.6 < κ ≤ 0.8	高度一致	一致性较强，结果可靠（如专业医生诊断或严格标注流程）。
0.8 < κ ≤ 1	几乎完全一致	一致性极高，接近完美（罕见，通常需检查是否过拟合或标注规则过于简单）。

通过IAA实验，得到：

	A1	A2	Avg. 平均
All labels 所有标签	0.79	0.60	0.70
Clean vs. Non-clean 清洁与非清洁	0.78	0.67	0.73

基于 LLM 的分类方法总体上能为 FineWeb 文本生成可接受的标签。

分类器训练

DeBERTa-v3
Stella-en-400M-v5
XLM-RoBERTa-base（支持多语言）

我们首先从文档中提取独立文本行，将每行作为单独样本。随后对数据进行随机打乱，并通过分层抽样划分为训练集（70%）、开发集（10%）和测试集（20%）。我们在每个模型上添加分类头，为每行文本生成 9 个类别的概率分布，同时微调分类头与基础模型。

我们采用 bfloat16 精度，学习率设为 1e-5，批处理大小为 16。基于评估损失值实施早停机制（耐心值为 5），最大训练轮数设为 5 轮，但模型通常在首轮后即收敛。我们对交叉熵损失函数施加 0.1 的标签平滑处理以提升泛化能力。所有训练均在单块 A100 GPU 上完成。

分类器混淆矩阵

大多数误分类样本被归入 Clean 类别，表明其他类别间具有较强区分度
冒犯性或不当内容区分度最低，源于 LLM 训练数据中对冒犯性材料定义边界存在固有困难
参考文献与引用类别因其易于识别的格式和内容特征，成为区分度最高的类别

分类器更倾向于将低质量文本行误标为"清洁"

而非错误地将高质量行标记为低质量

这种偏差有助于降低从数据集中丢弃有价值数据的风险

数据清洗

Clean数据占比86%确实可能会带来模型预测过度自信的问题

采用 Platt 缩放法
- 在保留测试集上训练 Platt 逻辑回归模型
- 在为 FineWeb-10BT 数据集预测质量分数时将其叠加应用于分类器之上
- 留坑，先不研究

对整个数据集进行分片，每个分片128行为一个批次

转化为分类问题，只判断是否为Clean
阈值分别设为0.5或0.9

GPT-2训练结果

大语言模型数据清洗 · 论文笔记（三）

FinerWeb-10BT Refining Web Data with LLM-Based Line-Level Filtering