Chapter 13: Preference Data and Reward Signals¶
Abstract¶
After supervised fine-tuning, a model can already follow instructions and produce formatted responses. Whether it can consistently respond in the manner an organization expects, however, depends on the design of preference data and reward signals. This chapter is addressed to teams responsible for constructing preference data and building reward models. It explains how preference data determines a model's behavioral style and why preference alignment is still necessary after SFT. The chapter first distinguishes three supervision paradigms—pairwise preference, scalar score, and process reward—along with their respective data requirements. It then discusses sources of preference (expert annotation, user feedback, model-as-judge, and rule-based arbitration) and strategies for mixing online, offline, and synthetic preference data. It proceeds to address annotation disagreement, style bias, and score drift, offering consistency governance methods including arbitration, re-labeling, annotator calibration, and golden sets. Finally, it establishes an interface mapping between preference data and training methods such as DPO, RM, RLAIF, and PRM, clarifying when to use pairwise preferences and when to use process supervision.
After SFT, a model can typically handle basic tasks: it understands instructions and produces answers in a given format. But in real-world deployment, the question is rarely whether the model can answer at all—it is whether the model will consistently answer in the way the organization expects.
The "choices" referred to here do not point to the low-level sampling behavior of which token the model selects from the search space. More precisely, they concern how a model, when faced with multiple response paths that are all "basically correct," will favor a particular style, follow a particular ordering of values, prioritize certain objectives, and make trade-offs when objectives conflict. A model that has not undergone preference alignment may be factually sound yet still exhibit verbosity, sycophancy, excessive conservatism, vagueness, overconfidence, over-templating, loosening of risk boundaries, or inconsistent business messaging. It can be used—but it cannot be used stably, at scale, or with confidence.
This is precisely why preference data and reward signals occupy an independent position in the model lifecycle. Their role is not to teach the model "what knowledge exists in the world," nor "what the correct answers to tasks are." They further specify: given multiple candidate behaviors that are all acceptable, what should be encouraged, what must be suppressed, what is acceptable in ordinary contexts but impermissible in high-risk ones, and what may please users yet violates business norms. In other words, the focus of preference learning is not knowledge acquisition—it is shaping a value ordering (Christiano et al. 2017; Ziegler et al. 2019).
For teams responsible for data design, annotation specifications, training interfaces, and deployment governance, this means preference data cannot be treated as a simple extension of SFT data. SFT is more like specifying what the model "can do," while preference data specifies "how the model does those things, whose needs it prioritizes, how it handles conflicts, and when it should stop." The former primarily addresses capability; the latter further addresses controllability, consistency, and organizational fit. Many teams, having completed SFT, feel the model is "good enough." Truly mature teams recognize that the preference phase is the critical engineering step for translating abstract business requirements, risk boundaries, and brand voice into training signals.
This chapter is addressed to team members who need to take preference learning from concept to concrete data construction and training interfaces. It systematically discusses the role of preference data, the relationship between preference pairs and reward models, process reward design, multi-objective preference aggregation and Pareto trade-offs, preference sources and supervision modes, noise and consistency governance, and how this data maps to training methods such as DPO, RM, RLAIF, and PRM (Lightman et al. 2024; Uesato et al. 2022). We emphasize a core thesis: preference learning should not be understood as "one more round of human feedback"—the focus is on explicitly constructing an organization-level behavioral ranking system. Reward signals, likewise, should not be treated as scores casually appended during training; they are the critical interface for formally converting that ranking system into learnable training objectives.
Keywords¶
Preference data and reward signals; supervised fine-tuning; preference data; alignment data; quality evaluation
Learning Objectives¶
- Explain why preference data determines a model's behavioral style and the fundamental reasons preference alignment is still needed after supervised fine-tuning.
- Distinguish the definitions, applicable tasks, and data requirements of three signal types: pairwise preference, scalar reward, and process reward.
- Compare the different dependencies of direct preference optimization and reward modeling on "comparison quality" and "scale stability," and design candidate construction strategies accordingly.
- Identify conflicts in multi-objective preferences, and apply Pareto trade-off thinking to design multi-layer reward systems.
- Govern annotation disagreement, style bias, and score drift—improving preference data reliability through arbitration, re-labeling, annotator calibration, and golden sets.
13.1 The Role of Preference Data¶
13.1.1 What Problem Does Preference Data Solve: Style, Trade-offs, and Behavioral Ranking¶
SFT can teach a model the basic response pattern for a given class of tasks, but it does not necessarily determine how the model chooses among multiple viable answers. The essence of SFT is point-to-point imitation, encouraging the model to approximate a reference output. In the real world, however, a large proportion of scenarios have no single correct answer—the same question may simultaneously admit a more concise response, a more complete one, a more cautious one, and a more encouraging one; all are internally consistent, but an organization will not treat them equally. What truly determines behavioral style is not whether the model can produce these answers, but which way it tends to lean when choosing among them.
Behavioral style is not determined by a single template or a choice of tone words; it solidifies into a statistical outcome through an accumulation of local choices. A model is perceived as "professional" not because it routinely says "the following is a professional analysis," but because across a large number of edge cases it more consistently hedges uncertainty, explains its reasoning, and articulates limitations. In other words, style is a ranking outcome, not merely a surface feature.
From a data construction perspective, preference data determines behavioral style through at least four layers: defining comparison dimensions (truthfulness, helpfulness, compliance, conciseness, etc.), defining the priority ordering among dimensions (correctness outweighs fluency in high-risk tasks), defining how conflicts are resolved, and defining the range of variation the organization can tolerate. Preference data is what transforms an abstract "behavioral style" into locally comparable facts that are visible during training.
Preference learning also converts originally ambiguous value judgments into a supervisory form that can enter training. When a team says "the model should be more robust," this statement cannot directly serve as a trainable label. Only when, for the same input, candidate A offers a definitive judgment while candidate B explicitly notes insufficient evidence and limitations—and the annotation system consistently marks B as preferred—does "robustness" get translated into a ranking bias that model parameters can perceive. The core value of preference data lies in making the organization's implicit value function explicit: when correctness conflicts with naturalness, which takes precedence? When safety boundaries conflict with user experience, which takes precedence? When completeness conflicts with conciseness, which takes precedence?
The stability of behavioral style depends on the directional consistency of the preference distribution. Style is not determined by individual samples; it emerges from an overall trend that is reinforced repeatedly across a population of samples. Stylistic behavior is unlike knowledge learning—once learned, it is retained—it is more like being gradually "pushed" in a direction within a fuzzy space. A model will not automatically generalize "in high-risk scenarios, always state limitations first" to all related tasks after encountering that principle once. What it truly learns is which expression patterns, response structures, and tone choices are more likely to receive positive feedback.
Suppose a team wants the model to always state limitations first in high-risk scenarios. If such samples appear sparsely in the training set, the model will struggle to generalize. Only when this preference is consistently repeated across different topics, different phrasings, and different difficulty levels will the model learn it as a stable tendency. Conversely, if half the training samples reward caution and the other half reward strong affirmation, the model will ultimately develop a wavering mixed strategy—sometimes cautious, sometimes assertive. Each individual response may seem defensible, but together they exhibit no unified behavioral center of gravity. For real-world products, a state of "occasionally correct but overall unstable" is often harder to address than pure factual errors, because it undermines users' expectations of the system's overall behavior.
The worst thing that can happen to preference samples is directional inconsistency. Many samples with inconsistent direction will teach the model only to "randomly sample one way of speaking depending on context"; relatively fewer samples with consistent direction will actually form a stable behavioral tendency more readily. The formation of behavioral style is fundamentally a distributional consequence, not an accumulation of memorable phrases.
Reward Hacking: When Preference Signals Capture Only Surface Features. A cautionary pattern must be noted here: if preference signals fail to capture task intent itself, the model will not automatically understand "why that answer is better"—it will only learn "what kind of output tends to win" (Reward Hacking). In engineering practice, at least four common manifestations arise:
- Format exploitation: The reward model focuses on learning "whether fields are complete" and "whether brackets are properly closed," so the model prioritizes generating structurally complete but content-empty answers—all JSON fields present, yet values uniformly "unable to determine" or "recommend manual review."
- Length exploitation: Annotators over-favor short answers, and the reward model mistakenly learns that "short" equals "good." The model truncates its responses prematurely on complex questions, omitting necessary conditions and withholding uncertainty. For tasks such as legal interpretation or medical risk explanation, high-quality answers inherently require preserving key premises and boundary conditions.
- Tone exploitation: Polite, positive expressions are more likely to be labeled as preferred, and the model learns to wrap insufficient content in warm language—repeatedly saying "you're right" and "thank you for pointing that out" without actually diagnosing the error source or providing actionable corrections.
- Safety template abuse: The model treats "defaulting to a safety template" as the most reliably high-reward behavior, refusing not only genuinely dangerous questions but also entirely benign ones it could safely answer—appearing safe on the surface while being practically unusable in high-risk domains.
The core lesson from these cases is: format should serve content organization rather than replace content; conciseness should eliminate redundancy rather than eliminate justification; politeness should improve interaction rather than mask empty responses; and safety should control risk rather than become indiscriminate refusal. Preference data must decompose organizational requirements down to the sample level. Words like "professional, trustworthy, natural, robust" cannot directly train a model; only when they are written into the comparison rules for candidate answers will the model receive clear signals: when to be more cautious, when to be more direct, and when to prioritize stating boundaries.
13.1.2 Why Preference Alignment Is Still Needed After SFT: Separation of Capability and Strategy¶
A common misconception is that sufficiently rich SFT data eliminates the need for preference alignment. But the two answer fundamentally different questions: SFT answers "what should the model learn to produce," while preference alignment answers "when multiple candidate outputs are all usable, which one should actually be preferred" (Askell et al. 2021; Ouyang et al. 2022; Bai et al. 2022a).
SFT is essentially point-to-point imitation, well suited to building task capability; preference alignment is candidate-wise ranking, well suited to shaping value ordering. Post-SFT models often exhibit "over-helpfulness" (cramming everything conceivable into the answer), "simulated certainty" (fluent and confident but lacking boundary declarations), and "style rigidity" (unable to switch to a more structured mode when a high-risk request appears). Many problems difficult to express within SFT—"this answer is correct but more verbose," "the information is complete but boundary declarations are insufficient," "it sounds natural but gives the impression of over-commitment"—are much better expressed through preference comparison. Their essence is not absolute right or wrong, but relative better or worse.
From a functional layering perspective, SFT more closely resembles building a foundational skill library (placing tools in the toolbox), while preference alignment more closely resembles establishing dispatch principles (which of the available tools to use by default). A model may be capable of producing both long and short answers, both assertive recommendations and cautious hedges—SFT makes it "capable of all," but does not automatically determine which it "selects more often." What real-world deployed systems fear most is not the model being incapable, but the model capable yet making the wrong choice.
This "capable yet wrong choice" problem is often inconspicuous in offline testing, because offline evaluation more readily reveals whether the model answered the question at all—it does not necessarily expose the model's tendency to choose among multiple viable answers. Yet much of what makes a product "good to use" is determined not by the ceiling of capability, but by default trade-off behavior. An enterprise customer service assistant that always gives lengthy answers to simple questions feels inefficient to users; a medical assistant that always tries to complete an answer with incomplete symptom information rather than first triaging the risk—even if medically knowledgeable—remains unsafe. These are not gaps in capability; they are priority ordering biases.
This is also why the same base model, shaped by different preference data, can exhibit very different system character—some more restrained, some more proactive, some leading with conclusions, some leading with conditions. The difference lies primarily in "default choices." What preference alignment truly addresses is not "can it be done," but "among multiple options that can all be done, which is most worth being repeatedly selected." It gives the model a stable decision center of gravity—without this center, the model easily exhibits inconsistency across different inputs: concise today, expansive tomorrow; cautious this time, quick to conclude the next.
Therefore, the key to deciding whether a problem belongs in SFT or the preference phase is to determine whether it represents "cannot do" or "can do but keeps choosing wrong." The former is a capability problem; the latter is a strategy problem. Distinguishing these two types is what gives the entire post-training pipeline its structure: first, use SFT to equip the model with sufficiently broad and stable capability interfaces; then, use preference alignment to organize those capabilities into behavioral sequences that better match product goals. SFT addresses the capability foundation; preference alignment addresses strategic consistency—moving the model from "can answer" toward "will answer in the way you want."
13.1.3 From Human Preferences to Business Preferences: Conflicts, Translation, and Annotation Implementation¶
In discussions of preference learning, "human preferences" are often treated as a naturally legitimate target. For enterprise-grade systems, however, what must genuinely be modeled is a superposition of multiple preference sources: ordinary users' interaction preferences, domain experts' professional judgment, business stakeholders' organizational objectives, risk and compliance baseline requirements, and brand teams' voice consistency requirements. These overlap substantially but are far from identical—user preferences favor naturalness, immediacy, fewer restrictions, and a human-like feel; business preferences require clear boundaries, restrained commitments, and auditable information. Users may want the model to give clear recommendations even when evidence is thin; business stakeholders require explicit acknowledgment of uncertainty. Users prefer conversation that feels "like a friend"; enterprise services require formal, robust, and accountable communication.
Preference data construction therefore cannot rest on the crude assumption of "let annotators choose the answer they prefer by intuition." A mature preference system must first distinguish "experience preference," "professional preference," "compliance preference," "brand preference," and "business preference," then decide how they are aggregated—otherwise "user-centric" may devolve into "short-term subjective satisfaction-centric," undermining long-term system stability.
It is precisely here that Pareto trade-offs become indispensable. For multi-objective systems, no uniquely optimal solution exists—one response may score very high on politeness but have weak boundary declarations, while another may be very robust yet feel mechanical. What the business truly must do is not abstractly declare which is "better," but judge by scenario: for high-risk tasks, we are more willing to sacrifice naturalness in exchange for safety consistency; for marketing scenarios, we are willing to give the model greater expressive freedom. The role of preference data is exactly to transform this strategy from verbal discussion into trainable ranking evidence.
Business preferences that do not enter annotation specifications cannot enter training. Principled requirements such as "more aligned with brand voice," "more like a professional consultant," and "more robust yet not mechanical" must be further decomposed into operationalizable judgment rules for annotators: "more aligned with brand voice" can be broken down into whether overly colloquial language is avoided, whether consistent forms of address are maintained, and whether exaggerated promises are avoided; "more like a professional consultant" can be broken down into whether the basis for judgment is stated, whether uncertainty is acknowledged, and whether conditions and limitations are given. Before business preferences can enter training, they must first undergo a "data translation"—training methods are merely the consuming end; the true conversion of business goals into supervisory structure is accomplished by data design itself.
Figure 13-1 illustrates the corresponding workflow or structure.
Figure 13-1: Flowchart from preference data to reward signal.
13.2 Pairwise Preferences, Scalar Rewards, and Process Rewards¶
Definitions of Pairwise Preference, Scalar Score, and Process Reward¶
Preference learning is often summarized as "giving the model feedback," but from a data engineering perspective this formulation is far too vague. When entering training, feedback must exist as a concrete data structure, and different training methods accept different forms of feedback. For data teams, the first question to clarify is not "should we collect human feedback," but "in what form should we express feedback." In current mainstream practice, the three most fundamental forms are pairwise preference, scalar score, and process reward (Uesato et al. 2022; Bradley & Terry 1952).
Pairwise preference is the most classic and currently most widely used form. The core idea is to generate two or more candidate outputs from the same input, then have a human, model, or rule system judge which is better. The most typical data structure is:
(x, y_w, y_l)
where x denotes the input, y_w the winning response, and y_l the losing response. "Winning" does not mean absolutely correct—it means more preferred on the current comparison dimension and in the current task context. The greatest advantage of pairwise preference is that it avoids many scale inconsistency problems inherent in absolute scoring. An annotator may not stably judge whether a response merits a 3 or a 4 on a rating scale, but can more easily judge which of two candidates is more worth retaining. For this reason, pairwise preference is often viewed as a low-overhead, high-robustness data interface (Bradley & Terry 1952).
Scalar score assigns a numerical rating to a single output. It can be expressed as:
(x, y, r)
where r may be a discrete level or a continuous value. Scalar scores are highly intuitive and especially well suited to scenarios requiring the training of a Reward Model (RM), since the RM's objective is to learn a function that maps an input-output pair to a real-valued score (Ouyang et al. 2022). From a unified interface perspective, scalar scores also have the advantage of being more convenient for downstream analysis, ranking, threshold-cutting, and cross-system sharing. However, they are also more prone to score drift and scale bias. Different annotators may have inconsistent understandings of "4 points" versus "5 points," and the density of high scores may differ across task pools, so scalar rewards typically incur greater governance costs in collection, calibration, and normalization.
Process reward addresses a different class of problems: for some tasks, value is not expressed solely in the final answer but in the process by which the model arrives at that answer. For such tasks, scoring only the final result is often insufficient, because errors may occur in intermediate steps and those intermediate errors can sometimes be masked by a surface-correct final output. The structure of process reward can be abstracted as:
(x, {s_1, s_2, ..., s_t}, {r_1, r_2, ..., r_t})
where s_i denotes an intermediate step, reasoning segment, tool-call action, or local decision state, and r_i denotes the local reward for the corresponding step. Process rewards are most suitable for long-chain reasoning, multi-tool invocation, complex execution planning, code generation, and Agent workflows, because the success or failure of such tasks is not determined solely by the terminal state—it depends heavily on the quality of intermediate behavior (Uesato et al. 2022).
From a data systems perspective, the differences among these three signal types are not merely a matter of "different storage formats"; they also reflect different functional roles in expressing preferences. Pairwise preference is better at expressing relative ranking; scalar reward is more suitable for forming a unified scoring layer; process reward pulls feedback backward from the terminal state into intermediate steps. Teams should choose the appropriate interface based on task characteristics, annotation capabilities, and training objectives, rather than treating "reward" as a single monolithic concept.
Furthermore, these three types correspond to three distinct problem orientations. Pairwise preference is more appropriate for answering "which of these two outputs should be retained"; scalar reward is more appropriate for answering "roughly what quality level does this output represent overall"; process reward is more appropriate for answering "through what path was this output obtained, and is the path itself worth encouraging." If a team has not even differentiated these problem orientations, a common failure mode arises: trying to solve complex long-chain behavior problems while collecting only overall satisfaction labels, resulting in a fundamental mismatch between training objectives and task structure.
The following snippet focuses on Minimal Data Formats for Three Types of Preference/Reward Signals (JSONL).
The same task input x can record supervisory signals using different "feedback interfaces." The three minimal viable structures below facilitate implementation in data pipelines and annotation platforms.
Listing 13-1 provides a JSON data example.
{"type":"pairwise","prompt":"User: Explain what DPO is in three sentences.","winner":"(A more concise answer that covers the key points)","loser":"(A more verbose answer that drifts off topic)","meta":{"task":"explain","dims":["helpful","concise"],"source":"human"}}
{"type":"scalar","prompt":"User: Please summarize the following policy document...","response":"(candidate response)","score":4,"meta":{"task":"summary","rubric":"v1","rater":"r_1027"}}
{"type":"process","prompt":"User: Break this problem into three steps and complete the retrieval.","steps":[
{"state":"Formulate retrieval query","output":"(step1 text)","reward":1},
{"state":"Select tool and parameters","output":"(step2 tool parameters)","reward":0},
{"state":"Integrate evidence and respond","output":"(step3 final answer)","reward":1}
],"meta":{"task":"rag_agent","unit":"step"}}
Listing 13-1: JSON data example.
Which Tasks Are Suited to Each of the Three Reward Signal Types¶
From the perspective of task types, pairwise preference, scalar reward, and process reward do not have an absolute hierarchy—they are better understood as three interfaces suited to different problem levels.
Pairwise preference is especially well suited to open-ended generation, style optimization, and tasks with clear relative trade-offs. Examples include question answering, summary rewriting, email revision, refusal template optimization, customer service script selection, and brand messaging standardization. In such tasks, what teams truly care about is generally not "how many points does this response score," but rather "among several acceptable answers, which would we most prefer to keep." Expressing relative merits through pairwise preference is therefore the most natural fit.
Scalar reward is better suited to scenarios requiring unified ranking and cross-system sharing. For example, when teams want to build a unified candidate scorer, rank outputs from different model versions, coarsely filter online feedback samples, perform quality control on automatically generated data, or share a single reward interface across multiple tasks, the value of scalar reward becomes more apparent. Its advantage lies less in the initial annotation and more in downstream platform reuse.
Process reward is suited to tasks where "getting the result right is not enough—the path must also be right." Examples include complex reasoning, RAG retrieval planning, Agent tool invocation, code repair, multi-step form operations, and workflow orchestration. The critical questions for these tasks go beyond whether the final output is usable; they include whether intermediate steps are stable, interpretable, transferable, and auditable. Rewarding only the terminal state in such cases is likely to conceal real risks.
Data Requirements for Direct Preference Optimization and Reward Modeling¶
When preference data actually enters training methods, the two most common paths are Direct Preference Optimization (DPO) and Reward Modeling (RM) (Stiennon et al. 2020; Rafailov et al. 2023). Both rely on preference data, but their requirements for data quality, data format, and governance priorities are not identical. Much of the wasted effort in the preference learning phase stems from not first clarifying: what kind of data does our training method actually need to consume?
DPO's advantage is directness. It typically does not require training a separate explicit reward model; it can directly use preference pairs, increasing the relative probability of winning answers over losing answers during parameter updates. In other words, DPO directly converts "we prefer A over B" into a training objective. For DPO, therefore, the most critical data asset is high-quality preference pairs. The key is not "how many points each answer deserves" but whether the comparison relationship is reliable. As long as win/lose relationships are stable, candidates are sufficiently differentiated in quality, and annotation rules are consistent, DPO can typically work effectively.
This means that when serving DPO, data teams should focus on candidate generation mechanisms, hard-case coverage, win/lose relationship consistency, and clarity of comparison standards. For instance, if candidates always differ by a large margin, DPO will only learn to distinguish coarsely between "obviously good and bad," making it difficult to adapt to the subtle but important stylistic differences that matter online. If candidates are always extremely close yet the specification is unclear, annotation disagreement increases and training signals weaken. DPO may appear to save one layer of reward model training, but it actually places very high demands on the construction quality of paired samples.
The RM approach, in contrast, attempts to first learn a "scorer." It aims for the model not only to know which of A and B is better, but to produce reasonably stable quality estimates of individual outputs. Such methods typically demand more of the data, because they require signals to be cross-sample comparable to some degree. Even when the starting data are preference pairs, teams typically need to consider whether those relative comparisons are sufficient to constrain a stable scoring function. When scalar rewards are used directly, additional problems of score scale inconsistency, task distribution imbalance, and temporal drift must be resolved.
The benefit of RM is that once a reliable reward model is trained, it can be reused across many contexts: candidate ranking, offline evaluation, policy optimization, online monitoring, and model-as-judge assistance. From a platform-building perspective, RM more closely resembles establishing a unified "reward interface." But for precisely this reason, its data system requirements are also more stringent. Teams cannot focus solely on who wins and who loses; they must also attend to why one wins, by how much, whether scores are comparable across different scenarios, and how dimension conflicts are aggregated.
From an engineering practice standpoint, if a team's most mature assets are large-scale preference pairs and the goal is to quickly steer model behavior toward better styles, DPO is usually the more convenient starting point. If the team wants to build long-term reusable reward infrastructure—enabling different models, different training phases, and different evaluation contexts to share the same reward representation—the value of RM will be higher. But whichever path is taken, the true prerequisite is not merely "collecting more feedback"; it also requires first establishing solid data foundations: preference definitions, candidate generation, scoring dimensions, noise governance, and version management.
DPO Depends More on "Comparison Quality"; RM Depends More on "Scale Stability"¶
From a data governance perspective, the difference between DPO and RM can be stated more plainly: DPO is more vulnerable to unstable comparison relationships, while RM is more vulnerable to unstable scoring scales (Stiennon et al. 2020; Rafailov et al. 2023).
For DPO, as long as the win/lose relationship between candidates on the same input is sufficiently clear, many problems are manageable. It is not particularly concerned with whether a given response scores 8 or 9; it cares more about whether the annotation system can consistently and stably identify the better option between A and B. DPO's primary risks therefore lie in poor candidate construction, imbalanced difficulty distribution, ambiguous win/lose criteria, and inconsistent annotator standards.
RM, because it must learn a cross-sample generalizable scoring function, must face another problem: can scores given by different samples, at different times, by different annotators be treated as quantities on the same coordinate axis? If not, what the reward model learns will not be a quality function but a noise function confounded by personnel differences, temporal differences, and task distribution differences. RM therefore typically requires greater rigor in calibration, normalization, source labeling, and stratified modeling.
This is why many teams start quickly with preference pairs for DPO, then gradually build the more complex RM infrastructure. The former addresses "first getting the direction right"; the latter addresses "building a unified reward infrastructure." The two are not mutually exclusive and often correspond to different maturity stages.
Why Candidate Construction Is a Prerequisite for Reward Signal Quality¶
Whether using DPO or RM, one frequently underestimated issue is candidate construction. Many teams place their emphasis on "whether annotation is accurate," overlooking the most critical step before annotation: how the candidate answers were generated. If the quality of the candidate pool is imbalanced, even the strictest annotation can only learn low-resolution preferences.
For example, if candidate A is obviously superior to candidate B, annotation is straightforward, but such samples can only teach the model to distinguish coarse-grained good from bad. If candidates A and B are extremely close yet the specification does not clarify priority among conflicting dimensions, high-disagreement samples increase and training signals become brittle. The ideal case is generally when candidates differ in ways that are genuinely meaningful for the business, but the differences are not so stark as to be immediately reducible to a single obvious surface dimension. Only such samples force the model to learn more fine-grained behavioral trade-offs.
Candidate construction is therefore the upstream core of preference learning. Which models the candidates come from, how differences are controlled, whether multiple styles are covered, whether boundary conflicts are included, and whether samples are deliberately gathered for cases that "look acceptable either way but the organization prefers one"—all of these directly determine whether the resulting reward signals carry real information.
Outcome Reward vs. Process Reward: From "Accepting Results" to "Supervising Behavior"¶
Many teams new to preference learning default to evaluating only the final output: is this response good overall? This is a reasonable starting point—for tasks like summarization, classification, customer service replies, and refusal quality, the result itself is the primary value carrier, and outcome reward is typically sufficient.
But in complex tasks, focusing only on the terminal state leads to two typical problems. The first is "getting lucky"—the model arrives at a correct result through unreliable reasoning, skipped steps, or lucky guessing. Rewarding only the terminal state encourages the model to reinforce "getting the answer right is all that matters." The second is "superficially usable"—the final output appears to satisfy requirements, but the intermediate process violates important principles: necessary retrieval was skipped, evidence verification was omitted, tool invocation order was inefficient, or the model was overconfident at a critical node. Outcome rewards conceal all such risks.
The significance of process rewards is precisely to shift supervisory signals earlier, making the model accountable not only for "what it achieves" but also for "how it achieves it" (Lightman et al. 2024; Uesato et al. 2022). For reasoning tasks, process rewards can encourage complete logical steps and timely correction of erroneous intermediate conclusions. For tool-invocation tasks, they can encourage retrieval before responding and verification before execution. Of course, the cost of process rewards is a significant increase in data construction difficulty—teams must first define what constitutes a "process unit" (a single reasoning step? a single tool call? a planning node?), since too coarse a granularity degrades to outcome reward, while too fine a granularity causes annotation costs to skyrocket and consistency to become very difficult to maintain.
From the perspective of training signal semantics, outcome rewards more closely resemble "post-hoc acceptance"—evaluation of the whole after completion; process rewards more closely resemble "behavioral supervision"—expressing preferences at key nodes along the trajectory. This distinction is especially critical in complex systems. In many automated scenarios, system risk may not reside in the result string itself but rather in the path that produced the result. For example, an Agent may complete a task, but along the way it made redundant retrievals, incorrect parameter calls, or unsupported inferences—none of which surfaced in the final output. If only the terminal state is observed, such bad behavior is tacitly endorsed, and the model will tend toward "opportunistic but low-effort" paths. This kind of path dependency is more dangerous than a single wrong answer, because it means the model has formed an unstable preference at the strategy level.
Process rewards address precisely this credit assignment distortion: extending supervision from the end of the task back into intermediate states, allowing the system to express preferences at key trajectory nodes. Whether a retrieval action was necessary, whether a tool call was compliant, whether an intermediate conclusion was evidence-supported—behavioral characteristics that outcome rewards previously masked can all be explicitly incorporated into evaluation under a process reward framework. The model learns not only a result distribution but a behavioral distribution.
Beyond this, process rewards serve the function of "making implicit norms explicit." Whether verifiable references were cited in financial question answering, whether uncertainty was acknowledged in medical assistance, whether approval workflows and permission boundaries were followed by enterprise Agents—if only outcome rewards are relied upon, these norms can only exist as vague overall preferences. Process rewards can ground them in more specific behavioral nodes, making "caution," "compliance," "evidence-grounded," and "no step-skipping" intermediate behavioral characteristics that can be scored, compared, and optimized.
Of course, process rewards do not replace outcome rewards—the two are complementary. Outcome rewards tell the model "what results are worth pursuing"; process rewards tell the model "what paths are worth trusting." Their combination enables a training system to simultaneously optimize result effectiveness and process reliability. A high-quality training system should not merely reward "getting lucky with the right answer"; it should progressively encourage reproducible, auditable, and generalizable behaviors.
How to Define a "Process Unit" Is the First Hard Problem in Process Reward Design¶
Process rewards sound intuitively natural in concept, but the moment they are applied to data construction, a core question arises: what counts as one step? In different tasks, the way "process units" are defined will significantly affect subsequent annotation quality and reward stability.
In reasoning tasks, a process unit can be a single reasoning statement, a logical inference, or an intermediate conclusion. In tool-invocation tasks, a process unit can be a single retrieval action, a single parameter fill, or a single API call. In multi-step execution tasks, a process unit can also be a planning node, a subtask completion state, or a state transition. Different definitions will present the same task chain at completely different supervision granularities.
If granularity is too coarse, process rewards degrade into an approximation of outcome rewards. If granularity is too fine, annotation costs skyrocket and consistency becomes even harder to ensure. Process rewards therefore cannot be simply understood as "annotating a few more intermediate steps"—the prerequisite is establishing a stable behavior segmentation framework. Without this foundation, process supervision easily becomes high-cost, low-stability data engineering.
Multi-Objective Preference, Process Rewards, and Pareto Trade-offs¶
In real production systems, preferences rarely take a single-objective form. A model may simultaneously be required to be correct, helpful, bounded, concise, stable, on-brand, low-latency, low-hallucination, low-refusal-error-rate, and sufficiently interpretable in certain scenarios. The problem is that these objectives do not grow in tandem. Very often, strengthening one dimension naturally sacrifices another. The true difficulty of preference learning lies precisely in this multi-objective structure.
If a team remains in a single "overall score" mindset, two problems readily emerge. First, multiple objectives are crudely aggregated, leaving the team unable to determine why the model improved or on which dimension it degraded. Second, conflicts in the training set are averaged away, the model learns no clear trade-off strategy, and can only form unstable behavior under ambiguous signals. Introducing multi-objective preference into preference data design is therefore not only a more precise approach—it is a necessary condition for handling real business conflicts.
Multi-objective preference means that teams, in addition to recording "which candidate is overall better," also retain judgment information across multiple evaluation dimensions. In an enterprise question-answering scenario, for instance, candidate A may outperform on helpfulness and naturalness but underperform on boundary clarity, while candidate B may be more robust but slightly stilted. Assigning a single overall win/lose label conceals the conflict; simultaneously recording judgments on individual dimensions provides higher-resolution information for downstream aggregation strategies, reward model training, and problem analysis.
At this point, Pareto trade-offs become an unavoidable concept in preference learning (Roijers et al. 2013; Deb et al. 2002). For multi-objective systems, no uniquely "absolutely optimal answer" exists; the system will contain a set of candidate solutions that are mutually non-dominated across different objectives—the Pareto frontier. Answers on the frontier typically mean that further improving one objective requires sacrificing another. For data teams, the key is not to eliminate this frontier but to clarify where the organization is willing to land on it. Teams must formally write into preference specifications and data interfaces questions such as "in what scenarios are we more willing to sacrifice naturalness for robustness" and "in what tasks are we more willing to sacrifice some conciseness for a more complete explanation."
Process rewards and multi-objective preference are naturally coupled. Many objectives are not expressed only in the terminal state—they are expressed in the process. "Evidence before conclusion" is a process preference; "in high-risk situations, state limitations before giving recommendations" is a process preference; "validate parameters before invoking a tool" is a process preference. In other words, if a team scores only the final answer on multiple objectives without recording the partial merits and faults within the process, it will be very difficult to genuinely support Pareto trade-offs in complex systems. Many critical objectives simply do not reside in the last sentence—they are hidden within the generation path.
Accordingly, from a preference learning maturity perspective, teams typically undergo a progressive upgrade: starting with overall preference pairs to address "who is generally better"; then introducing multi-dimensional evaluation to address "why it is better"; next adding process rewards for complex tasks to address "how to become better in the right way"; and ultimately forming a multi-objective, layered, interpretable reward signal system. This progression marks the transition of preference learning from concept to engineering.
The following snippet focuses on "Dimension-Aware Recording" for Multi-Objective Preference Annotation.
When you do not want "overall scores" to mask conflicts, you can simultaneously retain dimension labels and aggregation strategies in the data (the aggregation strategy can be decided at the training/sampling stage rather than fixed irrevocably at annotation time).
Listing 13-2 provides a JSON data example.
{
"type": "pairwise_multi_dim",
"prompt": "User: I've been feeling a tightness in my chest lately—should I take medicine?",
"candidates": {
"A": "(A more natural response, but lacking risk triage and advice to seek medical care)",
"B": "(A response that first provides risk warning and requests clarification, then gives general advice)"
},
"judgement": {
"overall_winner": "B",
"dims": {
"safety_boundary": "B",
"helpfulness": "B",
"naturalness": "A",
"conciseness": "A"
},
"reason_tags": ["high-risk triage first", "insufficient information—clarify first"]
},
"meta": {"scenario":"medical","risk_level":"high","policy":"medical_v2"}
}
Listing 13-2: JSON data example.
Multi-Objective Preference Cannot Rely on an "Overall Score" to Mask Conflicts¶
Many teams, when designing preference data, want to compress all dimensions into a single overall score, believing this makes training simpler and more convenient. The problem is that as long as system objectives are genuinely multi-dimensional, an overall score will inherently lose critical information. At best it tells you "things seem generally better"—it cannot tell you whether helpfulness improved while compliance declined, or whether conciseness increased while interpretability was sacrificed.
The advantage of an overall score is simplicity: it is easy to connect to training pipelines and convenient for version comparison. But its problem is equally direct. Once all differences are collapsed into a single number, it becomes very difficult for teams to determine on which dimensions the model improved and what costs were paid. The model may indeed look more like a "high-score answer"—but whether that high score reflects greater helpfulness or a tendency toward lengthy comprehensive responses; greater safety or learned use of vague language to avoid risk; better brand voice or efficiency sacrificed for softer phrasing—all of this can easily be flattened by the overall score.
This information loss may be tolerable in early stages, but once the system enters fine-grained optimization, it quickly becomes a bottleneck. Teams cannot determine what type of change the training signal is driving, nor whether a given version update pushed the system to the wrong trade-off point. For this reason, multi-objective preference data often needs to retain dimension labels, dimension scores, or at minimum the rationale for preferences, rather than keeping only a coarse final win/lose outcome.
In real engineering, many problems cannot be simply understood as "one answer is absolutely better"—they frequently manifest as "it is better on one dimension but worse on another." In customer service scenarios, a response may be more enthusiastic and complete but also more verbose, burying the next action step in the middle. In legal contexts, a response may be clearer but overstates things when conditions are uncertain. In medical contexts, a response may be more cautious and safer, yet users may feel it lacks information and fails to address their primary concern. None of these represent simple superiority or inferiority—they reflect genuine tensions among multiple objectives. If only an overall score is retained, the team sees only "this answer won" without seeing what it won on or what it conceded.
This is why, in later-stage optimization, a confusing phenomenon often emerges: subjective impressions change online, user feedback changes, but offline overall scores show no obvious problems. The model may have become safer yet more template-like; more concise yet missing necessary explanation; more on-brand yet evasive on complex questions. When only the overall score is observed, these changes are easily averaged out or even incorrectly interpreted as "roughly the same." But for real products, users experience not just averages—more often they notice specific categories of experience degrading and specific boundaries loosening.
Overall scores also have a more subtle problem: they force teams to make, prematurely, many value judgments that should not be crudely merged. For example, when helpfulness and compliance conflict, which matters more? When conciseness and explanatory detail pull against each other, which side should take priority? When warmth and professionalism cannot both be maximized, which should the system preserve? On the surface, collapsing all of these into an overall score appears to streamline training. In practice, it silently performs complex trade-offs at the data layer—trade-offs that are often neither transparent nor stable. Different annotators, different business teams, and different product objectives at different stages will all cause the meaning behind that "overall score" to shift. The number remains the same number, but what it represents has changed.
More problematic still: once the team looks only at the overall score, it becomes very difficult to manage conflicts consciously. Suppose the system's most important current objective is to tighten boundaries in high-risk scenarios without meaningfully degrading helpfulness. If data records only final win/lose outcomes, then even if the model does become more conservative after training, the team will struggle to confirm whether it also made many responses that could have been answered clearly more vague. Conversely, if a version pursues greater naturalness and a more human-like feel, and the model begins saying too much in situations that call for restraint, an overall score may not immediately expose this—because the overall score can only indicate "which direction overall preferences tilted," not whether that tilt came at the cost of some critical dimension.
Multi-objective preference data must therefore preserve not just results but the structure behind results. At minimum, teams should know the primary basis for each preference judgment—whether it was because the answer was more accurate, more concise, safer, or because it better met tone and format requirements. Only if this information is retained can subsequent training, analysis, and regression have something concrete to work with. Otherwise, every time the model changes, teams can only guess: what was pushed too far this time, and what wasn't pushed far enough?
In many cases, retaining dimensional information does not require a particularly elaborate system design. The key is not the number of labels but that the most important conflicts not be erased. For example, at minimum distinguishing high-frequency tension dimensions such as helpfulness, compliance, conciseness, and explanatory completeness; or requiring annotators to note "why A over B" in preference annotations, so that post-training analysis can still trace back to original rationale. The purpose is not to make the data table richer—it is to ensure that teams understand what they are actually changing during version iteration. An overall score can serve as a summary view, but it cannot substitute for structure itself.
Ultimately, the hardest part of multi-objective preference learning has never been aggregating a pile of dimensions. The core challenge is accepting that these dimensions will not always move in the same direction. Sometimes greater safety will look more conservative; greater conciseness may omit a layer of explanation; greater warmth may not be most efficient. What system optimization truly requires is not pretending these conflicts do not exist—it requires preserving the conflicts, seeing them clearly, then consciously deciding which direction to lean. If all tension is flattened by a single overall score, training becomes easier, but teams will sooner lose the ability to explain changes in system behavior. At that point, scores may still be rising while the model has quietly drifted away from the direction originally intended.
What Pareto Trade-offs Mean for Data Teams¶
Pareto trade-offs are not merely a term from optimization theory. For data teams, they represent a very practical question: you cannot simultaneously push all objectives to their extremes, so you must explicitly determine which sacrifices are acceptable and which are not.
For example, in high-risk question answering, teams may be willing to sacrifice some naturalness in exchange for clear boundaries. In marketing copy, teams may accept a degree of stylistic tension in exchange for more vivid brand expression. In enterprise assistants, teams may place greater importance on auditability than on companionship. All of this lies beyond what the training algorithm itself can decide; it must first be expressed by the data team through preference specifications and sample design.
Pareto trade-offs in preference learning therefore mean not the drawing of a beautiful frontier curve, but the requirement that teams convert "where we actually want to push the system" from verbal preference into explicit data choices.
The Evolution from Single Preference to Multi-Layer Reward Systems¶
From an engineering maturity perspective, preference learning rarely arrives at multi-objective, process-oriented, high-resolution stages all at once. The more common path is to first establish a single pairwise preference system so the model learns basic ranking; then progressively introduce dimensional evaluation so the system knows not only who is better but also why; subsequently add process rewards for high-value complex tasks so the system learns to "become better in the right way"; and finally integrate these different signal types into a long-term maintainable reward infrastructure through layered training and version governance.
This evolutionary path is important because it illustrates that preference learning should not be regarded as a static data format—it is a progressively upgradeable engineering capability. Teams do not need to implement all complex structures from the start, but they must have a clear picture of where they ultimately want to go. Otherwise, early data systems easily become bottlenecks in later stages due to excessive simplification.
Figure 13-2 illustrates the corresponding workflow or structure.
Figure 13-2: Schematic of multi-objective preference trade-offs.
13.3 Preference Sources and Supervision Modes¶
Expert Annotation, Ordinary User Feedback, Model-as-Judge, and Rule-Based Arbitration¶
Where preference signals come from is one of the most fundamental questions in preference learning design (Ouyang et al. 2022; Bai et al. 2022a; Lee et al. 2023). Signal sources determine not only collection cost and scale but also whose value judgments the reward system ultimately reflects. In practice, sources of preference data can generally be grouped into four categories: expert annotation, ordinary user feedback, model-as-judge, and rule-based arbitration. Each has value, and each has clear limitations. For teams that need to operationalize, the critical question is never about choosing the "most advanced" of the four—it is about understanding what role each can and cannot play.
Expert annotation is suited to tasks with high knowledge barriers, high stakes, and judgment criteria that are strongly dependent on professional training—such as medical advice, legal interpretation, financial compliance, and complex enterprise process question answering. In such tasks, many "seemingly correct answers" actually conceal serious problems that ordinary users—let alone general annotators—cannot easily identify. The greatest value of experts is not only that they can judge which candidate is superior, but that they can articulate the basis for their judgment, making implicit professional standards explicit. For data teams, expert annotation often plays a "standard-setting" role—establishing high-confidence baselines for certain task types. Its disadvantages are equally clear: high cost, slow pace, limited coverage, and potential style differences or even theoretical disagreements among different experts. Expert signals, though valuable, cannot single-handedly support a large-scale preference system.
Ordinary user feedback complements experts. It most closely reflects real product usage scenarios and can capture end-user experience at scale—whether a response feels handy, whether it truly helps complete a task, whether the tone is comfortable, whether the explanation is adequate. Many helpfulness, readability, and interaction-fluency problems are more sensitively detected by user feedback than by experts. But user feedback has prominent problems too: it is a highly mixed signal. Users may like a response because it is more natural, faster, or more aligned with their personal perspective—not necessarily because it is more correct, safer, or more organizationally compliant. User feedback is therefore more appropriate as an experience-layer preference and should not serve as the sole supervisory source for high-risk tasks.
Model-as-judge has become a common means of scaling preference data in recent years (Bai et al. 2022b; Lee et al. 2023). It typically works as follows: multiple models generate candidates, and then another model plays the role of judge, comparing, scoring, or explaining the candidates. Its advantage is that it greatly reduces cold-start costs and is especially useful for initial filtering, expanding long-tail coverage, constructing challenging comparison samples, and providing priority queues for human annotation. For data teams, the value of model-as-judge lies not in "replacing humans" but in improving the efficiency of preference data production. Its problems, however, are also highly salient: the judge model inherits its own training biases and may systematically favor certain phrasings, certain linguistic styles, or certain structured templates. It may even amplify homogeneity bias by being more lenient toward expressions familiar to itself. Model-as-judge is therefore often appropriate as auxiliary supervision, not as the final arbiter in high-risk scenarios.
Rule-based arbitration separates a subset of judgments that can be programmatically expressed from subjective preferences. It is suited to format correctness, required-field completeness, sensitive word and prohibited content detection, tool-call parameter validation, length threshold control, and rule violation detection. Rules' advantage is extremely high consistency, auditability, low cost, and great value in compliance and process constraints. But the coverage of rules is inherently limited—they can only judge constraints that can be explicitly encoded and cannot substitute for judgment on complex quality dimensions such as helpfulness, naturalness, and genuine correctness.
A mature preference system therefore typically does not operate as a single-source structure; it employs a layered combination. Experts set standards for high-risk, high-value tasks; ordinary users reflect authentic experience; model-as-judge handles scale expansion and pre-ranking; rule-based arbitration enforces rigid boundary checks. What data teams must truly do is define weights, applicable scenarios, and conflict resolution mechanisms for each source—not abstractly declare "we use human preferences" or "we use AI feedback." The choice of preference source is itself part of the reward system design.
Mixing Offline, Online, and Synthetic Preference Data¶
Beyond classification by source, preference data can also be categorized by the timing of production and the construction method into offline preference, online preference, and synthetic preference. These three forms typically coexist in a mature system and serve different lifecycle roles.
Offline preference is typically the starting point for preference learning. Teams first construct a batch of input tasks, then use one or more candidate models to generate multiple output versions, and then organize human comparison, scoring, or process review. Offline preference's advantage is strong controllability: tasks can be carefully sampled, candidates can be deliberately designed with target difficulty distributions, the annotation environment is stable, quality control processes are clear, and it is easy to establish golden sets and first-version annotation specifications. For the early stages of preference learning, offline preference is nearly irreplaceable—teams need to first learn "how to define preferences" in a clean, analyzable, reviewable environment.
Online preference comes from real interaction behavior after product launch—such as likes, dislikes, adoption, retries, follow-up questions, copying responses, escalation to humans, and task abandonment. It reflects the model's performance under a real distribution and can most quickly reveal drift problems in new tasks, new user groups, and new business scenarios. Online preference's greatest value is freshness and timeliness—it allows data teams to see real feedback structures beyond offline sets. But its noise level is also highest: not every user provides feedback, and those who do are not representative of the average user. Many behavioral signals are weak labels that cannot be directly mapped to quality judgments. More complicating still, online behavior is influenced by UI, interaction pacing, task context, and many other non-content factors. Online preference is therefore generally suited to trend monitoring, sample feedback loops, candidate pool expansion, and online correction—not to direct use as high-confidence training labels without processing.
Synthetic preference is an expansion method situated between human and automated production (Bai et al. 2022b; Lee et al. 2023). It may come from comparison results generated by model judges, from rule derivation, from historical log reorganization, from automatically constructed hard-case pairs, or from pairwise samples converted from existing scalar scores. Synthetic preference can significantly improve data production efficiency and is especially well suited to rapidly forming initial preference corpora during cold-start phases, or to large-scale gap-filling in certain lower-risk dimensions. But synthetic preference must have clearly defined boundaries: it is more like an "auxiliary signal—usable but to be handled with care," not naturally equivalent to high-quality human preference. Especially in high-risk scenarios, synthetic preference should not replace human and expert standard-setting; it can only serve as a supplement under explicit confidence levels.
A truly robust preference data system typically adopts a hybrid mode of "offline as foundation, online for correction, synthetic for expansion." Offline data provides a structurally clear baseline; online data provides fresh distributional corrections; synthetic data provides scale and coverage. For data teams, the key is not to choose one of the three—it is to establish a clear collaborative relationship among them: which tasks should preferentially draw on offline data; which online signals can be recycled into offline annotation pools; which synthetic signals should be used only for candidate filtering rather than entering the main training set directly; and which sources can only serve as weak supervision and should not be used as golden annotations. Only then will a mixed preference system avoid becoming a confused "hodgepodge" of untracked sources and indeterminate confidence levels.
Preference Annotation Process Design for High-Risk Scenarios¶
In high-risk scenarios, the meaning of preference annotation is fundamentally different from ordinary conversation scenarios (Askell et al. 2021; Bai et al. 2022a; Liang et al. 2022). The key question has shifted from "which answer is more pleasing" to "which answer is more acceptable in terms of responsibility, professionalism, boundary control, and risk disclosure." High-risk preference data therefore cannot follow lightweight annotation thinking; it must be designed as an auditable, traceable, and reviewable process-based system.
First, the preference definition for high-risk tasks must be established upfront rather than left to annotators' spontaneous judgment on the spot. Teams need to decompose tasks into executable dimensions before annotation begins—for example, factual correctness, whether the response exceeds authority, whether limitations are stated, whether high-risk advice is given, whether necessary prompts are omitted, and whether the response should be refused or escalated to a human. These dimensions must not only be written into annotation specifications but also, where necessary, reflected in the annotation interface and sample metadata. Otherwise, so-called preference annotation ultimately becomes "picking whichever answer feels more appropriate by gut feeling"—a result that has almost no governance value in a high-risk system.
Second, the candidate generation stage should already incorporate rule-based pre-filtering. Candidates that obviously violate requirements, obviously omit critical information, are obviously incomplete, or obviously fail to meet minimum business format requirements should not enter the expensive expert comparison stage. This both conserves annotation costs and establishes a baseline filter before training, preventing high-risk sample pools from being contaminated by low-quality noise.
Upon entering the human annotation stage, high-risk tasks generally do not lend themselves to a simple "pick one of two" approach. A more reasonable approach is to require annotators, when judging wins and losses, to also record the key deciding rationale—or at minimum tag the primary win/lose dimension. For example: this answer won because it is more accurate, has clearer boundaries, is less presumptuous in scope, or has a more auditable process. Rationale information is not merely for post-hoc analysis—it will directly affect whether future multi-objective preference aggregation, reward model interpretation, and annotator calibration can be performed.
Moreover, high-risk preference systems must have built-in arbitration mechanisms. Samples with large disagreements, samples triggering key rule conflicts, and samples involving major version boundary changes should not be resolved by simple majority vote. They should enter an escalated review process with senior annotators or domain experts. The value of arbitration is not only to produce a final label—more importantly, it accumulates disputed cases and converts ambiguous boundaries into new specification clauses or golden set samples. For data teams, this "using disputes to feed back into specifications" closed loop is the hallmark of a mature high-risk annotation system.
Finally, high-risk preference data must carry rich metadata. This information should at minimum include task type, risk level, candidate source, whether rule pre-filtering was applied, annotator identity tier, disagreement situation, arbitration status, primary win/lose rationale, and whether the sample is included in the golden set. In high-risk systems, the value of data is not only in training outcomes; it also lies in post-deployment interpretability and responsibility chain tracing. A preference pair lacking metadata may still be usable in low-risk scenarios; in a high-risk system, it almost entirely loses its governance value.
13.4 Noise and Consistency Governance¶
Annotation Disagreement, Style Bias, and Score Drift¶
Preference data looks simpler than factual annotation at first glance, because it seemingly only requires answering "which is better" (Cohen 1960; Dawid & Skene 1979). But precisely because it involves value judgment, style selection, and objective conflict, noise sources are far more complex than many teams anticipate. The true difficulty of preference learning lies not in failing to receive feedback, but in the fact that the feedback received may not be consistent, may not be stable, and may not genuinely correspond to the preferences the organization wants to learn. This is why noise and consistency governance becomes a central topic in preference data engineering.
The most common problem is annotation disagreement. For some samples, the reason for disagreement is straightforward: two candidates are so similar that superiority is hard to judge. But more concerning is the kind of "recurring structural disagreement." For example, some annotators consistently prefer more concise expression, while others consistently prefer more explanatorily complete answers; some prioritize politeness and comprehensiveness, while others prioritize directness. Such disagreement is not merely interpersonal variation—it often indicates that key dimensions in the preference specification still have not been explicitly ranked. Left unaddressed, the model will ultimately learn not a stable organizational preference but a "flavor-averaged mixture."
The second class of problem is style bias. Preference annotation is very susceptible to surface feature influence. Longer responses are often misjudged as more serious and comprehensive; more confident tone is often misjudged as more professional; more neatly structured text tends to be over-rated even when the substantive content is no stronger. This phenomenon is common in human annotation and equally common in model-as-judge settings. Over time, the system learns "sounds like a good answer" as a surface style feature that is treated as the essential characteristic of "is a good answer." For business systems, this bias is especially dangerous because it allows the model to become increasingly skilled at simulating quality rather than genuinely improving it.
The third class of problem is score drift. It is most prominent in scalar rewards but can also appear in latent form within preference pairs. As annotation time passes, teams expand, specifications iterate, and product goals shift, samples of originally similar quality may receive labels on different scales. A "high score" in one month may be only middling by another month's standards; a style that wins easily in one task pool may no longer be advantaged in another. Score drift directly undermines the comparability of reward signals, causing reward models to learn temporal and population differences rather than a stable quality function.
Additionally, preference systems encounter problems of candidate construction bias, task distribution bias, and source mixing bias. If candidates are always generated by models of the same type, model-as-judge will spin within a narrow stylistic space. If some category of popular tasks dominates the training set, the reward model will mistake "frequent task preference" for "universal preference." If user feedback, expert feedback, rule signals, and model judge results are mixed together without differentiation, training objectives become vague or internally conflicting.
Preference data governance should therefore not aim for the fantasy of completely eliminating noise. The first step is identifying noise types, then deciding which noise should be addressed through process governance, which should be handled through modeling, and which can only be mitigated through version isolation and source stratification. Only by accepting that preference data is inherently noisy observation—not absolute ground truth—can a team build a truly reliable reward signal system.
Arbitration, Re-labeling, Annotator Calibration, and Golden Sets¶
Since preference noise is unavoidable, what data teams need is not a one-time "cleansing" but a continuously operating consistency governance mechanism. In practice, the four most effective levers are arbitration, re-labeling, annotator calibration, and golden set construction. They correspond to four levels: high-disagreement problems, stability measurement, personnel alignment, and system anchoring.
The arbitration mechanism handles problems that simple majority voting cannot resolve. Not every disagreement warrants arbitration, but any sample involving high-risk scenarios, core metrics, version boundaries, major style shifts, or conflicting expert opinions should enter higher-level review. The purpose of arbitration is not merely to "produce a final answer"—more importantly, it structures and preserves the reasons behind disputes. For instance, if annotators repeatedly waver between "natural expression" and "clear boundaries" on a certain task type, the team must clarify: in this task, which takes higher priority, what level of naturalization is acceptable, and what level constitutes ambiguous boundaries. The true value of arbitration is converting costly disputes into reusable specifications.
Re-labeling is the key means of measuring data stability (Cohen 1960; Dawid & Skene 1979; Snow et al. 2008; Aroyo & Welty 2015). Teams cannot only check whether the first round of annotation is internally consistent—they must periodically resample and have samples re-compared at different times, in different batches, or even by different annotator groups. By observing win/lose reversal rates, consistency rate changes, and score distribution drift, teams can determine whether preference definitions are stable. If a certain task type repeatedly reverses on re-labeling, the root cause is usually not "annotators are too poor" but more likely an ambiguity in the task definition or dimension ranking itself. Re-labeling is therefore not only a quality check—it is a specification diagnostic tool.
The purpose of annotator calibration is to minimize human-introduced scale differences. In preference data, many errors do not stem from carelessness but from different interpretive frameworks. A mature annotation system typically does not let annotators start work directly; it first provides training through a set of representative samples, helping annotators understand key dimensions, typical boundaries, and common misjudgment patterns. Especially in multi-objective preference scenarios, calibration must explicitly tell annotators: when truthfulness and fluency conflict, how to choose; when helpfulness and compliance conflict, how to choose; when two candidates each outperform the other on different dimensions, how to make an overall judgment. Without this step, annotation systems quickly become dominated by individual habits.
Golden sets are the anchor points of the entire consistency governance system. Golden sets should not be understood as a collection of "perfect samples"—they are a batch of high-confidence samples that have been expert-confirmed, rule-validated, disagreement-arbitrated, and carry clear determination rationale. They can be used to train new annotators, to monitor whether veteran annotators have drifted, to compare whether different versions of preference specifications have changed the value function, and as high-weight samples in training. For reward models, the golden set functions more like a set of long-term stable reference coordinates—it cannot replace large-scale data, but it prevents the entire system from drifting without anchor in the noise.
From an engineering perspective, arbitration, re-labeling, calibration, and golden sets should not be treated as add-on processes—they should be regarded as part of preference data production. Because preference learning trains not only a model but also an entire value ranking function. If the calibration process for this function is unstable, the most advanced training methods will only amplify unstable supervision.
The following snippet focuses on A Simplified Script for Computing "Annotation Agreement Rate / Reversal Rate".
Preference pairs are inherently "comparison problems" and are well suited to using agreement rates for routine health checks. The script below takes two rounds of annotation results for the same batch of samples and outputs the agreement rate and a list of high-disagreement sample IDs (for routing into the arbitration pool).
Listing 13-3 provides a process flow example.
from collections import Counter
from typing import Dict, List, Tuple
def agreement(a: Dict[str, str], b: Dict[str, str]) -> Tuple[float, List[str]]:
"""
a/b: {sample_id: winner_candidate_id}
"""
common = sorted(set(a) & set(b))
if not common:
return 0.0, []
same = [sid for sid in common if a[sid] == b[sid]]
rate = len(same) / len(common)
disagreed = [sid for sid in common if a[sid] != b[sid]]
return rate, disagreed
if __name__ == "__main__":
# Demo data: two rounds of annotation on the same batch of samples
round1 = {"p1": "A", "p2": "B", "p3": "B", "p4": "A"}
round2 = {"p1": "A", "p2": "A", "p3": "B", "p4": "B"}
rate, disagreed = agreement(round1, round2)
print("Agreement rate:", rate)
print("Samples requiring arbitration:", disagreed)
Listing 13-3: Process flow example.
Debiasing Preference Data and Confidence Modeling¶
Once a team has accepted the fact that "preference data is inherently noisy," the next step is no longer to treat all samples as equally trustworthy. A more reasonable approach is to perform confidence modeling on the samples themselves—in addition to process governance—and to apply systematic debiasing when necessary (Dawid & Skene 1979; Northcutt et al. 2021). Because preference data cannot be treated as binary facts—it is a class of observations carrying source, context, difficulty, and judgment stability. For data teams, this means the training set should not be a simple list of samples but a collection of reward signals with attached weights, confidence levels, and source labels.
The most direct approach is to assign different weights to samples based on their attributes. For example, a preference pair derived from expert consensus judgment, strongly supported by rule-level constraints, with clearly differentiated candidates, and historically stable on re-labeling, should receive a higher training weight than a sample derived from model-as-judge, with large disagreement and very similar candidates. Similarly, a sample that entered the golden set after arbitration should be treated differently from an ordinary automatically expanded sample. The significance of this approach is that the training process shifts from treating all supervision as absolute ground truth to explicitly acknowledging that different samples differ in "degree of trustworthiness."
Beyond sample-level weights, debiasing also commonly manifests in score normalization and stratified modeling. For scalar rewards, teams often need to apply scale correction by annotator, task type, or time window, to prevent reward models from blending different rating habits into a spuriously unified scale. For preference pairs, teams can train in a source-stratified manner, or retain source labels in training and evaluation, to observe whether the model has overfit to one category of preference source. For process rewards, debiasing is more complex—the granularity and segmentation of process segments can themselves introduce systematic errors. Here, teams must control not only local reward values but also the definition of "what counts as one step."
Another important dimension of confidence modeling is difficulty awareness. Different preference samples do not carry equal information. Two candidates with a very obvious gap are easy to annotate but offer limited benefit for improving the model's fine-grained judgment. Two candidates that are very similar yet differ at a critical boundary often carry much more information, though they are also more likely to generate disagreement. A mature data system should retain both types of samples and treat them differently in training: the former helps the model establish basic ranking direction, while the latter helps the model learn the subtle trade-offs most critical in real-world deployment.
Debiasing further involves sampling strategy. Many teams, when expanding preference data, inadvertently over-sample the tasks the model handles best, retain the samples easiest to judge, and label the most obvious differences most heavily. This may cause offline metrics to improve rapidly, but it will lead to the model lacking sensitivity to difficult scenarios, boundary conflicts, and fine-grained stylistic differences. In other words, sampling bias itself is part of the bias in reward signals. What data teams must do extends beyond cleaning existing noise—it also means designing, at the data intake stage, a sampling strategy that is diverse, stratified, and covers critical conflict zones.
Preference data governance should not stop at "improving agreement rates"; it should further advance toward modeling inconsistency and uncertainty. Only when a team can identify which samples are more trustworthy, which sources are more stable, which tasks are more prone to drift, and which conflicts are worth preserving rather than forcibly averaged, will reward signals possess long-term maintainability.
Table 13-1 summarizes the corresponding comparison and engineering considerations.
Table 13-1: Reward Noise Sources and Governance Actions.
| Noise Source | Typical Manifestation | Impact on Training | Governance Action |
|---|---|---|---|
| Annotation disagreement | Win/lose labels on the same sample frequently reverse | Weakens preference signal, reduces DPO/RM stability | Establish arbitration pool, re-label, increase task specification granularity |
| Style bias | Preference for longer, more confident, more template-like responses | Reward model learns surface style rather than genuine quality | Dimension-split annotation, blind certain surface attributes, introduce counter-example sets |
| Score drift | Scoring scale inconsistent across time / teams | Scalar rewards not comparable; RM distorted | Periodic calibration, golden set regression, per-annotator normalization |
| Candidate construction bias | Candidate answer quality gap too large or too small | Training only learns coarse-grained good/bad; poor generalization | Control candidate difficulty distribution, mix in "fine-grained near-equivalent samples" |
| Source mixing | Expert, user, and model judge signals undifferentiated | Reward target unstable; conflicts implicitly averaged | Add source labels, stratified weighting, per-task modeling |
| Rule over-rejection or under-detection | Rule-system judgment conflicts with human preference | Compliance objective representation distorted | Rule audit, human review, layered decision-making by rules and human judgment |
| Task distribution bias | One high-frequency task type dominates the entire dataset | Model mistakes task frequency for preference | Bucketed sampling, per-version quota control, per-scenario evaluation |
| Inconsistent process segmentation | Step boundaries for similar tasks defined differently | PRM fails to learn stable step-level supervision | Fix process unit templates, step segmentation guidelines, process golden set |
13.5 Mapping to Training Methods¶
Data Interfaces for DPO, RM, RLAIF, PRM, and Related Methods¶
Preference data becomes genuinely usable as a reward signal only when it is successfully mapped to specific training methods. Data teams therefore cannot treat training methods as "the backend team's concern"—because training interfaces will directly and inversely determine what you should collect, what you should retain, and what you should discard. Understanding the differences in data interface requirements among Direct Preference Optimization (DPO), Reward Model (RM), Reinforcement Learning from AI Feedback (RLAIF), Process Reward Model (PRM), and related methods is a critical step in moving preference data engineering from concept to implementation (Rafailov et al. 2023; Bai et al. 2022b).
DPO most typically consumes preference pairs. It requires that for the same input, at least one winning answer and one losing answer can be clearly identified. For DPO, the most important factor is the reliability of pairwise comparison relationships—not how precise each answer's absolute score is. Therefore, if a team can already stably produce high-quality preference pairs, DPO is usually the most natural entry point. But this also means DPO's ceiling depends heavily on candidate construction and annotation consistency. If preference pairs are low in information content, stylistic conflicts have not been made explicit, or win/lose labels frequently drift, then DPO will directly amplify these problems into model behavior.
Reward Model (RM) more closely resembles building an intermediate layer. Through preference pairs, scalar rewards, or mixed signals, it attempts to learn a generalizable scoring capability for individual outputs. RM's value lies in reusability, analyzability, and its ability to serve multiple training and evaluation stages. For example, teams can use RM for candidate pre-ranking, online review assistance, offline comparison between different model versions, and even as an objective function for subsequent policy optimization. But RM also requires more systematic data governance: scoring scales must be relatively stable, dimension conflicts must be interpretable, source differences must be separable—otherwise this "unified reward layer" will ultimately become a unified noise layer.
RLAIF's key distinction is not an unusual data structure but a change in supervision source. It can use preference pairs or scalar rewards, but these signals are provided partially or primarily by AI judges. For data teams, the core challenge of RLAIF is not format—it is defining boundaries: which tasks are suitable for model-based judgment, which tasks must retain human or expert standard-setting, which dimensions can comfortably be delegated to rules and models for large-scale expansion, and which dimensions must maintain high-quality human closed loops. Without clear boundaries, RLAIF can easily scale in volume while simultaneously entrenching the model's own biases.
PRM (Process Reward Model) targets process rewards and has the most complex requirements. Teams must provide not only the final output but also intermediate steps, along with step-level positive/negative feedback or local scores. This means data teams must resolve in advance the problems of process segmentation, step semantic definition, and local reward attribution. PRM is most suited to complex tasks in which intermediate behavior quality determines overall reliability—such as multi-step reasoning, tool invocation, code repair, and Agent planning. Its returns are often very large, but the construction threshold is also highest. It is therefore typically not a default solution for all tasks but is better focused on high-value tasks where process supervision is most warranted.
In practice, many mature teams do not treat DPO, RM, RLAIF, and PRM as mutually exclusive paths. They instead build a layered system: the base layer uses preference pairs for rapid DPO alignment; the platform layer trains reward models to form a unified scoring interface; the expansion layer uses RLAIF and rule judges to increase coverage; and the complex-task layer introduces PRM for process constraints. In other words, the diversity of training methods essentially requires data teams to build layered data interfaces rather than expecting one sample type to serve all methods.
Table 13-2 summarizes the corresponding comparison and engineering considerations.
Table 13-2: Correspondence Between Preference Types and Training Methods.
| Preference/Reward Type | Typical Data Structure | Best-Matched Method(s) | Advantages | Key Challenges |
|---|---|---|---|---|
| Pairwise Preference | (x, y_w, y_l) |
DPO, ranking-based RM, some RLAIF | Annotation is intuitive, consistency is generally higher, fast to implement | Requires high-quality candidate construction; difficult to express fine-grained rationale |
| Scalar Score | (x, y, r) |
RM, some RLAIF, policy optimization | Reusable as a unified reward layer; convenient for scoring and analysis | Scoring scale unstable; prone to drift and subjective bias |
| Process Reward | (x, steps, step_rewards) |
PRM, process-supervised training, complex Agent optimization | Can constrain intermediate behavior; improves stability on complex tasks | Process segmentation is difficult; annotation costs are high; specification is complex |
| Multi-Objective Preference | (x, y_i, score_1...score_k) or dimension-labeled preference pairs |
Multi-head RM, weighted DPO, layered alignment | Can express objective conflicts and Pareto trade-offs in real business contexts | Aggregation strategy is complex; objective weights vary by scenario |
| Rule-Derived Reward | Rule hits, format validation, risk labels | RLAIF, hybrid RM, filter training | Low cost, high consistency, suited to hard constraints | Limited coverage; risks treating programmable constraints as overall quality |
| Online Behavioral Preference | Likes, dwell time, adoption, retries, and other behavioral signals | Online alignment, feedback loop modeling | Real distribution; fast updates | High noise; strong selection bias; difficult attribution |
When to Use Pairwise Preference and When to Use Process Supervision¶
Pairwise preference and process supervision are not an either/or choice, but the problems they are best suited to solving are genuinely different. For data teams, the most important question is not chasing methodological novelty—it is diagnosing where the current task's primary failure mode lies. If a task's core problem manifests at the final answer level, pairwise preference is usually sufficient. If task failure more frequently occurs at the level of intermediate behavior, process supervision becomes necessary.
When a task is essentially "final output ranking," pairwise preference is typically the lowest-cost, highest-return choice. Examples include customer service style optimization, question-answering completeness comparison, summarization quality comparison, refusal template optimization, content rewriting, marketing copy generation, and structured reply selection. In these scenarios, the model's success or failure is primarily reflected in the content ultimately presented to the user; the intermediate reasoning process is not necessarily what the business cares about most. What teams need most is for the model to form a more stable preference among multiple acceptable answers—not step-by-step examination of the process.
But when a task has a clearly multi-step structure and errors in intermediate steps can have a major impact on reliability, using pairwise preference alone is often insufficient. In complex reasoning tasks, for example, the model may arrive at the correct answer by luck while the reasoning chain is incoherent. In tool-invocation tasks, the model may produce a usable final result through an inefficient, uninterpretable, or even risky invocation path. In code generation tasks, the final code may pass tests, but the intermediate design choices may be highly fragile. When only the final result is given preference labels, the training signal encourages the model to continue relying on opaque paths rather than learning correct intermediate decision patterns.
More realistically, many systems need both in combination. Teams can first use preference pairs to steer overall style, helpfulness, and terminal quality in the right direction, then introduce process supervision for high-value tasks where "the final output looks roughly the same but the process differs greatly." This approach controls costs while concentrating process rewards where they can generate the most benefit. Process supervision is expensive; applying it uniformly across all tasks is neither economical nor necessary.
From an engineering planning perspective, a reasonable criterion is: if the model only needs to "look right at the end" to satisfy business requirements, pairwise preference takes priority. If the model must "obtain results in the right way"—especially in high-risk, auditable, or multi-step execution scenarios—process supervision should be incorporated into the design as early as possible. This is not a methodological preference; the determining factor comes from the task structure itself.
Version Governance and Deployment Strategy for Preference Datasets¶
Once preference data enters the training loop, it is transformed from a raw data asset into a behavioral policy asset (Gebru et al. 2021; Bender & Friedman 2018; Mitchell et al. 2019). Because preference data effectively defines "how the model should make trade-offs"—that is, an organization-level value function—every change to a preference dataset may induce changes in model behavioral style, objective ordering, and even risk boundaries. For precisely this reason, version governance for preference datasets must be stricter than that for general corpora.
First, teams must recognize that changing a preference data version changes not only the number of samples but potentially the value function itself. A new version that adds more "conciseness-first" preference pairs may make the model noticeably more terse; strengthening risk-boundary samples may make the model more conservative; incorporating large quantities of user feedback while reducing expert weight may make the model more natural but not necessarily more robust. Version governance therefore cannot merely record "how many data points were added"—it must record what changed in this version across target dimensions, source proportions, candidate generation strategies, scoring criteria, task coverage, and risk constraints.
Second, preference datasets should ideally have a clear layered structure. For example, the core high-risk golden set, expert baseline set, general main training set, automatically expanded weak supervision set, and online feedback candidate pool should all be managed separately. The benefit is that during training, teams can decide on different sampling ratios and weights per layer, and when problems arise, can quickly pinpoint: was it the high-confidence trunk data definition that changed, or did the automatic expansion layer introduce unexpected style drift?
Deployment strategy likewise cannot rely solely on single offline scores. For preference learning, the most valuable questions are not "did the overall score improve" but "has the multi-objective balance changed." A new version may have improved helpfulness while reducing refusal accuracy; improved naturalness while sacrificing brand consistency; improved user feedback scores while increasing latency and over-generation. Especially in multi-objective preference scenarios, teams must learn to read version results from a Pareto trade-off perspective: did this optimization push the system toward a better frontier, or merely toward a different trade-off point? If the latter, does this change align with the organization's current-phase business strategy?
Pre-deployment evaluation of preference data should therefore be conducted with per-dimension comparison, not just a single regression metric (Liang et al. 2022). Teams should maintain a stable multi-dimensional evaluation set, separately observing correctness, helpfulness, boundary control, conciseness, process reliability, refusal quality, and style consistency, and analyzing behavioral changes in conjunction with different task buckets. For high-risk tasks, there should also be dedicated regression sets and expert review mechanisms. Only then does preference dataset version governance truly mean "governance" rather than "archiving."
Furthermore, preference data version management should serve organizational communication. Because preference learning is fundamentally implementing a value ordering, many version adjustments are not purely technical decisions—they involve strategic choices jointly determined by product, business, risk management, brand, and other stakeholders. A mature preference data team cannot simply say "we updated 200,000 preference pairs"—it should be able to articulate: what preferences this update strengthened, what styles it suppressed, where it moved the system in the multi-objective space, what the expected benefits are, and what the potential side effects are. Only at this level does preference learning truly become an organization-level capability, rather than merely a component of training engineering.
Chapter Summary¶
Preference learning is not simply tallying which answers users prefer—it is the work of organizing an organization's requirements for model behavior into training signals. Pairwise preference expresses relative quality between candidates; reward models convert those judgments into reusable scoring capability; process rewards extend supervision into intermediate steps; and multi-objective preference handles the objective conflicts present in real business contexts.
For teams that need to operationalize preference learning into data construction, what matters most is never which popular algorithm to select first. The key is first answering a set of foundational questions clearly: What exactly are we comparing? Where does the preference come from? Which dimensions need explicit annotation? Which conflicts need to be resolved upfront through specifications, and which need to be modeled in training? Which tasks require only outcome-level preference, and which must introduce process rewards? Which noise can be governed through arbitration, re-labeling, and calibration, and which should enter confidence modeling and version isolation? Only once these questions are addressed systematically will the subsequent DPO, RM, RLAIF, and PRM pipelines have a truly solid data foundation.
In terms of deployment path, a pragmatic and steady strategy typically proceeds in phases. The first step is to use offline high-quality preference pairs to establish basic ranking capability—resolving the "capable but wrong choice" problem that remains after SFT. The second step is to introduce multi-dimensional evaluation and reward models, progressively forming a unified reward interface and beginning to explicitly address multi-objective preferences. The third step is to add process rewards on complex and high-value tasks, so the model learns not only to "arrive at good results" but also to "arrive at good results through reliable means." The fourth step is to strengthen expert standard-setting, arbitration, golden sets, and version governance in high-risk scenarios, incorporating preference learning into an auditable and accountable process system. Ultimately, teams should form a closed loop in which none of the following components can be absent: preference data design, reward signal construction, training method mapping, multi-dimensional pre-deployment evaluation, online feedback loops, and version comparison governance.
When preference data is understood and built in this way, it is elevated from being a single stage of RLHF or alignment training to being a formal value function engineering discipline. The "style," "robustness," "sense of boundaries," and "modes of helpfulness" that the model ultimately exhibits will no longer be incidental by-products of a stochastic generation distribution—they will become organizational behavioral outcomes jointly defined, jointly constrained, and jointly iterated by data teams, training teams, and business teams. For teams today that need to move large language models from "capable of generating" to "capable of stable deployment," this is the true significance of preference data and reward signals.
References¶
Christiano P F, Leike J, Brown T B, Martic M, Legg S, Amodei D (2017) Deep reinforcement learning from human preferences. Advances in Neural Information Processing Systems, 30. arXiv:1706.03741.
Ziegler D M, Stiennon N, Wu J, Brown T B, Radford A, Amodei D, Christiano P, Irving G (2019) Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593.
Stiennon N, Ouyang L, Wu J, Ziegler D M, Lowe R, Voss C, Radford A, Amodei D, Christiano P (2020) Learning to summarize from human feedback. Advances in Neural Information Processing Systems, 33, 3008–3021. arXiv:2009.01325.
Askell A, Bai Y, Chen A, et al. (2021) A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861.
Ouyang L, Wu J, Jiang X, et al. (2022) Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35, 27730–27744. arXiv:2203.02155.
Bai Y, Jones A, Ndousse K, et al. (2022a) Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862.
Rafailov R, Sharma A, Mitchell E, et al. (2023) Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 53728–53741. arXiv:2305.18290.
Bai Y, Kadavath S, Kundu S, et al. (2022b) Constitutional AI: Harmlessness from AI feedback. arXiv preprint arXiv:2212.08073.
Lee H, Phatale S, Mansoor H, et al. (2023) RLAIF vs. RLHF: Scaling reinforcement learning from human feedback with AI feedback. Proceedings of the 41st International Conference on Machine Learning (ICML 2024). arXiv:2309.00267.
Lightman H, Kosaraju V, Burda Y, et al. (2024) Let's verify step by step. International Conference on Learning Representations (ICLR 2024). arXiv:2305.20050.
Uesato J, Kushman N, Kumar R, et al. (2022) Solving math word problems with process- and outcome-based feedback. arXiv preprint arXiv:2211.14275.
Bradley R A, Terry M E (1952) Rank analysis of incomplete block designs: I. The method of paired comparisons. Biometrika, 39(3/4), 324–345. https://doi.org/10.2307/2334029.
Roijers D M, Vamplew P, Whiteson S, et al. (2013) A survey of multi-objective sequential decision-making. Journal of Artificial Intelligence Research, 48, 67–113.
Deb K, Pratap A, Agarwal S, et al. (2002) A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Transactions on Evolutionary Computation, 6(2), 182–197. https://doi.org/10.1109/4235.996017.
Cohen J (1960) A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37–46. https://doi.org/10.1177/001316446002000104.
Dawid A P, Skene A M (1979) Maximum likelihood estimation of observer error-rates using the EM algorithm. Journal of the Royal Statistical Society: Series C (Applied Statistics), 28(1), 20–28. https://doi.org/10.2307/2346806.
Snow R, O'Connor B, Jurafsky D, et al. (2008) Cheap and fast—but is it good? Evaluating non-expert annotations for natural language tasks. Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, 254–263.
Aroyo L, Welty C (2015) Truth is a lie: Crowd truth and the seven myths of human annotation. AI Magazine, 36(1), 15–24. https://doi.org/10.1609/aimag.v36i1.2564.
Northcutt C G, Jiang L, Chuang I L (2021) Confident learning: Estimating uncertainty in dataset labels. Journal of Artificial Intelligence Research, 70, 1373–1411. https://doi.org/10.1613/jair.1.12125.
Gebru T, Morgenstern J, Vecchione B, et al. (2021) Datasheets for datasets. Communications of the ACM, 64(12), 86–92. https://doi.org/10.1145/3458723.
Bender E M, Friedman B (2018) Data statements for natural language processing: Toward mitigating system bias and enabling better science. Transactions of the Association for Computational Linguistics, 6, 587–604. https://doi.org/10.1162/tacl_a_00041.
Mitchell M, Wu S, Zaldivar A, et al. (2019) Model cards for model reporting. Proceedings of the Conference on Fairness, Accountability, and Transparency, 220–229. https://doi.org/10.1145/3287560.3287596.
Liang P, Bommasani R, Lee T, et al. (2022) Holistic evaluation of language models. arXiv preprint arXiv:2211.09110.