Create Your First Reward
ReinforceNow supports training models with reinforcement learning with custom reward functions.
We will train a Qwen/Qwen3-8B using reinforcement learning on the OpenMathReasoning dataset. We will teach the model to be proficient at solving math reasoning problems.
If you want to learn more about reinforcement learning, check out our blog.
Step 1: Initialize the Template
To start, let's initialize the template for this tutorial. Create a new folder and run:
rnow init --template tutorial-rewardThis generates three files:
train.jsonl- Training dataconfig.yml- Configuration filerewards.py- Reward function (incomplete)
Understanding train.jsonl
Each entry contains messages (the conversation), rewards (which reward functions to use), and metadata (including expected_answer).
Note: For reasoning problems, we tell the model to output its answer in \boxed{} so we can extract the final answer without picking up intermediate steps from the reasoning.
{
"messages": [
{
"role": "user",
"content": "Solve the following math problem step by step. Show your reasoning, then provide your final answer inside \\boxed{}.\n\nWhat is the smallest number of planes needed to divide a cube into congruent tetrahedra?"
}
],
"rewards": ["accuracy"],
"metadata": {
"expected_answer": "3"
}
}For more information on train.jsonl and config.yml see Reference.
Step 2: Define the Reward Function
We will create a reward function that extracts the model's answer from \boxed{} and compares it to the expected answer. If the answer is correct, we reward it with 1.0. If incorrect, we reward it with 0.0.
Using math-verify
We use math-verify, a library for verifying mathematical expressions. It handles:
- LaTeX extraction - Extracts math from
\boxed{},$...$,\(...\), etc. - Symbolic comparison - Converts LaTeX to SymPy and compares mathematically (so
\frac{1}{2}equals0.5) - Nested braces - Correctly handles expressions like
\boxed{\frac{4}{5}}
Create the Function
Start by creating the function with the @reward decorator in rewards.py. Every reward function must be decorated with @reward.
@reward
def accuracy(args: RewardArgs, messages: list) -> float:Understanding RewardArgs
A reward function receives:
args- Containsmetadataandvariablesfrom your training datamessages- The full message history
You can access any field from your metadata object via args.metadata["field_name"].
Parse and Verify
Use math-verify to parse both the expected answer and the model's response, then verify they match:
from math_verify import LatexExtractionConfig, parse, verify
# Parse expected answer (from metadata)
gold = parse(args.metadata["expected_answer"])
# Parse model response, prioritizing \boxed{} content
pred = parse(
messages[-1]["content"],
extraction_config=[LatexExtractionConfig(boxed_match_priority=0)]
)
# Verify mathematical equivalence
return 1.0 if verify(gold, pred) else 0.0The boxed_match_priority=0 tells math-verify to prioritize extracting content from \boxed{} tags.
Complete Example
Here's the full rewards.py:
from math_verify import LatexExtractionConfig, parse, verify
from rnow.core import reward, RewardArgs
@reward
def accuracy(args: RewardArgs, messages: list) -> float:
"""Check if the boxed answer matches the expected answer."""
gold = parse(args.metadata["expected_answer"])
pred = parse(
messages[-1]["content"],
extraction_config=[LatexExtractionConfig(boxed_match_priority=0)]
)
if not pred:
return 0.0
return 1.0 if verify(gold, pred) else 0.0Step 3: Run your experiment
We're all set!
rnow run