Create Your First Reward

ReinforceNow supports training models with reinforcement learning with custom reward functions.

We will train a Qwen/Qwen3-8B using reinforcement learning on the OpenMathReasoning dataset. We will teach the model to be proficient at solving math reasoning problems.

If you want to learn more about reinforcement learning, check out our blog.

Step 1: Initialize the Template

To start, let's initialize the template for this tutorial. Create a new folder and run:

rnow init --template tutorial-reward

This generates three files:

train.jsonl - Training data
config.yml - Configuration file
rewards.py - Reward function (incomplete)

Understanding train.jsonl

Each entry contains messages (the conversation), rewards (which reward functions to use), and metadata (including expected_answer).

Note: For reasoning problems, we tell the model to output its answer in \boxed{} so we can extract the final answer without picking up intermediate steps from the reasoning.

{
  "messages": [
    {
      "role": "user",
      "content": "Solve the following math problem step by step. Show your reasoning, then provide your final answer inside \\boxed{}.\n\nWhat is the smallest number of planes needed to divide a cube into congruent tetrahedra?"
    }
  ],
  "rewards": ["accuracy"],
  "metadata": {
    "expected_answer": "3"
  }
}

For more information on train.jsonl and config.yml see Reference.

Step 2: Define the Reward Function

We will create a reward function that extracts the model's answer from \boxed{} and compares it to the expected answer. If the answer is correct, we reward it with 1.0. If incorrect, we reward it with 0.0.

Using math-verify

We use math-verify, a library for verifying mathematical expressions. It handles:

LaTeX extraction - Extracts math from \boxed{}, $...$ , $...$, etc.
Symbolic comparison - Converts LaTeX to SymPy and compares mathematically (so \frac{1}{2} equals 0.5)
Nested braces - Correctly handles expressions like \boxed{\frac{4}{5}}

Create the Function

Start by creating the function with the @reward decorator in rewards.py. Every reward function must be decorated with @reward.

@reward
def accuracy(args: RewardArgs, messages: list) -> float:

Understanding RewardArgs

A reward function receives:

args - Contains metadata and variables from your training data
messages - The full message history

You can access any field from your metadata object via args.metadata["field_name"].

Parse and Verify

Use math-verify to parse both the expected answer and the model's response, then verify they match:

from math_verify import LatexExtractionConfig, parse, verify

# Parse expected answer (from metadata)
gold = parse(args.metadata["expected_answer"])

# Parse model response, prioritizing \boxed{} content
pred = parse(
    messages[-1]["content"],
    extraction_config=[LatexExtractionConfig(boxed_match_priority=0)]
)

# Verify mathematical equivalence
return 1.0 if verify(gold, pred) else 0.0

The boxed_match_priority=0 tells math-verify to prioritize extracting content from \boxed{} tags.

Complete Example

Here's the full rewards.py:

from math_verify import LatexExtractionConfig, parse, verify
from rnow.core import reward, RewardArgs

@reward
def accuracy(args: RewardArgs, messages: list) -> float:
    """Check if the boxed answer matches the expected answer."""
    gold = parse(args.metadata["expected_answer"])
    pred = parse(
        messages[-1]["content"],
        extraction_config=[LatexExtractionConfig(boxed_match_priority=0)]
    )
    if not pred:
        return 0.0
    return 1.0 if verify(gold, pred) else 0.0

Step 3: Run your experiment

We're all set!

rnow run