ReinforceNowReinforceNow

config.yml

Reference for config.yml project configuration.

We currently support two types of experiments: Reinforcement Learning and Supervised Fine-Tuning.

Templates such as sft, rl-single, and rl-multi are also available through the cli. To initialize a project using one of these templates run:

rnow init --template <name>

We provide template configuration files for both:

Full Example (RL)

project_id: ""
project_name: "My RL Project"
dataset_id: ""
dataset_type: rl
organization_id: ""
data:
  train_file: train.jsonl
  batch_size: 2
  group_size: 16
model:
  path: Qwen/Qwen3-8B
  qlora_rank: 32
  qlora_alpha: 64
algorithm:
  loss_fn: ppo
  adv_estimator: grpo
  kl_penalty_coef: 0.01
rollout:
  max_turns: 1
  max_tokens: 2048
  termination_policy: last_tool
trainer:
  num_epochs: 30
  learning_rate: 0.0001
  save_step: 20
  eval_step: 0

Full Example (SFT)

project_id: ""
project_name: "My SFT Project"
dataset_id: ""
dataset_type: sft
organization_id: ""
data:
  train_file: train.jsonl
  batch_size: 2
  val_split: 0.2
model:
  path: Qwen/Qwen3-8B
  qlora_rank: 32
trainer:
  num_epochs: 40
  learning_rate: 0.0001
  save_step: 2
  eval_step: 3

Project Fields

project_id: Unique project identifier (auto-generated). Required.

project_name: Human-readable project name. Required.

dataset_id: Unique dataset identifier (auto-generated). Required.

dataset_type: Type of experiment. Must be rl or sft. Required.

organization_id: Organization ID. Uses active org if not set.

data

train_file: Path to training data file. Default: train.jsonl

batch_size: Training batch size. Max: 32. Required.

group_size: Number of parallel rollouts per prompt. Default: 4, Max: 64. RL only.

Note: batch_size * group_size must be <= 2048 (sandbox concurrency limit).

val_split: Validation split ratio (0-1). SFT only.

model

path: Model to train. Must be from the supported models list, or a checkpoint ID from a previous training run to continue training. Required.

Multi-model training: You can provide a list of models to train them all with the same configuration. The CLI will submit separate runs for each model:

path:
  - Qwen/Qwen3-4B-Instruct-2507
  - Qwen/Qwen3-8B
  - Qwen/Qwen3-30B-A3B

qlora_rank: QLoRA rank for efficient fine-tuning. Default: 32

Maximum LoRA rank by model:

  • 32: openai/gpt-oss-120b, openai/gpt-oss-20b
  • 64: Qwen/Qwen3-235B-A22B-Instruct-2507, Qwen/Qwen3-30B-A3B-, deepseek-ai/DeepSeek-V3.1
  • 128: Qwen/Qwen3-32B, Qwen/Qwen3-8B*, Qwen/Qwen3-4B-Instruct-2507, all meta-llama/* models

qlora_alpha: QLoRA alpha scaling factor. Default: qlora_rank * 2

Supported Models: Qwen/Qwen3-235B-A22B-Instruct-2507, Qwen/Qwen3-30B-A3B-Instruct-2507, Qwen/Qwen3-30B-A3B, Qwen/Qwen3-30B-A3B-Base, Qwen/Qwen3-32B, Qwen/Qwen3-8B, Qwen/Qwen3-8B-Base, Qwen/Qwen3-4B-Instruct-2507, meta-llama/Llama-3.1-70B, meta-llama/Llama-3.3-70B-Instruct, meta-llama/Llama-3.1-8B, meta-llama/Llama-3.1-8B-Instruct, meta-llama/Llama-3.2-3B, meta-llama/Llama-3.2-1B, deepseek-ai/DeepSeek-V3.1, deepseek-ai/DeepSeek-V3.1-Base, openai/gpt-oss-120b, openai/gpt-oss-20b

You can also use a checkpoint ID from a previous training run to continue training from that checkpoint.

algorithm

RL only.

loss_fn: Loss function. Default: ppo

Possible values:

  • ppo: Proximal Policy Optimization
  • importance_sampling: Importance Sampling

adv_estimator: Advantage estimator. Default: grpo

Possible values:

  • grpo: Generalized Reward Policy Optimization (recommended)
  • gae: Generalized Advantage Estimation
  • reinforce: Classic REINFORCE algorithm

kl_penalty_coef: KL divergence penalty coefficient. Default: 0.01

rollout

RL only.

max_turns: Maximum conversation turns. Default: 1

max_tokens: Maximum tokens per model response. Default: 2048

Important: This value must fit within the model's context window along with your prompt. The CLI validates that max_prompt_tokens + max_tokens < context_window.

For reasoning models: Models that use <think> tags consume many tokens for chain-of-thought reasoning. Set max_tokens to 8192 or higher (e.g., 16384) to allow sufficient space for reasoning.

termination_policy: When to end episode. Default: last_tool

Possible values:

  • last_tool: Episode ends when model responds without a tool call
  • max_turns: Episode always runs for exactly max_turns turns

Notes:

  • A turn is one model generation. Each assistant response (with any tool calls and results) counts as one turn.
  • Pre-existing messages in train.jsonl do not count toward max_turns.

thinking_mode: Chain-of-thought mode. Optional.

Possible values:

  • disabled: Explicitly disable reasoning
  • easy: Light reasoning
  • medium: Moderate reasoning
  • hard: Deep reasoning

When not specified, reasoning is automatically enabled for supported models. Set to disabled to turn it off.

max_tool_response_chars: Maximum characters for tool responses. Default: 4000

Longer responses are truncated. Set to null to disable truncation.

mcp_url: MCP server URL(s). Optional.

Connect to MCP servers to provide tools to your agent. Can be a single URL or an array of URLs:

Note: You can use both mcp_url and env.py together—tools from both sources will be available to your agent.

See the MCP Tutorial for setup instructions.

trainer

num_epochs: Number of training epochs. Required.

learning_rate: Learning rate. Default: 0.0001

save_step: Save checkpoint every N steps. 0 = end of epoch only. Default: 0

eval_step: Evaluate every N steps. 0 = end of epoch only. Default: 0