Models

Reference for all supported models on ReinforceNow.

Qwen Models (Text)

Model ID	Parameters	Description
`Qwen/Qwen3-235B-A22B-Instruct-2507`	235B (22B active)	Largest Qwen3 MoE, instruction-tuned
`Qwen/Qwen3-30B-A3B-Instruct-2507`	30B (3B active)	Medium Qwen3 MoE, instruction-tuned
`Qwen/Qwen3-30B-A3B`	30B (3B active)	Medium Qwen3 MoE, hybrid (thinking optional)
`Qwen/Qwen3-30B-A3B-Base`	30B (3B active)	Medium Qwen3 MoE, base model
`Qwen/Qwen3-32B`	32B	Large dense Qwen3, hybrid (thinking optional)
`Qwen/Qwen3-8B`	8B	Recommended for most tasks, hybrid (thinking optional)
`Qwen/Qwen3-8B-Base`	8B	Base model (not instruction-tuned)
`Qwen/Qwen3-4B-Instruct-2507`	4B	Smallest Qwen3, instruction-tuned

Qwen Models (Vision)

Vision-language models that can process images alongside text.

Model ID	Parameters	Description
`Qwen/Qwen3-VL-235B-A22B-Instruct`	235B (22B active)	Largest Qwen3 vision MoE
`Qwen/Qwen3-VL-30B-A3B-Instruct`	30B (3B active)	Medium Qwen3 vision MoE

Meta Llama Models

Model ID	Parameters	Description
`meta-llama/Llama-3.1-70B`	70B	Large Llama 3.1
`meta-llama/Llama-3.3-70B-Instruct`	70B	Latest Llama 3.3, instruction-tuned
`meta-llama/Llama-3.1-8B`	8B	Medium Llama 3.1
`meta-llama/Llama-3.1-8B-Instruct`	8B	Medium Llama 3.1, instruction-tuned
`meta-llama/Llama-3.2-3B`	3B	Small Llama 3.2
`meta-llama/Llama-3.2-1B`	1B	Smallest Llama, fast iteration

DeepSeek Models

Model ID	Parameters	Description
`deepseek-ai/DeepSeek-V3.1`	MoE	Latest DeepSeek V3
`deepseek-ai/DeepSeek-V3.1-Base`	MoE	DeepSeek V3 base model

OpenAI Open Source Models (Reasoning)

Reasoning models that always use chain-of-thought before responding.

Model ID	Parameters	Description
`openai/gpt-oss-120b`	120B (MoE)	Large reasoning model
`openai/gpt-oss-20b`	20B (MoE)	Medium reasoning model

Moonshot Models (Reasoning)

Model ID	Parameters	Description
`moonshotai/Kimi-K2-Thinking`	1T+ (MoE)	Largest reasoning model, built for long chains of reasoning and tool use

Choosing a Model

For Getting Started

Start with smaller models for faster iteration:

params:
  model: "Qwen/Qwen3-8B"  # Good balance of speed and capability

Or for very fast experiments:

params:
  model: "meta-llama/Llama-3.2-1B"  # Fastest iteration

For Production

Scale up once your reward functions are working:

params:
  model: "Qwen/Qwen3-32B"  # Higher capability

Base vs Instruct Models

Type	When to Use
Instruct (default)	Most RL tasks - already follows instructions
Base	When you want to train from scratch without prior instruction tuning

Model Types

Type	Description	When to Use
Instruction	Fast inference, no chain-of-thought	Low latency tasks
Hybrid	Optional thinking mode	Flexible - enable thinking for complex tasks
Reasoning	Always uses chain-of-thought	Complex reasoning, tool use, multi-step problems
Vision	Can process images alongside text	Multimodal tasks with images

Model Capabilities

Model Size	Best For
1B-4B	Quick experiments, simple tasks
8B	Most RL training, good balance
30B-70B	Complex reasoning, production quality
120B+	State-of-the-art performance
Vision models	Multimodal tasks with images
Reasoning models	Complex multi-step reasoning and tool use

Usage in config.yml

project_id: ""
project_name: "My Project"
dataset_id: ""
dataset_type: "rl"
params:
  model: "Qwen/Qwen3-8B"  # Choose from supported models
  batch_size: 2
  num_epochs: 30

Models

On this page