Models
Models
Reference for all supported models on ReinforceNow.
Qwen Models (Text)
| Model ID | Parameters | Description |
|---|---|---|
Qwen/Qwen3-235B-A22B-Instruct-2507 | 235B (22B active) | Largest Qwen3 MoE, instruction-tuned |
Qwen/Qwen3-30B-A3B-Instruct-2507 | 30B (3B active) | Medium Qwen3 MoE, instruction-tuned |
Qwen/Qwen3-30B-A3B | 30B (3B active) | Medium Qwen3 MoE, hybrid (thinking optional) |
Qwen/Qwen3-30B-A3B-Base | 30B (3B active) | Medium Qwen3 MoE, base model |
Qwen/Qwen3-32B | 32B | Large dense Qwen3, hybrid (thinking optional) |
Qwen/Qwen3-8B | 8B | Recommended for most tasks, hybrid (thinking optional) |
Qwen/Qwen3-8B-Base | 8B | Base model (not instruction-tuned) |
Qwen/Qwen3-4B-Instruct-2507 | 4B | Smallest Qwen3, instruction-tuned |
Qwen Models (Vision)
Vision-language models that can process images alongside text.
| Model ID | Parameters | Description |
|---|---|---|
Qwen/Qwen3-VL-235B-A22B-Instruct | 235B (22B active) | Largest Qwen3 vision MoE |
Qwen/Qwen3-VL-30B-A3B-Instruct | 30B (3B active) | Medium Qwen3 vision MoE |
Meta Llama Models
| Model ID | Parameters | Description |
|---|---|---|
meta-llama/Llama-3.1-70B | 70B | Large Llama 3.1 |
meta-llama/Llama-3.3-70B-Instruct | 70B | Latest Llama 3.3, instruction-tuned |
meta-llama/Llama-3.1-8B | 8B | Medium Llama 3.1 |
meta-llama/Llama-3.1-8B-Instruct | 8B | Medium Llama 3.1, instruction-tuned |
meta-llama/Llama-3.2-3B | 3B | Small Llama 3.2 |
meta-llama/Llama-3.2-1B | 1B | Smallest Llama, fast iteration |
DeepSeek Models
| Model ID | Parameters | Description |
|---|---|---|
deepseek-ai/DeepSeek-V3.1 | MoE | Latest DeepSeek V3 |
deepseek-ai/DeepSeek-V3.1-Base | MoE | DeepSeek V3 base model |
OpenAI Open Source Models (Reasoning)
Reasoning models that always use chain-of-thought before responding.
| Model ID | Parameters | Description |
|---|---|---|
openai/gpt-oss-120b | 120B (MoE) | Large reasoning model |
openai/gpt-oss-20b | 20B (MoE) | Medium reasoning model |
Moonshot Models (Reasoning)
| Model ID | Parameters | Description |
|---|---|---|
moonshotai/Kimi-K2-Thinking | 1T+ (MoE) | Largest reasoning model, built for long chains of reasoning and tool use |
Choosing a Model
For Getting Started
Start with smaller models for faster iteration:
params:
model: "Qwen/Qwen3-8B" # Good balance of speed and capabilityOr for very fast experiments:
params:
model: "meta-llama/Llama-3.2-1B" # Fastest iterationFor Production
Scale up once your reward functions are working:
params:
model: "Qwen/Qwen3-32B" # Higher capabilityBase vs Instruct Models
| Type | When to Use |
|---|---|
| Instruct (default) | Most RL tasks - already follows instructions |
| Base | When you want to train from scratch without prior instruction tuning |
Model Types
| Type | Description | When to Use |
|---|---|---|
| Instruction | Fast inference, no chain-of-thought | Low latency tasks |
| Hybrid | Optional thinking mode | Flexible - enable thinking for complex tasks |
| Reasoning | Always uses chain-of-thought | Complex reasoning, tool use, multi-step problems |
| Vision | Can process images alongside text | Multimodal tasks with images |
Model Capabilities
| Model Size | Best For |
|---|---|
| 1B-4B | Quick experiments, simple tasks |
| 8B | Most RL training, good balance |
| 30B-70B | Complex reasoning, production quality |
| 120B+ | State-of-the-art performance |
| Vision models | Multimodal tasks with images |
| Reasoning models | Complex multi-step reasoning and tool use |
Usage in config.yml
project_id: ""
project_name: "My Project"
dataset_id: ""
dataset_type: "rl"
params:
model: "Qwen/Qwen3-8B" # Choose from supported models
batch_size: 2
num_epochs: 30