ReinforceNowReinforceNow

Models

Models

Reference for all supported models on ReinforceNow.

Qwen Models (Text)

Model IDParametersDescription
Qwen/Qwen3-235B-A22B-Instruct-2507235B (22B active)Largest Qwen3 MoE, instruction-tuned
Qwen/Qwen3-30B-A3B-Instruct-250730B (3B active)Medium Qwen3 MoE, instruction-tuned
Qwen/Qwen3-30B-A3B30B (3B active)Medium Qwen3 MoE, hybrid (thinking optional)
Qwen/Qwen3-30B-A3B-Base30B (3B active)Medium Qwen3 MoE, base model
Qwen/Qwen3-32B32BLarge dense Qwen3, hybrid (thinking optional)
Qwen/Qwen3-8B8BRecommended for most tasks, hybrid (thinking optional)
Qwen/Qwen3-8B-Base8BBase model (not instruction-tuned)
Qwen/Qwen3-4B-Instruct-25074BSmallest Qwen3, instruction-tuned

Qwen Models (Vision)

Vision-language models that can process images alongside text.

Model IDParametersDescription
Qwen/Qwen3-VL-235B-A22B-Instruct235B (22B active)Largest Qwen3 vision MoE
Qwen/Qwen3-VL-30B-A3B-Instruct30B (3B active)Medium Qwen3 vision MoE

Meta Llama Models

Model IDParametersDescription
meta-llama/Llama-3.1-70B70BLarge Llama 3.1
meta-llama/Llama-3.3-70B-Instruct70BLatest Llama 3.3, instruction-tuned
meta-llama/Llama-3.1-8B8BMedium Llama 3.1
meta-llama/Llama-3.1-8B-Instruct8BMedium Llama 3.1, instruction-tuned
meta-llama/Llama-3.2-3B3BSmall Llama 3.2
meta-llama/Llama-3.2-1B1BSmallest Llama, fast iteration

DeepSeek Models

Model IDParametersDescription
deepseek-ai/DeepSeek-V3.1MoELatest DeepSeek V3
deepseek-ai/DeepSeek-V3.1-BaseMoEDeepSeek V3 base model

OpenAI Open Source Models (Reasoning)

Reasoning models that always use chain-of-thought before responding.

Model IDParametersDescription
openai/gpt-oss-120b120B (MoE)Large reasoning model
openai/gpt-oss-20b20B (MoE)Medium reasoning model

Moonshot Models (Reasoning)

Model IDParametersDescription
moonshotai/Kimi-K2-Thinking1T+ (MoE)Largest reasoning model, built for long chains of reasoning and tool use

Choosing a Model

For Getting Started

Start with smaller models for faster iteration:

params:
  model: "Qwen/Qwen3-8B"  # Good balance of speed and capability

Or for very fast experiments:

params:
  model: "meta-llama/Llama-3.2-1B"  # Fastest iteration

For Production

Scale up once your reward functions are working:

params:
  model: "Qwen/Qwen3-32B"  # Higher capability

Base vs Instruct Models

TypeWhen to Use
Instruct (default)Most RL tasks - already follows instructions
BaseWhen you want to train from scratch without prior instruction tuning

Model Types

TypeDescriptionWhen to Use
InstructionFast inference, no chain-of-thoughtLow latency tasks
HybridOptional thinking modeFlexible - enable thinking for complex tasks
ReasoningAlways uses chain-of-thoughtComplex reasoning, tool use, multi-step problems
VisionCan process images alongside textMultimodal tasks with images

Model Capabilities

Model SizeBest For
1B-4BQuick experiments, simple tasks
8BMost RL training, good balance
30B-70BComplex reasoning, production quality
120B+State-of-the-art performance
Vision modelsMultimodal tasks with images
Reasoning modelsComplex multi-step reasoning and tool use

Usage in config.yml

project_id: ""
project_name: "My Project"
dataset_id: ""
dataset_type: "rl"
params:
  model: "Qwen/Qwen3-8B"  # Choose from supported models
  batch_size: 2
  num_epochs: 30