ReinforceNowReinforceNow

Distillation

Distillation transfers knowledge from a large "teacher" model to a smaller "student" model. We'll use the MedBrowseComp dataset - a medical information-seeking benchmark with real web browsing.

Claude CodePaste these prompts into Claude Code

On-Policy Distillation

Token-level supervision from an open-weights teacher during training.

Claude Code Prompt
Train an on-policy distillation agent on the MedBrowseComp dataset:
https://huggingface.co/datasets/AIM-Harvard/MedBrowseComp

Use the on-distill-agent template with:
- Student: Qwen/Qwen3-8B
- Teacher: Qwen/Qwen3-32B
- Tools: browse (crawl4ai for real web content)

Process the entire train split (605 samples).

Before starting:
1. Ensure rnow is installed: uv pip install rnow
2. Run: rnow init --template on-distill-agent
3. Verify skills are loaded (check for rnow-* skills)
4. If skills not loaded, run: rnow init --template blank (to copy skills)

Off-Policy Distillation

Generate teacher completions first, then train with SFT.

Claude Code Prompt
Train an off-policy distillation agent on the MedBrowseComp dataset:
https://huggingface.co/datasets/AIM-Harvard/MedBrowseComp

Use the off-distill-agent template with:
- Student: Qwen/Qwen3-8B
- Teacher: openai/gpt-4o via OpenRouter
- Tools: browse (crawl4ai for real web content)
- Concurrency: 5 for data generation

Process the entire train split (605 samples).
Set OPENROUTER_API_KEY in .env file.

Before starting:
1. Ensure rnow is installed: uv pip install rnow
2. Run: rnow init --template off-distill-agent
3. Verify skills are loaded (check for rnow-* skills)
4. If skills not loaded, run: rnow init --template blank (to copy skills)