While math tasks benefit more from reasoning (higher InfoGain), medical tasks rely heavily on domain knowledge (higher KI). In fact, KI shows a stronger correlation (0.998) with task accuracy than InfoGain (0.698) in medical benchmarks.
Qwen-Base consistently outperforms DeepSeek-R1-distilled models across accuracy, InfoGain, and KI. The R1-distilled model struggles with medical adaptation, likely due to pretraining bias toward math/code domains
The authors ablate every step in the data pipeline, question sourcing, filtering, teacher choice, deduplication, and answer sampling, showing how each incrementally lifts performance. For instance, using multiple answers per question (16×) improved results more than simply increasing question diversity.
Surprisingly, QwQ-32B yielded better student models than DeepSeek-R1 or Phi-4 despite lower benchmark scores, suggesting teacher choice affects trace quality more than raw performance.
The full datasets and models are released at openthoughts.ai, providing a reproducible benchmark for open reasoning research.
Contrary to human intuition (fast-then-slow), models benefit from beginning with slow reasoning before transitioning to faster inference. This “frontloaded effort” schedule leads to more accurate problem solving.
Among several tested functions for controlling “wait” insertion (constant, linear increase, exponential/linear anneal), linear anneal—gradually reducing “wait” token frequency, proved best across multiple models and datasets.