2025.06.09 - Stream Notes

Stream Notes
- Catch Up
  - Exhausted
  - Stopping magnesium
    - Massive headaches
    - Brain fog
- AI News
  - The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity
    - Apparently it’s making the rounds on X/Twitter
  - From Tokens to Thoughts: How LLMs and Humans Trade Compression for Meaning
  - Knowledge or Reasoning? A Close Look at How LLMs Think Across Domains
    - While math tasks benefit more from reasoning (higher InfoGain), medical tasks rely heavily on domain knowledge (higher KI). In fact, KI shows a stronger correlation (0.998) with task accuracy than InfoGain (0.698) in medical benchmarks.
    - Qwen-Base consistently outperforms DeepSeek-R1-distilled models across accuracy, InfoGain, and KI. The R1-distilled model struggles with medical adaptation, likely due to pretraining bias toward math/code domains
  - OpenThoughts: Data Recipes for Reasoning Models
    - The authors ablate every step in the data pipeline, question sourcing, filtering, teacher choice, deduplication, and answer sampling, showing how each incrementally lifts performance. For instance, using multiple answers per question (16×) improved results more than simply increasing question diversity.
    - Surprisingly, QwQ-32B yielded better student models than DeepSeek-R1 or Phi-4 despite lower benchmark scores, suggesting teacher choice affects trace quality more than raw performance.
    - The full datasets and models are released at openthoughts.ai, providing a reproducible benchmark for open reasoning research.
  - Coding Agents with Multimodal Browsing are Generalist Problem Solvers
    - Currently unclear what makes this different from giving an Agent more tools
  - Self-Challenging Language Model Agents
  - AlphaOne: Reasoning Models Thinking Slow and Fast at Test Time
    - Contrary to human intuition (fast-then-slow), models benefit from beginning with slow reasoning before transitioning to faster inference. This “frontloaded effort” schedule leads to more accurate problem solving.
    - Among several tested functions for controlling “wait” insertion (constant, linear increase, exponential/linear anneal), linear anneal—gradually reducing “wait” token frequency, proved best across multiple models and datasets.
  - The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text
    - An eight terabyte collection of openly licensed text designed for LLM pretraining
  - RewardBench 2: Advancing Reward Model Evaluation
  - How much do language models memorize?
    - Saw this last week!
  - OpenAI hit $10Bn in annual recurring revenue
    - Does this seem sus?
      - OpenAI claims they are serving more than 500 million weekly active users and 3 million paying business customers.
        
        According to DataReportal, 5.63 billion people worldwide were using the internet at the start of April 2025
        
        That would mean ~10% of the world’s population was using ChatGPT
        
        If we apply free-to-play analytics to this, what is our assumption on distribution of paying users?
        
        If 2% of users are paying, that means 10 million users
        
        We can assume breakdown of paying users between $20/month and $200/month
        
        Say each paying user somehow is worth $60/month = 600 million/month
        
        Seems unlikely people are paying $200/month, if all 10m users are paying $20, that works out to $2.4bn a year…
      - Also Annual Recurring Revenue is an expectation of revenue generated
        
        Am curious what actual revenue they made since last year
- We raided deadslap
  - What was your raid message a few streams ago?

2025.06.09 – Stream Notes

Socials

Related