2025.06.02 - Stream Notes

Stream Notes
- Catch Up
  - Wedding
  - Car work
  - Coworker meetup
  - I started doing Leetcode prep again
- AI News
  - NLP AI papers of the week
    - “Sufficient Context”: context that plausibly allows answering a query, without requiring ground truth
      - Even when sufficient context is present, state-of-the-art LLMs like GPT-4o, Claude 3.5, and Gemini 1.5 still hallucinate answers more often than they abstain.
      - Conversely, models can sometimes answer correctly despite insufficient context, likely leveraging parametric memory.
      - Propose a “selective RAG” method that combines model self-confidence with the sufficient context autorater to decide whether to answer or abstain.
    - Self-Improving Code Agents
    - Intuitor is a reinforcement learning method that fine-tunes large language models (LLMs) using self-certainty – the model’s own internal confidence – as the sole reward
      - Framework is intended to allow language models to learn without any external rewards, gold labels, or verifiers
      - Enables scalable and domain-agnostic fine-tuning of LLMs in settings where human feedback or verifiable supervision is expensive or unavailable
    - Learn to Reason via Mixture-of-Thought
      - Mixed multi-modal reasoning
        
        Language Model (interpretability)
        
        Code Model (structured reasoning)
        
        Truth Tables (logic)
      - Supposed to be more like how humans reason (using multiple modes of thought)
    - Alita: Generalist Agent for Scalable Agentic Reasoning
      - Does web searches and code synthesis to make its own MCPs
    - State-of-the-Art Prompting for AI Agents
      - Be Hyper-Specific & Detailed (The “Manager” Approach):
        
        Summary: Treat your LLM like a new employee. Provide very long, detailed prompts that clearly define their role, the task, the desired output, and any constraints.
        
        Example: Parahelp’s customer support agent prompt is 6+ pages, meticulously outlining instructions for managing tool calls.
        
        6 pages of prompt seems like a lot
      - Assign a Clear Role (Persona Prompting):
        
        Summary: Start by telling the LLM who it is (e.g., “You are a manager of a customer service agent,” “You are an expert prompt engineer”). This sets the context, tone, and expected expertise.
        
        Benefit: Helps the LLM adopt the desired style and reasoning for the task.
      - Outline the Task & Provide a Plan:
        
        Summary: Clearly state the LLM’s primary task (e.g., “Your task is to approve or reject a tool call…”). Break down complex tasks into a step-by-step plan for the LLM to follow.
        
        Benefit: Improves reliability and makes complex operations more manageable for the LLM.
      - Structure Your Prompt (and Expected Output):
        
        Summary: Use formatting like Markdown (headers, bullet points) or even XML-like tags to structure your instructions and define the expected output format.
        
        Example: Parahelp uses XML-like tags like <manager_verify>accept</manager_verify> for structured responses.
        
        Benefit: Makes it easier for the LLM to parse instructions and generate consistent, machine-readable output.
      - Meta-Prompting (LLM, Improve Thyself!):
        
        Summary: Use an LLM to help you write or refine your prompts. Give it your current prompt, examples of good/bad outputs, and ask it to “make this prompt better” or critique it.
        
        Benefit: LLMs know “themselves” well and can often suggest effective improvements you might not think of.
      - Provide Examples (Few-Shot & In-Context Learning):
        
        Summary: For complex tasks or when the LLM needs to follow a specific style or format, include a few high-quality examples of input-output pairs directly in the prompt.
        
        Example: Jazzberry (AI bug finder) feeds hard examples to guide the LLM.
        
        Benefit: Significantly improves the LLM’s ability to understand and replicate desired behavior.
      - Prompt Folding & Dynamic Generation:
        
        Summary: Design prompts that can dynamically generate more specialized sub-prompts based on the context or previous outputs in a multi-stage workflow.
        
        Example: A classifier prompt that, based on a query, generates a more specialized prompt for the next stage.
        
        Benefit: Creates more adaptive and efficient agentic systems.
      - Implement an “Escape Hatch”:
        
        Summary: Instruct the LLM to explicitly state when it doesn’t know the answer or lacks sufficient information, rather than hallucinating or making things up.
        
        Example: “If you do not have enough information to make a determination, say ‘I don’t know’ and ask for clarification.”
        
        Benefit: Reduces incorrect outputs and improves trustworthiness.
      - Use Debug Info & Thinking Traces:
        
        Summary: Ask the LLM to include a section in its output explaining its reasoning or why it made certain choices (“debug info”). Some models (like Gemini 1.5 Pro) also provide “thinking traces.”
        
        Benefit: Provides invaluable insight for debugging and improving prompts.
      - Evals are Your Crown Jewels:
        
        Summary: The prompts are important, but the evaluation suite (the set of test cases to measure prompt quality and performance) is your most valuable IP.
        
        Benefit: Evals are essential for knowing why a prompt works and for iterating effectively.
      - Consider Model “Personalities” & Distillation:
        
        Summary: Different LLMs have different “personalities” (e.g., Claude is often more “human-like,” Llama 4 might need more explicit steering). You can use a larger, more capable model for complex meta-prompting/refinement and then “distill” the resulting optimized prompts for use with smaller, faster, or cheaper models in production.
        
        Benefit: Optimizes for both quality (from larger models) and cost/latency (with smaller models).
      - These prompt guidelines don’t seem that informative…in best case scenario, they seem intended for non technical people to “vibe code” technical work
  - Scraping millions of jobs with GPT4o-mini
    - Project located here at hiring.cafe
  - https://www.nytimes.com/2025/06/02/dining/ai-chefs-restaurants.html
- Coding
  - https://www.zdnet.com/article/java-at-30-how-a-language-designed-for-a-failed-gadget-became-a-global-powerhouse/
  - TODO see if you can make your own summaries/topics from the smol news AI newsletter
    - Daily is too frequent/noisy
    - NLP AI Newsletter is weekly and much more manageable
- From Chat / Derail
  - #nogoodnick_
    - badminton bot, trained in simulation, very impressive
    - what libraries or frameworks do people use to drive LLMs? langchain? or are there better options nowadays?
  - #jessynberries
    - Craziest programming ever made: Malbolge
  - #hittaus
    - Other crazy programming languages
      - Piet
    - Google released Jules in beta, an async coding agent
- We raided togglebit

2025.06.02 – Stream Notes

Socials

Related