The 26M Model That Does What Gemini Does

Sponsored by

Opening

Needle landed on Hacker News Tuesday with 280 upvotes and 102 comments. The claim: a 26-million parameter model that matches Gemini's tool-calling behavior, distilled down from the large model's behavioral signature on that specific task.

That framing matters. This is not a benchmark comparison. It is a claim that structured function-call behavior, the one capability that makes agents actually work in production, is separable from model scale. You can distill it out, compress it, and run it cheaply without an API dependency.

If the eval holds on diverse tool schemas outside the training distribution, the next question is direct: which other expensive hosted capabilities are distillable? Tool calling is the first obvious candidate. JSON extraction, classification, and schema adherence are all learnable behaviors that do not require 70 billion parameters to execute reliably. They require training data from a model that already does them well.

ByteDance's deer-flow hit the same news cycle: 67,000 stars, a full long-horizon agent harness covering research, code, creation, sandboxed execution, subagents, and memory. Open source, production-tested by star count. A small distilled tool-calling model running inside a mature open-source harness is the stack that starts eating hosted agent API spend. Both are in the drops below.

Is Your Retirement Plan Built to Last?

Most people saving for retirement have a number in mind. Fewer have a plan for turning that number into actual income.

The Definitive Guide to Retirement Income walks you through the questions that matter: what things will cost, where the money comes from, and how to keep your portfolio aligned with your long-term goals.

If you have $1,000,000 or more saved, download your free guide and start building a retirement income plan that holds up.

Download your free guide.

Today's Signals

Needle: 26M-parameter tool-calling model distilled from Gemini, 280 HN points, 102 comments. The repo documents the distillation methodology. If the eval holds on tool schemas outside the training set, this is the first credible public evidence that structured function-call behavior is separable from model scale. Implications for inference cost at production volume are significant. (github.com/cactus-compute/needle)
DeepMind: reimagining the mouse pointer for the AI era, 138 HN points, 115 comments. The post describes a pointer model that understands UI semantics rather than pixel coordinates. Not a product launch, a research direction. Operators building browser-use agents should read it: the framing suggests current selector-based approaches will look like telnet within two years. (deepmind.google/blog/ai-pointer/)
OpenAI shipped four Codex case studies in 48 hours, finance teams, NVIDIA engineers, AutoScout24, and a Q1 2026 ChatGPT adoption update. The pattern: large enterprises deploying Codex for internal tooling and workflow automation. OpenAI is assembling an enterprise reference library. That is a sales motion with a long tail. (openai.com/academy/how-finance-teams-use-codex)
Anthropic growing 10x per year, per Latent Space AINews from Saturday. If the trajectory holds, pricing pressure on GPT-4-class inference accelerates as Anthropic scales. Operators who locked in annual API contracts in Q4 2025 made a defensible call. (latent.space/p/ainews-anthropic-growing-10xyear)
Statewright: visual state machines for agent reliability, 74 HN points, 24 comments. State machine-based orchestration addresses the most common production failure mode: agents that drift on their own state across a long run and get stuck in unrecoverable branches. Worth a look if you run agents with more than three sequential steps. (github.com/statewright/statewright)

Learn how to code faster with AI in 5 mins a day

You're spending 40 hours a week writing code that AI could do in 10.

While you're grinding through pull requests, 200k+ engineers at OpenAI, Google & Meta are using AI to ship faster.

How?

The Code newsletter teaches them exactly which AI tools to use and how to use them.

Here's what you get:

AI coding techniques used by top engineers at top companies in just 5 mins a day
Tools and workflows that cut your coding time in half
Tech insights that keep you 6 months ahead

Join 200K+ engineers

The Drops

[REPO] bytedance/deer-flow (GitHub), Long-horizon agent harness from ByteDance. Covers research, code generation, content creation, sandboxed execution, subagents, memory, tool routing, and a message gateway. 67,000 stars in one year. The architecture handles the full stack most operators are currently gluing together by hand: task decomposition, tool routing, memory management, subagent spawning. If you're building multi-step agents and stitching frameworks together manually, this is the reference implementation worth reading before your next architecture decision.

[REPO] langgenius/dify (GitHub), Production-ready agentic workflow platform, 141,000 stars, pushed this morning. Visual workflow editor, LLM routing, RAG, agent orchestration, API publishing. The star count reflects real production adoption, not hype. If a client asks for an agent platform they can own and self-host, this is the credible answer with the community to back it up.

[REPO] hiyouga/LlamaFactory (GitHub), Unified fine-tuning interface for 100-plus LLMs and vision-language models. 71,000 stars, published at ACL 2024. Supports LoRA, QLoRA, and full fine-tune across model families. The Needle distillation story requires a solid fine-tuning harness to execute. LlamaFactory is what operators actually use in production. If capability distillation becomes a workflow you want to run, this is the tooling that makes it tractable.

Go from AI overwhelmed to AI savvy professional

AI will eliminate 300 million jobs in the next 5 years.

Yours doesn't have to be one of them.

Here's how to future-proof your career:

Join the Superhuman AI newsletter - read by 1M+ professionals
Learn AI skills in 3 mins a day
Become the AI expert on your team

Start learning AI now

The Stack

[TOOL] llm 0.32a2 (simonwillison.net)

Simon Willison's llm CLI shipped 0.32a2 this week. The alpha tag is not a stability warning, the tool has been production-usable for over a year. The 0.32 cycle adds plugin improvements and model routing changes worth tracking if llm is already in your automation stack.

The use case for operators: llm as a Unix-composable model interface. Pipe text in, structured output out, chain with standard shell tools. No framework, no SDK, no client library to maintain. For batch processing tasks, transcript extraction, content classification, log summarization, it is faster to prototype with llm than with any SDK wrapper. You get the tool working in 20 minutes and decide later whether it needs a proper service wrapper.

The plugin suite covers Ollama, OpenAI, Anthropic, Gemini, and most open-weight models via the llama.cpp backend. One binary, one config file.

My take: if you are writing throwaway scripts that call the OpenAI API directly, switch to llm. The composability dividend shows up immediately. The 0.32 alpha is stable enough for low-stakes automation today.

The Onboard

Wednesday pattern: complexity-based model routing.

The default behavior is to send every agent call to the best available model. The behavior that preserves margin is to classify task complexity first and route accordingly.

The practical split: use a fast cheap model (Haiku, Gemini Flash, or an 8B local via Ollama) for classification, extraction, and any task where the schema is tight and failure cost is low. Reserve the expensive model for judgment calls, ambiguous inputs, and tasks where a wrong answer breaks something downstream.

The routing function scores each incoming task on three axes: output schema complexity (loose vs. strict JSON), downstream consequence (reversible vs. irreversible action), and whether the task requires multi-hop reasoning. Two or more axes flagged, route to the large model. Fewer than two, route cheap.

Teams that implement this typically see 60-80% reduction in inference spend within two weeks with no measurable quality degradation on the tasks that routed down. The model's ceiling costs the same whether you needed the ceiling or the floor.

The Frame

The Distillation Trade

Every capability that migrates from a large model into a smaller one is a transfer of economic use from the lab to the operator.

Needle is one data point. But it names a direction that has been building for 18 months. The hard part of reliable agent behavior is not raw model intelligence, it is consistent structured output at a specific interface. Tool calling. JSON extraction. Function schema adherence. These are learnable, teachable behaviors. They do not require hundreds of billions of parameters to execute. They require training data from a model that already executes them well, run through a fine-tuning loop that compresses the behavioral signature into something smaller.

That is what distillation does. You extract a large model's behavioral pattern on a narrow task and compress it into something that runs locally, cheaply, without a rate limit, without an API contract, without a vendor dependency. The large model becomes a teacher. The small model becomes the production worker.

The counterargument is edge-case reliability. Distilled small models fail on inputs outside their training distribution. That is real. The answer is not to avoid distillation, it is to build the eval suite that catches edge-case failures before you promote the small model to production. That is an engineering problem with a known solution.

Anthropic growing 10x per year is not in tension with this story. The lab building the best large models and operators distilling specific capabilities out of those models are not competing, they are in different parts of the stack. Labs win on frontier capability, operators win on inference economics, and the apparent conflict resolves through distillation. It increasingly is not.

The operators who build a distillation workflow early will have a structural cost advantage that compounds with each model generation. The ones who stay on full-size hosted models for every task will watch that gap widen every time a new small model ships.

Builder's Brief

Friday's kit is the Meeting Notes to Slack Summary build. The specific thing most teams get wrong before writing a single line of code: Slack format. A prose dump in a DM gets skipped. A Block Kit post with a bold Decisions header, owner-tagged action items, and a due-date column gets acted on. That is the product. The transcript is just the input.

The kit includes the Claude extraction prompt (tested across five transcript types: standup, product review, client call, 1:1, all-hands), the Block Kit template generator with three layout variants, and the margin math. At $49 per workspace and under $0.16 per month in Claude Haiku costs, the gap between Otter.ai's per-seat pricing and this stack's cost structure is where your margin lives.

Full breakdown Friday for Operator Access subscribers.

Before You Go

If Needle's distillation holds on your tool schemas, what is the last hosted API call your agents actually need?

You are reading The AIgent. Forward this to one builder who should be on the list.