We break down how Z.ai’s GLM-5.1 landed a 58.4 on SWE-Bench Pro and edged past leading proprietary models on a major coding benchmark. Then we dig into why its MIT license, open weights, and tool-use focus could reshape the business of AI coding assistants.
Episodes (5)
Andy breaks down why LLM demos can fail in production even when the model fits on the GPU: the real pressure often comes from the KV cache during long prompts and high concurrency. He also explains Google Research’s TurboQuant approach, how 3-bit cache compression could slash memory use and infrastructure costs, and what to test before trying it in a self-hosted stack.
We break down OpenAI’s GPT-5.4 and its native computer-use abilities, from screenshot-driven clicks and typing to why the 75% OSWorld score matters for real office automation. The episode also covers developer controls, finance and ops use cases, pricing, and the guardrails you’ll need before putting it into production.
Anthropic’s restricted release of its most powerful model to top defenders raises a huge question: is this a security breakthrough or the start of a new offensive AI arms race? We dig into Mythos’ reported ability to independently find, reproduce, and exploit a 17-year-old FreeBSD flaw, and what that means for patching, disclosure, and enterprise defense.
OpenAI’s acquisition of Hiro Finance comes with a rapid shutdown, permanent data deletion, and a seven-day window for users to export their information. The episode explores why Hiro’s verified financial math mattered, what Ethan Bloch’s team brings to OpenAI, and how this deal could signal a bigger push into domain-specific AI for personal finance.
