GPT-5.4 Can Use Your Desktop Now
We break down OpenAI’s GPT-5.4 and its native computer-use abilities, from screenshot-driven clicks and typing to why the 75% OSWorld score matters for real office automation. The episode also covers developer controls, finance and ops use cases, pricing, and the guardrails you’ll need before putting it into production.
Is this your podcast and want to remove this banner? Click here.
Chapter 1
The model that can actually run your desktop
Andy InfoFina
Welcome to the show -- and I need to start with one number that made me sit up straight: 75%. [pauses] That is GPT-5.4’s score on OSWorld, and the human expert baseline is 72.4%.
Andy InfoFina
So this dropped March 5, 2026, and the big deal is not just “new model, bigger number, everybody clap.” [deadpan] We have enough of that already. The big deal is that OpenAI says GPT-5.4 is the first general-purpose GPT model with native computer use built in. Native. Meaning not some wrapper glued on top, not a brittle plugin, not a tiny demo where everything is carefully staged. The model can click, type, scroll, and navigate UI screens directly.
Andy InfoFina
And if you work in normal office software -- web apps, portals, internal tools, weird dashboards made in 2018 that somehow run payroll -- this is the moment the story changes. Because “automation” used to mean code-first automation. You wrote Playwright logic. You built RPA flows. You prayed nobody changed the layout. Now it starts to mean: show the model the screen and let it work.
Andy InfoFina
That sounds small until you think about what people actually do all day. Scrape data from a web app. Fill a multi-step form. Copy numbers from one system into another. Open desktop software, click through tabs, export something to CSV, upload it somewhere else. A lot of that work has been automatable in theory for years. In practice? [skeptical] It was often too fragile, too custom, too annoying to maintain.
Andy InfoFina
OSWorld is the benchmark to keep in your head here, because 75% versus a 72.4% human expert baseline is the kind of stat that turns “cool agent demo” into “okay... this might actually be operational.” Not perfect. I’m not saying hand this thing your production environment and go make coffee. But I am saying the center of gravity just shifted.
Andy InfoFina
And honestly, for me, this is where AI gets very real for spreadsheet people. Not in the abstract, not in some future robot office. In the painfully ordinary stuff. The screen is there. The buttons are there. The workflow is there. And now the model can see the same mess you see and still get through it. [softly] That’s a different category of tool.
Chapter 2
How you actually wire it up
Andy InfoFina
If you’re wondering how this looks in the API, the family names are pretty straightforward: well, not literally a break, but you know what I mean. You’ve got okay, let me say it cleanly. The main model is gpt-5.4, and there are variants: gpt-5.4-pro, gpt-5.4-mini, and gpt-5.4-nano. If you want computer use, you enable it by passing computer_use as a tool type in the request.
Andy InfoFina
That little detail matters because this is not some separate product category off in a corner. It’s part of the model workflow. And the workflow itself is screenshot-driven: you send the model the UI state, and it returns mouse and keyboard actions. In some cases, it can also return Playwright-compatible code it can use to operate the interface.
Andy InfoFina
So imagine the loop. The model sees the current screen, decides what to do next, clicks or types, gets the next screen, and keeps going. That is very different from the old way where you had to pre-script every selector and every branch like you were writing stage directions for a very nervous robot.
Andy InfoFina
And here’s the underappreciated part: the 1-million-token context window. That matters a LOT for this kind of work, because you can keep full UI history, logs, and prior screen states in the same session instead of constantly dropping context. If a workflow spans a bunch of screens, retries, or odd edge cases, the model doesn’t have to keep forgetting what just happened five minutes ago.
Andy InfoFina
This is where I had a very personal flashback. [laughs] The last time I fought a flaky internal tool, I kept blaming the code. But if I’m being honest, most of the pain was not the code. The pain was the UI changing underneath me. A button moved. A label changed. A modal appeared where no modal had any right to appear. Suddenly the automation wasn’t “wrong,” it was blind.
Andy InfoFina
So if you’ve ever spent half a day fixing browser automation because somebody renamed a button from Submit to Continue, this should make immediate sense. The promise here is not magic. The promise is resilience. Less hard-coded guessing. More “look at the screen that actually exists right now and respond to it.” That’s a big upgrade.
Chapter 3
Why finance and ops should care -- and where the catch is
Andy InfoFina
The people who should probably care first are developers and teams in finance and operations. Not because it’s flashy -- [sarcastic] nobody has ever accused procurement systems of being flashy -- but because these departments run on repetitive, semi-structured screen work.
Andy InfoFina
Think invoice portals, internal dashboards, data entry, claims workflows, procurement systems. Places where humans do the same screen dance every day. Open this. Copy that. Validate this field. Click next. Download the file. Re-upload the file to a totally different system because apparently software vendors enjoy pranks.
Andy InfoFina
That’s where GPT-5.4 could replace a lot of brittle glue. The classic problem with Playwright scripts and older RPA tools is almost comically simple: somebody updated the button label, and now the bot is dead. Or the layout shifted. Or the menu is nested differently. The workflow itself didn’t change, but the script was written for a frozen version of reality.
Andy InfoFina
Now, OpenAI says the model is steerable via developer messages, and that is CRITICAL in sensitive environments. You can constrain which apps it may touch and which actions it can take before it ever sees a live system. That is the difference between “interesting toy” and “maybe we can let this near a real process.”
Andy InfoFina
But here’s the catch -- and honestly this is the real story. You need confirmation policies before production access. Full stop. If a model can operate software, then you need clear rules around what it may do autonomously and what requires a human yes. Especially anywhere money moves, claims get approved, records change, or compliance matters.
Andy InfoFina
And then there’s latency. Screenshot-based control adds latency. That doesn’t kill the idea, but it means any time-sensitive pipeline needs benchmarking before anyone gives it the green light. If a workflow is okay taking a little longer in exchange for being more flexible, great. If the process is truly time-critical, don’t assume. Measure it.
Andy InfoFina
So the upside is obvious: less brittle automation in exactly the kinds of back-office workflows that eat hours. The caution is just as obvious: steer it tightly, gate risky actions, and benchmark the speed before you trust it. [reflective] Which, honestly, is a pretty adult way for this whole agent conversation to grow up.
Chapter 4
The price, the lighter model, and the race everyone else is running
Andy InfoFina
Let’s talk money, because this is where a lot of “wow” turns into actual decisions. Base pricing for GPT-5.4 lands at $2.50 per million input tokens. And GPT-5.4 mini keeps the same built-in computer use at lower cost, which makes it the obvious candidate for high-volume or budget-sensitive automation.
Andy InfoFina
That’s the model I would expect a lot of companies to test first. Not because mini sounds cute -- [chuckles] terrible technical criterion -- but because if you’re running repetitive workflows all day, cost discipline matters fast. A cheaper model with built-in computer use is exactly what ops teams are gonna look at.
Andy InfoFina
Now here is the detail people are absolutely going to miss if they skim the product page: GPT-5.4 nano does NOT include computer use. I’m gonna say that again because somebody, somewhere, is about to wire the wrong model into a workflow and then have a very bad Tuesday. Nano is not the cheap version that does everything. It does not have computer use.
Andy InfoFina
And all of this is happening while the rest of the market is sprinting. Google’s Gemini 3.1 Flash-Lite claims 2.5x faster time-to-first-token than Gemini 2.5 Flash. Gemini 3.1 Flash TTS now supports more than 70 languages and native multi-speaker dialogue. So even outside direct desktop control, the competition is pushing hard on speed and real-world usability.
Andy InfoFina
Then you’ve got Meta’s Muse Spark, the first model from Meta Superintelligence Labs under Alexandr Wang. The notable angle there is native multimodal reasoning and multi-agent orchestration. Which is a fancy phrase, but the plain-English version is: the big players are all trying to build systems that can handle different kinds of input and coordinate more complex task flows.
Andy InfoFina
So the question is not whether agents are coming. [pauses] I think that question is basically over. The question is which stack gets RELIABLE enough first. Reliable enough to trust with ugly enterprise software. Reliable enough to survive UI drift. Reliable enough that a finance team stops treating it like a demo and starts treating it like infrastructure.
Andy InfoFina
And that’s the thought I’m leaving you with today: once a model can use a computer the way a person does, the bottleneck stops being “can AI do the task?” and becomes “what are we willing to let it touch?” [warmly] That’s a much more interesting question. Talk soon.
