site banner

AI from the perspective of a software developer, year-end 2025

voltairesviceroy.substack.com

Read this article on Substack for images and better formatting

This article will cover the following:

A summary of where I think AI stands for software developers as of 2025.

  • How much AI has progressed in 2025, based on vibes.

  • A summary of how I’m currently using AI for my work, and an estimate of how much it’s speeding me up in certain tasks.

  • A summary of the top-down attempt from the executives of the company to get people to use AI.

  • How much I’ve seen the other developers that I work with use AI of their own volition.

  • What advancements in AI I’d find most helpful going forwards.

I’ll keep my explanations non-technical so you don’t have to be a software engineer yourself to understand them.

As 2025 comes to a close, it will mark the end of the third full year of the LLM-driven AI hype cycle. Things are still going strong, although there’s been an uptick in the amount of talk of an impending Dotcom-style crash. Basically everyone is expecting that to happen eventually, it’s just a question of when and how much it will bite. AI is almost certainly overvalued relative to the amount of revenue it could generate in the short term, but it’s also clear that a lot of people “predicting” a crash are actually just wishcasting for AI to go away like VR, NFTs, the Metaverse, etc. have done. That’s simply not going to happen. There will be no putting this genie back into the bottle.

Where AI stands for developers

The past few years of using AI to build software have looked something like this:

2023: This AI thing is nifty, and it makes a decent replacement for StackOverflow for small, specific tasks.

2024: OK, this is getting scarily good. Unless I’m missing something, this seems like it will change coding forever.

2025: Yep, coding has definitely been changed forever.

There’s no doubt in my mind that AI is the way that a lot of future coding will be done. This idea was mostly already solidified in 2024, with what I’ve seen in 2025 fortifying it further. The only question is how far it goes. That almost entirely depends on how good the technology eventually becomes, and how quickly. If AI stalls at its current level, then it would only produce a long run productivity increase for coders of 20-50% or so. If it fulfills the two features I discuss towards the end of this article, then it could automate huge swathes of coding jobs and increase productivity by 100%+, but there will still almost certainly be human work to do in interfacing with + building these systems. I don’t foresee a total replacement of SWEs anytime in the near future without a massive recursive self-improvement loop that comes out of left field.

If you’re a developer, you’re shooting yourself in the foot if you’re not using AI to speed-up at least some parts of your job. This isn’t an employment risk now, but it could be in 5-10 years if progress continues at its current pace1. My range for “reasonable disagreement” is the following: on the low end it’s something like “AI is useful in some areas, but it’s still woefully deficient for most parts of the job”, and on the higher end it’s something like “AI is extremely useful for large parts of the job, although you can’t trust it blindly yet”. On the upper end I remain pessimistic about “vibe coding” for anything except the most disposable proof-of-concept software, while on the lower end I respect anyone a bit less if they confidently dismiss AI completely in terms of programming and claim it's just another passing fad. AI is well-within those two extremes.

AI progress in 2025

As for how AI progressed in 2025, results were mixed.

If evaluated from a historical context where technological progress is nearly stalled outside of the world of bits, then 2025 was a bumper year. Indeed, every year since 2023 has seen bigger improvements in AI than in practically any other field.

On the other hand, if we evaluate from claims of the hypester CEOs, FOOM doomers, and optimistic accelerationists, then 2025 was highly disappointing (where’s my singularity???).

From my own expectations that were somewhere in between those two extremes, 2025 was a slight-to-moderate disappointment. There was a fair bit of genuine progress, but there’s still a lot of work to be done.

There were two major areas of advancement: LLM models got smarter, and the tools to use them got a bit better.

In terms of LLMs getting smarter, most of the tangible advancements came in the first few months of the year as reasoning paradigms proliferated with Gemini Pro 2.5, ChatGPT o3, Claude Opus 3.7, and Deepseek taking the world by storm. This happened mostly between January and April2. These were good enough that they could solve basically any self-contained coding challenge within a few prompts, as long as you supplied enough context. They could still make stupid mistakes every now and then, and could get caught in doom loops occasionally, but these weren’t particularly hard to resolve if you knew what to look for and had experience with the tricks that would generally solve them. Comparing the benchmarks of a person using ChatGPT 4o in November 2024 vs ChatGPT o3 in April of 2025:

After this, though, there wasn’t much beyond small, iterative improvements. New versions of Claude, Gemini 2.5 getting tuned in May and June before releasing 3.0 later in the year, ChatGPT getting v5, 5.1, and 5.2 -- none of these had any “oh wow” moments for me, nor did I notice any significant improvements over several months of use. I’m playing with Sonnet and Opus 4.5 now, and they’re… maybe a small step up? I’ve noticed it will correctly ask for additional context more often rather than hallucinate as weaker models tended to do, but this might also be explainable by me just improving my prompts. I keep up with model releases pretty well, and I see their benchmarks continuing to slowly rise, but it doesn’t really feel like I’m using a model that’s obviously smarter today in December 2025 than I was from April 2025. Obviously this is all highly anecdotal, but vibes and personal use are how a lot of people primarily judge AI.

Part of the issue is almost certainly diminishing returns. Would it make that much of a difference if LLMs could code some feature in an average of 4 reprompts instead of 5? It’d probably be useful if you could consistently get that number down to 1 for all coding tasks as you could then trust LLMs like we do for compilers (it’s very rare for a dev to check compiler output any more). But getting AI to that level will take exponentially more work, and I doubt it will be doable at all without unhobbling AI in other ways, e.g. automating context management instead of relying on a human to do that.

The other major area of improvement was in the tools to use these LLMs, and by that I’m talking about AI “agent” tools like ChatGPT Codex and Claude Code. These are glorified wrappers that are nonetheless useful at doing 2 major things:

  1. They edit code in-line. In 2024 most people relied on copy+pasting code from the browser-based versions of LLMs, which caused a number of problems. Namely, either the AI would slowly regenerate the entire program each time there was a single alteration which would take forever and burn a bunch of tokens, or they would only regenerate parts of the code and thus rely on the user to selectively copy+paste the updated code in the right locations, which caused an endless series of headaches whenever there was a mistake.
  2. They have a limited ability to go out and find the context that they need, assuming it’s relatively close by (i.e. in the same repo).

I don’t want to undersell the usefulness of tools like Claude Code, as those two things are indeed very helpful. However, the reason I called them “glorified wrappers” and put the “agent” descriptor in quotation marks is because they’re still extremely limited, and aren’t what I had in mind when people were talking about “agents” a year or two ago. They cannot access the full suite of tools a typical fullstack engineer needs to do his job. They cannot test a program, visually inspect its output, and try again if things don’t look right. Heck, they don’t even do the basic step of recompiling the program to see if there are any obvious errors with the code that was just added. When people discussed “agents”, I had something in mind that was much more general-purpose than what we have right now.

Both Sam Altman and Jensen Huang stated that 2025 would be the “Year of AI Agents”, but this largely did not happen at least in terms of general purpose computer agents. There were a few releases like Operator that generated some buzz earlier in the year, but they were so bad that they were functionally worthless, and I’ve heard very little in the way of progress since then. I feel this is a bigger deal than most people think, as it’s almost certainly necessary (though not sufficient) for AI to have at least mediocre skills in general computer use for it to break free of the rut of short time horizons tasks it’s currently stuck in.

AI on the job in 2025

Below is a chart of the tasks I do as a software engineer, how helpful AI is in completing them, and the approximate percent of my working hours that I spent on them in 2020 vs today. My job over the past year has been in modernizing a reporting tool that our Research team uses for statistical releases. The old tool was designed in the 90s using SQR and Java Swing, i.e. it’s two fossils duct-taped together. My job is to modernize it to SQL and React.

Task group AI utility out of 10 % time in 2020 % time in 2025
Gathering requirements
and useful meetings
0 5% 5%
Pointless meetings 0 10% 10%
Translating SQR to SQL
(backend)
7 50% 25%
Creating the frontend 8 15% 10%
Creating the Excel
templates
2 10% 15%
Dealing with the middle “guts”
(service layer, service email, etc.)
3 5% 15%
Dealing with server
issues
4 10% 20%

Gathering requirements and useful meetings: AI is not helpful here since I need to know how I should be designing things, and that’s just most easily done by being present and asking questions. Perhaps I could use AI to help take notes, but that seems like more of a hassle than its worth so far.

Pointless meetings: AI is also not helpful here since these are socially enforced. Every day we have a completely unnecessary 30-45 minute “standup” (we are always sitting) where we go around the room with all the devs explaining what they worked on to the manager, like a mini performance review. These often devolve into debugging sessions if any task takes longer than a day. I think the manager is trying to offer helpful suggestions, but the way he phrases things always makes it sound like thinly veiled accusations that people aren’t working fast enough. He also has no real idea of the nuances of what people are working on, so his suggestions are almost always unhelpful. God I hate these meetings so much.

Translating SQR to SQL: In 2020 this would have been by far the bulk of my work. This is where the logic for the reports is, and AI has been a massive help. AI has been so good that I’ve just never bothered to learn much about SQR -- it’s close enough to SQL that I can get the gist of it, while any of the more complicated bits can be dealt with by the LLM. AI can often one-shot the data portions of the logic, although it has trouble with the more ad-hoc formatting bits. The reason I give this a 7/10 instead of a 10/10 is because my human skills are still needed to give context, define the translation goals, break compound reports into manageable chunks, reprompt when errors occur, and press the “compile” button. I’m more of a coordinator while AI is the programmer, and as a result I spend a lot less time on this part than I would otherwise need to.

Creating the frontend: AI is also quite helpful in wiring up the frontend. We have less ad hoc jank here which typically means less reprompts are necessary, but my human skills are still required to give context, press the “compile” button, etc.

Creating Excel templates: The people we’re creating reports for are very particular about the formatting, which means AI isn’t very helpful. I’ve tried it a few times, but it’s just faster to do things myself to get the exact bolding, column widths, etc., though AI is good at giving conditional formatting formulas. Since AI has sped up the coding parts I’ve ended up working faster, which means I end up spending a greater percentage of my time on these.

Dealing with the middle “guts” and Dealing with server issues: The servers and the middleware “guts” are a tangled mess of accumulated cruft solutions that have built up over the years. Before AI I would have tried to avoid touching this stuff as much as possible, but now it’s much easier to understand what’s going on and thus to make improvements. If the servers go down (which is very frequent) I can try to proactively fix the problem myself. I’ve also played around with Jenkins to automate some of the server redeployments, and learned how to pass info through the various layers to tell if something is going wrong. This stuff is too tangled and spread out over different repos for a tool like Claude Code to automate anything, but LLMs can at least tell me what’s going on when I copy+paste code.

Overall, I’d estimate my work speed has gone up by about 50%. That’s spread out over 3 main areas:

  • 10% working faster and doing more.
  • 20% spending more time to design more robustly, avoid tech debt, and build automation, mostly in the middle “guts” and server areas.
  • 20% dark leisure -- most of this article was written while at work, for instance.

Management’s push for AI

As the executives at the company I work for have tried to implement AI, they’ve done so in an extremely clumsy way that’s convinced me that most people have no real clue what they’re doing yet. If what I’ve witnessed is even somewhat representative, then the public’s overall understanding of AI is currently at the level of a senile, tech-illiterate grandma who thinks that Google is a real person you ask questions to online rather than a “search engine”. Even something as basic as knowing ChatGPT has a model selector with this thing called “Thinking” mode easily puts you in the top 10% of users, perhaps even the top 1%.

Upper management is plugged into whatever is trendy in the business world, and AI is certainly “in” right now. The CTO (with implicit pushing + approval from the rest of the executive team, especially the CEO) is heavily promoting AI use, but this seems destined to be mostly a farce for the usual reason that all bandwagony business trends tend to be -- saying the firm is “using” is the goal moreso than any extra efficiency it can give. Oh sure, they’d happily take any extra efficiency if it was incidental, but it’s not really the primary focus. This acts as one of the most powerful anti-advertisements for AI use among the developers where I work. Fads like low-code tools or frameworks like CoffeeScript come along perennially and almost always disappoint. It’s no wonder that many devs instinctively recoil when a slimy MBA comes in who clearly doesn’t know much about AI or programming, and starts blathering on with “let me tell you about the future of software development!” We’ve seen it all before, so now whenever I try to sing the praises of AI I have to overcome that stench. “No really, AI is a powerful tool, just ignore how management is trying to hamhandedly shove it into everything including places it obviously doesn’t belong”.

Since I’ve become known as the “AI guy” among the developers, I got put on the CTO’s “AI working group” -- basically a bunch of meetings twice a month where we discuss how AI could be implemented across the organization. I tried to go in with an open mind, but that didn’t last long.

Meeting 1: The CTO gathers the developers and asks for input on what AI tools we could augment the development stack with. He mentions examples including Github Copilot, Claude Code, and a few others. I express strong interest in Claude Code as I’ve heard good things about it, while another dev chimes in to back me up. The CTO responds with “interesting perspective… but what have you heard about Copilot?” It becomes clear that the meeting is consultation theater and that his mind is already made up -- he tells us he’s attended several conferences where Copilot was mentioned favorably and he has a bunch of promotional material to share with us. I ask if we could try both Copilot and Claude Code, but that idea is also shut down. I do some research on my phone and discover that Github Copilot is just a SaaS Wrapper with the useful models locked behind “premium requests” that you have to pay a subscription to access. The difference is between a model like Sonnet 4.5 on the premium side vs ChatGPT 4o-mini for the free version, which if you’ve played around with those models you’d immediately know there’s a vast gulf in capabilities there. I ask if we can get the business license to use the advanced models, and the CTO asks if the advanced models are really necessary. I cringe internally a bit, but convince him that they are, and he agrees to get the business licenses.

Meeting 2: The CTO gathers 2-3 employees from each department of the company for us to discuss how each of us is using AI in our day-to-day work environment. Well, at least that’s ostensibly what was supposed to happen. In practice it becomes an entire hour of introductions and each of us giving a little blurb on what our overall opinion is on AI. Basically everyone says some variation of “It sure seems interesting and I look forward to hearing how other people use it” without anyone actually saying how they use it beyond the most basic tasks like drafting emails. The only interesting part was when one of the employees said she hopes we can minimize AI use as much as possible due to how much power and water it uses. The CTO visibly recoils a bit on hearing this, but collects himself quickly to try to maintain that atmosphere of “all opinions are valid here”. I cringe a bit internally since the bit about water usage is fake, but I know the perils of turning the workplace into a forum for political debate so I say nothing.

Meeting 3: The CTO gathers the 2-3 employees per department and announces an “Action Plan” where the company will be implementing a feature that will allow us to build our own little mini Copilots that have specific context in them. He requests that each of us comes up with 2-3 use cases for where these could be utilized. These are using tiny models that will be worthless for coding so I mentally check out.

Meeting 4: This meeting involves the details of the “Action Plan” but I strategically take a vacation day to miss it.

How I’m seeing other developers (my coworkers) use AI

Despite the executives’ clumsy attempts with a top-down approach, I’ve seen a decent amount of bottom-up usage of AI among the developers that I work with, though it’s highly stratified.

Older dev #1: This guy is a month or two from retirement. He’s shown approximately zero interest in using AI himself, although he’s been impressed when he’s seen me use it.

Older dev #2: This zany individual has since left the company. He was, for some reason, a big fan of Perplexity -- an AI tool that I’ve heard very little positive feedback about from anyone else. Beyond this, he tried using AI to solve a difficult coding challenge (likely one he didn’t want to deal with personally), but a combination of 1) it being a tough problem, 2) him using a weak (free) AI model, and 3) him not being particularly experienced in aspects like how to break down a problem for AI and how to give it the right amount of context, all combined to produce an unsatisfactory result. After this, he mostly dismissed AI in the software world as any other passing fad.

Middle age dev #1: This guy is probably the most skilled dev out of all of us has tried using AI at least a little and had a few positive stories to share about it, but it seems to be something he’ll only use occasionally. When we were debugging an issue together, his first instinct was the old method of Google searching + StackOverflow like it’s still 2020.

Middle age dev #2: He has shown some interest in using AI, but has little gumption to do it himself. He asked me for advice on using AI on a problem he clearly didn’t want to deal with since it was long and complicated. I told him that AI would probably only give middling results on a task like that, to his disappointment. I’ve seen very little AI use after this.

Middle age dev #3: I’ve seen him use AI a little bit here and there, but mostly just for debugging things, not systematically to generate code.

Middle age dev #4: He’s basically using AI as much as I do. I’ve asked him his opinions on the subject, and they’re a close mirror to my own -- he thinks AI is here to stay, but he’s doubtful of things like vibe coding. He subscribes to the $100 plan of Claude, and I see him using Claude Code almost all the time.

Younger dev #1: I see him using AI a decent amount, although it seems somewhat unsophisticated -- he’s been using the Github Copilot trial that the CTO pushed for. A lot of the time savings seem to be going towards dark leisure.

Manager: He’s tried to use AI out of what seems like pure obligation -- partially from the CTO pushing down from above, and partially from also being plugged into the MBA zeitgeist that’s saying that AI is trendy and implicitly that anyone who isn’t using it will “fall behind”. He offhandedly mentioned how he tries to set aside 30 minutes per day to test out AI, but the features he’s most excited about are extremely basic ones like voice mode and the fact that ChatGPT retains some basic memories between conversations. When he complained about hallucinations in a particular instance, I asked him what model he was using and he reacted with total confusion. Then we went to his screen and he was on the basic version of ChatGPT 5 instead of ChatGPT 5 Thinking, and I had to explain the difference -- so he’s not exactly a power-user.

Wishlist #1 for AI development -- more context!

One of the most critical aspects of programming with AI is managing its context window. LLMs learn pretty effectively within this window, but long conversations will rack up usage on your daily allotment or your API costs rather quickly. You want to start new conversations whenever you’re done working on a small or medium sized chunk of your program to flush out the old context and keep everything chugging along smoothly. When you start a new conversation, you’ll need to supply the initial context once again. In my case I keep several text files that have general guidelines for how the AI should approach the frontend, the backend, various connector pipelines, etc. Then I supply the latest version of the relevant code, which can span across multiple files that I’ll need to manually select based on whatever feature I’m adding. Finally, I’ll need to tell the AI what I want it to build and the general gist of how that fits into the code I just supplied. All of this navigating around and thinking about what info should be added vs what should be omitted takes time. It’s not a huge amount of time, but it’s not trivial either -- of the time I spent manually coding previously, I’d reckon I spend about 15% of my time now supplying context to the AI, and 85% working with the code that it spits out (integrating it, debugging it, reprompting if it’s wrong, etc.)

Problems arise if I get this wrong and don’t supply enough context. The AI will hallucinate in random directions, sometimes adding duplicative functions assuming they don’t exist, while other times it tries to rely on features that haven’t been created yet. The outcome of this is never catastrophic -- usually the code just fails to compile -- but it can waste a surprising amount of time if the issue only breaks at one critical line halfway through the generated output. If I’m being lazy and just copy + pasting what the AI is churning out, I’ll get the compiler error and copy + paste that back into the AI expecting it to fix it, which can lead to several rounds of the AI helplessly trying to debug a phantom error of a feature that doesn’t exist yet. Eventually either I’ll read through things more closely or the AI will figure out “hey wait, this function is totally missing” and things will be resolved, but this can take 15 or 30 or even 45 minutes sometimes. If you include the time I spend managing context, along with all the issues I have to resolve with not supplying the right context, then that 15%-85% split I mentioned in the previous paragraph can look more like 35%-65%. In other words, on a particularly bad day I might spend more than 1/3rd of my productive time just wrangling with context rather than actually building anything.

Some of this is just on me as a simple “skill issue”. I should be less inattentive both in supplying context and in checking what the AI is pumping out. But on the other hand, if I exhaustively proofread everything the AI does then that would destroy much of the time savings of using AI in the first place. There’s a balance to be struck, and finding that balance is a skill. If we were in a scenario where LLMs stalled out at their current capabilities as of 2025, I doubt I’d ever completely eliminate making mistakes with managing context for the same reason it’s unrealistic for even experienced developers to never have bugs in their code, but I’m confident I could get it below 20% of my productive work time.

<20% isn’t that much, but I think this still underestimates the time lost to managing context for one final reason: it’s boring, and that makes me want to do other things and/or procrastinate. Managing context mostly just consists of navigating to various files and copy + pasting chunks of code, then writing dry explanations of how this should fit together with what you actually want to build. It’s not difficult by any absolute metric, but it’s fairly easy to forget one important file here or there if you’re not paying attention. Starting new AI conversations becomes a rupture in the flow state, which gives my brain the excuse to think “Break time! I’ll check my emails now for a little bit…” And then that “little bit” sometimes turns into 60+ minutes of lost productivity for normal human reasons.

Obviously this is a willpower issue on my part. But it’s also not particularly reasonable to demand that humans never procrastinate, especially on boring, bespoke tasks. My preferred solution to this issue would just be for someone else to fix the problem for me.

If we had unlimited access to infinite context windows, this entire subset of problems would basically disappear. Instead of needing to flush out the context every so often, we could just have one giant continuous conversation. Instead of strictly rationing context piecemeal for any given request, we could just dump the entire source code into the AI and let it sort itself out.

The fantasy of an infinite context window is fairly analogous to the idea of AI agents that are capable of “continuously learning”. We’re probably several years if not a decade+ away from a robust version of that. However, even moderate increases in usable context window sizes would still be very helpful. A usable context window that was twice as big would mean we’d need to refresh conversations half as often, which would mean half the time spent on supplying context, half the chances to screw up and provide too little, and half the opportunities to procrastinate on the issue. An order-of-magnitude increase could mean that large codebases only need 3-6 conversations going for each of its broad areas, rather than needing 30-60 separate conversations for each feature.

It almost feels like a fever dream that ChatGPT 4.0 launched with a context window of just 8K tokens back in early 2023, with a special extended version that could accept 32k tokens. Today, ChatGPT 5.1 and Sonnet/Opus 4.5 both have 200K+ context windows, while Google Gemini 3.0 advertises 1 million.

But it should be noted that just because they advertise something doesn’t mean it’s necessarily true for practical purposes. I used the term “usable context window” a few times, which depends on 2 main factors:

  • How good the compression algorithms are. From my limited research, it appears that context windows follow an O(N2) relationship, i.e. increasing context by 10x requires ~100x more compute, which is brutal. To get around this, model developers use tricks like chunking, retrieval, memory tokens, and recurrence to compress past context into a smaller representation. This allows for larger context sizes, but comes at the cost of making in-context elements “foggy” if the compression is too aggressive. Companies could advertise a context window of 10M or 100M, but if the details from 100K tokens ago are too compressed to be usable then that wouldn’t be worth much.
  • How much access end-users actually get at a reasonable cost. Longer conversations with more stuff in the context window use more resources, which model providers meter with increased API costs and daily allotment usage. I recently had an exceedingly productive conversation with Claude Sonnet on a long and complicated chunk of code that was almost certainly near the limit of the context window, and each message + response took a full 10% of my 6-hour allotment on the $20/month plan. Even if the model’s recall is perfect in a 10M context window, if companies charge an arm and a leg to use anything past 500K then 10M once again doesn’t mean much to the average user.

It’s almost certain that LLMs will continue to get cheaper over time. There’s a very robust trend towards more cost-efficiency per performance, so that half of the equation should continue to improve rapidly over the next few years.

As for how big the context windows will grow, that part is much less clear. I would expect it to get better, but not necessarily at a Moore’s Law exponential rate. The O(N2) relationship is the stick-in-the-mud here, and I’d expect lumpy jumps from time to time when there are breakthroughs in compression algorithms, or if companies feel they have sufficient compute to eat inflated costs.

But there’s also somewhat of a case for pessimism on the rate of improvement, as it feels like context is getting a lot less attention than it used to. Part of that might be due to how most casual users won’t need super huge context windows unless they’re doing some crazy long RP session over multiple weeks. Context limits used to be highlighted metrics when new models were released, but many new models don’t even bother mentioning it at all any more. This feels strange when there are some very goofy benchmarks like Humanity’s Last Exam that get cited almost every time there’s a 1-2 point improvement. The only benchmark I know of that tests context recall is Fiction Livebench which is questionably designed and rarely updated -- as of December 2025 the latest results are from September 2025, with many of the most recent models not listed.

The days of 8K token context limits feel similar to when video games in the 80s had to be compressed down to just a few hundred kilobytes (e.g. Final Fantasy 1 was less than 200 KB). In just a few years we’ve entered the equivalent of the megabytes era, which is a lot better but is still quite restrictive in a lot of ways.

Wishlist #2 for AI development -- better general computer use.

The other innovation that would substantially improve AI coding at this point would be for the AIs to learn how to use computers like humans can. There was quite a bit of noise around this in late 2024 with Claude and in early 2025 with Operator, but the consensus at the time was that they were both terrible. They were at the ChatGPT2 era of functionality that could sometimes do something useful, but relying on them was a fool’s errand.

Since then, we haven’t heard much in the way of progress. Gemini had a version of computer use in October, but it was mostly focused on browsers and still wasn’t very good. General computer use agents are still broadly worthless as of year-end 2025.

Competent general computer use would be extremely helpful for moving between various layers of software. Most nontrivial modern applications end up being a horrible Frankenstein monstrosity of many different frameworks that each deal with one specific issue. Software developers have euphemistically come to refer to this as a “stack”. For a basic example, a customer who logs into a site will see the frontend in a language like Javascript, while validating their login credentials and serving account details is dealt with a backend in a language like SQL.

In practice, just having 2 parts like this is VERY conservative. For a more realistic example, I’ll go with what I’m working on at my job. I’m rebuilding an internal app for statistical data that needs to Create, Read, Update, and Delete entries. It’s a bit more complicated than your average CRUD app, but at the end of the day it’s still just a CRUD app -- it’s not exactly rocket science. Yet even for this, I have to deal with the following:

  • Monthly: The frontend.
  • service-layer: An interstitial layer that routes most of the info from the frontend to the backend.
  • azure-sharepoint-online: Another interstitial layer that routes some info from the frontend to the backend.
  • service-reports: This receives calls from the backend and packages them into formatted excel reports.
  • service-email: This emails finished reports to the end-user.
  • The backend to query data goes through TSQL, which we build and debug with an IDE called DBArtisan
  • The formatting of reports happens through a package called Aspose, which turns TSQL results into an Excel format
  • The Excel sheets themself also require a bunch of manual formatting for each report.
  • Jira tickets for documenting work
  • Bitbucket for source control, and navigating between 3 different environments of DEV, TEST, and PROD.
  • The internal server that hosts this app runs off of a Linux box and Apache Tomcat.

AI within any single one of these domains is quite good. Claude Code can hammer out a new UI element in the Monthly repo pretty effectively (assuming I provide it adequate context). Similarly, I can go to the browser interface of any frontier AI and generate SQL that I can paste into DBArtisan to do backend calls within a handful of prompts. But navigating between any of these areas remains the domain of humans for now. AI simply doesn’t have the capacity to operate on the entire stack from end-to-end, and so its ability to accomplish most long-horizon programming tasks is gimped before it's even out of the starting gate. Before AI, I reckon I spent 80% of my time writing individual features, and 20% of the time gluing them together. Now I spend about 1/3rd of my time supplying context to the AI, 1/3rd of my time verifying AI output and reprompting, and 1/3rd of my time gluing various AI-built things together.

An AI that could navigate the entire length of an arbitrary software stack is a necessary (though not sufficient) requirement to be a drop-in replacement for a software developer. The most robust solution would be for the AI to just learn how to use computers like a human could. The main benchmark for this is OSWorld. Unfortunately, this is [also quite bad](https://epoch.ai/blog/what-does-osworld-tell-us-about-ais-ability-to-use-computers

  • like Fiction Livebench is, although at least it gets updated more frequently. Part of the issue is that many of the questions that are ostensibly supposed to be judging computer + keyboard usage can be cheated through the terminal where AI is stronger. However, this got me thinking “maybe this could be part of the solution?”

By that, I mean another potential solution to the problem is to flip it upside down: instead of designing an AI that can use arbitrary computer applications, we could design computer applications that an AI could more effectively use. Claude Code can be seen as an early version of this, as it’s very good at building things through the terminal. I want to use it as much as possible, so when designing future projects I’d look for backend solutions that play to its strengths while avoiding its weaknesses. For example, in the future I’ll look for pipelines that avoid calling SQL through janky IDEs like DBArtisan. Also, while breaking the main body of our apps across 5 different repos may have made sense pre-2023, now it just makes the context management portion more difficult than it needs to be. Beyond what I could do as a single software dev, the AI designers and third-party scaffolding companies could also probably create individual apps or bundled computer environments where AI excels. These types of solutions would be a lot more brittle than simply making a competent general-purpose computer agent, but they’re probably a lot easier, and so they could serve a transitional role like how the somewhat-dirty natural gas is helping us move away from ultra-dirty coal to clean renewables.

Conclusion

As said before, I don’t think that LLMs delivering slightly more accurate one-shot answers is a good way to further their programming capabilities at this point. This is what a lot of benchmarks are set up to measure which is part of the reason I think a lot of benchmarks are kind of superfluous now. If continuing in that direction is all 2026 has in store, I’ll be very disappointed. On the other hand, if there’s even a moderate amount of progress in the two areas I’ve described, then we could see a discontinuity with LLM programming ability jumping up to where it can suddenly do 80% of a senior programmer job instead of the current 30-40%. These two areas are pretty similar to what Dwarkesh cited recently, and progress (or lack thereof) is what I’ll be watching most closely as a leading indicator of how well LLMs will scale in the near future.

I can’t wait for what AI has in store in the next few years. I hope it keeps scaling and we get more periods like early 2025 with rapid progress. AI has been far more positive than negative, though it will probably take the broader public many years, or perhaps even a decade+ to realize this. There are other AI tools I’ve played around with like Nano Banana Pro that are extremely impressive on their own terms, but which are beyond the scope of this article.

In any case, I hope this article has given you some insight into what AI looks like on the frontlines of software development.

Jump in the discussion.

No email address required.

Beyond what I've listed here, I've also used AI for data analytics using R/Python. It's really good at disentangling the ggplot2 package for instance, which is something I had always wanted to make better use of pre-LLM but the syntax was rough. It's good at helping generate code to clean data and do all the other stuff related to data science.

I've also used it a bit for some hobbyist game development in C#. I don't know much C# myself, and LLMs are just a massive help when it comes to getting started and learning fast. They also help prevent tech debt that comes from using novice solutions that I'd otherwise be prone to.

At this point it's better to ask "what standalone programming tasks can't LLMs help with", and the answer is very little. They're less of a speedup compared to very experienced developers that have been working on the same huge codebase for 20+ years, but even in that scenario they can grind out boilerplate if you know how to prompt them.

At this point it's better to ask "what standalone programming tasks can't LLMs help with", and the answer is very little.

I disagree. I have yet to see a single relevant example of LLM helping in a real world situation that would apply to me. Ie. Doing expert level tasks with large C++ codebases that have little to no documentation and also don’t rely on just calling some popular framework to do everything. Or to put it another way, the situations where help would actually be needed beyond regular googling.

Instead it’s always simple beginner level stuff (eg. your ”I don’t know much C#”) or web / databases / python.

Show me LLM writing eg. a non-trivial Linux device driver, deciphering how the hell you’re supposed to do something not quite trivial with ffmpeg programmatically or writing structural document for a random medium size codebase that’s actually accurate and not just wild ass guessing. Thus far ”AI is a must for coding” has been pure gaslighting as far as I’m concerned.

FWIW I work on a tangly close-to-hardware C++ codebase that jumps through ridiculous hoops for perf optimization. A half year ago the frontier models did look incapable of doing anything useful here. Now after the new Gemini came out I gave AI another try, and so far it seems to actually get nontrivial tasks done with some prodding. Might be a lucky set of tasks I’ve had on hand, IDK. It’s getting together working, clean code changes in a minute that I think might take me a half hour and several rounds of trial and error.