May 13, 2026·5 min read·Programming

Don’t read AI-written Code

Sounds crazy. I know. But it makes more sense than you’d expect.

Sami Hindi

🔑

To be clear up front: this isn’t an argument against verifying AI-generated code. It’s an argument that reading it line-by-line, as a routine gate on every change, is the wrong place to do that verification.

In the beginning of the AI Coding revolution (Mid 2025-Early 2026), developers used to try and hide their use of AI in order to seem more technical, and avoid being labelled as a “vibe-coder”.

The way they did that, is by for example saying that they didn’t use the “Agent” mode in Cursor, and preferred the “Tab-to-Complete” instead. We all know that was untrue now (mostly, it was very useful when LLMs weren’t that smart yet).

Now, the same sentiment still exists, just attached to a different thing. One side openly admits they’ve stopped reading the code that AI tools generate for them. The other side still insists:

“I need to review EVERY LINE OF CODE AI generates. I just don’t trust the models, they’re not there yet.”

Here’s the problem with that line of thinking: line-by-line reading is a weak verification gate, especially since our review instincts were trained on human code. We scan for bad variable names, inconsistent formatting, and obvious mistakes. However, AI-generated code is clean by default (usually). It can dress up the worst code ever, to be as beautiful as Sydney Sweeney (debatable).

So a reviewer skims it, sees nothing ugly, and approves, while the logic bug falls straight through. Reading AI code genuinely like verification while doing far less work than verifying human code does.

One way that I’ve been personally trying to modernize my reviewing skills, is by using prompts similar to these:

Make an html page explaining our diff, 
and give each feature change a score from 1 to 10.

At the end, include a table with Summary, Score, Improvement Suggestion,
and whatever other column you need.

An important detail however, is that our job as developers really didn’t get a lot easier. Usually, with abstractions, your job drastically loses complexity. However, I would claim that AI is not just “another abstraction”. Refer to “Agentic Coding is a Trap” by Lars Faye, if you’d like to learn more about this claim, essentially, it boils down to the fact that a higher level of ambiguity is not a higher level of abstraction.

In this case, I believe that the complexity simply shifted from generating/reviewing the code, to making sure that the correct pre-requisites exists, for AI to refrain from “sloppifying” your codebase, and actually write high-quality, easily-maintainable code.

This includes:

having a small & concise AGENTS.md/CLAUDE.md (see this study)
having a clean project tree (src/, artifcacts/, etc.)
a dedicated docs folder for the AI agent to continuously write documentation for the created software (preferably separated by date. As agents in general find it hard to keep changing around with the same documentation. You may face different results, but this has been my experience)
using an Agent-centric task management system, like beads by Steve Yegge
a test suite that the human actually owns. If you’re not reading the code, the tests are doing the verifying, so they can’t be written by the same model that wrote the code (or at least not in the same session), or they’ll share its blind spots. Derive at least some of them straight from the requirements, by hand.
writing detailed goal-posts
- without specifying in great detail, how to get there. This is almost worth it’s own separate article.
  Essentially, models like GPT-5.5 are smart enough to pick the best available option, according to the context it has.
  If you’re using Claude Opus, you should definitely give it some recommendations and opinions, so it has a starting point.

However, this is much, much easier said than done. In my experience of using AI coding tools, each model and provider have their own set of “tendencies”. I also like to refer to these set of built-in, optional rules, that appear as a result of post-training and RLHF, as the Model’s Personality. This is why it’s so important to have someone on the team, who spends a lot of time with these models, so they can figure out, which model is the best for each task, and how you should write the prompt in order to achieve the best possible result, on the first try. On the other hand, if you update the (system)/prompt with instructions that directly contradict and conflict the personality of the model, you can, and WILL have worse results than when you fall in line with it’s choices and behaviors. Obviously this doesn’t mean you should throw away all your preferences, the point is to be more aware of the model’s tendencies and adapt your instructions to the model.

⚠️

This doesn’t apply to all labs or models. But it does to most frontier research labs like Anthropic & OpenAI. For example, try telling Codex to code without writing tests. Or the opposite with Claude. It’ll work to a certain extent, but once you start running out of context, for example, you’ll start facing issues.

Let’s address the elephant in the room: if nobody reads it, who debugs it when it breaks in production? Well, the reading doesn’t really disappear. But it stops being a mandatory gate on every change and becomes on-demand debugging. When something breaks, you debug. How? Find and give the logs to the model of choice. You shouldn’t be reading the code at that point, still.

All the things that were mentioned in this article, almost require for one person to own the repository or project they’re working on. Or at least the main / master branch.

Why, you might wonder? Well, remember the technical discussions you used to have with your co-workers about the specific syntax? About when you’re fine with unsafe calls in Rust?

While that was a big tangent from the original topic, it is still related, because code-generation & reviewing generated code go hand-in-hand.