What works well so far

This article is an unordered collection of things that I have tried to do with AI (Claude Code mostly using Opus 4.6 to be more specific).

Code Review

Brilliant. Claude is great at code review. We have integrated the claude github action (yes, the standard one from anthropic) into our workflow and started mentioning one of various flavors of @claude review to ask claude for a review and it tends to uncover bugs that no human would ever detect.

Variants of this:

@claude review, keep it brief
@claude review, output a table of issues
@claude review, focus on changes since last review

Does not replace human review though, humans tend to find an entirely different class of issues that AI has, so far, proven unable to detect. The ones that require oversight and large-scale context. Human code review is also about 80% of how we developers communicate with each other. Before we give this up, we'll at least need a new, ideally better suited medium to move this communication onto. At least until we find another medium for these discussions.

It's also really important to not just implement all suggested changes verbatim. I'd estimate that I usually implement 50% (share is rising as models improve though) of claude's feedback and reject the rest.

Simple issues in github actions

I assign simple issues directly to claude via @-mention ( @claude implement this). This works really well if the issue is self-contained and small enough. It implements the feature just from the issue description and saves me all the ceremony of checking out a branch, implementing a change, committing, pushing, creating a pull request etc. I can even iterate on it with claude using further @claude comments on the pull request.

The only caveat is that it does make a lot of sense to check out the branch and actually work on it locally if it turns out it's too hairy to be solved with maximum one round of corrections because iteration cycles are slow, github actions take time to set up and tear down and it burns tokens recreating context from scratch on each run.

This is where better tooling could shine. Maybe the solution is already out there but I'm conservative enough to scan the landscape for another while before I settle on a solution.

Claude code online

Claude has an online feature at claude.ai/code. I love this feature. I can pick a repo and issue or branch and just start working there. It runs claude code in some VM on anthropics servers. I do not have any sort of environment set up in there (and I'm not going to trust anthropic with any of my secrets, I hear these guys vibe-code a lot...) so this is most useful for self-contained projects like ai-pod but still, really cool.

This solves, to a degree, the same problem as the github actions setup but doesn't loose context on every run, interaction is immediate just like with local claude code. Plus it can help me brainstorm things with the actual code as context on my phone. On the other hand it's another interface I have to keep track of in addition to github comments and my local terminals.

Issue analysis

I have a scheduled job that makes claude pick one issue per day, analyze it, ask questions and suggest an implementation plan. This is essentially a test run for whether claude could actually enhance all our issues.

We had mixed results with this. On some issues claude made useful remarks, on some it ended up producing a long wall of text that is, in the end, not worth the time it takes to read it. Might be possible to get useful results with a better prompt here I guess.

Writing

All the blog posts I publish are completely hand-written. Writing is, for me personally, mainly a tool to slow down my own thoughts. AI is there to accelerate it. While trying out formulating things in tandem with claude I just noticed that the output does not feel like my own and the process isn't half as helpful as doing it on my own.

Nevertheless, I still created a workflow so I can write blog posts in markdown files on my computer and sync them to DatoCMS instead of writing in their web interface so I can still have claude

translate what I write from english to german
fix typos, inconsistencies, stylistic issues
summarize the result for me.

Woman at a pool, gigantic checkmark and x symbol in the back — I love the color-blocking style template on Nano Banana but it does have a hard time producing anything without a pool from it.

Implementing (comprehensive) features

There are features that are well suited to AI-implementation. If there is one thing I have learned from recent months of AI-use it is about which features are Vibe-Codable and which are not.

My rule of thumb is: if I know exactly where I want to go with a feature and how it should be implemented then claude will usually be able to produce reasonably good code for it, even without me spelling out my thoughts too much. AI just does not deal well with ambiguity and ambiguity in my thoughts translates directly into ambiguity in my prompts.

If I do not see at least the rough shape of the result I want to produce it's time to get into it manually. Sure, sometimes a good planning session helps but then again, my own ambiguity is usually due to me working on a system that I do not know well and in those cases sketching a code solution and seeing how edge cases and ugly workarounds emerge is just more direct and useful feedback to me than a famously sycophantic LLM.

This has led to a change of my daily routine: In the morning I decide whether I will be working on a bunch of issues in parallel using AI or on a single issue with focus but without AI. Doing both at the same time does not work because the required state of mind is just so different.

To break it down, I think the main thing AI cannot figure out on its own is intent. It can produce plans, solutions, tests, code in all kinds of languages, great words, maybe the best words, it just cannot figure out the intent of a feature and so this is what a human needs to input and if the intent of a change is not clear to me then I cannot communicate it to the LLM and it will produce unintentional code and that code might be correct but only by accident.

Refactoring

Claude has made it easy to do those long-awaited refactorings. I have always been a fan of bold refactorings. So far I was limited to things I could do semi-automatically with regexes and some oversight, now claude can do almost arbitrary refactorings and it is really good at it.

This use case does not present the same challenges as actually implementing new features. With a refactoring the code is already there, the intent is encoded in it and all claude needs to do is adjust it following a strict pattern.

There are two remaining issues with claude-based refactorings:

claude is surprisingly bad at doing things exhaustively. The results are on a best-effort level. This is not a problem when the compiler can guide it and in some instances it can work through lists of references using static analysis tooling such as LSPs but sadly those are also usually the cases where I do not need claude at all to do the refactoring. This does get better with increasing model capabilities.
Refactorings touch lots of code, when claude does one I generally lose some confidence in all the code it touched because it does, every once in a while, introduce what I call its little easter eggs.

Tests

AI can produce very useful tests. I can describe a scenario I want to test in plain english and it will produce a working test for it, however untestable the scenario might be. The test code can get ugly but it usually works fine and tends to be at least useful enough to guide the LLM to actually fix the problem the test was written for.

What does not work:

Add some tests. Bad prompt. Don't do this, it will add meaningless test that only create additional work and cost precious pipeline minutes.
Fixing tests and code. The AI does not understand the intention of both code and test so if it is allowed to change both it will essentially pick at random whether the test or code should be fixed. This leads to horribly bad results sometimes:

vi.mock('@lib/auth', () => ({ enforceAuth() { return false } })

describe('enforceAuth', () => {
  it('returns false if no user is authenticated', () => {
    expect(enforceAuth()).toBe(false)
  })
})

Rewrites

It is very easy now to approach a legacy codebase and rewrite it from scratch in a couple of days with an LLM. A rewrite usually has a goal. This goal might be to improve overall performance.

The result will usually look great, all the code will look clean, the targeted metric will be fulfilled.

After a while one tends to notice that some feature that was present in the old codebase is missing. Claude increased a website's performance score by 20% and accessibility (as measured by lighthouse) by 40%. This sounds great. It might have lost the structured data along the way that did so much for the actual search ranking.

Now these things can usually be caught by careful review or audits and then fixed but by producing code that actually looks extremely plausible claude has accidentally made auditing the code a lot harder. Usually an experienced developer can scan a codebase for patterns such as frequently touched files, known code smells, exotic naming, repetition, dead code, general code-ugliness. All of these techniques fail on a claude-rewrite of a codebase so auditing it is really down to auditing the result. This leaves us with just half the toolkit.

Summary

AI, and Claude Code specifically, has become a genuine productivity multiplier for certain tasks — code review, simple GitHub issues, refactoring, and test writing stand out as clear wins. The common thread in what works: the intent is already clear and encoded either in my head or in existing code. AI executes well when it can follow a defined shape.

The failure modes are equally consistent: ambiguous goals produce ambiguous code, exhaustive tasks get done on a best-effort basis, and rewrites that look clean can silently drop important behavior. The toolkit for auditing AI-generated code is smaller than for human-written code, which is an underappreciated risk.

The meta-lesson I keep coming back to is that AI amplifies clarity. If I know what I want, AI gets me there faster. If I don't, it gets me somewhere plausible-looking — which can be worse than nothing.

Handwritten, Summary + Grammar / Style correction by Claude