(Mis)adventures in AI CLIs: Part 1 (?)

Originally, I was going to write about using OpenAI Codex versus Google Gemini CLI to help me code my personal photos website. I could and may still do that, however, here we’ll tell a story about building a Hugo theme with codex, where things went right, and where they did not. We might also take digressions into commentary on the reality of software development and the 1986 Fred Brook’s essay ‘No Silver Bullet…’.

First, context

I have a personal photo website currently implemented as static files hosted by Cloudflare. In turn, the static files for the site are generated by Hugo, a static site generator. Until recently, I used someone else’s theme templates which carried the limitation of not supporting avif compressed images or correctly supporting webp compressed images. It also did not support use of the HTML <picture> element to offer clients multiple images sizes based on display size. In short, the theme did not apply modern tools to optimize delivery of assets to clients. This annoyed me.

My goal with Codex and Gemini CLI was threefold:

Use Codex to help author templates for Hugo
Use Gemini CLI to help author updates to an existing photo lightbox library
Write something about the above

The intended execution loop

At a high level

A diagram showing a plan -> review -> write specification loop — The overarching flow

for example, I initially prompted ChatGPT with,

*…preceding context setup prompts like instructing it to create a template for use by a CLI tool…*  
We're building a suite of tools, libraries, and ultimately an image gallery theme for the hugo static site generator. I expect that when the work is done we will have two primary outputs,  
1. A javascript library that displays multimedia, primarily images, from a gallery of images in a lightbox.  
2. A customizable theme that can be used with the hugo static site generator.

Given the above expected outputs, I expect that we will primarily use javascript (or typescript), css (or sass), and golang's templating language. For content that will be served to browsers, we can assume modern browsers and use things like ecmascript modules. 

The initial user of the library will be the theme and the initial user of the theme will be myself, but we can reasonably expect that other developers or bloggers might use the library and theme.

Further context on the background for this project,  
1. There are many other javascript lightbox libraries. I want some specific features though. I specifically want something that supports use of the HTML <picture> element to allow the user agent to choose the best available image format. For example, I want to be able to offer avif, webp, and jpg images for maximum efficiency.  
2. Most existing hugo themes assume use of hugo's (and go's) builtin image processing. Unfortunately, go's libraries do not support avif and also have a bug in webp processing. Therefore, I want to either preprocess my images and provide the theme with a folder of prerendered thumbnails and fullsize images for use. Another option is to have the theme use an external application like graphicsmagick, but that is not preferred.  
*…conversation continues onward to writing specification*

Followed by

A diagram showing a write --> propose --> review --> implement loop for codex implementations — The feature workflow

for example, in codex

Please read the instructions file and implement the feature listed under item \#1. A static grid of images populated from a page's resource bundle.

The plan was wrong from the start, dummy!

Various models from OpenAI have been known for confirming biases, to put(1) it(2) mildly(3). As the last link shows, GPT-5 and GPT-5 Codex purportedly have improvements to make them less sycophantic.

Despite full-well knowing this tendency, I blithely started a chat with ChatGPT that made a critical assumption, and even acknowledged it no less, that the approach would use Hugo despite its lack of support for modern formats (webp: due to golang library issues, avif: due to multiple issues).

This request might have given me pause in former jobs where I was tasked with developing software.

A good developer might even turn around to their manager or product manager and ask, “Are you sure you want to do this? We might spend a bunch of time and only then build a subpar experience because of underlying tool issues. What about considering X?”

It’s 2025, “X” is probably Astro. Regardless, ChatGPT is (currently) not going to do this. It, in a way that echos hard working junior developers, will try its hardest to make the request as stated work.

A brief digression on an old essay

Brook’s 1986 essay, ‘No Silver Bullet…’ distinguishes between essential and accidental difficulties of development. Brooks, who incidentally passed away late in 2022 at the dawn of the current AI-hype cycle, was probably not considering conversational agents that generate code. Amusingly, Brooks does address AI though in the form of expert systems (the hot AI approach of the 1980s) and tools to create software directly from specification.

Some of what is old is new again. To the credit of the modern day, generative AI meets some of Brooks’ requirements for AI to accelerate software development in ways that no prior expert system ever could. As we see below (and possibly in later posts about Gemini CLI) though it also fails at what Brooks really hoped it could do, distill expert knowledge and make it easily usable by junior developers. At least in its current form, the tools instead seem most powerful with experienced hands guiding the overall design and implementation.

Still, we remain with the question of whether generative AI solves accidental or essential complexity in the development of software. In prior engineering management positions I have oft-advised rising senior engineers that creating code is rarely the hard part. The hard part is figuring out what to build in the first place. Today’s generative tools save typing, sometimes lots of typing, but they still cannot make critical design decisions.

Like Brooks, I come down on the side of today’s AI solving accidental complexity in the work. That said, I am also happy to admit that this statement has a time and place, 2025-09-22. It may be totally wrong in 12 months, or it might not.

10 steps forward, 9 steps back.

I used a simple pattern after the initial project scaffolding was completed. I defined each incremental feature in the theme’s README.md file, describing how the feature would be used, and how it would impact display of images. I then asked codex to read the readme, propose a plan of changes to implement that specific feature only, I’d suggest edits to the plan, and then ask it to implement.

It worked.

In a way.

It worked

Here’s the very first commit of the README.md file that was written by codex. Along with that same commit is a working theme that pulls together image files and puts them in a grid. Even better, later that day I made this commit that added support for thumbnails based on a description of the feature in the README.md. There’s already trouble brewing though, note the commit message, “Vibe-ish…”.

Six commits in and, like others on the internet have written about, I found myself course correcting. Codex’s GPT-5 model would get stuck on golang templates usage, produce unnecessary with context blocks, and, more subtly, needed frequent plan corrections on where to place functionality.

On one hand this validated the value of the “plan → review → update plan → implement plan” loop. On the other hand, I am genuinely perplexed by OpenAI’s proclamations of letting codex run for “hours” without intervention.

In a way

Still, I persisted and desired to experience the power of a coding agent doing all of the typing for me. I wrote a description for a larger feature and let it go. What’s not visible in the commit log are tweaks to AGENTS.md and a few aborted attempts to ask codex to refactor with a focus on reusability. I will grant two things to codex, the tool and model,

Go templates are sparsely used compared to React style JSX or other web application focused templating tools. Sparser use means less data underlying the model. Less data behind the model means worse results.
The project does not have an end to end test harness. Instead the agent is instructed to validate that the templates are syntactically correct and generate output via hugo’s build command. This is a trade-off because it means the agent cannot execute a, for example, playwright test, to validate changes.

On the other hand, the main issues with the generated code were even more basic than the above caveats would suggest. Instead, issues ranged from multiple RegEx in place of one, to large blocks of duplicate code, and to no reusability via template partials as instructed. Perhaps I am being unfair but these seem reasonable expectations given codex’s advertising and a reasonable AGENTS.md file.

Later, I rewrote this feature to refactor the generated implementation out into multiple partials. My workflow now included a review and refactor step.

A diagram showing a write --> propose --> review --> implement --> refactor loop for codex implementations — The improved feature workflow

And Next?

After this, I adapted my approach to generally assume that I would need to rewrite the generated implementation. Codex was still useful for first implementations, another example. Which then got refactored into modules.

My photos site uses this theme today. I expect that over the coming weeks I will continue to make some further tweaks to it using Codex and Gemini CLI.

Dear reader, at this point you likely expect a conclusion that judges AI coding tools as worthless and half-baked. Well, they are half-baked insofar as the marketing claims about their capabilities, which overstates both their breadth of capabilities and independence from human guidance. I do not think they are worthless. For this project, Codex’s main value was both subtle and powerful. It made getting started easy. The time cost of boilerplate is real and is often a reason why I decide to not bother starting a personal project. A version of this same cost analysis applies to commercial developed software, with equal benefit.

I turned to Gemini CLI to work on updates to an existing lightbox library after getting this theme working. We’ll discuss that next and why things have gone slightly better there before returning to the main issue of the overall plan being wrong from that start.

First, context#

The intended execution loop#

The plan was wrong from the start, dummy!#

A brief digression on an old essay#

10 steps forward, 9 steps back.#

It worked#

In a way#

And Next?#