(Mis)adventures in AI CLIs: Part 2 Confidently Building the Wrong Thing

If Part 1’s theme was attempting to ‘vibe’ through golang text templates with codex then Part 2’s theme is attempting to ‘vibe’ through updates to a legacy javascript library with Gemini CLI. As planned this did not go as planned. I recommend skimming Part 1’s narrative, if not reading, as the following builds on it incrementally.

Context, again

First, a brief refresh on the origin of this work. I was dissatisfied with an existing hugo theme for my image gallery. I also wanted to present images using modern formats like avif which are unsupported by hugo’s builtin processing capabilities. This had impacts on how the templates for my new theme were structured. It also meant client side, in browser client, processing of image tags (e.g. exif or XMP data) for presentation.

I want to, again, acknowledge that tag handling for presentation would be better preprocessed either by the tools generating static files (hello, again, Astro) or by a server side framework. There is a non-zero probability of a ‘Part 3’ post in this series which addresses replacing all of this work with new tools that do properly support avif and webp.

In the meantime though, my goal for what this post covers was to achieve this.

A screenshot highlighting use of title and copyright text from image tags — Tag based copyright and title

The text inside of the red box is extracted at runtime from tags present on the image file.

That’s it. Now with the context out of the way, let’s address more context, our starting point.

Lightbox2

I picked Lightbox2 because,

I didn’t want to write my own lightbox library
Lightbox2 was near the top of google search results for me (no, really)
Near the top
And, most importantly, Lightbox2 worked in a way I wanted. It does nothing but work as a lightbox and it uses standard data- HTML element attributes to achieve that functionality.

Unfortunately, Lightbox2 has some downsides too:

No tests
Early 2010s-era toolset
And written in 2000s style JS

Fine. AI can help with all of that, right? Furthermore, as a challenge, let’s try it with Gemini’s free tier!

Baby steps

Like last time, I used the tool, Gemini CLI integrated with VS Code, to bootstrap itself. It generated a reasonable agents file, GEMINI.md , that I later extended and used with both Gemini CLI and codex. Next, I decomposed the work into three phases in order to simplify prompting,

Dev tool updates
New tests as needed
Functionality changes

Perhaps I was being overly cautious from experiences in Part 1, but from there I further broke ‘Tool updates’ into bite size tasks. The kind of tasks I might hand to an intern in their first week. Some examples,

“We’re going to consolidate build tools to npm and away from separate commands for grunt and bower. To start, please add npm run scripts that execute bower, grunt, etc as need to install, build, and deploy”
“We’re going to continue consolidating, please propose a plan for replacing bower as a dependency manager with just npm”…“Please implement this plan”
“…continuing, please propose a plan for replacing grunt with purpose built tools and npm run commands. I’d specifically like to use rollup for javascript library packaging. Please maintain the same outputs as currently generated by grunt.”

This initially went very well with Gemini CLI, producing few unexpected outputs and outputs that were well aligned with input prompts. This stopped working when we got to converting from jscs to eslint.

Clean up

My second buildtool update pull request has many more updates than the first, including removing grunt, removing bower, and updates to linting tools. I did this incrementally. For linting, I broke down the prompt to Gemini further into (1) generate a ruleset with new tools and (2) run the lint commands and fix resulting errors.

Gemini failed to complete (2). In the literal sense of the free tier hitting a rate limit for tool calls before it could fix the underlying lint errors. The tool generated a reasonable approach of, npm run lint, npx eslint --fix, and then make manual changes to resolve the remaining issues. The model unfortunately was unable to correlate the lint errors to actionable changes in the fix file; instead looping on whitespace changes and executing npm run lint. Mirroring my experience in Part 1, this felt like watching an intern struggle with an obscure linting rule. I stepped in to fix this issue and then continued.

Aside: I am aware that this could also have been solved by simplifying the gemini generated eslint configuration by hand. Leaving it and seeing the tool (fail to) work through linting issues was an intentional choice. The outcome is the same for the sake of evaluating gemini cli in my personal workflow.

Tests

Adding tests.

Could I have just made the changes to the library and moved on with my life? Of course, but then I would have nothing to write about.

I got started by adding baseline tests for functions that I would be modifying using the workflow from Part 1.

A diagram showing a write --> propose --> review --> implement --> refactor loop for codex implementations — The improved feature workflow

Using Gemini CLI to add the dependencies, setup up the test run commands, add a github workflow, and add skeletal dummy tests went swimmingly. Prompting Gemini to write initial tests for the exported lightbox module went ‘OK’, but yielded strange behavior. The generated output often mocked more underlying functions than needed or mocked things that were not called.

Gemini ultimately failed to complete this task though. Prompts for more tests would result in a returned plan but no implementation changes. I attribute this to hitting tool call limits and using gemini flash instead of pro.

Interstitial Conclusions

Gemini’s free tier is great! I got a bunch of ‘free to me’ tool usage that wrote boilerplate dev toolchain updates. The free tier is not enough to do day to day development but it worked great for lightweight use and, again, helped get the work started. Similar to last time, it is unclear to me how this would be useful for someone without a background in software development, but It’s good enough as an assistant tool with known limitations.

I used codex for the rest of the work discussed here.

Adding functionality.

Codex observed issues with my implementation choices and saved me time in a reversal from Part 1.

A workflow diagram showing codex used as a reviewer and test writer. — AI working as a test writer

I wanted to play with different ways of integrating a new dependency into the library and updated my workflow with codex accordingly. In this approach, codex acts as a reviewer and ‘QA’ unit test writer.

On my first loop, codex identified two issues,

My code had a typo, where I had attempted to use a method called loan instead of load
load was an asynchronous method used in a synchronous context.

This was great! I had legitimately made a typo and hadn’t noticed it because I was transitioning immediately from writing a couple lines of new code to asking the agent to write tests for me. No manual functional testing here!

Unfortunately, codex then took both issues as gospel and ran with them. It stubbed out asynchronous mocks for the invented method and tried to write tests that invoked synchronous methods with inner asynchronous code. Continuing the theme of a junior developer trying hard to make it work, codex added code that put test cases to sleep using a fixed timer in order to let the async mocks resolve. This was another great example of codex being a piece of software and not a human reviewer.

The tests ‘passed’ when it was done, but I ended up undoing almost all of the work after fixing the two identified issues.

After the initial issue though, work went well. I made small incremental, <10 line changes, ran existing tests, had codex add more tests, and repeated until I had a working feature.

Miscellaneous complaints

When given the option Gemini and codex tended to generate output that would work but was slightly out of date. This occurred in Part 1 and it occurred here too in a couple critical places,

Adding prettier as a plugin of stylelint. In particular, Gemini was hung up on the changes in stylelint v15+ that simplified configuration. The agent looped on trying to resolve conflicting prettier and stylelint package versions in npm instead of using the latest configuration.
Generating a github actions publish workflow. When prompted to create a release generation workflow, codex generated a workflow that worked but exclusively used outdated and deprecated actions. In at least one case the action had been deprecated with links to supported actions for at least four years.

I am unsure of what to make of this and as an LLM novice, at best, can only speculate that it has to do with a combination of underlying training biases and my failing to prompt in a way that would trigger the tool to perform a web search for more current techniques. In the above cases, I resolved the issues by handwriting up to date configurations.

Further conclusions, and next?

Overall, codex and Gemini were more helpful working on legacy Javascript than they were with scratch authoring golang text templates. Both tools worked well with tight workflows that relied on a human, me, giving incremental direction and guidance in the form of prompts and handwritten code. Occasionally, weird things happened that surprised me. If I did not have a software development background then there were a couple spots I might have gotten stuck for a while, but there were no insurmountable problems.

Yet again, I marvel at the claims by OpenAI and, most recently, Anthropic where their agents work independently for tens of hours at a time. I wonder how much time was spent crafting the initial specifications for the agent to follow and how many failed attempts were made before the successful run used for market materials occurred.

In a single sentence, these tools are useful assistance that have saved me time, but are not magic senior or staff developer level replacements. If there is a Part 3 to this series, it will likely focus on (re)developing functionality from a more appropriate baseline toolset.

Context, again#

Lightbox2#

Baby steps#

Clean up#

Tests#

Interstitial Conclusions#

Adding functionality.#

Miscellaneous complaints#

Further conclusions, and next?#