Over the course of two experiments within two half-day coding sessions, I better glimpsed what working with coding agents might look like over the next couple of years.

1. I teach the agent to build a simple Rails CMS

This is pretty unremarkable, given my experience. But, like so many lately, I wanted more reps at exploring agent coding workflows. In particular: what works for guiding an agent and keeping it on-track?

I did a bit of the upfront setup with wits alone. Executing the correct boilerplate is where our LLM assistants seem to most reliably get themselves into early trouble. Plus, Rails’ whole thing is accelerating this part of a project. If somehow it were slow-going, it would be a tremendous failing of the framework. But it wasn’t!

Next, I set up the basic models I wanted for this particular little CMS. Again, a way that an LLM often goes off the path is trying to build new models just right. So I did that part myself. Perhaps, this is also where I’m most opinionated about getting things right. 🤷🏻‍♂️

Finally, I made sure the app had good guardrails in place. Namely, Justfile tasks to run tests, rubocop checks, and brakeman. Claude Code seems to do really well given those kinds of guidance, so setting them up up-front makes sense.

From there, I let the machine take the wheel. I gave it brief instructions, sometimes using plan mode to build up context first, and let it go. Document and element model CRUD, done. Tweaking a DaisyUI theme to look a little more Swiss design or Bauhaus-y, done. Build up a document editor that works with the document and element models at the same time, done.

Result: taking a little time up-front to lay the right foundation and guardrails makes agent coding way better.

2. The agent teaches me about LLM evals

This one is (unintentionally) the opposite of what I was doing with Claude and Rails. In that case, I was working with Claude as the expert, guiding it towards an outcome I knew was easily within reach for this project.

For this project, I was experimenting with using Claude to teach me how to do LLM evals. My understanding is that these are super important for building AI-based apps; they are as close as you can get to a TDD-style feedback loop to verify your work when working with nondeterministic language models.

To start, I used Claude (not Code) and asked for a starting point to learn LLM evals. I wanted to get something simplistic, like a hello world of that problem domain. I would rather not set up multiple ML frameworks (looking at you, Conda), so I said let me run these models locally, and I’m fine with using Python.

Claude generated some Python code, which I pasted into a new uv project and got started. From this point forward, I was using Claude Code to restructure code, like moving code into source files out of Jupiter notebooks or putting code plus explanations back into Jupiter notebooks. I’d also ask why a particular approach was taken or what jargon means.

In this case, I was really using Claude code to teach myself something new, and it was the expert guiding me. For something like this where I’m not caught up in the specifics of machine learning and its problem domain, it’s a perfect approach.

Result: I got way ahead on a new-to-me topic over the course of a couple of hours.

3. 💥 At the same time

The big reveal is, I was working on these two projects at the same time. Two terminal tabs, two Claude Code instances. “Talk” to the first one, let it run while I talk to the second one. First one finishes, talk to it, rinse and repeat.

At first, this was like pairing with two developers working on entirely different projects. And they’re sitting on either side of me, sometimes attempting to talk at the same time. My brain hurt and it felt like I was moving slow enough that single-tasking would have been far more effective.

After 60 minutes or so, things seemed to pick up. I felt like I was doing a better job of bouncing back-and-forth between the two agents. I was quickly improving my prompts, allowing the agents to work more effectively while I was working with the other agent. However, I wouldn’t say I mastered multitasking in one hour. More like, I came to grips with keeping each agent going, like keeping a room full of eager developers moving independently despite needing frequent directions.

It’s still crucial to taste test the work frequently. Product-focused developers/managers will have a leg up here because they’re used to working this way. I think anyone who has good attention to detail and the ability to communicate what they want is going to have an advantage here.

From all this, I learned a few things:

  • Working with agents is going to look similar to multitasking and delegation.
  • To keep up with multitasking, humans will need to get as good at context management as agents (we both have limited context windows).
  • Often, we’ll mentor agents to get the best results, particularly by giving them good guardrails (automated and dynamic context generation).
  • In some cases, the agents can teach us just enough to get stuff done.
  • For better or worse, there’s still much to explore, and the frontier is moving quickly.