Building an AI agent that analyzes data

Recently, I spent a lot of time building ChartPop, an agent that automates “coming up with actionable insights for key stakeholders” into presentation-ready charts. It’s not perfect, but I’ve learned a lot along the way about building agents with structured data in the loop. Some of these lessons are well known; others maybe not so much. Here are my top tips for anyone starting out on the journey of building a data-driven agent:

1. LLMs can ace math contests, but still mess up basic addition.

They can win Math Olympiads, but without a code interpreter tool, they might fail at adding numbers under 100. If you don’t want them to hallucinate numbers, calculate every number with code. Reasoning alone isn’t reliable.

2. Let your agent write code — but watch what it does.

Giving the model the ability to generate Python is powerful. It’s more flexible than, say, SQL, but it comes with hazards. The model will happily try to make its code “work” by, for example, typing a Pandas Dataframe column with errors="coerce" instead of first fixing inconsistent date formats. Running code successfully isn’t a good enough success metric. SQL has fewer gotchas because it’s less flexible and your data is already typed, but handling large, poorly documented schemas using, e.g., RAG introduces its own headaches. You’ll only notice these horrors if you log your analysis traces.

3. Combing through your agent logs is table stakes

Which brings me to logging your agent calls. I spent a lot of time reviewing agent traces to uncover blind spots and subtle bugs. Looking only at the end result isn’t enough. You can log every request and response to whatever database you’re using, or you can use LangSmith or LangFuse (open-source) to capture traces. I found it was extremely helpful to be a domain expert in the process I was trying to model with an agent.

4. Don’t obsess over benchmarked solutions

For working with data in data warehouses, you can look at the text-2-sql Spider 2.0 and BIRD benchmarks for some ideas, but ultimately the solutions tend to overfit to the environment and even the hidden test set. The benchmarks that I’m aware of don’t come with query logs, very large schemas, or a DBT manifest, so the published solutions have nothing to say about using those to handle more realistic setups.

5. Over-prescribed instructions will bite you.

Sometimes logs reveal issues you thought you’d already covered only to find your prescriptive instructions caused subtle contradictions that tripped the model. Even without contradictions, changing these instructions sometimes works and sometimes turns into a game of whack-a-mole. It’s almost always better to re-architect your agent or define instructions around end goals rather than tightly prescribing every step.

6. Keep iterations small.

Building the agent was highly iterative, and I found that small work chunks help with almost everything: easier testing, fewer errors, faster retries. If you ask the model to generate code for the “top three open questions,” it’ll come back with really shallow analysis. Splitting the process into “generate hypotheses” and then “generate code to test them” works much better. Still, you’ll want validation and a retry budget for the agent.

7. Don’t cling to old prompts when the model improves.

During the time I worked on ChartPop, model capabilities improved a lot: parallel tool calls, better context windows, prompt caching, and general performance gains. I restarted from scratch many times, slowly adding in pieces from the old solution. Just swapping out models can really strangle the potential of the new model.

8. Remember the user might not see what the agent sees.

The context available during a full run of the agent is very different from what the end user sees. The agent might leave out key information in chart titles or labels because it assumes the user has seen the full analysis trace. You need to make sure the final output is self-explanatory. Spell out the environment, the intent, and any assumptions or the model will fill in the blanks incorrectly, and your charts will miss context.

9. Getting charts right is a lot of work.

Automating presentation-ready charts is almost as much work as automating analysis. Most chart packages are a pick-your-poison trade-off. They won’t help you wrap labels under bars, decide when to orient a chart horizontally, set good margins, decide how much data to show, or how to avoid overlapping text. Some of this logic should have a home in your prompts or tool definitions, but other things are better handled elsewhere.

Asking LLMs to write chart code directly doesn’t usually work out. The defaults that chart packages use are rarely presentation-ready. The outputs tend to become bloated to fix that and misaligned with what you actually want. A better approach is defining a tool with a structured JSON schema. You can then calculate component dimensions and make the final call on orientation and other things outside of the model. It also makes it much easier to wire up an interface for editing charts by users.

Let me know if you’ve learned something about building agents that use data that isn’t on this list.