Recently, I spent a lot of time building ChartPop, an agent that automates “coming up with actionable insights for key stakeholders” into presentation-ready charts. It’s not perfect, but I’ve learned a lot along the way about building agents with structured data in the loop. Some of these lessons are well known; others maybe not so much. Here are my top tips for anyone starting out on the journey of building a data-driven agent:
1. LLMs can ace math contests, but still mess up basic addition.
They can win Math Olympiads, but without a code interpreter tool, they might fail at adding numbers under 100. If you don’t want them to hallucinate numbers, calculate every number with code. Reasoning alone isn’t reliable.
2. Let your agent write code — but watch what it does.
Giving the model the ability to generate Python is powerful. It’s more flexible than, say, SQL, but it comes with hazards. The model will happily try to make its code “work” by, for example, typing a Pandas Dataframe column with errors="coerce" instead of first fixing inconsistent date formats. Running code successfully isn’t a good enough success metric. SQL has fewer gotchas because it’s less flexible and your data is already typed, but handling large, poorly documented schemas using, e.g., RAG introduces its own headaches. You’ll only notice these horrors if you log your analysis traces.
3. Combing through your agent logs is table stakes
Which brings me to logging your agent calls. I spent a lot of time reviewing agent traces to uncover blind spots and subtle bugs. Looking only at the end result isn’t enough. You can log every request and response to whatever database you’re using, or you can use LangSmith or LangFuse (open-source) to capture traces. I found it was extremely helpful to be a domain expert in the process I was trying to model with an agent.
4. Don’t obsess over benchmarked solutions
For working with data in data warehouses, you can look at the text-2-sql Spider 2.0 and BIRD benchmarks for some ideas, but ultimately the solutions tend to overfit to the environment and even the hidden test set. The benchmarks that I’m aware of don’t come with query logs, very large schemas, or a DBT manifest, so the published solutions have nothing to say about using those to handle more realistic setups.
5. Over-prescribed instructions will bite you.
Sometimes logs reveal issues you thought you’d already covered only to find your prescriptive instructions caused subtle contradictions that tripped the model. Even without contradictions, changing these instructions sometimes works and sometimes turns into a game of whack-a-mole. It’s almost always better to re-architect your agent or define instructions around end goals rather than tightly prescribing every step.