I Built a ChatBI. Then I Realized It Wasn't Enough

Two years ago, my team built a ChatBI — a tool that uses LLMs to translate natural language into SQL, so people who can’t write code can query data on their own.

Two weeks after launch, daily active users surpassed a self-service query tool we’d been running for three years. Accuracy held above 90% on complex datasets involving 8-table joins and thousands of fields. Product managers, operations staff, even people in finance were using it.

It looked like a success.

But when I look back, the biggest thing I got out of it wasn’t “we shipped a good product.” It was finally understanding where ChatBI’s ceiling is — and what we should actually be building.

Why We Built It

The data team — engineers, BI analysts — was drowning in ad-hoc query requests. A business stakeholder asks a question, an analyst writes a SQL query, pastes the result in a chat, waits for the next question. Repeat five to eight times a day.

We tried everything: more dashboards, self-service query tools, documentation. All of it helped a little. None of it solved the real problem.

Because the core tension was never addressed — the speed at which business asks questions will always outpace the speed at which analysts can write SQL.

By the second half of 2024, Text-to-SQL with LLMs had matured enough to use in production. We decided to build.

The Hardest Part Wasn’t the Model — It Was the Knowledge

Before we started, I assumed the hard part would be the model: which LLM to use, how to fine-tune, how to write prompts. After we shipped, I realized the model was the easiest part.

The hard part was knowledge engineering.

Think about what a person actually does when writing SQL. They find the right tables and fields, understand what each column means and its valid values. They handle time semantics — does “this week” start on Monday or Sunday? They translate company-specific jargon — what’s a “highly engaged user”? What exactly does “immersive DAU” — a custom engagement metric — measure? And they know SQL dialect differences — MySQL’s WEEKDAY and StarRocks’ DAYOFWEEK_ISO are not the same thing.

All of this knowledge lives in analysts’ heads, in internal docs, in historical SQL scripts. The LLM knows none of it.

So we spent most of our time on knowledge management. We wrote descriptions for every table and field, documented dimension relationships and join rules — not a one-time task, but ongoing maintenance. We built a rule library: default to yesterday’s data, use StarRocks syntax not MySQL, when computing MAU do a count distinct across the full time window not a sum of daily aggregates. Each rule looks trivial. Each one is a real bug we hit. We built synonym mappings. We built an example library of canonical question-to-SQL pairs — not to let the model copy, but to teach it how to reason about a class of problem.

Looking back, 70% of the work went into knowledge engineering, 30% into the model and infrastructure. The opposite of what most people expect.

A Few Architectural Choices

The final product had three interfaces: a web app, an internal IM bot, and an open API. Behind all three was a single data intelligence engine — multiple models at the base (no single-vendor lock-in), a knowledge and tooling layer in the middle (RAG, conversation memory, syntax checking, permission validation), a set of specialized agents above that (intent recognition, dataset selection, NL2SQL, chart generation, error correction), and product features on top.

The architecture itself wasn’t unusual. What’s worth talking about are a few decisions we only got right after taking the wrong path first.

First, RAG can’t be just vector retrieval. We used three retrieval methods simultaneously — vector recall, relational recall, and LLM-based recall — then reranked the results. A single retrieval method isn’t reliable enough, especially for synonyms and fuzzy phrasing. When a user says “fiction channel,” the knowledge base stores terms like “male-oriented web novels” and “female-oriented web novels.” Pure vector retrieval often fails to connect them.

Second, agents need to be decomposed. Early on we tried a single agent doing everything. It worked poorly. Once we split into multiple specialized agents — each with a focused prompt and single responsibility — accuracy jumped meaningfully.

Third, end-to-end tracing is non-negotiable. Every query, from intent recognition through dataset selection through SQL generation through execution, is fully traceable: what RAG retrieved, what prompt the LLM received, token cost, latency at each stage. Without tracing, when something goes wrong you’re guessing. With it, you can locate the problem. Build this capability as early as possible.

Evaluation

The question that worried me most before launch: how do we know if it’s right?

SQL errors come in two types. Syntax errors fail to run — those are actually fine, you see them immediately. Semantic errors are the dangerous ones: the query runs, returns a result, but the result is wrong. A user gets a number that looks reasonable, makes a decision, and the number was off by 30%.

Our approach: build a question bank from real business queries, pair each one with a hand-written “golden” SQL. After every model upgrade, knowledge base update, or prompt change, run the full set and compare results. The key — don’t compare SQL text (there are many correct ways to write the same query). Compare the data the queries return.

For production traffic, we also had humans label every query result: correct or not, and if not, why. Those labels fed back into the knowledge base and prompt improvements, closing the loop.

This evaluation system took real effort to maintain. But without it, shipping to production is just gambling.

A Few Things I Didn’t Know Before

Here’s a concrete example. We spent time debating whether to upgrade to a more expensive model — the expected accuracy gain was around 5%. That same week, an analyst added one business rule: “when computing MAU, count distinct across the full time window, don’t aggregate by day first then sum.” It wiped out an entire class of errors. Knowledge ROI is much higher than model ROI. When resources are tight, that priority order matters.

Another unexpected finding: the quality of user questions determines the quality of answers. No model is smart enough to fix a vague question. We put real effort into guiding users at the product layer — surfacing popular questions, showing dataset information, nudging users to specify dimensions. These interactions look low-tech. Their impact on accuracy was as large as model work.

One more thing. When UV surpassed the old tool two weeks after launch, I thought we’d reached a milestone. The real value came later. User behavior started pulling data governance in new directions. Which tables got queried most. Which metrics kept getting misunderstood. Which dimensions were missing entirely. These signals were more precise and actionable than any dedicated data governance project we’d run.

The Ceiling

After running for most of last year, the ceiling became visible.

A product manager asked: “Why did DAU drop last week?” ChatBI could tell you what last week’s DAU was. It couldn’t tell you why it dropped. Answering “why” requires looking at the aggregate first, then slicing by dimension, locating the anomaly, and generating a root-cause conclusion — not one SQL query, but an analyst’s full workflow.

ChatBI answers What. What the business actually wants to know is Why and How.

Going from Text-to-SQL to an Analysis Agent isn’t a feature upgrade — it’s a paradigm shift. The former simulates a “SQL-writing tool.” The latter simulates “an analyst who can reason” — multi-step inference, a code interpreter, analytical reasoning chains. A completely different thing from writing SQL.

Thinking this through changed how I looked at the team’s direction. ChatBI was a first step, but only a first step. The real goal isn’t to make machines write SQL. It’s to make machines understand data.

If I were doing this over, a few things I’d do earlier: build the evaluation system on day one, don’t backfill it; set up end-to-end tracing from the start, because without it you can’t debug; put the knowledge base ahead of model selection in priority order.

And the most important one — don’t treat Text-to-SQL as the destination. Getting to “query data in natural language” is 30% of the job. What the business wants isn’t SQL. It’s insight.

We’ve since taken another step forward and built a new Data Agent system. That’s a different story, for next time.