我做了一个 ChatBI，然后发现它不够用

前年，我带团队做了一个 ChatBI — 用大模型把自然语言翻译成 SQL，让不会写代码的人也能查数据。

上线两周，日活用户超过了我们用了三年的自助取数工具。准确率在 8 张联表、上千字段的复杂数据集下做到了 90% 以上。产品、运营、甚至财务的同事都在用。

看起来是个成功的项目。

但做完之后，我最大的收获不是「做出了一个好产品」，而是想清楚了 ChatBI 的天花板在哪里，以及真正该做的是什么。

为什么要做这件事

公司数据团队（包括数据工程师、BI分析师）日常最大的资源黑洞是 ad-hoc 取数 — 业务方提一个问题，分析师写一条 SQL，贴到群里，等下一个问题。一天重复五到八次。

我们试过各种办法：建更多看板、做自助查询工具、写取数文档。都有用，但都没有根本解决问题。

因为核心矛盾一直没有被解决 — 业务提问的速度永远快过分析师写 SQL 的速度。

2024 年下半年，大模型的 Text-to-SQL 能力已经成熟到可以在生产环境用了。我们决定动手。

最难的不是模型，是知识

做之前我以为最难的部分是模型 — 选哪个 LLM、怎么 fine-tune、prompt 怎么写。做完之后发现，模型是最不难的部分。

最难的是知识工程。

想想一个人写 SQL 要经历什么。先找到要用的表和字段，理解字段含义和枚举值；然后处理时间语义 —「本周」从周一开始还是周日开始？再翻译企业黑话 —「大亲子」是什么？「沉浸DAU」是什么口径？最后还要知道 SQL 方言的差异，MySQL 的 WEEKDAY 和 StarRocks 的 DAYOFWEEK_ISO 不是一回事。

这些知识散落在分析师的脑子里、文档里、历史 SQL 脚本里。大模型一个都不知道。

所以我们花了大量时间做知识管理。给每张表、每个字段写描述，标注维度关系和关联规则 — 这不是一次性工作，是持续维护。建规则库 — 默认取昨天的数据、用 StarRocks 语法不是 MySQL、算 MAU 要对全时间区间做 count distinct 不能先按天聚合再 sum。这些规则看起来琐碎，但每一条背后都是真实踩过的坑。建同义词映射 —「亲子」≈「儿童」≈「大亲子」，「相声」≈「相声评书」，不建起来模型就会懵。建样例库 — 经典问题的标准 SQL，不是让模型抄，是让它学会这类问题应该怎么思考。

回头看，70% 的工作量花在了知识工程上，30% 花在模型和工程上。和大多数人的直觉恰好相反。

架构上的几个选择

最终产品有三种形态：网页端、内部 IM 机器人、开放 API。背后是一套统一的数据智能引擎 — 底层接多个模型（不绑定单一供应商），中间是知识管理和工具层（RAG、会话记忆、语法检查、权限校验），上面是一组各司其职的 agent（意图识别、数据集选择、NL2SQL、绘图、纠错），最上面是产品功能。

架构本身没什么特别的，值得说的是几个走过弯路之后才做对的决策。

第一，RAG 不能只做向量检索。我们同时用了向量召回、关系化召回和大模型召回三种方式，然后做重排。单一检索方式的召回率不够，尤其是处理同义词和模糊表述的时候 — 用户说「小说频道」，知识库里存的是「男频网文」「女频网文」，纯向量检索经常召不回来。

第二，agent 一定要拆。早期我们试过用一个 agent 处理所有事，效果很差。拆成多个专职 agent 之后，每个 agent 的 prompt 更简洁、职责更清晰，整体准确率上了一个台阶。

第三，全链路 Trace。每一次查询，从意图识别到数据集选择到 SQL 生成到执行结果，全部可追踪 — RAG 召回了什么、LLM 收到的 prompt 长什么样、token 花了多少、每个阶段耗时多久。没有 trace 的时候，出了错只能猜，有了 trace 才能定位。这个能力建得越早越好。

评测

上线之前最让我焦虑的问题是：怎么知道它对不对？

SQL 错误有两种。语法错误，跑不出来，反而好办。语义错误才要命 — SQL 能跑、有结果，但结果是错的。用户拿到一个看起来合理的数字，做了决策，但那个数字其实差了 30%。

我们的做法是用真实业务问题作为题库，每道题配一条人工写的「标准答案」SQL。每次模型升级、知识库更新或 prompt 调整，把全部题目跑一遍，对比结果。需要注意的是，这里的对比不是对比 SQL 文本是否相同（同一个问题可以有很多种正确的 SQL 写法），而是对比最终查出来的数据是否一致。

线上也一样，每一次查询的结果都由人工标注 — 对还是错，错在哪里。标注反过来喂给知识库和 prompt 优化，形成闭环。

这套评测体系维护成本不低。但没有它，上线就是赌博。

几个之前不知道的事

举个例子。我们曾经花了不少精力评估要不要换一个更贵的模型，预期准确率能提升 5% 左右。但同一周，有个分析师补了一条业务规则 —「算 MAU 不能先按天聚合再 sum，要对全时间区间直接 count distinct」— 直接消灭了一整类错误。知识的 ROI 远高于模型。在资源有限的时候，这个优先级很重要。

另一个意外的发现是，用户提问的质量决定了回答的质量。再聪明的模型也救不了一个模糊的问题。我们在产品层做了不少引导 — 推荐高频问题、展示数据集信息、引导用户选择具体维度。这些看起来「低技术含量」的交互设计，对准确率的贡献不亚于模型本身。

还有一件事。上线两周 UV 超过老工具之后，我以为阶段性胜利了。但真正的价值在后面 — 用户的使用习惯开始反向牵引数据建设。哪些表被问得最多，哪些口径总是被误解，哪些维度缺了，这些信号比任何专项数据治理都精准和高效。

天花板

去年大半年运行下来，天花板开始变得清晰。

一个产品经理问：「为什么上周 DAU 掉了？」ChatBI 能查出上周 DAU 是多少，但它不知道为什么掉了。要回答「为什么」，需要先看总量、再按维度拆解、定位异常源、生成归因结论 — 这不是一条 SQL 能做的事，这是一个分析师的完整工作流。

ChatBI 回答的是 What。业务真正想问的是 Why 和 How。

从 Text-to-SQL 到 Analysis Agent，不是功能升级，是范式跳跃。前者模拟的是一个「会写 SQL 的工具人」，后者模拟的是一个「会思考的分析师」— 需要多步推理、Code Interpreter、分析思维链，和写 SQL 完全是两回事。

想明白这个之后，我开始重新审视团队的方向。ChatBI 是第一步，但只是第一步。真正的目标不是让机器写 SQL，而是让机器理解数据。

如果让我重新做一遍，有几件事我会更早做：评测体系第一天就建，不要等上线了再补；Trace 第一天就接，不然出了错只能猜；知识库的优先级放在模型选型前面。

还有一件事，也许是最重要的 — 不要把 Text-to-SQL 当终点。做到「自然语言查数据」只是 30%。业务要的不是 SQL，是洞察。

我们后来又往前走了一步，做了一套新的 Data Agent 系统。那是另一个故事，下次再写。

End

Two years ago, my team built a ChatBI — a tool that uses LLMs to translate natural language into SQL, so people who can’t write code can query data on their own.

Two weeks after launch, daily active users surpassed a self-service query tool we’d been running for three years. Accuracy held above 90% on complex datasets involving 8-table joins and thousands of fields. Product managers, operations staff, even people in finance were using it.

It looked like a success.

But when I look back, the biggest thing I got out of it wasn’t “we shipped a good product.” It was finally understanding where ChatBI’s ceiling is — and what we should actually be building.

Why We Built It

The data team — engineers, BI analysts — was drowning in ad-hoc query requests. A business stakeholder asks a question, an analyst writes a SQL query, pastes the result in a chat, waits for the next question. Repeat five to eight times a day.

We tried everything: more dashboards, self-service query tools, documentation. All of it helped a little. None of it solved the real problem.

Because the core tension was never addressed — the speed at which business asks questions will always outpace the speed at which analysts can write SQL.

By the second half of 2024, Text-to-SQL with LLMs had matured enough to use in production. We decided to build.

The Hardest Part Wasn’t the Model — It Was the Knowledge

Before we started, I assumed the hard part would be the model: which LLM to use, how to fine-tune, how to write prompts. After we shipped, I realized the model was the easiest part.

The hard part was knowledge engineering.

Think about what a person actually does when writing SQL. They find the right tables and fields, understand what each column means and its valid values. They handle time semantics — does “this week” start on Monday or Sunday? They translate company-specific jargon — what’s a “highly engaged user”? What exactly does “immersive DAU” — a custom engagement metric — measure? And they know SQL dialect differences — MySQL’s WEEKDAY and StarRocks’ DAYOFWEEK_ISO are not the same thing.

All of this knowledge lives in analysts’ heads, in internal docs, in historical SQL scripts. The LLM knows none of it.

So we spent most of our time on knowledge management. We wrote descriptions for every table and field, documented dimension relationships and join rules — not a one-time task, but ongoing maintenance. We built a rule library: default to yesterday’s data, use StarRocks syntax not MySQL, when computing MAU do a count distinct across the full time window not a sum of daily aggregates. Each rule looks trivial. Each one is a real bug we hit. We built synonym mappings. We built an example library of canonical question-to-SQL pairs — not to let the model copy, but to teach it how to reason about a class of problem.

Looking back, 70% of the work went into knowledge engineering, 30% into the model and infrastructure. The opposite of what most people expect.

A Few Architectural Choices

The final product had three interfaces: a web app, an internal IM bot, and an open API. Behind all three was a single data intelligence engine — multiple models at the base (no single-vendor lock-in), a knowledge and tooling layer in the middle (RAG, conversation memory, syntax checking, permission validation), a set of specialized agents above that (intent recognition, dataset selection, NL2SQL, chart generation, error correction), and product features on top.

The architecture itself wasn’t unusual. What’s worth talking about are a few decisions we only got right after taking the wrong path first.

First, RAG can’t be just vector retrieval. We used three retrieval methods simultaneously — vector recall, relational recall, and LLM-based recall — then reranked the results. A single retrieval method isn’t reliable enough, especially for synonyms and fuzzy phrasing. When a user says “fiction channel,” the knowledge base stores terms like “male-oriented web novels” and “female-oriented web novels.” Pure vector retrieval often fails to connect them.

Second, agents need to be decomposed. Early on we tried a single agent doing everything. It worked poorly. Once we split into multiple specialized agents — each with a focused prompt and single responsibility — accuracy jumped meaningfully.

Third, end-to-end tracing is non-negotiable. Every query, from intent recognition through dataset selection through SQL generation through execution, is fully traceable: what RAG retrieved, what prompt the LLM received, token cost, latency at each stage. Without tracing, when something goes wrong you’re guessing. With it, you can locate the problem. Build this capability as early as possible.

Evaluation

The question that worried me most before launch: how do we know if it’s right?

SQL errors come in two types. Syntax errors fail to run — those are actually fine, you see them immediately. Semantic errors are the dangerous ones: the query runs, returns a result, but the result is wrong. A user gets a number that looks reasonable, makes a decision, and the number was off by 30%.

Our approach: build a question bank from real business queries, pair each one with a hand-written “golden” SQL. After every model upgrade, knowledge base update, or prompt change, run the full set and compare results. The key — don’t compare SQL text (there are many correct ways to write the same query). Compare the data the queries return.

For production traffic, we also had humans label every query result: correct or not, and if not, why. Those labels fed back into the knowledge base and prompt improvements, closing the loop.

This evaluation system took real effort to maintain. But without it, shipping to production is just gambling.

A Few Things I Didn’t Know Before

Here’s a concrete example. We spent time debating whether to upgrade to a more expensive model — the expected accuracy gain was around 5%. That same week, an analyst added one business rule: “when computing MAU, count distinct across the full time window, don’t aggregate by day first then sum.” It wiped out an entire class of errors. Knowledge ROI is much higher than model ROI. When resources are tight, that priority order matters.

Another unexpected finding: the quality of user questions determines the quality of answers. No model is smart enough to fix a vague question. We put real effort into guiding users at the product layer — surfacing popular questions, showing dataset information, nudging users to specify dimensions. These interactions look low-tech. Their impact on accuracy was as large as model work.

One more thing. When UV surpassed the old tool two weeks after launch, I thought we’d reached a milestone. The real value came later. User behavior started pulling data governance in new directions. Which tables got queried most. Which metrics kept getting misunderstood. Which dimensions were missing entirely. These signals were more precise and actionable than any dedicated data governance project we’d run.

The Ceiling

After running for most of last year, the ceiling became visible.

A product manager asked: “Why did DAU drop last week?” ChatBI could tell you what last week’s DAU was. It couldn’t tell you why it dropped. Answering “why” requires looking at the aggregate first, then slicing by dimension, locating the anomaly, and generating a root-cause conclusion — not one SQL query, but an analyst’s full workflow.

ChatBI answers What. What the business actually wants to know is Why and How.

Going from Text-to-SQL to an Analysis Agent isn’t a feature upgrade — it’s a paradigm shift. The former simulates a “SQL-writing tool.” The latter simulates “an analyst who can reason” — multi-step inference, a code interpreter, analytical reasoning chains. A completely different thing from writing SQL.

Thinking this through changed how I looked at the team’s direction. ChatBI was a first step, but only a first step. The real goal isn’t to make machines write SQL. It’s to make machines understand data.

If I were doing this over, a few things I’d do earlier: build the evaluation system on day one, don’t backfill it; set up end-to-end tracing from the start, because without it you can’t debug; put the knowledge base ahead of model selection in priority order.

And the most important one — don’t treat Text-to-SQL as the destination. Getting to “query data in natural language” is 30% of the job. What the business wants isn’t SQL. It’s insight.

We’ve since taken another step forward and built a new Data Agent system. That’s a different story, for next time.