让 Data Agent 懂业务的六层 Context
Six-Layer Context Stack for Data Agents
上一篇讲了为什么要把数据团队从写 SQL 转向训练 Agent,提到在建一个六层 Context Stack。这篇把每一层拆开,讲怎么落地的。
先放全景
信号密度 ↑ 获取成本 ↑
┌─────────────────────────┐
L6 │ Runtime Exploration │ Agent 实时探查
├─────────────────────────┤
L5 │ Agent Memory │ 交互纠错 → 自进化
├─────────────────────────┤
L4 │ Domain Knowledge │ 组织知识 → 分析框架
├─────────────────────────┤
L3 │ Code-Level Semantics │ ETL 代码 → 真实语义
├─────────────────────────┤
L2 │ Curated Semantics │ 专家标注 → 指标/关系
├─────────────────────────┤
L1 │ Schema & Usage │ 元数据 + 查询模式
└─────────────────────────┘
从下往上,信号密度越来越高,获取成本也越来越高。下面逐层讲我们做了什么、为什么这么选。
L1:不靠 RAG,先靠目录
一开始我们把所有表结构和查询历史全部进 embedding,查询时 RAG 召回。表数过千、指标过百之后,纯语义检索的召回准确率不够稳。
后来加了一层叫 Manifest 的轻量结构。每个 manifest 文件只存条目的 id、name、aliases 和 ref,所有 manifest 加起来在 context 里也就几千 token。Agent 启动时 manifest 直接进 context,相当于一张「目录页」。命中之后再按 ref 拉详情。
先精确路由,再按需补详情。RAG 留作兜底,不是主路径。
L2:指标必须结构化,自然语言描述不够
有「DAU」、「eDAU」、「播放 DAU」这种指标群,同一个名字底下,不同业务有不同口径。如果只靠人工写一段自然语言描述,Agent 会把它们混着用。
我们把指标、维度、表的 JOIN 图谱都写成结构化的 yaml。每个指标明确写出 source_tables、filter、cross_day_rule、related_metrics。结构化之后 Agent 在相似指标之间不会乱选。
这一层的工作量是 stack 里最大的。但回头看,它撑起了 80% 的日常查询。
L3:表的含义在 ETL 代码里
Schema 告诉你一张表有哪些列,但它不会告诉你这张表只包含 App 端流量、不含车载和第三方渠道。这些信息在生产这张表的 ETL 代码里。
我们有自己的任务调度系统,存着每张产出表的完整 SQL、调度配置、上下游依赖。基于这个做了一层工具封装,让 Agent 在 L2 不够用的时候,可以拉一张表的产出 SQL,反推它的真实含义和血缘。
ETL 本来就在描述「这张表是怎么算出来的」。
接上这一层之后,Agent 区分两张名字相似但含义不同的表的能力有了明显提升。
L4:领域知识要独立于 skill
早期走的是「为每个业务域写一个 skill」的路子。推送分析一个 skill,会员分析一个 skill,每个 skill 里塞着路由、领域知识、SQL 模板、执行逻辑。写了几个之后就乱了,同一个指标在两个 skill 里定义不一样,领域知识没法穷举,skill 里既有”知识”又有”执行流程”,一改就要动一整块。
解法是把领域知识从 skill 里抽出来,单独作为 yaml 和 md 维护。
抽离之后,专题分析 skill 退化为”轻量路由 + 特殊流程编排”。大部分场景,通用分析 Agent 读领域知识就能处理,不再需要为每个业务域单独写一个 skill。
L5:记忆要会过期
纠错的复用是基本功,用户改了一次,下次别再错。但 memory 不能只进不出,否则半年下来就是一堆噪音。
给每条记忆加了 hotness_score,用 sigmoid 归一化命中频率、乘以半衰期 7 天的时间衰减。30 天前的 memory 权重大约 5%,不靠人工清理,让记忆自然过期。
同时高 hotness 的记忆进 review 队列,确认有价值的会被沉淀回 L2 或 L4 的源文件,变成”固化知识”。L5 的本质是个临时缓冲区,它的价值不在于自己长期保存,而在于把高频纠错及时往下沉淀。
L6:探查的结果要沉淀
兜底层。L1 到 L5 都没有答案时,Agent 现场查仓库,验证 schema、采样数据、追溯血缘。
多做了一步,每次 L6 探查产生的有价值发现,都会被建议沉淀回低层。表结构 → L1,字段含义 → L2 的 glossary,业务理解 → L5 memory。Agent 临时探查的成本不算低,能沉淀就别让它再探一次。
整个 stack 是一个反馈回路,上层的临时知识往下沉淀,下次直接用得上。
为什么不用本体
银行和金融机构做数据问答,很多走的是本体(ontology)路线,把概念、关系、规则全部形式化,让 AI 在确定性的知识图谱上推理。
本体的吸引力是确定性。概念有明确定义,关系有明确边界,推理路径可审计。金融领域用它有道理,「净资产收益率」的定义十年不变,监管要求可解释性,出错成本极高。花两年建一套形式化的知识体系是值得的。
但本体有两个隐含假设:领域知识可以穷举式形式化,维护成本可持续。
对业务快速迭代的场景,两条都不成立。今天多一个「沉浸 DAU」,明天多一个实验指标,后天某张表的口径悄悄改了。本体跟不上变化的速度。
其实 L2 本质上就是一个轻量本体,指标有 id、定义、source_tables、related_metrics,维度有值域和 sql_snippet,表有 JOIN 图谱。只是不追求穷举,结构化到够用就好,管不了的交给上层兜底。
纯本体方案试图把所有东西压进一层。分层的好处是,不需要一开始就把所有知识形式化,先把高频的 80% 结构化到 L2,剩下的让 L3 到 L6 逐步覆盖和沉淀。
还在 Phase 1
我得诚实地说,这套 Stack 目前只做完了 Phase 1,L1 + L2 的 manifest 路由 + 详情读取。L3 接了一部分,L4 在抽离中,L5 和 L6 都还没真正激活。
Phase 1 跑下来,比 ChatBI 时代纯靠 RAG 召回稳定多了。错误的”形状”也变了,从前是 Agent 自信地选了一张错的表,现在是 Agent 知道自己不确定,会去问详情或兜底语义检索。
有一次内部分享,一位高管问了个问题:“公司有多少个 AB 实验?有哪些运行 6 个月以上、不符合公司实验全局最优 spec、且 xx 指标为负的实验?列出来。”
这种问题我们从来没测过。我当时很忐忑,觉得跑不出来。Agent 跑了 40 分钟,给出了结果。用户很满意。
过程中,Agent 表现出了强大的目标驱动,想尽任何办法去完成、不甩锅的执行力。我们的方向是对的。
几条体会
第一,Context 必须分层,而且层与层之间要有明确的协作关系。把所有东西塞进一个大 prompt、一个大 RAG 库、一个大 skill 里,效果都不会好。
第二,分层不是拍脑袋。六层之所以站得住,是因为每一层有清晰的边界。L1 是物理事实,L2 是人加工过的语义,L3 是代码反推的真相,L4 是组织共识,L5 是 Agent 自己的学习,L6 是临时探查。边界清晰,才能独立迭代。
第三,分层的最大回报是让 skill 变轻。原本臃肿的 topic skill 退化为轻量路由,大部分领域知识进了 L4。skill 写得少了,质量反而高了。
最后,这条路远没走完。L5 的沉淀机制、L6 的探查协议、跨层 ID 体系,都还在迭代。下一步真正要验的是,Stack 跑稳之后,Agent 能不能稳定的做到可审计、可纠偏,驱动更多 insight 和对应策略的产生。
The previous post talked about why we’re shifting the data team from writing SQL to training Agents, and mentioned we’re building a six-layer Context Stack. This post breaks down each layer and how we put it into practice.
The Big Picture
Signal density ↑ Acquisition cost ↑
┌─────────────────────────┐
L6 │ Runtime Exploration │ Agent probes live
├─────────────────────────┤
L5 │ Agent Memory │ Interactive correction → self-evolving
├─────────────────────────┤
L4 │ Domain Knowledge │ Organizational knowledge → analysis frameworks
├─────────────────────────┤
L3 │ Code-Level Semantics │ ETL code → real semantics
├─────────────────────────┤
L2 │ Curated Semantics │ Expert annotations → metrics/relationships
├─────────────────────────┤
L1 │ Schema & Usage │ Metadata + query patterns
└─────────────────────────┘
From bottom to top, signal density increases, and so does the cost of acquiring it. Below I walk through each layer: what we did and why we chose it.
L1: Catalog First, Not RAG
We started by dumping all table schemas and query history into embeddings, retrieving via RAG at query time. Once we crossed a thousand tables and a few hundred metrics, pure semantic retrieval wasn’t reliable enough.
Then we added a lightweight structure called Manifest. Each manifest file stores only an entry’s id, name, aliases, and ref. All manifests together fit in a few thousand tokens of context. The Agent loads manifests directly into context at startup, like a “table of contents” page. Once it hits a match, it pulls details by ref.
Precise routing first, fetch details on demand. RAG is the fallback, not the main path.
L2: Metrics Must Be Structured, Natural Language Isn’t Enough
We have metric clusters like “DAU”, “eDAU”, “playback DAU”, same name, different definitions across business lines. If you only rely on hand-written natural language descriptions, the Agent will mix them up.
We wrote metrics, dimensions, and table JOIN graphs as structured yaml. Each metric explicitly spells out source_tables, filter, cross_day_rule, related_metrics. Once structured, the Agent doesn’t pick wrong among similar metrics.
This layer is the most labor-intensive in the stack. But looking back, it carries 80% of daily queries.
L3: A Table’s Meaning Lives in the ETL Code
Schema tells you which columns a table has, but it won’t tell you that this table only includes App-side traffic, excluding in-car and third-party channels. That information lives in the ETL code that produces the table.
We have our own task scheduling system that stores the full SQL, scheduling config, and upstream/downstream dependencies for every output table. We built a tool wrapper on top of it, so when L2 isn’t enough, the Agent can pull a table’s production SQL and reverse-engineer its real meaning and lineage.
ETL code is already describing “how this table is computed.”
After plugging this layer in, the Agent’s ability to distinguish between two similarly named but semantically different tables improved significantly.
L4: Domain Knowledge Must Be Independent of Skills
Early on we went the route of “write one skill per business domain.” One skill for push analysis, one for membership analysis. Each skill stuffed routing, domain knowledge, SQL templates, and execution logic together. After writing a few, things got messy: the same metric was defined differently across two skills, domain knowledge couldn’t be exhaustively enumerated, skills mixed “knowledge” with “execution flow”, any change required touching a whole chunk.
The fix is to extract domain knowledge from skills and maintain it separately as yaml and md.
After extraction, topic-analysis skills degraded into “lightweight routing + special flow orchestration.” For most scenarios, a generic analysis Agent reading domain knowledge can handle it, no need to write a separate skill per business domain.
L5: Memory Has to Expire
Reusing corrections is table stakes, the user fixes it once, don’t get it wrong next time. But memory can’t be write-only, otherwise after half a year it’s just noise.
We added a hotness_score to each memory entry, normalizing hit frequency with a sigmoid and multiplying by a 7-day half-life decay. Memories from 30 days ago carry about 5% weight, no manual cleanup needed, memory expires naturally.
Meanwhile, high-hotness memories enter a review queue. Confirmed valuable ones get promoted back to source files in L2 or L4, becoming “solidified knowledge.” L5 is essentially a temporary buffer. Its value isn’t in long-term storage itself, but in promptly pushing high-frequency corrections down to lower layers.
L6: Probe Results Should Be Persisted
The fallback layer. When L1 through L5 have no answer, the Agent queries the warehouse live, verifying schema, sampling data, tracing lineage.
We added one more step: every valuable finding from L6 probing gets suggested for promotion back to lower layers. Table structure → L1, field meaning → L2 glossary, business understanding → L5 memory. On-the-fly probing isn’t cheap, if it can be persisted, don’t make the Agent probe again.
The whole stack is a feedback loop, ephemeral knowledge from upper layers settles down, ready to be used directly next time.
Why Not Ontology
Banks and financial institutions building data Q&A often go the ontology route, formalizing concepts, relationships, and rules entirely, letting AI reason on a deterministic knowledge graph.
Ontology’s appeal is determinism. Concepts have precise definitions, relationships have clear boundaries, reasoning paths are auditable. It makes sense in finance, the definition of “Return on Equity” hasn’t changed in ten years, regulators demand explainability, error costs are extremely high. Spending two years building a formal knowledge system is worth it.
But ontology has two implicit assumptions: domain knowledge can be exhaustively formalized, and maintenance cost is sustainable.
For fast-iterating business scenarios, neither holds. Today there’s a new “immersive DAU”, tomorrow a new experiment metric, the day after some table’s definition quietly changes. Ontology can’t keep up with the pace of change.
L2 is essentially a lightweight ontology, metrics have id, definition, source_tables, related_metrics, dimensions have value ranges and sql_snippet, tables have JOIN graphs. It just doesn’t pursue exhaustiveness, structured enough to be useful, and what can’t be managed gets handed off to upper layers.
A pure ontology approach tries to compress everything into one layer. The advantage of layering is that you don’t need to formalize all knowledge upfront. Structure the high-frequency 80% into L2 first, let L3 through L6 gradually cover and persist the rest.
Still in Phase 1
I have to be honest, this Stack is only through Phase 1, L1 + L2 manifest routing + detail lookup. L3 is partially connected, L4 is being extracted, L5 and L6 aren’t really activated yet.
Phase 1 has been running noticeably more stably than the ChatBI era of pure RAG retrieval. The “shape” of errors has changed too. Before, the Agent would confidently pick a wrong table. Now, the Agent knows it’s uncertain and will go ask for details or fall back to semantic retrieval.
Once at an internal sharing session, an executive asked: “How many AB experiments does the company have? Which ones have been running for over 6 months, don’t conform to the company’s global-optimum experiment spec, and have negative xx metrics? List them.”
We had never tested this kind of question. I was nervous at the time, thought it wouldn’t run. The Agent ran for 40 minutes and produced a result. The user was very satisfied.
Through that process, the Agent showed strong goal-driven behavior, doing whatever it took to complete the task, no excuses, real execution. Our direction is right.
A Few Takeaways
First, Context must be layered, and the layers must have clear collaboration boundaries. Stuffing everything into one big prompt, one big RAG index, or one big skill, none of it works well.
Second, the layering isn’t arbitrary. The six layers hold up because each has clean boundaries. L1 is physical fact, L2 is human-curated semantics, L3 is truth reverse-engineered from code, L4 is organizational consensus, L5 is the Agent’s own learning, L6 is on-the-fly probing. Clear boundaries enable independent iteration.
Third, the biggest payoff of layering is that skills become lightweight. Originally bloated topic skills degraded into lightweight routers, most domain knowledge moved into L4. Less skill code, higher quality.
Finally, this road is far from finished. L5’s persistence mechanism, L6’s probing protocol, cross-layer ID system, all still iterating. The next thing to really validate is: once the Stack stabilizes, can the Agent reliably deliver auditable, correctable results, and drive more insights and corresponding strategies?