用 LLM 做策略，最后做成了专家系统

过去两年，很多数据和业务团队都在做同一件事：用 LLM 重做策略。一开始只是辅助，后来越用越深，红利也越拿越多。

最开始很简单，让 LLM 帮拆 KPI、出业务方向、写方案初稿，效率立刻可见。然后是用户分群运营，每个群一套话术、一套权益、一套节奏，比”一套方案打天下”准多了。再走一步就是千人千面：每个用户每次访问，模型现场算出适合他的内容、定价、推送动作，这是过去十几年互联网行业一直想做但做不到的事。

每细一步，业务都拿到正反馈。AI 看起来无所不能，过去做不到的精细化今天都能做到，连复杂多场景的决策好像都可以慢慢交给模型。

我自己也是这条路上的乐观者，直到最近被参与一组新方向的讨论。开了几轮会我一直有种”哪里不对”的感觉，但说不出来。直到有天我看着白板上密密麻麻的规则节点和状态迁移线，突然反应过来：

我们试图在用 LLM 重做一遍专家系统。

专家系统这个词现在没什么人提了。但在 60 到 80 年代，它就是 AI 的代名词。MYCIN 帮医生开抗生素剂量，DENDRAL 推断分子结构，XCON 给 DEC 的客户配 VAX 服务器，每年给 DEC 省几千万美元。那时候 AI 的故事就是规则加推理引擎。

这些系统几乎全是用 Lisp 写的。McCarthy 1958 年发明 Lisp 就是为了 AI，那时候 AI 基本等同于符号计算。Lisp 的几个特性放在专家系统这个场景里特别合身：

homoiconic，代码本身就是数据。一条规则可以被另一条规则读、改、合成。推理引擎不是在”运行代码”，是在”操作数据”，而那些数据恰好是代码。
macro 系统让你能写贴近业务语言的 DSL，编译时再展开成执行逻辑。每个专家系统本质上都是个领域 DSL。
REPL，知识工程师可以在跑着的系统里加一条规则、改一条规则、立刻看推理结果，不用重启，不用重新编译。

后来连铲子（硬件）都有了，Symbolics、LMI 这些 Lisp Machine 公司，是一整条产业链。

然后崩了。

“崩”在哪里？

第一，规则爆炸。规则到几百条还能管，到上千条之后，规则之间的相互作用没人能完全预测。新加一条不知道会触发哪些旧规则的连锁反应；改一条不知道会让哪个早就跑得好好的场景突然报错。维护成本超线性上涨。MYCIN 600 条规则，规则两两之间潜在冲突就有 18 万对。

第二，业务规则本身在变。新疾病、新产品、新法规、新例外。每次变化都要把规则集重新校准，知识工程师的吞吐永远赶不上业务的迭代速度。

第三，慢。从领域专家那里把规则一条条挖出来、写成形式化规则，从 0 到能用常常要两三年。

XCON 在 DEC 跑了 10 年，每年给公司省 4000 万美元，维护团队最后涨到 55 人专职维护规则库。最后被废弃，因为维护成本超过了它节省的钱。

AI Winter 来的时候没人觉得意外。

40 年后，LLM 的进展解决掉了一些问题。

知识获取不再是手工活。模型已经把人类的大部分领域知识吃进去，不需要知识工程师再把规则一条条挖出来形式化。

模糊匹配也不是问题。以前专家系统要写 “if 病人 age > 60 and 体温 > 38.5 then…”，LLM 直接读自然语言描述就能召回相关规则，或者直接生成规则执行代码。

这两件事是进步，还存在两个问题。

规则爆炸还在。给 LLM 挂一套结构化的规则图谱和状态机，规则之间的相互作用不会因为底层换了 Transformer 就自动变可控。规则数量过了某个门槛，调试和回归测试照样会变成噩梦。

业务规则在变，LLM 也解不了。模型再大也不知道你下个季度会上什么新产品、新策略对老规则有什么连带影响。这只能靠人维护，工作量和 30 年前没本质区别。

把 LLM 套在状态机上，相当于用一台超级跑车的引擎推一辆牛车。引擎确实好。但牛车的车轴和车轮还是 30 年前的。

那专家系统这条路是不是任何场景都不该走？也不是。

规则数量稳定、业务规则极少变化、合规驱动需要可解释和可审计的场景，专家系统其实从来没真死过，只是改了名字。银行风控的规则引擎、医疗诊疗的辅助决策、航空航天的故障树推理，都是这套范式。在这些地方加 LLM 是合理的：用 LLM 解知识获取和模糊匹配，规则集本身保持稳定，规则爆炸的问题被业务边界天然封住。

判断这条路值不值得走，看两个问题就够了：你的业务规则一年改几次，规则总数在两年后会膨胀几倍。

如果一年改十几次、规则两年翻三倍，你就是在重新走一遍 90 年代那批团队踩过的坑。

举个我自己做过的例子。用户刻画 × 用户状态 × 决策动作，构成一张极大的策略表。冷启用户和高频用户、新用户和回流用户、活跃时段和沉默时段，每个组合下推什么内容、给什么权益、走什么策略，如果用规则铺，规则集几千条起步。每加一条新业务，规则之间的相互作用就再没人能完全说清。这就是 90 年代的规则爆炸，换了一套现代的皮肤。

推荐系统的精排十年前就解过这个问题。精排不会让人手写”如果用户 age 在 25-35 且最近 7 天活跃天数 >= 3 且偏好分类是 A 则提权”，精排是把所有这些信号塞进一个模型，让模型从交互数据里把最优排序学出来。规则爆炸被一个模型替代了。

精排的范式是不枚举规则，从数据里学策略。代价是可解释性下降，好处是规则数量没有上限。

精排的范式还在往前走。这十年的精排早就不是”预测点击率、按概率排”，已经是多目标优化：CTR、完播、转化、留存、收入，几个目标同时出，业务按阶段把权重调成 0.3、0.4、0.2、0.1 这种数字。这一季要拉留存，权重往留存倾；下一季冲收入，再调回来。整套机制建立在显式损失函数、可导优化器、日级反馈回灌之上。

那把”用户刻画 × 用户状态”喂给 LLM，让它直接生成决策动作，能不能解决多目标优化？

我的判断是不能，至少不是 LLM 当下的强项。LLM 推理的目标函数是预测下一个 token，“偏好”嵌在 RLHF 训练好的权重里，不是你能在 prompt 里写 0.3、0.4、0.2、0.1 然后 LLM 就照做的。告诉它”平衡点击和留存”，它能说出听起来合理的话，但精度上不会区分 0.3 和 0.35 的差别。更要命的是反馈回灌：精排一晚上能基于昨天的曝光-点击数据重训一遍，LLM 做不到，业务 KPI 的梯度回不到模型里。

LLM 在这个场景能做的是定性的活：解释为什么这个内容适合这个用户、识别要谨慎的边界、补足数据稀疏的长尾。这些是对精排的增强，不是替代。多目标优化仍然是 ML 的主场。

LLM 时代的做法是分三层。排序模型决定”推什么”：组合爆炸的策略空间、多目标优化的权重调度，这是 ML 的主场。LLM 负责”用户是谁、内容是什么”，把开放式语义、跨边界整合、规则穷尽不了的长尾，翻译成排序模型能用的特征。规则引擎守住”不能做什么”：商业许可、合规边界，硬约束不能交给概率模型。状态机如果还要用，只能落在第三层做硬约束。它的位置是约束，不是决策。

把 LLM 喂给一个状态机让它跑规则，既关掉了 LLM 的强项，也绕开了 ML 早就解过的问题。说到底，是范式选错了。

最后说个花絮。当时讨论技术方案，Scala 跃然纸上，觉得它简直就是为这种规则加状态的系统准备的：代码即规则、声明式的转换、函数式第一性等等。

后来我反应过来，这些特性都不是 Scala 的发明，Lisp 40 年前就在用，那一代的专家系统就是用 Lisp 写出来的。我们以为是发现了一门趁手的新工具，其实是问题把我们领回了 40 年前。

Over the past two years, a lot of data and business teams have been doing the same thing: rebuilding strategy with LLMs. It started as assistance. The deeper we went, the bigger the dividend.

At first it was simple. Have the LLM help break down KPIs, propose business directions, write first drafts of plans. The efficiency was immediate. Then segment-level operations: each segment with its own messaging, perks, and cadence, far sharper than “one plan for everyone.” A step further was personalization at the user level. For every user, every visit, the model would compute on the spot the content, the pricing, the push action that fit them. This is the thing the internet industry has wanted to do for the past decade and a half but never quite could.

Each step finer brought positive feedback. AI looked omnipotent. The granularity we had never reached before was suddenly within range. Even complex multi-scenario decisions seemed like things we could gradually hand off to the model.

I was one of the optimists on this path. Until recently I was pulled into discussions about a new direction. After a few rounds of meetings I kept having a “something is off” feeling, but couldn’t name it. One day I was staring at a whiteboard packed with rule nodes and state transition arrows, and it hit me.

We were rebuilding an expert system, on top of LLMs.

Nobody talks about expert systems anymore. But from the 60s through the 80s, “expert system” was synonymous with AI. MYCIN helped doctors set antibiotic dosages. DENDRAL inferred molecular structures. XCON configured VAX servers for DEC’s customers and saved DEC tens of millions of dollars a year. Back then the story of AI was rules plus an inference engine.

Almost all of them were written in Lisp. McCarthy invented Lisp in 1958 for AI, when AI was essentially synonymous with symbolic computation. A few of Lisp’s features fit expert systems unusually well:

Homoiconic: code itself is data. One rule can read, modify, or synthesize another. The inference engine is not “running code” but “operating on data,” where the data happens to be code.
The macro system lets you write a DSL close to the business language, expanded into executable logic at compile time. Every expert system is essentially a domain-specific DSL.
REPL: a knowledge engineer can add a rule, change a rule, and see the inference result immediately, on a running system. No restart, no recompile.

Eventually even the shovels showed up. Symbolics, LMI, the Lisp Machine companies were a whole industry of their own.

Then it collapsed.

Where did it collapse?

First, rule explosion. A few hundred rules are manageable. Past a thousand, no one can fully predict the interactions between rules. A new rule could trigger chain reactions in existing rules in ways you didn’t anticipate. Changing a rule could break a scenario that had been working fine. Maintenance cost grows super-linearly. MYCIN had 600 rules; the pairwise conflict space is 180,000 pairs.

Second, the business rules themselves keep changing. New diseases, new products, new regulations, new exceptions. Each change requires recalibrating the rule set. Knowledge engineers’ throughput could never keep up with the business iteration speed.

Third, slow. Extracting rules from domain experts one at a time, formalizing them, going from zero to usable typically took two to three years.

XCON ran at DEC for ten years and saved the company $40 million annually. The maintenance team eventually grew to 55 people working full-time on the rule base. It was finally retired, not because it stopped working, but because the maintenance cost had passed the savings.

When AI Winter arrived, no one was surprised.

Forty years later, LLMs solved some of these problems.

Knowledge acquisition is no longer manual. The models have ingested most of human domain knowledge. You no longer need a knowledge engineer to extract and formalize rules one by one.

Fuzzy matching is no longer a problem either. Where expert systems used to need “if patient age > 60 and body temperature > 38.5 then…”, an LLM can read a natural language description and surface the relevant rules, or directly generate the executable rule code.

Both are progress. Two problems remain.

Rule explosion is still there. Hanging a structured rule graph and a state machine off an LLM does not make rule interactions controllable just because the substrate is now a Transformer. Past a certain threshold, debugging and regression testing become a nightmare just the same.

The business rules keep changing, and the LLM can’t solve that either. No matter how large the model, it doesn’t know what new product you’ll ship next quarter or how a new strategy will cascade through old rules. That has to be maintained by humans. The work is not fundamentally different from thirty years ago.

Strapping an LLM onto a state machine is like using a supercar engine to push an ox cart. The engine is good. The axles and wheels are still from thirty years ago.

Does that mean the expert-system path is wrong for every scenario? Not quite.

Where the rule count is stable, business rules rarely change, and compliance demands interpretability and auditability, expert systems never really died. They just changed names. Bank risk-control engines, clinical decision support, aerospace fault-tree reasoning, all of these are the same paradigm. Adding an LLM here is reasonable: let the LLM handle knowledge acquisition and fuzzy matching, while the rule set itself stays stable. The rule explosion problem is naturally sealed in by the business boundary.

Two questions are enough to tell whether this path is worth taking: how many times a year do your business rules change, and how many times will the rule count grow in two years.

If the answer is more than ten changes a year and tripling in two years, you are walking right back into the pit the 90s teams fell into.

Here is an example from my own work. User profile × user state × decision action forms a huge strategy table. Cold-start users and high-frequency users. New users and returning users. Active hours and silent hours. For every combination, what to recommend, what perks to offer, what strategy to run. If you cover all of this with rules, the rule set starts in the thousands. Each new piece of business introduces interactions no one can fully explain. This is rule explosion from the 90s, dressed in modern skin.

Recommendation ranking solved this problem ten years ago. Ranking doesn’t ask anyone to write “if user age between 25 and 35 and active days in last 7 ≥ 3 and preferred category is A, boost.” Ranking stuffs all of those signals into a model and lets the model learn the optimal ordering from interaction data. Rule explosion was replaced by a model.

The ranking paradigm is: don’t enumerate rules, learn strategy from data. The cost is reduced interpretability. The benefit is no upper bound on rule count.

The ranking paradigm has kept moving. For the past decade, ranking is no longer “predict click-through rate, sort by probability.” It is multi-objective optimization: CTR, completion, conversion, retention, revenue, several objectives at once. The business sets weights like 0.3, 0.4, 0.2, 0.1 by phase. This quarter is about retention, push the weight toward retention. Next quarter the goal is revenue, dial it back. The whole machinery is built on explicit loss functions, differentiable optimizers, and daily feedback ingestion.

So can you feed “user profile × user state” into an LLM, have it generate the decision action directly, and solve multi-objective optimization that way?

My read is no, at least not as a current strength of LLMs. The objective function in LLM inference is predicting the next token. Its “preferences” are baked into the post-RLHF weights, not something you can dial in by writing 0.3, 0.4, 0.2, 0.1 in a prompt and expect the model to honor. Tell it “balance click and retention,” it will produce something that sounds reasonable, but it won’t distinguish 0.3 from 0.35 with any precision. The bigger problem is the feedback loop. Ranking can retrain overnight on yesterday’s impression-click data. An LLM can’t, and the gradient from business KPIs doesn’t make it back into the model.

What an LLM can do here is qualitative work: explain why a piece of content fits a user, identify boundaries to handle carefully, fill in the long tail where data is sparse. These augment ranking, they don’t replace it. Multi-objective optimization remains ML’s home turf.

The LLM-era division of labor is three layers. The ranking model decides “what to recommend”: the combinatorial strategy space and the multi-objective weight scheduling. This is ML’s home turf. The LLM handles “who the user is and what the content is,” translating open-ended semantics, cross-domain integration, and the long tail rules can’t enumerate, into features the ranking model can use. The rule engine guards “what we can’t do”: commercial licenses, compliance boundaries, hard constraints that shouldn’t be left to a probabilistic model. The state machine, if it still has a place, lives at the third layer as a hard constraint. Its role is constraint, not decision.

Feeding an LLM into a state machine and letting it run rules shuts down what the LLM is good at and routes around what ML solved long ago. At the bottom of it, the paradigm is wrong.

One last footnote. During those technology discussions, Scala jumped onto the page. It felt purpose-built for a rules-plus-state system: code as rules, declarative transformation, functions as a first principle, and so on.

Later it hit me that none of those properties were Scala inventions. Lisp had them forty years ago. That generation of expert systems was written in Lisp. We thought we had found a handy new tool. In reality, the problem had led us back forty years.

用 LLM 做策略，最后做成了专家系统

The Expert System Trap, on LLMs