Over the past two years, a lot of data and business teams have been doing the same thing: rebuilding strategy with LLMs. It started as assistance. The deeper we went, the bigger the dividend.

At first it was simple. Have the LLM help break down KPIs, propose business directions, write first drafts of plans. The efficiency was immediate. Then segment-level operations: each segment with its own messaging, perks, and cadence, far sharper than “one plan for everyone.” A step further was personalization at the user level. For every user, every visit, the model would compute on the spot the content, the pricing, the push action that fit them. This is the thing the internet industry has wanted to do for the past decade and a half but never quite could.

Each step finer brought positive feedback. AI looked omnipotent. The granularity we had never reached before was suddenly within range. Even complex multi-scenario decisions seemed like things we could gradually hand off to the model.

I was one of the optimists on this path. Until recently I was pulled into discussions about a new direction. After a few rounds of meetings I kept having a “something is off” feeling, but couldn’t name it. One day I was staring at a whiteboard packed with rule nodes and state transition arrows, and it hit me.

We were rebuilding an expert system, on top of LLMs.


Nobody talks about expert systems anymore. But from the 60s through the 80s, “expert system” was synonymous with AI. MYCIN helped doctors set antibiotic dosages. DENDRAL inferred molecular structures. XCON configured VAX servers for DEC’s customers and saved DEC tens of millions of dollars a year. Back then the story of AI was rules plus an inference engine.

Almost all of them were written in Lisp. McCarthy invented Lisp in 1958 for AI, when AI was essentially synonymous with symbolic computation. A few of Lisp’s features fit expert systems unusually well:

  • Homoiconic: code itself is data. One rule can read, modify, or synthesize another. The inference engine is not “running code” but “operating on data,” where the data happens to be code.
  • The macro system lets you write a DSL close to the business language, expanded into executable logic at compile time. Every expert system is essentially a domain-specific DSL.
  • REPL: a knowledge engineer can add a rule, change a rule, and see the inference result immediately, on a running system. No restart, no recompile.

Eventually even the shovels showed up. Symbolics, LMI, the Lisp Machine companies were a whole industry of their own.

Then it collapsed.


Where did it collapse?

First, rule explosion. A few hundred rules are manageable. Past a thousand, no one can fully predict the interactions between rules. A new rule could trigger chain reactions in existing rules in ways you didn’t anticipate. Changing a rule could break a scenario that had been working fine. Maintenance cost grows super-linearly. MYCIN had 600 rules; the pairwise conflict space is 180,000 pairs.

Second, the business rules themselves keep changing. New diseases, new products, new regulations, new exceptions. Each change requires recalibrating the rule set. Knowledge engineers’ throughput could never keep up with the business iteration speed.

Third, slow. Extracting rules from domain experts one at a time, formalizing them, going from zero to usable typically took two to three years.

XCON ran at DEC for ten years and saved the company $40 million annually. The maintenance team eventually grew to 55 people working full-time on the rule base. It was finally retired, not because it stopped working, but because the maintenance cost had passed the savings.

When AI Winter arrived, no one was surprised.


Forty years later, LLMs solved some of these problems.

Knowledge acquisition is no longer manual. The models have ingested most of human domain knowledge. You no longer need a knowledge engineer to extract and formalize rules one by one.

Fuzzy matching is no longer a problem either. Where expert systems used to need “if patient age > 60 and body temperature > 38.5 then…”, an LLM can read a natural language description and surface the relevant rules, or directly generate the executable rule code.

Both are progress. Two problems remain.

Rule explosion is still there. Hanging a structured rule graph and a state machine off an LLM does not make rule interactions controllable just because the substrate is now a Transformer. Past a certain threshold, debugging and regression testing become a nightmare just the same.

The business rules keep changing, and the LLM can’t solve that either. No matter how large the model, it doesn’t know what new product you’ll ship next quarter or how a new strategy will cascade through old rules. That has to be maintained by humans. The work is not fundamentally different from thirty years ago.

Strapping an LLM onto a state machine is like using a supercar engine to push an ox cart. The engine is good. The axles and wheels are still from thirty years ago.


Does that mean the expert-system path is wrong for every scenario? Not quite.

Where the rule count is stable, business rules rarely change, and compliance demands interpretability and auditability, expert systems never really died. They just changed names. Bank risk-control engines, clinical decision support, aerospace fault-tree reasoning, all of these are the same paradigm. Adding an LLM here is reasonable: let the LLM handle knowledge acquisition and fuzzy matching, while the rule set itself stays stable. The rule explosion problem is naturally sealed in by the business boundary.

Two questions are enough to tell whether this path is worth taking: how many times a year do your business rules change, and how many times will the rule count grow in two years.

If the answer is more than ten changes a year and tripling in two years, you are walking right back into the pit the 90s teams fell into.


Here is an example from my own work. User profile × user state × decision action forms a huge strategy table. Cold-start users and high-frequency users. New users and returning users. Active hours and silent hours. For every combination, what to recommend, what perks to offer, what strategy to run. If you cover all of this with rules, the rule set starts in the thousands. Each new piece of business introduces interactions no one can fully explain. This is rule explosion from the 90s, dressed in modern skin.

Recommendation ranking solved this problem ten years ago. Ranking doesn’t ask anyone to write “if user age between 25 and 35 and active days in last 7 ≥ 3 and preferred category is A, boost.” Ranking stuffs all of those signals into a model and lets the model learn the optimal ordering from interaction data. Rule explosion was replaced by a model.

The ranking paradigm is: don’t enumerate rules, learn strategy from data. The cost is reduced interpretability. The benefit is no upper bound on rule count.

The ranking paradigm has kept moving. For the past decade, ranking is no longer “predict click-through rate, sort by probability.” It is multi-objective optimization: CTR, completion, conversion, retention, revenue, several objectives at once. The business sets weights like 0.3, 0.4, 0.2, 0.1 by phase. This quarter is about retention, push the weight toward retention. Next quarter the goal is revenue, dial it back. The whole machinery is built on explicit loss functions, differentiable optimizers, and daily feedback ingestion.

So can you feed “user profile × user state” into an LLM, have it generate the decision action directly, and solve multi-objective optimization that way?

My read is no, at least not as a current strength of LLMs. The objective function in LLM inference is predicting the next token. Its “preferences” are baked into the post-RLHF weights, not something you can dial in by writing 0.3, 0.4, 0.2, 0.1 in a prompt and expect the model to honor. Tell it “balance click and retention,” it will produce something that sounds reasonable, but it won’t distinguish 0.3 from 0.35 with any precision. The bigger problem is the feedback loop. Ranking can retrain overnight on yesterday’s impression-click data. An LLM can’t, and the gradient from business KPIs doesn’t make it back into the model.

What an LLM can do here is qualitative work: explain why a piece of content fits a user, identify boundaries to handle carefully, fill in the long tail where data is sparse. These augment ranking, they don’t replace it. Multi-objective optimization remains ML’s home turf.

The LLM-era division of labor is three layers. The ranking model decides “what to recommend”: the combinatorial strategy space and the multi-objective weight scheduling. This is ML’s home turf. The LLM handles “who the user is and what the content is,” translating open-ended semantics, cross-domain integration, and the long tail rules can’t enumerate, into features the ranking model can use. The rule engine guards “what we can’t do”: commercial licenses, compliance boundaries, hard constraints that shouldn’t be left to a probabilistic model. The state machine, if it still has a place, lives at the third layer as a hard constraint. Its role is constraint, not decision.

Feeding an LLM into a state machine and letting it run rules shuts down what the LLM is good at and routes around what ML solved long ago. At the bottom of it, the paradigm is wrong.


One last footnote. During those technology discussions, Scala jumped onto the page. It felt purpose-built for a rules-plus-state system: code as rules, declarative transformation, functions as a first principle, and so on.

Later it hit me that none of those properties were Scala inventions. Lisp had them forty years ago. That generation of expert systems was written in Lisp. We thought we had found a handy new tool. In reality, the problem had led us back forty years.