从标签到记忆 - 重建互联网 App 对用户的理解
From Labels to Memory - Rebuilding How Internet Apps Understand Users
用户打开 app,说了一句话:“我最近很累,想听点不用动脑子的。”
旧系统里,这句话会被拆解成几个关键词,去匹配内容标签,然后返回一批”轻松”或”放松”的内容。有时候还不错,有时候完全跑偏。
新系统里,背后多了一件事:这句话不是在真空里被处理的。系统知道这个用户这周连续听了三个深度财经播客,每天通勤时段;知道他上个月收听时长突然断崖,后来又回来了;知道他两年来对历史类内容有持续的偏好,但最近在探索一些新的东西。这些上下文合在一起,才能理解”累了、想放松”真正意味着什么,该推什么。
这两者的差距,不是算法好不好的问题。是「用户理解」的基础设施不同。
旧世界:便利贴人
过去很多年,互联网 app 理解用户的方式,本质上都是一套「贴标签」的路子。
用户是男性,年龄在 31-40 岁,点击过悬疑类内容,最近七天有过付费行为,来自二线城市。这些是特征,也是标签 — 离散的、静态的枚举值。搭上协同过滤和机器学习,这套路子跑了很多年,也创造了很大的商业价值。
但它有一个根本上的局限:标签描述的是人的截面,不是人的轨迹。
用户为什么这周突然开始听英语课?他最近的收听序列里藏了什么信号?上周他搜了”助眠”,这和他一直以来的内容偏好是什么关系?这些问题,标签体系很难回答,因为它设计上就不是为了这个。
更麻烦的是,在大多数公司里,“推荐”、“搜索”、“推送”、“定价”是不同的团队,各自维护着各自的用户理解。用户在搜索里表达了某种偏好,推荐引擎并不知道。这些碎片化的理解不能复用、不能积累,每个系统都在各自的信息孤岛里猜测同一个人。
新世界:叙事流
大模型出现之后,一件事变得可能:用户可以被理解为一段连续的、有因果关系的行为叙事,而不是标签的集合。
这是质变。
统计模型擅长在大量数据里找到规律,但它处理的是特征,不是意义。大模型不一样,它能推理叙事。同样是”最近听了很多财经播客”,统计模型看到的是一个特征,大模型能推断出:这个用户可能在经历某种职业焦虑,或者在做某个重要决策,或者只是找到了一个喜欢的主播。
这种推理能力,让「叙事式的用户理解」在工程上第一次说得通了。
但随之来了一个新的架构问题:用户的叙事要存在哪里?谁来维护?谁都能用?
两层分离:Context 的架构设计
在工程上,我们把 Agent 的上下文(Context)拆成了两层。
Application Context,业务自管。每一次任务的”规则和边界”:这个 Agent 的目标是什么,可以调用哪些工具,当前设备是手机还是车机,现在是晚上还是通勤时段,这次对话里用户说了什么。这些是局部的、短暂的,生命周期是一次会话。
Platform Context,平台统一提供。跨会话、可复用的用户理解能力:这个用户是谁,长期喜欢什么,最近在关注什么,内容库里有哪些东西跟他当前状态匹配。这些是全局的、持久的,不属于任何一个业务功能,属于整个平台。
Platform Context 里有三件事。其中最稳定的是长期记忆 — 回答”用户是谁”。不是”点击过悬疑内容”,而是”历史文化爱好者,喜欢有深度、能串联背景知识的叙述方式”。这种偏好跨越几个月甚至几年都不变。落到工程上,这是一段自然语言的用户摘要,加上可向量化的兴趣标签。
短期记忆是另一层,回答”用户最近在做什么”。这里有一个容易搞混的区分:短期记忆不等于会话历史。会话历史是某个 Agent 局部的交互记录;短期记忆是跨场景、跨入口的全局视角 — 用户最近七天的播放序列、搜索记录、跳出行为、收藏动作,全都在里面。时效性强,可衰减,更新频繁。
第三件是内容知识,把用户映射到内容空间。用户的记忆再丰富,最终要落到可以推荐、可以播放的具体内容上。内容侧需要同样丰富的理解:这个专辑的核心主题是什么,情感基调是什么,适合什么场景,哪类用户最容易被它打动。用户理解和内容理解做向量化之后,便能做语义匹配的召回。
这不只是 chatbot 的事
现在很多人讲 Context 和 Memory,默认的语境是”让你的 AI 助手更好用”。这个想法太窄了。
推荐、搜索、推送、定价、冷启动 — 每一个需要「理解用户」的地方,都在各自维护一套理解。用户在搜索里表达了某种偏好,推荐引擎不知道。用户在对话里表达了情绪,推送不知道。各做各的,就永远是便利贴的路子,只是换了一套更复杂的便利贴。
真正的转变,是把用户理解变成平台级的共享基础设施。用户在 app 上发生的所有互动,都沉淀进同一个 Memory,不消失在各自系统的日志里。
我们在实践里验证了这件事的效果。AI 搜索接入用户记忆摘要之后,有效播放率达到 82%,次日留存 +1.68%。不是因为搜索算法变好了,而是因为搜索第一次知道了用户是谁。内容推荐接入向量化的用户摘要做召回之后,在低频用户和新用户群体上改善明显 — 恰恰是标签体系最弱的地方,语义记忆的优势最大。
规模化的墙
理想很清晰,工程上有一道绕不过去的墙:成本。
对全量用户做实时大模型推理,成本是天价。五千万用户,每个都要 LLM 推理一遍,如果用云端 API,每天光这一项就能把预算打穿,更何况还有合规限制,不可能用云端 API。
我们趟过了几条弯路:试过把原始行为日志直接丢给大模型,噪音太大,模型迷失在 context 里;试过全量实时推理,撑不住;试过用 embedding 替代传统统计特征,发现两者是互补关系,不是替代关系。
大模型推理最贵,但对高活跃用户(行为数据已经很丰富)边际收益有限。所以做了分层分级 — 把推理资源集中在最需要它的地方:新用户(冷启动,数据稀疏,推理收益最大)和召回用户(信号断档,需要重新建立理解)。
第二个关键是增量更新,不是每次都从零重建用户记忆,而是像 Git 一样只更新发生变化的部分。用户今天多听了几集历史类内容,只需要更新短期记忆,不需要重跑长期画像。这让实时触发成为可能,成本降到可接受的范围。
最后是小模型蒸馏。大模型推理产出 label,拿这些 label 训练专有小模型,用小模型承接高并发的线上场景。大模型做理解,小模型做执行。
优化到最后,日峰值处理 token 量在 500 亿量级,完成五千万用户的大模型理解和向量化,推理成本是云端 API 的 1/30。
记忆是石油,然后呢
Memory 基础设施建好之后,一个更大的问题浮出来了:有了真正懂用户的基础设施,产品形态会怎么变?
现在的互联网 app,产品形态基本是固定的 — 固定的 tab、固定的信息流、固定的内容卡片。个性化发生在内容层(推什么),但界面和结构是静态的。
如果 app 对用户的理解精细到这种程度,会不会有一天,当用户说”我最近很累,想听点不用动脑子的”,系统给他生成的不只是一张推荐列表,而是一个动态组合的「解压专区」 — 特定的色调、精选的片段、配合他当前情绪的引导语?
UI 动态编排、生成式推荐,这些现在听起来还有点科幻。但记忆基础设施是前提。没有这一层,讨论生成式产品形态都是空的。
工程上还有很多关键点没有突破:端侧的实时意图推理、供给侧内容和素材的对齐、幻觉的架构层护栏。
A user opens the app and says: “I’ve been exhausted lately. I just want something that doesn’t require any thinking.”
In the old system, that sentence gets broken into keywords, matched against content labels, and returns a batch of “light” or “relaxing” content. Sometimes it works. Often it doesn’t.
In the new system, something else happens first. The sentence isn’t processed in a vacuum. The system knows this user has listened to three in-depth finance podcasts this week, during the commute every day. It knows his listening time dropped sharply last month before recovering. It knows he’s had a steady preference for history content for two years, but has been exploring something new recently. That context is what makes it possible to understand what “exhausted, want to unwind” actually means — and what to surface.
The gap between these two isn’t about having a better algorithm. It’s about having different infrastructure for understanding users.
Old world: the sticky-note person
For many years, the way internet apps understood users came down to one thing: labels.
Male, 31-40, clicked on mystery content, made a purchase in the last seven days, lives in a second-tier city. These are features — discrete, static enumerated values. Paired with collaborative filtering and machine learning, this approach ran for years and created real business value.
But it has a fundamental limitation: labels describe a cross-section of a person, not their trajectory.
Why did this user suddenly start listening to English lessons this week? What signals are buried in his recent listening sequence? He searched for “sleep aid” last week — what does that have to do with his longer content preferences? Labels can’t answer these questions. They weren’t designed to.
Making it worse: in most companies, recommendations, search, push notifications, and pricing are run by different teams, each maintaining their own version of user understanding. Something a user signals in search never reaches the recommendation engine. These fragmented pictures can’t be shared or compounded. Every system is guessing at the same person from its own isolated corner.
New world: narrative stream
LLMs made something possible: a user can be understood as a continuous, causal narrative of behavior — not a collection of labels.
That’s a real shift.
Statistical models are good at finding patterns in large datasets, but they work with features, not meaning. LLMs are different — they can reason about narrative. “Listened to a lot of finance podcasts lately” is a feature to a statistical model. An LLM can infer: this person might be going through some career anxiety, or making an important decision, or just found a host they like.
That reasoning capability is what makes narrative-style user understanding viable as an engineering problem for the first time.
But a new architectural question follows: where does the user’s narrative live? Who maintains it? Who gets to use it?
Two layers: the Context architecture
In practice, we split the Agent’s context into two layers.
Application Context is managed by each product team. The “rules and boundaries” for each task: what is this Agent trying to do, what tools can it call, is the user on a phone or in a car, is it nighttime or commute time, what did the user say in this session. Local, short-lived, scoped to one conversation.
Platform Context is provided centrally by the platform. Cross-session, reusable user understanding: who this user is, what they’ve liked over time, what they’re paying attention to lately, what in the content library matches their current state. Global, persistent, not owned by any single feature — owned by the platform.
Platform Context has three parts. The most stable is long-term memory — answering “who is this user.” Not “clicked on mystery content,” but “a history and culture enthusiast who prefers deep, well-contextualized storytelling.” That kind of preference doesn’t change much over months or years. In engineering terms, it’s a natural language user summary plus vectorizable interest tags.
Short-term memory is the next layer, answering “what has this user been doing lately.” There’s a distinction worth keeping clear: short-term memory is not the same as conversation history. Conversation history is a local log for one Agent. Short-term memory is a global view across contexts and entry points — the user’s playback sequence, search queries, drop-off behavior, and saved items from the past seven days, all in one place. It decays over time and updates frequently.
The third part is content knowledge — mapping the user into content space. However rich the user’s memory is, it has to land on actual content that can be recommended and played. The content side needs equally rich understanding: what is this album really about, what’s its emotional tone, what situations does it fit, which users are most likely to connect with it. Once user understanding and content understanding are both vectorized, semantic retrieval becomes possible.
This isn’t just about chatbots
Most people talking about Context and Memory frame it as “making your AI assistant more useful.” That’s too narrow.
Recommendations, search, push notifications, pricing, cold start — every function that needs to understand users is maintaining its own separate picture. What a user signals in search never reaches the recommendation engine. What they express in a conversation never reaches push. Each team doing its own thing means the label approach never really goes away, just gets re-implemented with more complexity.
The real shift is turning user understanding into shared platform-level infrastructure. Every interaction a user has in the app accumulates into the same Memory, instead of disappearing into each system’s own logs.
We’ve validated this in practice. After connecting AI search to the user memory summary, effective play rate reached 82% and next-day retention improved by 1.68pp. Not because the search algorithm got better — because search finally knew who the user was. After connecting content recommendations to vectorized user summaries for retrieval, we saw clear improvements for low-frequency and new users — exactly where label-based systems are weakest, and where semantic memory has the most to offer.
The scaling wall
The vision is clear. The engineering problem is cost.
Running real-time LLM inference across all users is expensive. Fifty million users, each requiring an LLM pass — at cloud API prices, that alone would blow through the budget in a day. Compliance rules out cloud APIs entirely on top of that.
We went down several wrong paths. Feeding raw behavior logs directly to the LLM: too much noise, the model gets lost in context. Running full real-time inference across all users: not sustainable. Replacing traditional statistical features with embeddings: turned out they’re complementary, not substitutes.
LLM inference is the most expensive, but for highly active users — whose behavioral data is already rich — the marginal return is limited. So we tiered it: concentrate inference where it matters most. New users (cold start, sparse data, highest return from inference) and returning users (signal gap, user understanding needs to be rebuilt).
The second piece was incremental updates. Instead of rebuilding user memory from scratch each time, we only update what changed — like Git. A user listened to a few more history episodes today: update short-term memory, no need to rerun the long-term profile. This made real-time triggering practical and brought costs into range.
The third was model distillation. LLM inference generates labels; those labels train a smaller purpose-built model; the small model handles high-concurrency production traffic. Large model for understanding, small model for execution.
After all the optimization: peak daily token volume in the 50-billion range, covering 50 million users with LLM understanding and vectorization, at 1/30th the cost of cloud APIs.
Memory as infrastructure. Then what?
Once the Memory infrastructure is in place, a bigger question surfaces: if you actually understand your users this well, how does the product change?
Today’s internet apps are mostly fixed in form — fixed tabs, fixed feeds, fixed content cards. Personalization happens at the content layer (what to show), but the interface and structure are static.
If an app understands a user precisely enough, could there be a day when “I’ve been exhausted lately, I just want something mindless” generates not just a recommendation list, but a dynamically assembled “decompression space” — a specific color palette, curated clips, copy written to match the user’s current mood?
Dynamic UI composition, generative recommendations — these still sound a bit like science fiction. But the Memory infrastructure is the prerequisite. Without it, the conversation about generative product forms is empty.
There’s still a lot to work out on the engineering side: real-time intent inference on-device, aligning the supply side’s content and assets, and guardrails at the architecture level for hallucination.