From Labels to Memory - Rebuilding How Internet Apps Understand Users
A user opens the app and says: “I’ve been exhausted lately. I just want something that doesn’t require any thinking.”
In the old system, that sentence gets broken into keywords, matched against content labels, and returns a batch of “light” or “relaxing” content. Sometimes it works. Often it doesn’t.
In the new system, something else happens first. The sentence isn’t processed in a vacuum. The system knows this user has listened to three in-depth finance podcasts this week, during the commute every day. It knows his listening time dropped sharply last month before recovering. It knows he’s had a steady preference for history content for two years, but has been exploring something new recently. That context is what makes it possible to understand what “exhausted, want to unwind” actually means — and what to surface.
The gap between these two isn’t about having a better algorithm. It’s about having different infrastructure for understanding users.
Old world: the sticky-note person
For many years, the way internet apps understood users came down to one thing: labels.
Male, 31-40, clicked on mystery content, made a purchase in the last seven days, lives in a second-tier city. These are features — discrete, static enumerated values. Paired with collaborative filtering and machine learning, this approach ran for years and created real business value.
But it has a fundamental limitation: labels describe a cross-section of a person, not their trajectory.
Why did this user suddenly start listening to English lessons this week? What signals are buried in his recent listening sequence? He searched for “sleep aid” last week — what does that have to do with his longer content preferences? Labels can’t answer these questions. They weren’t designed to.
Making it worse: in most companies, recommendations, search, push notifications, and pricing are run by different teams, each maintaining their own version of user understanding. Something a user signals in search never reaches the recommendation engine. These fragmented pictures can’t be shared or compounded. Every system is guessing at the same person from its own isolated corner.
New world: narrative stream
LLMs made something possible: a user can be understood as a continuous, causal narrative of behavior — not a collection of labels.
That’s a real shift.
Statistical models are good at finding patterns in large datasets, but they work with features, not meaning. LLMs are different — they can reason about narrative. “Listened to a lot of finance podcasts lately” is a feature to a statistical model. An LLM can infer: this person might be going through some career anxiety, or making an important decision, or just found a host they like.
That reasoning capability is what makes narrative-style user understanding viable as an engineering problem for the first time.
But a new architectural question follows: where does the user’s narrative live? Who maintains it? Who gets to use it?
Two layers: the Context architecture
In practice, we split the Agent’s context into two layers.
Application Context is managed by each product team. The “rules and boundaries” for each task: what is this Agent trying to do, what tools can it call, is the user on a phone or in a car, is it nighttime or commute time, what did the user say in this session. Local, short-lived, scoped to one conversation.
Platform Context is provided centrally by the platform. Cross-session, reusable user understanding: who this user is, what they’ve liked over time, what they’re paying attention to lately, what in the content library matches their current state. Global, persistent, not owned by any single feature — owned by the platform.
Platform Context has three parts. The most stable is long-term memory — answering “who is this user.” Not “clicked on mystery content,” but “a history and culture enthusiast who prefers deep, well-contextualized storytelling.” That kind of preference doesn’t change much over months or years. In engineering terms, it’s a natural language user summary plus vectorizable interest tags.
Short-term memory is the next layer, answering “what has this user been doing lately.” There’s a distinction worth keeping clear: short-term memory is not the same as conversation history. Conversation history is a local log for one Agent. Short-term memory is a global view across contexts and entry points — the user’s playback sequence, search queries, drop-off behavior, and saved items from the past seven days, all in one place. It decays over time and updates frequently.
The third part is content knowledge — mapping the user into content space. However rich the user’s memory is, it has to land on actual content that can be recommended and played. The content side needs equally rich understanding: what is this album really about, what’s its emotional tone, what situations does it fit, which users are most likely to connect with it. Once user understanding and content understanding are both vectorized, semantic retrieval becomes possible.
This isn’t just about chatbots
Most people talking about Context and Memory frame it as “making your AI assistant more useful.” That’s too narrow.
Recommendations, search, push notifications, pricing, cold start — every function that needs to understand users is maintaining its own separate picture. What a user signals in search never reaches the recommendation engine. What they express in a conversation never reaches push. Each team doing its own thing means the label approach never really goes away, just gets re-implemented with more complexity.
The real shift is turning user understanding into shared platform-level infrastructure. Every interaction a user has in the app accumulates into the same Memory, instead of disappearing into each system’s own logs.
We’ve validated this in practice. After connecting AI search to the user memory summary, effective play rate reached 82% and next-day retention improved by 1.68pp. Not because the search algorithm got better — because search finally knew who the user was. After connecting content recommendations to vectorized user summaries for retrieval, we saw clear improvements for low-frequency and new users — exactly where label-based systems are weakest, and where semantic memory has the most to offer.
The scaling wall
The vision is clear. The engineering problem is cost.
Running real-time LLM inference across all users is expensive. Fifty million users, each requiring an LLM pass — at cloud API prices, that alone would blow through the budget in a day. Compliance rules out cloud APIs entirely on top of that.
We went down several wrong paths. Feeding raw behavior logs directly to the LLM: too much noise, the model gets lost in context. Running full real-time inference across all users: not sustainable. Replacing traditional statistical features with embeddings: turned out they’re complementary, not substitutes.
LLM inference is the most expensive, but for highly active users — whose behavioral data is already rich — the marginal return is limited. So we tiered it: concentrate inference where it matters most. New users (cold start, sparse data, highest return from inference) and returning users (signal gap, user understanding needs to be rebuilt).
The second piece was incremental updates. Instead of rebuilding user memory from scratch each time, we only update what changed — like Git. A user listened to a few more history episodes today: update short-term memory, no need to rerun the long-term profile. This made real-time triggering practical and brought costs into range.
The third was model distillation. LLM inference generates labels; those labels train a smaller purpose-built model; the small model handles high-concurrency production traffic. Large model for understanding, small model for execution.
After all the optimization: peak daily token volume in the 50-billion range, covering 50 million users with LLM understanding and vectorization, at 1/30th the cost of cloud APIs.
Memory as infrastructure. Then what?
Once the Memory infrastructure is in place, a bigger question surfaces: if you actually understand your users this well, how does the product change?
Today’s internet apps are mostly fixed in form — fixed tabs, fixed feeds, fixed content cards. Personalization happens at the content layer (what to show), but the interface and structure are static.
If an app understands a user precisely enough, could there be a day when “I’ve been exhausted lately, I just want something mindless” generates not just a recommendation list, but a dynamically assembled “decompression space” — a specific color palette, curated clips, copy written to match the user’s current mood?
Dynamic UI composition, generative recommendations — these still sound a bit like science fiction. But the Memory infrastructure is the prerequisite. Without it, the conversation about generative product forms is empty.
There’s still a lot to work out on the engineering side: real-time intent inference on-device, aligning the supply side’s content and assets, and guardrails at the architecture level for hallucination.