New Views on the Future of Data
Six years ago I wrote two pieces on the future of data: one on three directions for data development technology, one on three directions for data products. Both were broad directional bets, made against the technology stack I could see in 2020.
Looking back today, almost all the directions came true, but none of the implementation paths matched what I had imagined. More importantly, all 6 directions assumed one unspoken premise: the consumer of data is human. That premise has been shaken.
Time to revise.
Looking Back
In the order of the two original pieces.
Three Directions for Data Engineering
Stream-batch unification. Came true, and no longer discussed as a standalone topic. With Lakehouse plus table formats like Iceberg / Hudi / Paimon, the physical boundary between streaming and batch was naturally erased. Storage unification is mature. With the rapid growth of coding agent capabilities, whether stream and batch code are unified no longer matters. It’s not a topic worth singling out anymore.
Code automation. The direction was right, but the path I imagined back then was off. I was looking at the Dataphin route: visual modeling plus config-driven code generation. Today people write SQL and code with coding agents, and auto-optimization is increasingly pushed down into the engines themselves. Low-code didn’t die, but it’s no longer mainstream. Even low-code vendors have added AI assistants, replacing “drag to build a data warehouse” with “describe what you need in natural language.”
The decline of OLAP Cubes. Came true. Lakehouse plus MPP columnar engines became the de facto standard, and precomputed Cubes were essentially retired in most scenarios. As engines like StarRocks / Doris matured, querying detail tables directly became faster than precomputed aggregation in most cases. My earlier concern that “this would be hard on the business side, needing BI tools to evolve” turned out to be overcautious. Agents digested that layer directly.
Three Directions for Data Products
BI / low-code for building data products. Still valid today, but the BI entry point is being further absorbed by Agents. Needs that used to become dashboards are increasingly resolved in conversation. The recent two years of updates from Tableau and Power BI have been entirely about adding AI Copilots. No one is innovating on BI itself anymore. That’s a signal in itself. BI isn’t dead, but it has moved from main entry to fallback.
Data products and business products merging into one. Came true, and pushed further than I had imagined. Back then I pictured “embedding diagnosis and SOPs inside the product.” Today Agents fetch the data, draw the conclusion, and call the business tools themselves; the “data product” shell often no longer exists as a separate layer. Claude Code and Cursor, this wave of coding agents, are the earliest examples of the pattern. Engineers no longer open three panels (code search, docs, Slack) to decide what to write. They just ask the AI. The pattern is spreading from coding to every scenario where data informs a decision.
Interactive and conversational analysis. Came true, but the path was nothing I had imagined. My plan was to go through three layers: first a natural language understanding layer to translate spoken questions into structured queries; then a controlled domain vocabulary to force user questions onto predefined metrics and entities; on top of that a semi-structured semantic layer. What actually happened was that LLMs collapsed the first two steps directly. The ontology layer was replaced by a layered context engineering, which is much more flexible than rigid ontologies.
Finally, of the 6 directions almost all came true. But there was one thing I completely failed to imagine: in all the path scenarios, the consumer of data was assumed to be human, making it fast to query, easy to read, possible to fill in the gaps. After LLMs arrived, data gained a new kind of consumer: the Agent.
New Directions
A fresh look at where things are heading.
Data Engineering
First, Context engineering becomes the foundational work of data construction. The skeleton of data construction used to be “data warehouse modeling + metadata governance + metric systems,” all in service of making it fast and accurate for humans to query. In the Agent era, the core work of data construction is making the business intuition that used to live inside people’s heads explicit, as semantic assets readable by Agents. There’s no standard methodology for this yet; many teams are still figuring it out. It will come together over time.
Second, the data interface expands from query language to capability unit. The data team used to expose interfaces like SQL, APIs, and dashboards. Today, what Agents call is Skills and Tools. A single call carries query capability plus business semantics. Standardization, cross-scenario reuse, and version management for Skills / Tools will become the core engineering work of the data platform. After we opened the data platform tools as a CLI this year, we’ve accumulated multiple high-frequency Skills (query, segmentation, profile, behavior sequence, A/B evaluation, and others), being reused at scale across multiple business lines. This paradigm is already the de facto standard.
Third, evaluation and feedback loops are promoted from R&D byproduct to infrastructure. Data quality used to rely on validating deliverables plus monitoring alerts, with people discovering and resolving problems. In the Agent era, quality is a production-system problem: how do you build the eval set, how do you fix the Agent when it errs, how do you fold that fix back into Context so the error doesn’t repeat, how do you share that learning across scenarios and users. OpenAI’s Eval framework and Anthropic’s evals toolchain are early industrial forms of this path. What data teams used to do as “data quality monitoring” has to become “Agent behavior evaluation,” and the two are different from concept to toolchain.
Data Products
First, Agents become the primary form of data products. BI, dashboards, and reports will continue to exist, but the main entry will give way to a conversational Agent. “Looking at a number” will become the exception action; “asking an Agent” will become the default. The core UX of a data product is no longer about chart layout, but about the Agent’s conversational ability and the long-term trustworthiness of its memory. Six months after our conversational data Agent went live, simple queries went from 30 minutes to 2 minutes, and complex analyses went from 2-3 days to 30 minutes. But more telling than “fast” is this: users have started asking questions they never asked before, because once the cost of asking drops, the density of “hypotheses” goes up.
Second, the boundary of data products extends into business systems. The Agent closes its own loop. “Data products and business products merging into one” was the early form of this path. Today, Agents already run the closed loop themselves: looking at the data, drawing the conclusion, calling the business system. Agent-ification on the business side, in growth, marketing, customer support, will accelerate. The data team’s deliverables will increasingly land directly in business execution, not stop at an analysis report. We can already see it: growth team Agents running their own “look at data → produce strategy → call the ad system → check results” loop; customer support Agents handling most standardized tickets automatically. These didn’t exist as mature products in 2024 and started going into production in 2026. Change is getting closer.
Third, data becomes a core runtime component of the product, not just something in the back office. The data team’s deliverables used to serve the company internally; what users perceived was the recommendation result behind the product UI. AI entry points let users talk to an Agent directly, and what the Agent invokes is the user memory, content understanding, and behavior narratives built by the data team. The data team I lead today runs a user memory pipeline that processes millions of users per day and consumes hundreds of billions of tokens, all on local inference. Data shifts from “the object of after-the-fact analysis” to “the input of real-time reasoning.” The data team’s deliverables now face the user experience directly.
Core Capabilities
Each of the two original pieces ended with three capabilities (for data development: business understanding, depth in working with data, holistic view of the pipeline; for data products: a sense of the business’s goal-evaluation system, the ability to abstract analytical frameworks and action points, an obsession with iteration efficiency). Looking at it today, this list needs revisiting.
Business understanding, from “knowing” to “writing it out.” Saying a data person “understands the business” used to mean they could answer when asked in a group chat, and the report they wrote could close a business decision. In the Agent era, “understanding the business” means something else: can you write that understanding into semantic assets that Agents can read. A concrete example: it used to be enough for a senior analyst to answer “what’s the difference between immersion DAU and DAU”; today they have to be able to write that definition into a structured file so the Agent can automatically pick the right metric. Business understanding used to live in heads; now it has to land in files.
Evaluation ability, from “is what I made correct” to “can the system catch and not repeat its errors.” Data quality used to rely on validating the final deliverable: was the SQL correct, was the metric correct, was the report correct. In the Agent era, quality is a production-system problem. I covered the specific evaluation and feedback mechanisms in the previous section; here I’m only looking at the capability itself: can you shift from policing single deliverables to building an evaluation loop. A concrete scene: when our team first defined an eval set for a data Agent, the hardest part wasn’t designing the problems. It was turning “how an analyst judges whether an analysis report is right” from implicit standard into computable metric. No one has systematically done this before; every team has to reinvent it now. The strength of this capability decides whether your Agent system can ship to production and whether it can earn long-term trust from the business.
Agentic thinking, from “made for humans to read” to “designed for the Agent first, made usable for humans too.” People who built data products used to talk about “user experience,” and the user there was assumed to be human. Today, you have to carry an Agent in your head: when designing any data asset or service, the default starting point is “what will the Agent read when it reads this,” and “what humans see” comes second. The same semantic document must be precise enough for the Agent to call correctly and clear enough for a human to read. Take an example: it used to be that a metric document followed a template of “metric name + business meaning + calculation formula + things to watch out for.” Now you have to add “synonyms / applicable scenarios / non-applicable scenarios / historical definition changes.” Those additions aren’t for human reading. They’re for keeping the Agent from going wrong in ambiguous situations.