Coding Agent 能做数据挖掘吗
Can Coding Agents Do Data Mining?
如果几个月前问我「Coding Agent 能做数据挖掘吗」,我大概率会说不能。数据挖掘要业务直觉、统计基础、踩过坑才有的手感,模型再强也只是工具。
这段时间几个真实项目给了我很大的冲击,下面是两个具体的。
流失诊断
第一个是流失会员诊断。本来分析师已经做过一轮。
挑了 5 个典型流失用户深看,故事很干净:曾经的活跃 VIP,流失前几个月都在听历史、财经类内容,流失前都出现了「只听几个固定主播」的收缩。结论几乎写出来了:流失会员是知识型用户,平台知识类供给跟不上。
这个结论对吗?
我们让 Agent 重新跑一遍。它不用从零开始,这一年我们陆续把数据平台封了几个 cli,抽出了 5 个 skills:
- 取数:自然语言进,结构化结果出,处理表选择、SQL 生成、执行查询
- 圈人:根据自然语言生成圈人维度(标签 + 行为 + 时段),返回符合的用户集合,支持分层抽样
- 用户画像:给 user_id,返回这个用户的标签集合、长短期记忆、AI 生成的摘要
- 用户行为:给 user_id 和时间窗口,返回完整的事件序列
- AB 实验评估:给一个实验 id,返回完整的实验报告和诊断结论
Agent 直接调这些 skills 拿结构化数据,不用自己写 SQL、调度任务、处理脏数据,精力可以全花在”看数据”上。
第一轮 20 人(分层抽样,10 个知识型 + 10 个娱乐型),跑出来纯知识型只占 10%。第二轮 100 人,做了地域 × 活跃度 × 付费历史的分层抽样,知识型占比掉到 2% 到 7%。
100 人样本拉出来的时候,第一个让我意外的数字是 45%。这批”流失”会员里,45% 还在月活跃 1 到 2 天。他们没离开平台,只是不再付费。
第二个让我意外的是 3.5 年。我们之前想象的流失是几周或几个月的事,从来没人把视野拉到 3 年以上看。Agent 跑生命周期图的时候顺手拉了 10 年历史,3.5 年这个 pattern 自己浮出来了。
这两个数字摆出来,5 人那一轮看到的「都是知识型用户」就明显是采样偏差。我事后想了一下,如果当时停在 5 人那一轮(大概率会停),整个召回策略就建立在错的前提上。Agent 没替我们绕开这种错误,但它让样本扩容的成本接近零,让”再扩一轮看看”成了一个便宜的选择。
儿童业务
第二个项目是儿童业务诊断。业务想知道产品该不该改、往哪个方向改。
按常规做,分析师会先拉行为数据看用户在听什么。结果大概率是儿歌、故事、宝宝巴士、米小圈、小猪佩奇。一份”用户在听什么内容”的报告,业务听完会礼貌地说”谢谢”,但拿回去不知道怎么用。
我们让 Agent 跑,没规定它从哪里切。它的第一步是按小时拉播放分布。这一步分析师当然也会做,但 Agent 做得更彻底:每一小时的播放量、完播率、单次时长、是否有切换、是否有搜索全拉出来。
结果跳出来一个我当时没看懂的事。20:00 到 23:00 播放占了 35%,是全天最高峰,但这个时段的完播率非常低,单次播放时长却很长(30 到 60 分钟)。
高峰、长时长、低完播,这三个数字单独看都好理解,凑在一起读不出来。Agent 在报告里把它们放在一起,我盯着看了一会儿才反应过来:用户在播放过程中睡着了。这不是一个”内容消费”场景,这是一个”助眠”场景。
把这条线索拉开之后,几条独立证据接连出来:42.7% 的用户画像里有「睡眠」「哄睡」「睡前故事」类标签,高频搜索词里「睡前故事」排前五,应用商店评论里反复出现「每天晚上给孩子听」。都指向睡前场景。
原来想的是「做一个让孩子用得爽的内容 App」,看完诊断变成「做一个让家长在小孩睡前用得顺手的助眠工具」。产品形态、交互、内容编排都不一样。
同一个项目 Agent 还跑出另一件我没想到的事。我看报告时差点跳过去,它在搜索行为分析里花了一段写「小猪佩琪」这种拼写错误。本来以为是 Agent 跑歪了,看下去才发现它在追一个假设:拼写错误是儿童自主操作的信号,因为家长不会写错这种公共 IP 名字。这个信号可以用来在共用账户里区分”家长在操作”和”孩子在操作”,因为儿童账户里 86% 是家长和孩子混用的,区分开行为才好分别优化。
「拼写错误是儿童信号」这种假设分析师未必会主动跑,因为看起来投入产出比不划算。Agent 没这种顾虑,跑得便宜,把不靠谱的角度也跑一遍,没信号扔掉,有信号升级。
回头看
两个项目做完,我开始想 Agent 比传统挖掘做得好的地方在哪。复盘下来有几件事。
一个是样本量。分析师做久了,会条件反射地”先看 5 个、再说”。这是合理的,因为人工成本高。Agent 把这个成本接近清零,可以直接跑 100 人、跑 1000 人。挖掘工作里很多偏差来自样本太小,样本太小是因为时间不够,时间不够是因为工程开销吃掉了大头。这条链解开,挖掘的下限就抬起来了。
一个是假设密度。分析师写代码慢,只敢跑高置信度假设。「拼写错误是儿童信号」这种偏门角度,分析师心里冒出来也会按下去,因为不划算。Agent 跑得便宜,会把这类假设一起跑,多数没信号、扔掉,少数有信号、升级。
一个是跨指标组合。分析师做久了一个业务,手里有一套熟悉模板(流失分析、漏斗、cohort、留存矩阵)。模板本身没问题,但会限制注意力。Agent 没这种包袱,会更随机地两两、三三组合指标。儿童那个「高峰 + 长时长 + 低完播」就是这种组合的产物,不在任何常规模板里,组出来整个业务定位翻了。
最后一个是不被业务预设绑死。做过几年某个业务的分析师,常识熟,但常识有时候是包袱。流失项目里业务一开始的预设是「流失 = 离开」,分析师默认这个前提的话,可能根本不会去看流失用户的免费内容数据。Agent 没这些预设,拿到什么跑什么,45% 那个发现就是这么撞出来的。
边界
Agent 没解决一切。看清楚边界才能正确用它。
先说场景。Agent 跑挖掘适合复杂、长周期、原因不明的问题,流失和儿童都属于这类。简单指标监控(DAU 跌 3% 哪里去了)传统 BI 更快。如果数据基础太薄,行为日志和画像不全,Agent 调 skills 拿不到东西,再好的模型也没素材。
即便在合适的场景里,也离不开人。
业务问题的定义还是要人来。「我们这批会员怎么了」这种话,业务方说出来时是模糊的,要分析师和业务一起翻译成 Agent 能跑的目标。
最终结论的解读还是要人来。Agent 跑出 45% 还在听、跑出 35% 在睡前,这些数字意味着什么、要不要据此改产品、改成什么样,它不知道。
最后,业务 knowhow 的沉淀还是要人来。那些”大家都知道但没写下来”的事情,如果没沉淀到 Agent 能拿到的地方(Context),它跑的假设全是公开数据里看得到的常识。
数据挖掘工程师、高级分析师、数据科学家等角色不会消失,但工作内容在变。以前大量时间在写脚本、看分布、套模板,现在这些事 skills + Agent 做得更熟,人的时间转去定方向、判断异常、找跨域组合。
If you’d asked me a few months ago, “Can a Coding Agent do data mining?”, I would have said no. Data mining takes business intuition, statistical grounding, and the kind of feel you only get from getting burned a few times. However strong the model, it’s still just a tool.
A few real projects in recent months gave me pause. Two specific ones below.
Churn Diagnosis
The first was a churn diagnosis on lapsed members. The analyst team had already done a round.
They picked 5 typical churned users and went deep. The story looked clean: former active VIPs, listening to history and finance content in the months before leaving, all narrowing down to “just a few favorite hosts” right before they churned. The conclusion almost wrote itself: churned members are knowledge-type users, and the platform’s knowledge supply isn’t keeping up.
Was the conclusion right?
We had the Agent rerun it. It wasn’t starting from scratch. Over the past year we’d wrapped a few CLIs around the data platform and distilled out 5 skills:
- query: natural language in, structured results out, handles table selection, SQL generation, query execution
- segmentation: from a natural-language description, generates segmentation criteria (tags + behavior + time window), returns the matching user set, supports stratified sampling
- user profile: given a user_id, returns the tag set, long-term and short-term memory, and an AI-generated summary
- user behavior: given a user_id and time window, returns the full event sequence
- A/B test evaluation: given an experiment id, returns a full report and diagnostic conclusion
The Agent calls these skills directly to pull structured data. No SQL writing, no job scheduling, no dirty-data wrangling. All its attention goes to “looking at the data.”
Round one: 20 people, stratified, 10 knowledge-type + 10 entertainment-type. Pure knowledge-type came in at 10%. Round two: 100 people, stratified by region × activity level × payment history. Knowledge-type share dropped to between 2% and 7%.
When the 100-person sample came out, the first number that surprised me was 45%. Of these “churned” members, 45% were still monthly active 1 to 2 days. They hadn’t left the platform. They just stopped paying.
The second surprise was 3.5 years. We’d pictured churn as a matter of weeks or months. Nobody had stretched the view past 3 years. When the Agent ran the lifecycle chart, it pulled 10 years of history along the way, and the 3.5-year pattern surfaced on its own.
With those two numbers on the table, the “all knowledge-type” finding from the 5-person round was clearly sampling bias. Thinking back, if we’d stopped at the 5-person round (which we very likely would have), the entire recall strategy would have rested on a wrong premise. The Agent didn’t keep us from making that mistake. What it did was drive the cost of expanding the sample to near zero, which made “try one more round” a cheap choice.
Kids Business
The second project was a diagnosis on the kids business. The team wanted to know whether the product should change, and in which direction.
By default, an analyst would pull behavior data and look at what users are listening to. Most likely result: nursery rhymes, stories, Baby Bus, Mi Xiao Quan, Peppa Pig. A “what users are listening to” report. The business team politely says “thanks” and walks away not knowing what to do with it.
We let the Agent run with no constraint on where to start. Its first move was to pull playback distribution by hour. An analyst would do that too, but the Agent went further. For each hour, it pulled playback count, completion rate, session length, whether sessions got switched, whether searches happened.
One thing came out of that I didn’t understand at first. From 20:00 to 23:00, playback hit 35% of the day’s volume, the daily peak. But completion rate in that window was very low, while individual session length was very long (30 to 60 minutes).
Peak, long sessions, low completion. Each number alone reads fine. Together they don’t. The Agent put them next to each other in the report, and I sat with it for a moment before it clicked: users were falling asleep mid-playback. This isn’t a “content consumption” scenario. It’s a “sleep aid” scenario.
Once that thread was pulled, independent evidence stacked up. 42.7% of user profiles had tags like “sleep,” “putting to sleep,” “bedtime story.” Top search terms had “bedtime story” in the top five. App store reviews kept showing “play it for the kid every night.” All pointed to bedtime.
The original thinking: build a content app kids love using. After the diagnosis: build a sleep tool parents can use smoothly at their kid’s bedtime. Product form, interaction, content arrangement, all different.
Same project, another thing I hadn’t seen coming. I almost skipped past it in the report. In the search-behavior section, the Agent had spent a paragraph on misspellings like “Peppa Pig” written wrong. I thought it had gone off track. Reading on, it was actually chasing a hypothesis: a misspelling is a signal that the kid is operating the account directly, because parents don’t typo a well-known IP name like that. That signal can be used inside shared accounts to separate “parent operating” from “kid operating.” Since 86% of kids accounts are shared parent-and-kid usage, you need to split the behavior to optimize each side.
“Misspelling = kid signal” is the kind of hypothesis an analyst probably wouldn’t actively run, because the ROI looks bad. The Agent has no such concern. It runs cheap, so unreliable angles get a pass too. No signal, drop it. Signal, escalate.
Looking Back
After two projects, I started asking where the Agent does better than traditional mining. A few things came out.
One is sample size. Analysts who’ve been at it a while develop a reflex: “look at 5 first, then talk.” That’s reasonable, because human cost is high. The Agent drives that cost to near zero. You can just run 100 people, or 1,000. A lot of mining-side bias traces back to small samples. Small samples trace back to time pressure. Time pressure traces back to engineering overhead eating the budget. Untangle that chain and the floor of the work lifts.
One is hypothesis density. Analysts write code slowly, so they only run high-confidence hypotheses. Take “misspelling = kid signal”. Even when it crosses an analyst’s mind, they shut it down because the math doesn’t work. The Agent runs cheap, so it runs those too. Most produce nothing, get dropped. A few produce something, get escalated.
One is cross-metric combinations. An analyst who’s spent years on one business has a familiar toolkit (churn analysis, funnel, cohort, retention matrix). The templates aren’t wrong, but they narrow attention. The Agent doesn’t carry that baggage. It pairs and triples metrics more randomly. The kids “peak + long session + low completion” combination came out of that. Not in any standard template. Once it surfaced, the whole business positioning flipped.
The last one is not being trapped by business priors. An analyst who’s worked one business for years knows the common sense well, but common sense can be baggage. In the churn project, the business’s prior was “churn = leaving.” If the analyst takes that for granted, they might never look at churned users’ free-content data. The Agent doesn’t carry that prior. It runs on whatever it’s given. The 45% finding came out of that exact opening.
Edges
The Agent doesn’t solve everything. Seeing the edges clearly is what makes it usable.
Scenarios first. Agent-driven mining fits complex, long-arc, cause-unclear problems. Churn and kids both qualify. Simple metric monitoring (where did the 3% DAU drop go?) is faster with traditional BI. If the data foundation is too thin (behavior logs and profiles incomplete), the Agent can’t pull from skills, and no matter how strong the model, there’s nothing to work with.
Even in a fitting scenario, several things still need a human.
Defining the business question still needs a human. “What’s going on with our members?” is vague when the business team says it. It takes the analyst and the business team together to translate it into a target the Agent can run on.
Interpreting the final conclusion still needs a human. The Agent surfaces 45% still listening, 35% in bedtime mode. What those numbers mean, whether to change the product based on them, what to change to: the Agent doesn’t know.
Lastly, the sedimentation of business knowhow still needs a human. Take the “everyone knows but nobody wrote down” kind of thing. If it doesn’t get deposited somewhere the Agent can reach (Context), the hypotheses it runs are stuck at common sense visible in public data.
Roles like data mining engineer, senior analyst, data scientist aren’t disappearing, but the work content is shifting. Writing scripts, looking at distributions, applying templates: skills + Agent handle those more fluently now. The human’s time moves toward setting direction, judging anomalies, finding cross-domain combinations.