0%

最近要参加一个关于数据埋点和分析的线上讨论,这两天总结了对一些问题的思考。

为什么企业需要一套完善的用户行为埋点和分析平台?

一个互联网产品从萌芽到发展壮大,离不开对用户行为的深度洞察。

产品初创期间,需要分析天使用户的行为来改进产品,甚至从用户行为中得到新的思路或发现来调整产品方向;产品 growth 过程,通过对用户行为的多角度(多维)分析、对用户群体的划分以及相应行为特征的分析和比较,来指导产品设计、运营活动,并对市场渠道效果进行评估。

配合上 A/B 试验平台,可以加速产品的迭代,更快得到用户的真实反馈。同时,这些数据沉淀下来,对业务的数据仓库建设、数据智能应用等方面也能起到促进作用,比如做实时推荐,需要能更快获得用户尽可能多且明细的行为数据;做用户分类、意愿预测等机器学习业务,需要清洗过的规范化、结构化的数据做 training。

Read more »

最近在看 Stonebraker“Readings in Database Systems”, 发觉开拓了很多思路。

这么多年自己一直在从事大数据方面的工作,但除了翻过数据挖掘算法和分布式系统设计方面的论文外,完全没想过去翻翻数据库相关的论文看。现在想想,其实大数据和数据库两者很多需求和场景是一致的,要解决的问题,没准学术界很多年前就已经有方案了。

这篇文章主要是 "Interactive Analytics" 相关部分。

What is Interactive Analytics

假如你是一家电商公司的分析师,如果有 100 万用户原始交易数据打印出来摆在你面前,让你去分析这些数据的意义,你会怎么做?

Read more »

We have a legacy system, which is a web service, receives HTTP POST from clients, parses the data, then stores them in a file.

The function of the system is simple, and people already done functional and performance test, it's stable. As time drifted away, the system was copy and paste to some projects by only changing the data parsing logic.

I had a similar requirement recently, then I delved into the legacy code to check if it works in order to not reinventing the wheel.

WTF

At first, I noticed below code in a HttpServlet class, it allocates more than 1M memory for each HTTP POST request.

Read more »

Long long ago, I wrote a post about how to do TDD using Objective-C, since Apple WWDC 2014, Swift is really eye-catching, I think I should write a new one to follow the trend.

XCTest is used as the unit test framework, and Xcode 6 is needed.

TDD Work-flow

  1. Add a test for a user case or a user story
  2. Run all tests and see if the new one fails
  3. Write some code that causes the test to pass
  4. Run tests, change production code until all test cases pass
  5. Refactor the production code
  6. Refactor the test code
  7. Return to 1, and repeat

The 5 and 6 are optional, do them only if needed, but be sure that DO NOT do them at the same time. That is, when you refactor production code, you can't change the test code, until all the test cases are passed, then you are confident that your production code refactoring is perfect, then, you can refactor the test code, and this time, you can't change the production code.

Read more »

In last blog post, The job Client has been created and initialized.

This post will discuss on how does Client do to deploy the job to hadoop cluster.

Code snippets will be full of this post, to not confuse you, all comments added by me begin with //** instead of // or /* and the code can be cloned from Apache Git Repository, commit id is 2e01e27e5ba4ece19650484f646fac42596250ce.

The Resource Manager Proxy

1
2
3
4
5
6
7
//** org.apache.hadoop.yarn.applications.distributedshell.Client.java L330
public boolean run() throws IOException, YarnException {

LOG.info("Running Client");
yarnClient.start();

...
Read more »

In last blog post, a hadoop distribution is built to run a YARN job.

1
2
3
4
$ bin/hadoop jar share/hadoop/yarn/hadoop-yarn-applications-distributedshell-2.2.0.jar \
org.apache.hadoop.yarn.applications.distributedshell.Client -jar \
share/hadoop/yarn/hadoop-yarn-applications-distributedshell-2.2.0.jar \
-shell_command 'date' -shell_args "-u" -num_containers 2

The date -u command is executed in Hadoop cluster by above script, we might conclude that there exists a dispatcher named Client in "hadoop-yarn-applications-distributedshell-2.2.0.jar", responsible for deploying a jar to cluster with parameters, such as shell command and args, and notify the cluster to execute the shell command.

To see what's in the rabbit hole, let's step into the Client source code.

Code snippets will be full of this post, to not confuse you, all comments added by me begin with //** instead of // or /* and the code can be cloned from Apache Git Repository, commit id is 2e01e27e5ba4ece19650484f646fac42596250ce.

Read more »

The Old MapReduce

The Hadoop 0.x MapReduce system composed of JobTracker and TaskTrackers.

The JobTracker is responsible for resource management, tracking resource usage and job life-cycle management, e.g. scheduling job tasks, tracking progress, providing fault-tolerance for tasks.

The TaskTracker is the per-node slave for JobTracker, takes orders from the JobTracker to launch or tear-down tasks, and provides task status information to the JobTracker periodically.

Read more »

Story

We, programmers built Apps for people to use, sometimes, we could benefit from our users, too.

We could collect anonymous data from users by recording their behaviors on using our App, then analyzing those data, we could find the most favorable features of our App for us to plan for future development, we could uncover some hidden needs of users for us to add new features or create new Apps, we could cluster the users and use different marketing strategy on each users group, etc.

This post will be an example of how I do user clustering.

Imagine I have a music player app, which has 2 millions users.

Read more »