# Colorful Season

– Taken on November 30, 2013

# An Approach for User Behavior Clustering

## Story

We, programmers built Apps for people to use, sometimes, we could benefit from our users, too.

We could collect anonymous data from users by recording their behaviors on using our App, then analyzing those data, we could find the most favorable features of our App for us to plan for future development, we could uncover some hidden needs of users for us to add new features or create new Apps, we could cluster the users and use different marketing strategy on each users group, etc.

This post will be an example of how I do user clustering.

Imagine I have a music player app, which has 2 millions users.

All data in my hand is how many times a user played, downloaded, purchased, and shared the songs as well as his active days (If a user opens the app one day, then the active days increment by one) as follows.

100035 7 53 0 0 4
150079 45 312 3 8 63
199972 114 2425 82 25 205

## k-means Algorithms

k-means clustering aims to partition n observations into k clusters and each cluster is represented by its cluster center.

Euclidean distance can be used to represent the distance of each point.

Given cluster centers, we can simply assign each point to its nearest center. Similarly, if we know the assignment of points to clusters, we can compute the centers by their means.

This introduces a chicken-and-egg problem.

The general computer science answer to chicken-and-egg problems is iteration. We will start with a guess of the cluster centers, for example, randomly choose k points as cluster centers. Based on that guess, we will assign each data point to its closest center. Then we can recompute the cluster centers on these new assignments.

Repeat above process until clusters stop moving.

If you want to know more, please click.

## Solution

### Identify the features

From the data, there are five columns, “played”, “downloaded”, “purchased”, “shared”, and “active days”.

The first four are user behaviors, and we believe all are important, so those four will be our features.

### Normalize the data

But the data is not “balanced”, some values are hundreds of times bigger than the others. Luckily, we have “active days”, simply divide each feature value by its “active days”, then the values are “balanced”.

### Clustering

We will use scipy, believe me, it’s a great tool.

First, import the packages and load the data.

Then trying k = 4,

and get the centers of each cluster.

Below code will assign the code(cluster) to each subjects(observations).

And calculates each cluster size.

Then I cried, look at the biggest cluster, the fourth one, the number of songs played, downloaded, purchased, and shared per active day by users are all nearly 0.

The final truth is although I have 2 million users, nearly all are zombie users.

(Disclaimer, the data mentioned in this post is faked)

# An Arithmetic Issue in Hive SQL

Last week, a requirement came to me, that we needed to know how often some kind of actions users did in one of our products.

It’s temporal, I didn’t want to write a mapreduce program, instead, I figured out an hive SQL which is a little complicated (joins 3 tables and a long where clause).

It worked. I got the results, we knew how many times people did one kind of actions on their 1st day, 1st week, 2nd week and 3rd week.

But in the 4th week, I got 0!

It’s weird.

Then I counted the 5th week, a dramatic smaller number compared with the 3rd week.

I could get an conclusion, that nearly all people comes to our product, at their 4th week, they left.

It’s unbelievable.

I asked a colleague of mine do those statistics by python again, the results of the first 3 weeks were the same, but the 4th and 5th week, his were much bigger than mine.

There might be something wrong with my hive SQL.

Am I using too many joins?

Am I mistakenly put where clause out of join on condition?

Those questions plagued me for days, but none of them is true.

Below where clause is what I used to extract the data in the 4th week.

I noticed that my persistedtime column type is string, maybe string to int auto conversion went wrong during comparison. But if it went wrong, why only affected the 4th week. It’s unreasonable.

I made a test, created a test table, and changed the column type to bigint, same result.

What changed in the whole SQL is the arithmetic operation,

24*60*60*1000*7/14/21/28

Int overflow?

No! hive is not that stupid.

The biggest int32 is 2147483647, and 60*60*24*1000*28 equals to 2419200000, if int overflows, the where clause will be “less than a smaller number and greater than a bigger number”, which is ZERO for any data!

I put 2419200000 to where clause to replace 60*60*24*1000*28, and, I got the right result.

The multiplication operation converts its result to int32, the addition operation is not affected.

We aren’t gonna stop, are we?

Dig deeper.

From above code, we can see that hive multiplication operator (*) simply converts its result to the same type of its operands. In our case, (int, int) -> (int).

How to tell hive the number we typed in is a long type, I made a guess, l, failed, then tried L, bingo!

So if you want to use multiplication operator (*) in hive SQL, better to add L to one of the operands if the result might be bigger than the maximum value of int32.

Why the addition operator does not have that problem?

Because in our case the left operand is already bigint.

At last, should I blame Hive?

Absolutely not! Because I didn’t RTFM :(

A * B

all number types

Gives the result of multiplying A and B. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands. Note that if the multiplication causing overflow, you will have to cast one of the operators to a type higher in the type hierarchy.