An Approach for User Behavior Clustering
Story
We, programmers built Apps for people to use, sometimes, we could benefit from our users, too.
We could collect anonymous data from users by recording their behaviors on using our App, then analyzing those data, we could find the most favorable features of our App for us to plan for future development, we could uncover some hidden needs of users for us to add new features or create new Apps, we could cluster the users and use different marketing strategy on each users group, etc.
This post will be an example of how I do user clustering.
Imagine I have a music player app, which has 2 millions users.
All data in my hand is how many times a user played
,
downloaded
, purchased
, and shared
the songs as well as his active days
(If a user opens the
app one day, then the active days increment by one) as follows.
User id | Downloaded | Played | Purchased | Shared | Active days |
---|---|---|---|---|---|
100035 | 7 | 53 | 0 | 0 | 4 |
150079 | 45 | 312 | 3 | 8 | 63 |
... | ... | ... | ... | ... | ... |
199972 | 114 | 2425 | 82 | 25 | 205 |
k-means Algorithms
k-means
clustering aims to partition n
observations into
k
clusters and each cluster is represented by its cluster
center.
Euclidean distance can be used to represent the distance of each point.
Given cluster centers, we can simply assign each point to its nearest center. Similarly, if we know the assignment of points to clusters, we can compute the centers by their means.
This introduces a chicken-and-egg problem.
The general computer science answer to chicken-and-egg problems is
iteration. We will start with a guess of the cluster centers, for
example, randomly choose k
points as cluster centers. Based
on that guess, we will assign each data point to its closest center.
Then we can recompute the cluster centers on these new assignments.
Repeat above process until clusters stop moving.
If you want to know more, please click.
Solution
Identify the features
From the data, there are five columns, "played", "downloaded", "purchased", "shared", and "active days".
The first four are user behaviors, and we believe all are important, so those four will be our features.
Normalize the data
But the data is not "balanced", some values are hundreds of times bigger than the others. Luckily, we have "active days", simply divide each feature value by its "active days", then the values are "balanced".
Clustering
We will use scipy, believe me, it's a great tool.
First, import the packages and load the data.
1 | import numpy as np |
Then trying k = 4
,
1 | centers, dist = vq.kmeans(subject, 4) |
and get the centers of each cluster.
1 | array([[5.42879071e+00, 1.37091994e+00, 1.04975836e-01, 6.75508656e-02], |
Below code will assign the code(cluster) to each subjects(observations).
1 | code, distance = vq.vq(subject,centers) |
And calculates each cluster size.
1 | In [266]: a.shape, b.shape, c.shape, d.shape |
Then I cried, look at the biggest cluster, the fourth one, the number of songs played, downloaded, purchased, and shared per active day by users are all nearly 0.
The final truth is although I have 2 million users, nearly all are zombie users.
(Disclaimer, the data mentioned in this post is faked)