HTTP Cookie, Browser Identity, and Privacy

A cookie, also known as an HTTP cookie, web cookie, or browser cookie, is a small piece of data sent from a website and stored in a user's web browser while a user is browsing a website. When the user browses the same website in the future, the data stored in the cookie can be retrieved by the website to notify the website of the user's previous activity. (refer)

HTTP is a stateless protocol, which does not require the HTTP server to retain information or status about each visit for the duration of multiple requests, when a HTTP server receives multiple requests, it can't recognize whether those requests are coming from one single browser or not.

HTTP cookie is designed to be a reliable mechanism for websites to remember the state of the website or activity the user had taken in the past. When a user visits a website, cookies are sent by user's browser to the HTTP server inside HTTP header fields, the HTTP server can set cookies for its own purposes and sent them back to the user, e.g., set values to identify the user.

With cookie, the website recognizes its users, on the one hand, users don't need to set language preferences or login, etc. every time when they visit the site again, on the other hand, the website could provide their users customized contents or services.

But business is business, although some sites provide free services, they need revenue, so, they insert ads to web pages, with cookie, they can customize the ads to suit the users' needs, those ads could be less bothered and probably has better CTR when their users see the ads.

Since cookies are isolated by domains, different sites can't share cookies, it's more like a compromise between convenience and privacy, as my point of view, it's a win-win situation.

Some advertising companies build systems, which can "share" cookies across multiple domains. When a user visits a website, it sets a user track cookie not only on its domain, but also on the domain of an advertising company, when the user visits another site which also use the system of that advertising company, the company recognizes him and displays ads according to the user's previous activities.

Imaging, if the two sites are Google and Amazon, then one day you searched some porn stuff on Google, and some days later, you visit Amazon, it shows porn video recommendations on the top page directly.

This is called Third-party cookies, which goes too far, no longer a win-win situation.

Luckily, nearly all modern web browsers allow to disable third-party cookies, current version of Safari even disable it by default.

I strongly recommend to disable it.

There might be some people are madly concern about their privacy, disable all cookies to forbid any website to track them, but doing this might make them more easier to be tracked by other ways.

Entropy

Entropy is a measure of unpredictability or information content. (refer)

For a random \[X\] with \[n\] outcomes \[\{x_1, ... x_n\}\], the entropy denoted by \[H(X)\], is defined as

\[ H(X) = -\sum_{i=1}^{n} p(x_i) log_b p(x_i) \]

\[b\] is the base of logarithm used, normally, it's \[2\].

If there is a set of \[n\] possible events \[\{x_1, ... x_n\}\] with equal probability \[p(x_i)=1/n\], the entropy of it would be

\[ H(X) = -\sum_{i=1}^{n} p(x_i) log_b p(x_i)=-n\frac{1}{n} log_2 \frac{1}{n}=log_2 n \]

Easily, we can conclude that the entropy is the number of bits needed to specify all the events.

If there is another set of \[m\] possible events \[\{y_1, ... y_m\}\] with equal probability \[p(y_i)=1/m\], the two sets are independent, then the entropy would be

\[ H(X)=-\sum_{i,j} p(x_i,y_j) log_b p(x_i,y_j)=-nm \frac{1}{nm} log_2 \frac{1}{nm}=log_2 n+log_2 m \]

Thus, we can conclude that if we add sources of uncertainty, the overall uncertainty should be the sum of the individual uncertainties.

We could design a browser fingerprinting algorithm \[f(x)\] which gives an ID to every browser that visits, follows discrete probability distribution

\[ P(f_i)=\sum_i P(f(x)=i)=1 \]

The entropy is

\[ H(F)=-\sum_{x=0}^{N} p(f_x) log_2(p(f_x)) \]

Given \[X\] different browsers, each has equal probability of visiting the website, its entropy is \[log_2 X\].

The uniform distribution on the interval [a,b] is the maximum entropy distribution among all continuous distributions which are supported in the interval [a, b] (which means that the probability density is 0 outside of the interval. (refer)

Then if the entropy of the browser fingerprinting algorithm is at least \[log_2 X\], the browsers are uniquely recognizable.

Browser Fingerprinting Algorithm

According to this paper,

In particular, a fingerprint that carries no more than 15-20 bits of identifying information will in almost all cases be sufficient to uniquely identify a particular browser, given its IP address, its subnet, or even just its Autonomous System Number.

Below are the metrics it used to do browser fingerprinting:

  • User Agent
  • HTTP ACCEPT headers
  • Cookies enabled?
  • Screen resolution
  • Timezone
  • Browser plugins, plugin versions, and MIME types
  • System fonts

Those metrics could be easily obtained by simple scripts running on the user's browser. And here comes the pretty good results,

In general, modern desktop browsers fare very poorly, and around 90% of these are unique. The least unique desktop browsers often have JavaScript disabled (perhaps via NoScript). iPhone and Android browsers are significantly more uniform and harder to fingerprint than desktop browsers.

How to defend the fingerprinting algorithms?

There are two ways. First, hide the browser information, e.g., disable JavaScript. Second, make the browser similar to other browsers so that fingerprinting algorithm might misrecognize it.

But, hide information of the browser will make it less common, make it common will disclose its information. It's a paradox.

The Mobile Internet Era

Smart phones, Apps are all around us, when privacy could be violated, the consequences might be worse than the internet browsers.

How do Apps track the device identity?

In iOS, UDID could be used to identify devices, luckily, it was banned by Apple, but after iOS 6, Apple provides the Advertising identifiers similar to UDID, except it can be switched off by user. Also, MAC address can be used too.

In Android, IMEI, Android device ID, and MAC address all can be used to identify the devices.

Those identifiers are different from cookies, since all cookies can be deleted, disabled, but MAC addresses, IMEIs, etc. can't be changed, and there is no way to prohibit Apps for getting those information.

Apps can share your device ID between each other, here's how do some mobile App marketing systems work:

An App displays some ads of other Apps from a marketing system, when a user taps on one ad to install the App, it would send a request to the marketing system with your device ID, when the user opens the newly installed App, that App would send another request to the marketing system with the same device ID, then the marketing system knows the click and activation rates of an ad.

And, there are some mobile analytics systems require their SDK to be embedded into target Apps, e.g., Google Analytics. If all Apps are using the SDK, and the system has the ability and motivation to analyze all the data it has collected, activities of every user across all Apps could be monitored.

How to defend those privacy threats?

We just can't.