Here's what you need to know about FLoC: Google's alternative to individual tracking

Google made some waves earlier this week when it boasted that it will soon stop tracking individuals via ads and their browsing activities. Many have understandably been wary about this announcement and believe that there must be a loophole which will still allow Google to track you and present you targeted ads.

As usual, it is important to look past the headlines, as the devil is in the details. In this piece, we will take a look at what Google is proposing as an alternative to its usual tracking capabilities.

How does Google usually track individuals?

Google tracks your online activity in a multitude of ways including your browsing habits, the metadata and telemetry it extracts from cookies, as well its software and hardware products such as your location, your search history, your interaction with Google Assistant, your Google account, Gmail, YouTube, and more.

The company then builds a unique ad profile for you which can be used by ad providers to show you targeted ads. As can be seen from my own ad profile in the screenshot above - you can view yours here -, Google correctly knows that I'm a 24-year-old male, and my interests include movies, gaming, mobile platforms, cybersecurity, banking, and so on. Most of these are accurate and as you'll be able to grasp, this is only a portion of what Google thinks are interests relevant to me. The list is in alphabetical order and actually contains hundreds of topics that the company has assigned to my profile.

Through this, it is extremely simple for ad providers to identify which ads they want to show me. Usama is interested in audio software? Let's start showing him ads for Spotify premium subscriptions. He's interested in antivirus software? Let's start showing him ads for Norton AntiVirus. The list goes on and on, and as you can see, there is huge value in this data for ad providers at the cost of individual privacy.

However, Google says that it will soon stop tracking your individual interests in this manner. This is great news on paper but does this mean that the company won't track you at all? Not quite.

Enter FLoC

A flock of birds — *A FLoC(k) of birds | Photo by Aleksandar Pasaric from Pexels*

Federated Learning of Cohorts (FLoC) is a technique that Google proposed back in January 2020 as a part of its Privacy Sandbox initiative. This API, which is currently in testing, enables a privacy-preserving mechanism in the sense that rather than generating unique ad profiles for individuals, it assigns them to a group of people with similar interests. I will interchangeably use the terms "group", "cluster", and "cohort" this article.

The idea is for browsers to locally analyze the user's habits without sending them to a server, and then assign them a cohort ID. For example, I may belong to "Cohort A", which will contain thousands of individuals with browsing habits similar to mine, while other cohorts (Cohort B, C, D...) will contain users whose interests are significantly different than mine.

As such, when a cohort is developed and ready to be used to show ads, ad providers will have to present similar ads to the entire cohort rather than showing me ads specific to my interests. So, there will be scenarios if my group is assigned to a cohort consisting of predominantly people with interests in tech, I'll see ads related to tech, but not all of them will fully match my interests. Similarly, the same cohort may also be somewhat skewed towards people who like football alongside tech, so I'll be intermittently shown ads related to football, even though it's not a super-interesting topic to me.

How does FLoC work?

A blackboard with mathematical equations on it — *Photo by Gabby K from Pexels*

As FLoC is based on the idea of grouping users with similar interests, the key area of interest is cohort assignment. From a bird's eye view, Google is testing several algorithms to generate a p-dimensional binary vector based on a user's browsing history. All users who have a similar hash value will share the same cohort.

Since the main reason behind this initiative is ensuring user privacy, Google says that a cohort must have at least "k" distinct users. Intuitively, the higher the value of "k", the greater privacy you will have across the web. The company has termed this metric "K-anonymity", saying that it is similar to hiding in a crowd.

The principles behind FLoC include the fact that the algorithms used for cohort assignment should be unsupervised because providers will be using their own optimization functions and will not be dependent on labeled data. Similarly, a cohort should consist only of users with similar activity, and the ID should restrict individual cross-site tracking. In terms of the algorithm itself, it should be simple to compute because it will be run locally in browsers, and the parameters used should be easy to understand.

Google tested multiple algorithms and judged them according to three metrics: privacy, utility, and centralization. There is an inherent tradeoff between the first two. The more privacy you have (that is, you belong to a large cluster), the less personalized ads you'll get, and vice versa. The goal is to have cohorts with large number of users, all with similar interests. Lastly, centralization refers to whether any information needs to be sent to a centralized server to calculate a cohort ID.

Algorithms

A laptop display with a coding IDE open and spectacles in the front — *Photo by Kevin Ku from Pexels*

All algorithms that Google has tested so far are clustering algorithms, which makes sense given the aim of the initiative. Although we won't go into the mathematical details of each algorithm since those are full-fledged topics altogether, we will be discussing them from a high-level along with their characteristics.

First up is SimHash, which generates p-dimensional vectors based on input. The closer these vectors are, the more likely they are to be hashed to the same cohort. One of the inputs to the algorithm in this case can be a user's browsing history. If two users have similar browsing habits, the cosine angle between their vectors will be very small and they will likely be assigned the same cohort, and vice versa.

The advantage of SimHash is that it does not depend upon other users to generate a cohort ID. A cohort ID can be generated without knowing about other users' vectors, because it depends solely upon your own browsing history. A drawback of this approach is that a minimum cluster size cannot be enforced without a central server. Without a minimum size, there may be edge cases where some clusters will contain hundreds of users while others will contain only a couple. A cohort server can be set up to tackle this problem which will track cluster sizes and will not allow the API to return the ID of a cluster that does not meet minimum requirements. This server will require a "small bit length hash" of the input - which is the user's browsing history in order to function.

Secondly, we have the SortingLSH algorithm, but first it is important to understand why it is needed. The value of "p" in p-dimensional output vectors generated by SimHash is crucial in determining the potential sizes of cohorts. A smaller value may result in large cohorts while a large value may result in small cohorts, through which we circle back to the privacy-utility tradeoff.

SortingLSH solves this problem via post-processing the results of SimHash. It does this by sorting the hash values in lexical order and then assign them to a cohort in such a way that each cohort should necessarily have at least "k" users.

Of course, there is a drawback in this approach too, which may be immediately visible to some of our readers. Since the hashes have to be sorted before being assigned to a cluster, they depend upon other users. As such, a centralized server needs to do the sorting and obviously needs some information in order to do this processing, a disadvantage also present in the vanilla SimHash approach.

A piece of paper with equations written on it and a pen on top

In order to further enhance the privacy-utility tradeoff, Google also tested a technique called Affinity hierarchical clustering with centroids. What it does is that it generates a graph where users are nodes and the edges connecting them depict their similarity. Two nodes very close together will have a very short edge, showing their affinity. Then, this algorithm performs a bottom-up hierarchical agglomerative clustering by calculating new centroids and merging smaller clusters into bigger ones. The minimum size of a cluster is explicitly defined at the start of the clustering process.

Although this results in better clusters because the information of one user is utilized to proactively find similar users, we have a disadvantage where a central server is generating graphs based on raw browsing history rather than hashes. In 2020, Google stated that this is due to the baseline naive algorithm it has built, and this concern could be resolved with "federated learning technology". We don't have the latest from Google regarding how successful its efforts have been at this point of time.

How good are these algorithms?

An A4 paper next to a laptop and stationery showing graphs and charts — *Photo by Lukas from Pexels*

Just because something sounds good on paper doesn't mean that it will perform similarly well in the real-world, so it is important to look at both aspects when evaluating performance of an algorithm. When Google tested these algorithms on public datasets containing information about preferences of users with regards to movies and music, the fully centralized algorithm, Affinity Clustering, outperformed decentralized methods such as SimHash and SortingLSH. But it is interesting to note that the latter two algorithms still achieved results that were 85% the quality of Affinity Clustering.

Additionally, Google used Word Clouds to visualize the semantic meaning behind each cohort and noticed that decentralized methods typically produced large clusters which were somewhat well-defined whereas Affinity Clustering generated smaller clusters which were defined even better.

A hand holding a phone showing the Google Ads logo on the screen

After performing validation on public datasets, Google proceeded to test the algorithms on its own dataset. It contained proprietary data including de-identified URLs from publishers in the Google Display Network collected across seven days.

As such, the input features fed to the algorithms were encoded URLs, domains, and topic categories. All clusters with a size below a minimum of "k" members were dropped from the evaluation process. Google notes that with SimHash, it noticed a 350% improvement in recall and a 70% improvement in precision even at very high anonymity levels when comparing to random clustering.

Although Google has not reported the results of SortingLSH and Affinity Clustering so far, it claims that the performance was similar to that observed on public datasets.

What's next?

The results of the testing done so far has been very promising, showing that there is a potential way forward in protecting user privacy without compromising on utility. In this vein, Google has previously claimed that advertisers can expect to make at least 95% as many conversions per dollar spent compared to cookie-based advertising, so ads can live on and be effective without violating user privacy.

The company has confirmed that it will be phasing out support for third-party cookies by 2022 and will no longer build identifiers in its products to track individual activity. The firm always uses the term "individual" to indicate the distinction between the two tracking techniques. To be clear, it isn't giving up on getting metadata from you, it's just making it difficult for providers to identify you based on your metadata.

Privacy-preserving APIs using FLoC-based cohorts will enter public testing next month, and will also be tested with Google Ads advertisers in Q2 of this year. Google is welcoming feedback on its approaches and recommends that ad-tech providers evaluate the proposed algorithms on their own proprietary datasets too. The process will likely go through several iterations of enhancements and feedback before it becomes the norm within the next couple of years, provided that everything goes smoothly.

tl;dr

All in all, Google will still be using your data to show you ads, but similar adverts will now be presented to the entire cohort rather than personalized to yourself only. Provided that a cluster size is large enough, your activity won't be identifiable to you unless it is combined with other signals. Similarly, your data won't be shared with a centralized server in raw format. This is in stark contrast to current methodologies where Google has built a complete ad profile unique for you that it can share with ad partners. Provided that the initiative is successful, Google hopes that browsers and ad partners will adopt it as the de facto standard for ads in the future.

Coming back to the original question: Does this mean that the company won't track you at all? Kind of. It won't track your individual activity, but the activity of a group of similar users will be utilized to show ads, obviously at a lower granularity level than what we have right now, thus preserving privacy to some extent. Google will need to have strong governance and security procedures in place which ensure that users cannot be de-identified by combining their data with other signals, something that the company is already considering.