DIT — enabling de-identified data collection on WhatsApp

At WhatsApp, privacy is our DNA. That’s why we rolled out end-to-end encryption in 2016 — so that when messages are end-to-end encrypted, only you and your intended recipients can see the messages you send. But securing messages and calls is just one part of how we minimize the information we collect in the process of providing a global service.

We’re always looking for ways to improve privacy while maintaining a reliable network that supports more than 100 billion messages and 1 billion calls per day. We’re excited to share that we’ve completed our global roll out of a new method that we are testing to gather usage, reliability and performance data called De-identified Telemetry (DIT) and are testing it everywhere to ensure that it can support our scale. DIT (formerly known as PrivateStats) aims to further minimize any metadata tied to a specific person or phone number, and ultimately makes WhatsApp even more private.

In order to provide a reliable network at our scale, we need to understand how our service is functioning. To do this we need metrics such as whether messages are delivered and how many people are using various operating systems. DIT is built on a proprietary Anonymous Credential System (ACS) that is designed to authenticate data without our server ever learning where the information is gathered from. To date, we rely on data deletion and secure storage protocols to prevent usage information from being tied back to people, but we want to go even further with our privacy protection measures.

Combined with other techniques, we believe that DIT will eventually allow us to obtain usage, reliability, and performance data about our service in a de-identified, privacy-protective way—effectively making WhatsApp even more private. For example, we would be able to understand things like how many people have outdated operating system software or which version of WhatsApp they are running without knowing who those people are. We could also understand if messages have been sent successfully without knowing who sent them. These sorts of insights help us better operate, support, and develop WhatsApp’s service, and we’re excited to be testing a way to gather it, without it being tied to a specific user.

We began building these technologies in early 2020 and even as we are still testing it today, we believe that the underlying techniques can be implemented into other products and use cases beyond messaging. To facilitate that, below is a detailed overview of how DIT and the ACS work in their current form, so that the engineering community can benefit from these developments. More information can also be found in our whitepaper.

How can we make authentication de-identified?

The idea behind DIT is to collect de-identified analytics data from client applications (or “clients”) in a way that is also authenticated, which may sound counterintuitive. To start, we have to explain what types of data can be gathered from a client. This includes usage, performance, and reliability information such as app versions and whether or not a message was sent successfully. Although we collect a minimal amount of information in order to operate our service, and take steps to reduce access to it through secure storage and data deletion, performance, usage, and reliability metadata could ordinarily be associated with an individual in some way due to authentication requirements. But with DIT we’re aiming to change that.

To gather analytics data in a de-identified and authenticated way, the logging requests from WhatsApp clients cannot contain anyone’s identity or any identifiable information, such as the IP address of the client. To ensure that we are doing this in a secure way, we have to enable this technology while simultaneously ensuring that only logging requests from legitimate WhatsApp clients are accepted.

At a high level, DIT addresses this conundrum by splitting the logging workflow into two distinct steps. First, WhatsApp clients use an authenticated connection to the server to obtain an anonymous token (also referred to as an anonymous credential) in advance. Then, whenever the clients need to upload logs, they send the anonymous token along with the logs in an unauthenticated connection to the server. The anonymous token serves as proof that the client is legitimate. To facilitate this, we use ACS to support this workflow.

The new logging workflow

Here is how the new logging workflow functions:

For the first step:

1.) Initially, the WhatsApp mobile client obtains a batch of tokens from our servers using a Verifiable Oblivious Pseudorandom Functions (VOPRF) scheme. Each token is an evaluation of the VOPRF, with a random string that the client chooses as the input.

2.) The client then sends a network request with a token.

3.) When a request hits our servers, the authentication server verifies the legitimacy of the request and the ACS, which manages keys for several applications, evaluates the VOPRF using its secret key.

4.) The result is returned as the credential to the mobile client via the application server.

For the second step:

1.) When the WhatsApp mobile client logs telemetry data, it attaches the input associated with the token to the logging request and binds the request with an HMAC applied to the data with a key derived from the token.

2.) The application server forwards the request to the ACS, which validates the token and limits the number of times it can be used, then derives the HMAC secret and returns it to the application server.

3.) The application server verifies the integrity of the log and decides whether to proceed with it.

De-identified Telemetry (DIT) logging workflow — The logging workflow of the De-Identified Telemetry (DIT) system

The pseudo-randomness evaluation of the VOPRFs ensures that tokens cannot be linked across different steps, thereby decoupling a person’s identity and log data. The verifiability seeks to help clients ensure they aren’t using maliciously crafted keys, and instead using only valid ones.

Our decision to use VOPRFs for de-identified interactions was inspired by the Privacy Pass protocol and blind signatures. While Privacy Pass uses VOPRFs to prevent service abuse from third-party browsers, we’ve shown that the same construction can also be useful in first-party data minimization.

Deploying DIT at scale

There are several practical considerations and challenges when deploying DIT and the ACS at scale. Here is how we addressed some significant ones in testing:

The Curve Choice: Deciding which encryption curve to use is an important part of the protocol setup. We compared RSA and Elliptic curve (EC)-based VOPRF algorithms and decided to use an EC-based algorithm similar to Privacy Pass, mainly due to Privacy Pass’ path to standardization. Regarding the choice of the EC group, we initially intended to use Ristretto for EC-VOPRF instantiation, while switching to the existing curve, Curve25519, that was bundled with the app for end-to-end encryption, as WhatsApp has stringent app size requirements. To be mindful of potential static DH attacks against the Curve25519, we’ve also incorporated additional mitigations such as more frequent key rotations.

Unlinkability guarantees: If DIT proves to be both reliable and effective at scale, it will eventually allow WhatsApp to understand, for example, how many people have experienced an app crash without knowing which people were impacted by the crash. To facilitate such aggregations, DIT has a pseudonymous identifier for each client that is rotated periodically and sent with the log payload. This lets clients control their pseudonymity while providing useful aggregate information linked by ephemeral identifiers. Along these lines, with weaker unlinkability guarantees, we allow tokens to be re-used a small number of times before they’re invalid to improve the system’s reliability and efficiency. We currently have the limit set at 64 times per day, which allows the vast majority of our clients to go up to an entire day without having to fetch a new token. The re-use of these tokens has no impact on the keys that enable and protect WhatsApp’s end-to-end encryption.

Re-identifiability: We reduce the re-identification risk of a VOPRF token by actively measuring the re-identification and joinability potential of the data that’s collected and sounding an alert if the potential exceeds a particular threshold. This allows us to stop gathering telemetry data that has high re-identification potential. We have also added additional protections to mitigate against this risk, including removing the IP address that would have been associated with the anonymous requests at our edge servers so that the logging server does not have access to it. Since we are actively testing DIT, we are still exploring the impact and tradeoffs of this approach, and may end up adjusting it prior to fully deploying and relying on DIT.

Rate limiting: Since we cannot rate limit people during the anonymous redemption of the tokens, we use key rotation to rate limit them. We do this by limiting the number of tokens a single client can request per public key, and rotating the public key to expire the tokens. For redemption requests, the logging server also tracks the number of times a unique credential has been redeemed and rejects the logging request if the credential is already redeemed more times than a preset threshold.

Communication cost: Compared with WhatsApp’s existing procedures, DIT’s workflow takes extra steps to fetch the credential prior to the actual logging request and communicates with the ACS in the middle of each step. To save time and reduce the number of round trips to the server, we allow tokens to be reused a few times. We also deploy ACS servers locally in relation to WhatsApp application servers to reduce the latency from cross-region traffic.

What’s next for DIT

Our ethos at WhatsApp has always been to provide a simple, reliable service at scale that preserves the privacy of the people who choose to use it. We believe that additional privacy preserving techniques both at the time of collection, (e.g. local differential privacy), and after collection, (e.g. global differential privacy), can further strengthen our privacy guarantees. There is a long road from testing this technology to fully utilizing it without any redundancies in place, but we are excited to be on this journey. We’re looking forward to seeing how our testing performs and making any necessary refinements .

DIT is part of a broader initiative across Facebook to build and deploy features and infrastructure that can further enhance user privacy and minimize data collection. More information about other privacy preserving technologies in development can be found here.