Google’s $2 billion acquisition of Fitbit last month has been met with concern from privacy advocates worried about how the tech giant will use personal fitness data. This reaction prompted the tech giant to clarify that the acquisition is “about devices, not data.”
The deal has brought to light a larger issue that we all seem to gloss over: Every day, millions of people publicly share seemingly innocuous personal health information with many stakeholders, including employers, insurance companies, providers and even publicly on the Internet.
This becomes especially concerning during a time when there are literally hundreds of clinical studies, some of them with hundreds of thousands of participants, that may request permission to use the same fitness-tracker data to study everything from obesity to COVID-19 symptoms. In the service of public health, many of these datasets are then made publicly available to allow other researchers to reproduce their research or perform new research. But this is not a risk-free situation.
Examples of fine-grained step data shared on public social networks: Garmin connect platform (left), Fitbit steps shared automatically on Twitter (right).
In a world where “anonymized” study participants can be individually re-identified simply by using a genealogy database, it’s not a huge leap to imagine malicious actors being able to figure out the true identity of a participant in a study by triangulating something as simple as your step count.
Consider that fitness data such as step counts is just a sequence of numbers, much like DNA is a sequence of the nucleotides C, G, T and A. As the length of the sequence grows, the likelihood of someone having exactly that sequence for some given date decreases exponentially.
Just six days of step counts are enough to uniquely identify you among 100 million other people. Step counts are a unique key that can be used to match the weekly step-log from your latest Tweet to the “anonymized” step count in a research dataset – a dataset that may also list other sensitive information, like a mental health diagnosis. Without a course correction, exposing such data using these kinds of re-identification attempts will become increasingly easier, as it’s been for other complex datasets in the past.
Schematics of a re-identification attack based on wearable data. A person with a heart condition decides to participate in a research study that collects physical-activity information through a wearable device, in addition to information about his condition (1). The participant also uses a social network to share the outcomes of his physical activity and set weekly goals (2). At the end of the study, the research data is anonymized and made publicly available (3). A malicious actor can retrieve the anonymized data set and the data published on the social network and match them on the physical-activity time-series (4). The malicious actor can re-identify the study participant and link his social network identity to the medical condition (5).
To reduce these risks, we would ideally see fundamental changes in the business models of companies gathering fitness data. In the meantime, we need to educate research participants about the risks of their wearable data leaking through other channels. If someone is enrolling in a study that involves using their own personal wearable, researchers should warn them to turn off public dashboards and unlink other apps using their data if the person is concerned about their privacy.
Researchers should also make sure that datasets are not naively released into the public domain, but instead limited in use to qu