
Recent news reports described studies, and opposing views, about the Sturgis motorcycle rally and its potential contribution to the spread of Coronavirus. One study blamed the rally for over 250,000 new cases in four weeks, or some 20% of the national total during that time, causing $12 billion in treatment costs. Others question that conclusion, noting, for example, that the paper did not undergo peer review.
The studies relied on cellphone data, which immediately raised questions: How did the authors get access to cellphone data? Is it just floating around for anyone to use? All such information one would think proprietary, expensive and held close to the vest. Were such data freely available, everyone could easily spy on everyone else.
Location Data Can Easily be Linked to Identifiable Individuals
One of the studies, by the Covid Alliance, used data from a company called X-Mode. X-Mode collects location data about cellphone users and, customarily, sells it to “partners in advertising, trading, market research, real estate, college safety, and nightlife.” X-Mode claims to map over 5% of the U.S. population monthly, with plans to expand to over one-third of the country within 3-5 years. It claims over 20 million U.S. users, over 60 million globally. A year-long subscription costs $600,000 for U.S.-only data, $900,000 for global which excludes the European Union due to privacy regulations. Among other features, X-Mode boasts that it “collects a high accuracy (70% accurate within 20 meters), dense (150+ data pts per daily user, see users on average 15+ days out of the month) data panel that includes mobility metrics (speed, bearing, altitude, vertical accuracy) and other detection capabilities (IoT, Wi-Fi, and Beacon)” and that records include items such as an “anonymized” id, location with venue name and category, dwell time, longitude, latitude, and altitude, wifi SSID, and device model and carrier.
Ostensibly “anonymized,” the location data can easily be linked to identifiable individuals in many if not most cases. Each record identifies the places that the phone owner has been over time. Most people spend the night in their homes, the owners or occupants of which can frequently be identified through public records, search engines, or otherwise. Even in the case of multi-resident dwellings, a person’s place of work or study can be combined to associate a record with an individual. Once the record is associated with a known individual, it discloses all the places that person has been, at least with his or her phone turned on with location services enabled. Apartment residences, itinerant or nocturnal work, and other factors may of course complicate the linking process.
Nothing Legally Constrains Those with Access Not to Spy
X-Mode says it offers its data free to Covid-19 researchers. Does this mean the researchers, along with X-Mode, its employees, its subscribers, and anyone else with access to its data, can spy on their enemies, neighbors, friends, family, celebrities with publicly known addresses, or anyone else? According to the Amazon page on which X-Mode makes the data available, “Subscriber use cases that include re-identification or otherwise associating content with an identifiable individual will be rejected as it is a violation of the Terms of Service.” The same page includes a notification at the top, evidently from Amazon, that reads as follows: “X-Mode has confirmed that this product contains sensitive information (e.g. health, financial, race), but that information cannot be used to identify a person.” This appears to be, at best, misleading, if not deceptive. While the data itself may not include personally identifiable information such as names, as X-Mode’s website states, such information can readily be divined in many if not most cases. Asked in a CNN interview whether its technology can identify individuals, X-Mode’s CEO stated that “it could, but we don’t allow that.” See especially the 2:10 through 2:20 marks of the video.
Yet, it seems, nothing but the terms and conditions applicable to the X-Mode data legally constrain those with access not to spy. The legal and practical consequences of violating those terms and conditions are not clear. Nor are there any apparent mechanisms in place to detect misuse or prevent leakage. Should this data become public, anyone with moderate technical competence could spy on the movements on anyone captured in the data.
Different Approaches to Anonymization Yield Substantial Differences in Privacy Protection
At least one study relied on a different dataset whose owner, SafeGraph, also offers it free of charge to Coronavirus researchers. SafeGraph data appears to offer more meaningful anonymization. Rather than show all locations of a given device, SafeGraph aggregates visits to particular places, along with the home census block group (a geographic area, such as a city block) of devices which visited each place. To understand the difference, consider that the X-Mode data might show that a particular device spends most of its nights at address A in City B, and that the cellphone visited location X during the Sturgis rally. The SafeGraph data would show the number of visits to location X from devices that dwell in a given census block group that includes address A, but without identifying address A. As a further privacy protection, SafeGraph does not identify census block groups from which only a single cellphone visited a given location. This enables researchers to track traffic while not disclosing all movements of a particular device.
The two different approaches to anonymization yield substantial differences in privacy protection. There are likely similar datasets available either for free or for purchase. Does the existence of such data and paucity of constraints influence consumers’ willingness to use location services? Please comment below or contact me directly to share the details of any additional datasets and their privacy regimes, your reaction to those described above, and relevant laws and regulations on the books or pending.
Bruce Ellis Fein, Legal Director and Co-Founder of Dagger Analytics, Inc., leads Dagger’s legal predictive coding operations.