IoT sensors, big data and advanced analytics to ensure optimal growth - Part IV
In this blog entry, we will write about a topic that is very relevant these days: GDPR and its impact on analytics. We will go through some of the most relevant aspects of GDPR in the context of mobility/location data. We write about the limitations that the regulations introduce, but also about some techniques that can enable us to use very sensitive location data. We will present an example of a technique that can potentially enable malicious parties to recover personal information by using what could be considered fully anonymized data.
Understanding a few key concepts
Let us start by defining a few concepts that are key to understand GRPR in our context.
Personal data. According to the European Commission, personal data is “any information that relates to an identified or identifiable living individual.
Personally identifiable information (PII). Also known as sensitive personal information (SPI), is information that can be used on its own or in combination with different sources to identify a person. PII is a subset of personal data. Different techniques can be used to recover personal data from PII, a simple example would be that of “re-identification from inference”, where possession of additional information can enable one to infer the identity of the data subject.
Identifier (ID). In our context, an identifier is simply a name that identifies a unique entity. We are interested in unique identifiers since we are interested in a person’s identity and a way to protect it.
Quasi-identifier. This is a piece of information that is not a unique identifier in itself, but that is possible to correlate with other data sources and quasi-identifiers to create or recover a unique identifier. Quasi-identifiers can this, when combined, become PII. This process is called re-identification. (See for example the case where Latanya Sweeney showed that even though gender, birth date and postal codes do not uniquely identify an individual, the combination of these three quasi-identifiers is sufficient to identify nearly 90% of individuals in the U.S.A. )
Pseudonymized data. This is data that has been subject of pseudonymization, a method to substitute PII with a consistent value. This consistent value can be, for example, a string of characters (as for the case of data pseudonymization by hashing), that is in general not so simple to trace back to the data subject (not so simple, but possible!).
Anonymized data. This is personal data that has been rendered anonymous, data that cannot be traced back to individuals by any mechanism or technology. It is very important that, for data to be truly anonymized, the anonymization must be irreversible.
Examples of personal data
- Name and surname
- Home address
- E-mail address such as firstname.lastname@example.org
- An identification card number
- Location data (for example, GPS location data generated by a mobile phone)
- An internet protocol (IP) address
- Cookie ID
- Advertising identifier of your mobile phone
- Data kept by a hospital or doctor, which could be used to uniquely identify a person
Example of pseudonymized data
In figure 1, we present an example, perhaps an oversimplified one, of pseudonymized data: a hashed name, middle name and surname string. In the code snippet below you can see a simple hashing procedure. We use the SHA-256 cryptographic hash algorithm to mask my name. The result is a character string that cannot be re-identified in a straightforward manner (See however about rainbow tables and full hash tables).
Figure 1. Simple example showing the output of the SHA-256 cryptographic hash algorithm.
Anonymized location data (The naïve way)
Before getting to what anonymized location data is, let’s consider what PII data is in this context. The headers of a very simple table with PII location data would look something like the following,
(e.g. Name-Surname, Phone number or IMSI, etc.)
(e.g. Device brand/model, etc.)
(e.g. location as calculated by GPS or Bluetooth, etc.)
Time of measurement
(i.e. Time when the location of the user was estimated.)
In the table above, we have that the column Identifier contains personal data while the columns Pseudo-identifiers, Location and Time of measurement are quasi-identifiers.
Why are Location and Time of measurement quasi-identifiers? As it happens our worldlines as individuals are highly cyclic, highly repetitive. We have very well eweekstablished routines: we visit our work locations during weekconstraindays and dwell there for about 8 hours, we perhaps have our preferred gym in the city, the store we really love to buy our supplies. It has been shown that all it takes to re-identify individuals in a data sample stemming from the movement of half a million persons over a period of a week , are a few spatio-temporal points (See: Not so unique in the crowd: a simple and effective algorithm for anonymizing location data).
In an attempt to anonymize location data as the one presented in the table above we could aggregate at a geographical level. Typically, this is done by generating a grid like the one in figure 2.
Figure 2. Generating a spatial grid to aggregate groups of persons and avoid the use of identifiers. This is a naïve approach to anonymization.
What we would see in a dataset aggregated like this are measurements that tell the number of persons per cell at time t. The trajectory of the user is illustrated in figure 3 below.
Figure 3. Example of a trajectory for a user up until time t_4. The trajectory is represented as a set of locations within the grid. The location vector of such a user would be given by S_i,t=[5,2,3,8].
What we will show is that even this kind of aggregated data is not fully anonymized: we can recover individual trajectories from such a data set! The problem consists of determining the most likely location for each user at time t+1, subject to an overall constraintim that we know how many users in total there must be at each location at time t+1.
This problem can be formulated as one of completing a decision matrix. This has N rows (one for each user trajectory to date), and N columns. Let’s suppose that from the “anonymized” data set, we know that there must be two users in location 1 and four users in location 2. Then the first two columns in the matrix will represent the two available occupancy slots available in location 1. Let’s exemplify this in figure 4 below:
Figure 4. Example of a decision matrix with N locations and N trajectories. The N trajectories correspond to N users.
In this decision matrix Xt, x_(i,j)t=1 if next location for trajectory i is the location identified by column j, and zero otherwise. The matrix element with a red underscore represents the trajectory of user #2 being assigned to location 4 at the next time step. A valid completion of the matrix has every row adding up to one. This is because each trajectory is assigned one and only one next location. Every column must add up to one, this meaning that each occupancy slot is filled by one and only one user.
In other words, the decision matrix above indicates that the next location for the trajectory of user 2 has been assigned to location 4 as can be seen in figure 5:
Figure 5. Illustration of the trajectory generated by user 2 until to time t_5.
When we complete the decision matrix Xt, we need to do it in a way that is not randomly assigning occupancy to the slots in the cell. In order to avoid this randomness, we create a cost matrix Cc . This will be also NxN, with the same row and column structure as Xt. A difference is that Cc is not being filled with 0s and 1s but with values representing the cost of moving to the location for slot j given the trajectory so far for user i. Take for example the trajectory in figure 6, it finishes at location 12. We can use as the cost of moving to each potential next location, simply the number of hops in the grid to get there (red numbers). Of course, the cost function would have a more complex set of values following a different cost distribution, but this one serves as a good illustration.
Figure 6. Illustration of the trajectory generated by user 2 until to time t_5. In this figure, we assign the cost for user 2 to move from cell 8 to any cell in the grid. This cost is written in red.
A row in the cost matrix Cc would be filled according to the costs in figure 6, as illustrated in figure 7.
Figure 7. The cost matrix: we show how the cost matrix is to be filled depending on the costs associated with user 2 moving away from location 8.
Without going into further details about how to calculate the cost functions, we can say that this problem is an optimization problem subject to a constraint. We want to
In the paper Trajectory Recovery From Ash: User Privacy Is NOT Preserved in Aggregated Mobility Data, it is shown that by using an algorithm similar to the one described above, one is able to recover users’ trajectories with accuracies of up to 93%. This would indicate severe privacy leakage in such data sets… and nobody wants to be in such a position with GDPR being around the corner!
Is everything lost? Can’t we use location data even at this aggregated level? The answer is yes, with the use of privacy enhancing techniques. In the next blog post, we will discuss different techniques that can be used to fully anonymized location data. Until then, happy GDPR, oh sorry! Happy commute