Authors: Thyne Boonmark and James Carney
Note from the authors
This is not intended to be an exhaustive discussion or even a practical guide about how to implement differential privacy, but rather a way to convey learnings surrounding the implementation details and the explain-ability of privacy guarantees. We hope that his post makes clear some of the considerations that should be made when implementing a differentially private system. The takeaways outlined here are not tied to any specific implementation of differential privacy and are meant to be considered when creating any differential private system.
The following are all hypothetical examples and use synthetic data that we have created, any mentions of data or information are fictional and not actual personal information. The is done to give a semi realistic example of how implementing privacy systems need special consideration.
The Department of Mental Health and Substance Abuse (DMHSA) is a government agency responsible for gathering information and setting policy related to sensitive public health issues, in particular mental health and substance abuse. DMHSA conducts an annual nationwide survey on the use of illegal substances and mental disorders. The survey, aptly named the Nation Survey on Health and Drug Use (NSHDU), is an aggregation of interviews of approximately 70,000 Americans ages 12 and older and done through face-to-face interviews. DMHSA funds and organizes the survey as well as releases summary aggregations from the survey to policy makers, research institutions, and public health organizations. The survey is incredibly influential for policy makers to understand the state of mental health and substance abuse in the country and is used to drive decisions such as:
- Targeted intervention programs for mental health in schools
- Which school districts get prioritized funding for teen substance abuse psychological services
- Policy decisions around criminality of substance abuse
- Research around prediction of later risk / patterns of substance abuse (these are just some examples but there are many more).
Survey participants were asked to answer questions about their mental health history and their substance use. These questions may include things like “How many times have you considered suicide in the past month?” and “On average how many times have you used methamphetamine in the past month?”. Interviewers did not record survey participant’s names but did record demographic information such as age, ethnicity, gender, household income, school grade (if applicable), state, zip code, and the date the survey was taken.
Policy makers want more data
Alex has recently been hired as a research fellow with DMHSA and has been given an exciting new opportunity by the Director of DMHSA. Previously, the NSHDU survey data was only released to policy makers (and other qualified public health related research organizations) as a basic summary of aggregated statistics with limited granularity. These partner organizations have long requested more access to the survey data via an interactive database that they can query so that they can get more targeted analysis on the needs of specific areas or demographics. This year, DMHSA has decided to try out this new method of releasing the survey and the Director has tasked Alex with setting up this system.
Alex’s primary task is to design and implement an interactive database for the NSHDU data which will allow for partner organizations to query the database. It is of utmost importance to both DMHSA as well as Alex personally that this new system does not violate the privacy of any of the people who participated in the survey. Specifically, the interactive survey database cannot reveal any sensitive attributes of specific people in the survey.
Why not just release non noisy statistics?
Alex’s first thought is if the survey data itself doesn’t record participant names then it is already anonymous so she can just simply allow direct queries to the database. This thought is actually a very common one and at the surface seems perfectly reasonable.
Suppose that Alex allowed partner research organizations to query the database and ask aggregate statistical queries in a granular fashion. For example,
What is the average cocaine use within zip code 94701?
How many Hispanic people under the age of 16 living in zip code 94701 received mental health services in the past 6 months?
Although the query responses are aggregated statistics, they are in fact not anonymous and this setup could lead to unacceptable privacy loss of participants in the data.
Suppose using public information such as Facebook or people search sites, a malicious or even unintentionally malicious actor (in this case a malicious actor could even just be a curious employee at a partner organization) knows that there is exactly 1 non-Black 16 year old female who was plausibly surveyed within the zip code 57106.
SELECT AVG(how_many_times_used_heroin) FROM data WHERE zip_code == 57106 AND gender like “Female” AND age == 16 AND race NOT like “Black”# OUTPUT -> 6
This query shows that the one non Black 16 year old female self reports using heroin 6 times per month.
A simple fix Alex could make here is that the query mechanism should not give an output if the number of individuals included in the query output is less than some value. However, even with this fix implemented there are still issues with the system design.
Suppose a malicious actor now writes the following queries:
SELECT AVG(how_many_times_used_heroin) FROM data WHERE zip_code == 57106 AND gender like “Female” AND age == 16# OUTPUT -> 1.5SELECT AVG(how_many_times_used_heroin) FROM data WHERE zip_code == 57106 AND gender like “Female” AND age == 16 AND race like “Black”
# OUTPUT -> 0
The first query includes all 16 year old females in the given zip code including the one non Black individual. The second query excludes the non Black individual but also reveals information about her heroin use, specifically the malicious actor now knows that this one individual self reports using heroin enough times to change the mean from 0 to 1.5. If the number of 16 year old girls in the zip code is plausibly reasoned, the individual’s exact self reported heroin use is revealed.
Alex decides to add noise
Alex realizes that there is a real problem with her current system design and decides to explore alternatives. Alex has heard about differential privacy and knows that it is a way of adding noise to queries which give privacy to the data subjects (essentially through plausible deniability generated from noisy outputs). What this means is that instead of users making a query and receiving a definitive result, the system takes the result and adds a random value pulled from a probability distribution.
Alex reads about how differential privacy is a privacy guarantee and is very excited to be able to show her team at DMHSA this clever way of providing privacy whilst also making partner organization’s happy.
Alex has found a helpful Python library that seems to implement exactly what she is looking for. She finds IBM’s Python based differential privacy library and determines that she can use this in her query design. This particular API uses the Laplace distribution to create noisy means.
Her new system design may look something like:
Alex reads some primer material on differential privacy and she thinks she understands the general idea. For each of the supported query types, Alex will use the differential privacy library to make a “noisy” version of the query (noisy mean, noisy counts, etc.)
Intuition of Differential Privacy
From her reading, Alex knows that a randomized algorithm M (in her case this would be a query such as a mean) satisfies epsilon-differential privacy if the algorithm M behaves similarly regardless of whether an individual joined a study or not. That is, the algorithm only weakly depends on an individual’s data. In other words the epsilon in epsilon-differential privacy represents a bound on how much any one individual’s data can change the output.
Intuitively speaking, the algorithm does not “leak too much” information about you when it does a computation. This guarantee holds not just for a specific individual, but for everyone.
Here is the formal definition of epsilon-differential privacy for those who are curious:
A randomized algorithm M satisfies epsilon-differential privacy if, for all datasets D and D’ that differ in exactly 1 element, and for all events S ⊆ Range(M)
Implementing IBM’s differential privacy lib
Rather than using the pure SQL, Alex decides to implement some Python logic which will allow her to return differentially private versions of her queries. Specifically she starts by using the diffprivlib.tools.mean() function to return noisy means when users query averages. Alex keeps the default parameters of this function and her code compiles without error.
While the code runs and appears to work, privacy leaks can still occur since no work has been done to tune the hyper-parameters of Alex’s differentially private implementation. Without properly setting the epsilon parameter, malicious actors can still take advantage of the system to violate the privacy of individuals included in the dataset.
The Epsilon Parameter
Within the NSHDU database, the variable `how_often_do_you_have_a_drink_of_alcohol` is a quantitative score representing how often an individual drinks alcohol. A score of 0 represents an individual who does not drink and a score of 9 represents an individual who drinks multiple times every day.
The image above shows the distribution of `how_often_do_you_have_a_drink_of_alcohol` query results for two neighboring datasets using differentially private mechanisms. Neighboring datasets are two datasets that differ in only one individual of interest, ie. one dataset where a specific individuals survey responses are included and one in which they are excluded. As you adjust the value of epsilon, you can see the effects that epsilon has on the distributions of a query output on neighboring datasets.
Suppose Alex chose an epsilon of 2, then we can see that the distribution of query results are significantly different and are centered around different points. This means that an attacker can identify that the individual of interest is not only within the dataset, but also that the individual’s `how_often_do_you_have_a_drink_of_alcohol` score was higher than average. This is similar to earlier when we showed how two carefully constructed queries could reveal an individuals’ effect on the mean. If the goal of this implementation is to protect the privacy of a sensitive attribute for all individuals included in the data, this choice of epsilon for this query would be inappropriate.
On the flip side, as we lower the value of epsilon we can see that the query results become less accurate, as shown in the figure below. Choosing a lower epsilon value will decrease the accuracy, or widen the distribution of your query results (because intuitively you are adding more noise to get more privacy).
There is almost always a trade off between privacy and accuracy to keep in mind and these are fundamental questions Alex needs to ask stakeholders of the project. Such questions Alex could ask are:
How accurate do you need your query results to be?
Is there a certain amount of privacy that you’re willing to risk in return for more accurate results for the people using your system?
It is important for Alex and anyone who is implementing privacy guaranteed systems to think about these questions and figure out the goals of your data system and what are the potential consequences of various implementations of differential privacy.
The Privacy Budget
In addition to selecting the right epsilon per mechanism, there are other considerations that Alex must take into account in her system design. In the previous example we saw that if an inappropriate epsilon is chosen, an attacker can use a large number of arbitrary queries to construct the noise distribution for any given query result.
With a conservative enough epsilon privacy can still be preserved to a point, but this still only one part of the complete story. If Alex doesn’t read her differential privacy primer material carefully, she might miss a very important part:
Privacy loss is cumulative
This means that privacy loss for every query result from a specific mechanism — so every time a user gets a result from a specific query — accumulates as total privacy loss for the mechanism. Suppose Alex’s system does use appropriate epsilons but allows users unlimited query power, malicious actors could theoretically reconstruct the dataset and expose sensitive information.
To combat the privacy loss associated with querying, Alex should implement a privacy budget. The idea of a privacy budget is that there is a predetermined maximum privacy loss that is not to exceeded (ie. a max accumulated epsilon). This budget serves as an upper ceiling to cap the amount of queries that can be performed. As a query is made, the epsilon of that query is added to an epsilon counter. Users can perform queries as long as adding the epsilon associated with the query does not make the epsilon counter go over the budget.
For example if Alex were to set the IBM privacy accountant to have an epsilon budget of 10 and each query had an epsilon of 1, then users could only make 10 queries before the accumulation of loss exceeds the maximum acceptable privacy loss. Once they have exhausted the privacy budget, no users will be able to make additional queries.
The privacy budget needs to be enforced or else you no longer actually have a privacy guarantee.
Preventing users from getting any information after the privacy budget is spent may not be ideal. To alleviate this issue and make it so people querying the data can still get some information, a cache can be implemented. What the cache will do is keep track of previous query results and simply return these values if the same query is run again. This can reduce how quickly a privacy budget is spent, as redundant queries will not spend more of the budget, and it means that once the budget has run out users can still get the results of previously made queries.
Just because records in the data do not have names, not make them anonymous or have any firm bearing on privacy
Differential privacy is a guarantee on a mechanism and not on a dataset. Designing an entire system that is differentially private requires careful consideration of all types of queries that will be supported.
Hyper parameters are important. Simply adding noise to a statistic isn’t enough to preserve differential privacy. It is important to tune the epsilon parameters to your use case. This decision should take into account the needs of people using the system, as well as the possible consequences should a privacy leak occur. Is there a certain level of risk you’re willing to put onto participants in the dataset in return for more accurate statistics?
Privacy loss is cumulative over all queries made with a mechanism. In order to limit this loss over time it is important to use a privacy budget.
We would like to thank our instructors Paul Laskowski and Nitin Kohli.