Data Anonymization through Dimensionality Reduction #79

MerlinSchaefer · 2021-01-03T11:36:42Z

MerlinSchaefer
Jan 3, 2021

Hey everyone,

After viewing the part on how anonymization is flawed (great course so far btw!), I was wondering how this applies to data that was anonymized through dimensionality reduction techniques (PCA, t-SNE,LDA etc.).
Some well known datasets (e.g. about Credit Card Fraud) use these techniques.
I am aware that you can under some circumstances reconstruct the data, but from my understanding this usually requires specific information about how it was structure before dim. reduction and how the reduction was done.

Does someone know how "safe" these techniques are?
Are they a viable option in some context or should they be avoided for that purpose?
How hard is it to reconstruct the data sufficiently?

Thanks :)

Answered by iamtrask

Jan 3, 2021

@MerlinSchaefer - great question! While every case is probably a little different - dimensionality reduction reduces across dimensions of each individual's data to create "latent features". So perhaps the simplest example is - with Machine Learning there's very little structure you need to know in order to de-anonymize. The most straightforward way would be to find the true records for a small percentage of the data and then learn a linear classifier (for linear techniques) or a non-linear classifier for more advanced compression techniques.

Worth mentioning - many dimensionality reduction techniques have fantastic differentially private alternatives.

View full answer

yemikifouly · 2021-01-03T12:58:15Z

yemikifouly
Jan 3, 2021
Maintainer

Hi Merlin! Great questions! Let me get you more specialized help. 🙂

@iamtrask or @mcleonard Can you help with this?

0 replies

MariiaDen · 2021-01-03T14:15:01Z

MariiaDen
Jan 3, 2021

I also have a bit of skepticism when hearing that anonymization is broken 😊 If I would tell that to anyone from cybersecurity, they would tell me I’m wrong.

Speaking from the cybersecurity perspective, no single system is considered to be “completely secure”. When we are talking about risks, we are talking about their reduction, but not about the possibility to bring it to zero. In the same way I see anonymization. Of course, you might find some way to connect it to real people if you get some other piece of information. But what is the probability of you getting this data? Especially when we are talking about finances. If we quantitize it, we will find it out that the probability is so low, that the risk can be accepted. And this is something any business would do, instead of redesigning the whole information flow and replacing best practices.

For this reason I feel it problematic answering some of the questions to the videos. Because the right answers feel a lot like a personal point of view, but not a single truth, in other words - fact. If we say that anonymized data can be connected to real people in 2% of cases, it doesn't mean that anonymization can't protect personal data.

2 replies

em-blue Jan 3, 2021

Hey @MariiaDen these are all great points and yeah there's a lot of nuance here really re: anonymization - I assume you're talking about Lesson 2: safe data networks, Quiz question 1 where it asks if anonymization itself protects people?

The important question is: does it protect them from harm from that data use? No, not necessarily

The important point here is that today, a lot of people assume that as soon as data is anonymized, it can't do harm. So we're trying to dispel that myth. Remember that harm from data use goes far beyond the individual - when thinking about reducing risk from data use we can think about the individual, the groups that that individual is part of (perhaps race, gender, preferences, hobbies, communities, etc), and also at the national level (security). Anonymization is a part of this whole process - but anonymization is not a perfect tool to protect society from the harm of its use. As Helen Nissenbaum pointed out in Lesson 2: Data is Fire, actually identifying the person hardly matters as long as you have a way to reach them. Too often we think of privacy as "preventing someone from being identified" - but that's only one of the sources of harm. In Ramesh Raskar's Strava example leading to a massive national security risk, the actual identities of those people were completely irrelevant!

Re: the probability of getting that dataset to do a linkage attack etc, this is what the field of differential privacy works on and is covered in Lesson 4. :-) stay tuned.

Thoughts? Would love to hear your feedback on these points (I totally agree with you re: no system is completely secure and it's all about balancing probabilities - which hopefully we get across in Lesson 4!)

iamtrask Jan 3, 2021
Maintainer

I agree with your framing - and under this framing I believe there's still a very strong case to say that anonymization is broken. Namely - because anonymization doesn't give any formal limit (including from a security standpoint) on the probability that it can be deanonymized. Whereas formal cryptographic techniques such as Differential Privacy (or other encryption technologies) do have this ability.

The nature of the uncertainty from anonymization is that you have no idea what other datasets might be available to an adversary. This is why anonymization has - on many occasions - created a false sense of security and is "broken". One can use it - but one has no idea how much security one is getting when using it. And in practice - with the availability of advanced statistics techniques - the guarantee has often been found to be surprisingly low. Especially given the increasing availability of data. As mentioned in the course - there are businesses who are SO good at deanonymization that it's the core to their business model. They buy anonymised data - and deanonymize it enough to sell targeted insights to health insurance companies.

While whether to call something "broken" or not is certainly up for debate - it is largely accepted within the privacy technology community that anonymization is broken (see Cynthia Dwork and Aaron Roth's interviews) - and if there was any other security technique which was so weak that reliable businesses could be built on breaking it - it would also likely be considered broken

But the good news is - Differential Privacy is a lot like anonymization except that it is robust. So there is an alternative :)

iamtrask · 2021-01-03T17:02:12Z

iamtrask
Jan 3, 2021
Maintainer

@MerlinSchaefer - great question! While every case is probably a little different - dimensionality reduction reduces across dimensions of each individual's data to create "latent features". So perhaps the simplest example is - with Machine Learning there's very little structure you need to know in order to de-anonymize. The most straightforward way would be to find the true records for a small percentage of the data and then learn a linear classifier (for linear techniques) or a non-linear classifier for more advanced compression techniques.

Worth mentioning - many dimensionality reduction techniques have fantastic differentially private alternatives.

1 reply

MerlinSchaefer Jan 4, 2021
Author

Thank you for the response!
In combination with the infomation above in relation to the availability of data and the lack of knowledge about it's distribution that makes perfect sense.

Are these techniques covered in the upcoming courses? If not is there a "guide" where to learn them potentially after the course completion?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Data Anonymization through Dimensionality Reduction #79

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Data Anonymization through Dimensionality Reduction #79

Uh oh!

MerlinSchaefer Jan 3, 2021

Replies: 3 comments · 3 replies

Uh oh!

yemikifouly Jan 3, 2021 Maintainer

Uh oh!

Uh oh!

MariiaDen Jan 3, 2021

Uh oh!

em-blue Jan 3, 2021

Uh oh!

Uh oh!

iamtrask Jan 3, 2021 Maintainer

Uh oh!

iamtrask Jan 3, 2021 Maintainer

Uh oh!

MerlinSchaefer Jan 4, 2021 Author

MerlinSchaefer
Jan 3, 2021

Replies: 3 comments 3 replies

yemikifouly
Jan 3, 2021
Maintainer

MariiaDen
Jan 3, 2021

iamtrask Jan 3, 2021
Maintainer

iamtrask
Jan 3, 2021
Maintainer

MerlinSchaefer Jan 4, 2021
Author