Good Research

Back to Insights

Eric Khumalo and Jessica Traynor

August 2023

The Three C's of Privacy Engineering

Privacy engineering is a relatively new field. People working in privacy have tended to be in legal and policy and until recently, haven’t directly worked on designing, building, and shipping products. Privacy engineers are an important part of implementing “privacy by design.” We are responsible for tackling privacy issues early in the software development lifecycle. But there’s far more to being a privacy engineer. Privacy is not just a technical problem. It’s nuanced and contextual and ever-changing.

I have written about my experience and education in privacy engineering and how I’ve applied these concepts to my teaching. Being a privacy engineer is a wide ranging role that I didn’t learn about in school. I am now beginning to understand how much interpretation is required, what questions to ask, and when I need to draw from other disciplines. After working in the field for more than two years, I have summarized what I do as a privacy engineer into three categories: classify, contextualize, and communicate.

At Good Research one of the services we offer is to analyze websites and mobile apps looking for “bad” or unexpected behavior regarding personal or sensitive data. To describe these three C’s in more detail, I’ll use the example of when I am given a dataset of network traffic generated from a mobile app with the goal of identifying whether there are potential privacy violations.

Classify

Classifying involves organizing and categorizing the data well enough to get an initial understanding of what information has been exchanged between the app and other parties. I run some scripts and use other tools to group and classify what I’m seeing. While classification does inherently require some interpretation, there isn't a lot in this step. I'm not assigning a value (e.g. dangerous, useful) or flagging a behavior (e.g. unexpected, privacy violation). The main goal of this step is to group things together, like all of the transmissions that went to the same place, or all of the messages that weren’t encrypted. Essentially, I’m trying to figure out who receives what information. I do need to decide what information matters, depending on why I’m looking at this dataset in the first place, so there is some interpretation but more on that in a later post.

Contextualize

Once I have a decent understanding of who is getting what information, I add context. I find out why or why not this data is needed. For example, if I observe a third party receiving personal information, then I need to know more about both the information and the third party. As you can imagine, the third party might have a perfectly legitimate reason for receiving that data. But out of context that might not be so obvious. Alternatively, if I am assessing whether an app is sending geolocation information, I can look for something like latitude and longitude, but there are many ways to figure out where an app is being used. Three different pieces of information individually might not reveal location, but together, they can. For example, information like wifi network name, signal strength, and other nuanced identifiers can be pieced together and matched against an already existing catalog to identify the location of the device, sometimes with surprising accuracy.

At other times I look for potential instances of ID bridging, which is the practice of linking multiple identifiers for the same user on different devices and across multiple apps. These identifiers are not only collected and shared but also bridged or linked together by the developer or a third-party so as to know that they belong to the same individual. ID bridging permits tracking across apps, across devices, and over time to allow third parties to assemble comprehensive advertising profiles, often without the user’s consent. Developers and third parties use a variety of identifiers, and I need to know which ones are persistent and which ones are changeable, and who receives them and under what conditions. Yes, context matters!

Communicate

Communication is the hardest part of being a privacy engineer. Classifying the data is relatively straightforward. Adding context is more subjective. However, communication relies a lot on my interpretations. I often have to share my findings with a nontechnical audience and make sure all the nuance, complexities, and context are clear and accessible. Before writing a report or preparing a presentation, I ask: who is the audience, what is their goal, and what are my main points. Importantly, once delivered, I want to make sure they understand and that I can address their questions and comments.

Despite how this process sounds, it’s not linear. After adding context, I almost always go back and reclassify the data. And, I rarely feel like the contextualize step is totally complete. There’s never enough context! Finally, good communication requires a feedback loop.

Privacy engineering draws from many different disciplines. I use my experience and refer to conversations, blogs, and many other sources to understand whether what I am observing is expected behavior or in violation of a regulation or contradicting a privacy policy or … other. There’s a lot of judgment needed and as I continue to learn more, I will be sure to share with you!

Thanks to Cassia Artanegara and Will Monge.