Microsoft shares more details about its Privacy Preserving Machine Learning initiative

Ensuring customer privacy and maintaining their trust, at least on paper, has become a priority for many tech firms, especially since Facebook's Cambridge Analytica scandal back in 2018. Striking a balance between the privacy of individuals while showing them personalized content - sometimes from third-parties - has become an exceedingly daunting task. To tackle this problem, Microsoft Research kickstarted the Privacy Preserving Machine Learning (PPML) initiative in collaboration with other Microsoft product teams, and today, it has shared more details on the effort.

Microsoft has noted that when training large-capacity models, data fed to systems sometimes contains personally identifiable information (PII), which is a concern in terms of privacy. Sometimes models also memorize data present in the training set, which directly compromises this aspect. Microsoft says that it has taken a methodological approach to tackle this problem where it first understands the risks through the eyes of the law and the integrity of machine learning (ML) pipelines, then quantifies them through frameworks, and finally mitigates them to reduce risks and stay compliant with local laws. The company already applies this in practice to its own models by layering PPML techniques, removing PII, and sampling data with caution.

Differential Privacy is a core component of its PPML initiative and is something that Microsoft pioneered back in 2006. It is being used as the industry standard in multiple domains already such as government and Windows telemetry. Using the concept of a "privacy budget", it essentially adds noise to a training set which limits the ability of the model to learn about specific entities who contributed to the data while offering mathematical proof that the model is generalized. The company has also published several research papers in this area that you can find in its blog post.

Another mitigation technique that Microsoft uses is threat modeling through which it studies potential real-world attack vectors and methodologies such as model update, tab, and poisoning attacks in a black box scenario. These cover novel approaches like abstract leakage and attribute inference. The knowledge gained from this activity is used to define privacy metrics and mitigation techniques. Once again, you can read more about this process in detail via the research papers highlighted in the blog post.

Moving forward, Microsoft says that the PPML initiative will focus on the following areas:

Following regulations around privacy and confidentiality

Proving privacy properties for each step of the training pipeline

Making privacy technology more accessible to product teams

Applying decentralized learning

Investigating training algorithms for private federated learning, combining causal and federated learning, using federated reinforcement learning principles, federated optimization, and more

Using weakly supervised learning technologies to enable model development without direct access to the data

Microsoft also talked about how decentralized and federated learning is being proposed - by companies such as Google - to preserve privacy, but the Redmond tech giant emphasized that it won't be enough to tackle the risks of leakage, and you'd still need to implement other layers of Differential Privacy. It also highlighted its own Azure Confidential Computing implementation powered by Trusted Execution Environments (TEEs), which leverage multi-party computation (MPC) and fully homomorphic encryption (FHE) solutions, but noted that they have a computational overhead and are not suitable for non-experts.

For now, Microsoft will continue exploring the potential of PPML and will develop tools that translate technology ethics into practice while ensuring that the results are beneficial to society. There are also tons of questions that are yet to be answered related to privacy-utility tradeoffs, training using synthetic data, and ingraining privacy guarantees into the design of deep learning models. You can find out more about the company's efforts here.