Privacy in Natural Language Processing

Master Thesis

Motivation

Recent advances in the development and deployment of large language models (LLMs) promise profound impact on society and economy. However, LLMs with billions of parameters tend to memorize and reproduce significant portions of the vast amount of the data that is used for training them, which poses significant challenges from a security/privacy point of view.

Goal

The goal of this timely thesis is to explore various subtle ways in which LLMs can leak private training data, propose suitable defense mechanisms, and practically evaluate the proposed attacks/defenses using publicly available LLMs. Initially, the most recent literature in this field, such as [1,2], will be comprehensively reviewed and systematized. The primary objective is to derive a conclusion regarding the criticality of the information leakage and to identify a practical mitigation strategy. Then, a comprehensive evaluation of the defense’s effectiveness, encompassing the implementation of a prototype and benchmarking, must be carried out.

Requirements

  • Solid understanding of state-of-the-art NLP techniques
  • Interest in cryptography and secure computation
  • Programming experience
  • High motivation + ability to work independently
  • Knowledge of the English language, Git, LaTeX, etc. goes without saying

References

Supervisors