The availability of a vast amount of data has led to the recent hype of machine learning (ML). ML applications range from business analytics over fraud detection to genome analytics and medical diagnostics. However, if data from several sources is needed to create high quality results, strong privacy protection is indispensable for customer data, business secrets, and sensitive personal information. This has led to intensive research in the area of privacy-preserving supervised ML [1-3]. However, privacy-preserving unsupervised ML techniques were mostly neglected so far. Clustering is a well-known unsupervised ML technique that groups similar elements into clusters while elements in different clusters should be maximally different. The first private clustering protocols mostly focus on a simple clustering algorithm called K-means [4-6]. However, K-means has several limitations making it impractical for real-world privacy-preserving clustering.
This work seeks to design, implement, and benchmark a practical privacy-preserving clustering protocol usable for multi-party computation and outsourcing that does not only achieve a good efficiency in terms of runtime and communication costs for large datasets, but also has practical memory requirements, achieves good clustering quality for datasets with different characteristics and only requires the input of no or mostly data-independent parameters that do not leak information.
The student is expected to review existing clustering algorithms, assess their suitability for privacy-preserving clustering, and decide about the new protocol design based on these insights. The final implementation should be benchmarked and compared to state-of-the-art privacy-preserving clustering protocols such as [4-6].
- Good programming skills in C/C++
- At least basic knowledge of cryptography and machine learning
- High motivation + ability to work independently
- Knowledge of the English language, Git, LaTeX, etc. goes without saying
-  Fabian Boemer, Rosario Cammarota, Daniel Demmler,Thomas Schneider, and Hossein Yalame. MP2ML: A mixed-protocol machine learning framework for private inference (opens in new tab). In ARES, 2020.
-  Chiraag Juvekar, Vinod Vaikuntanathan, and Anantha Chandrakasan. Gazelle: A low latency framework for secure neural network inference (opens in new tab). In USENIX Security, 2018.
-  Jian Liu, Mika Juuti, Yao Lu, and N. Asokan. Oblivious neural network predictions via MiniONN transformations (opens in new tab). In CCS, 2017.
-  Payman Mohassel, Mike Rosulek, and Ni Trieu. Practical privacy-preserving K-means clustering (opens in new tab). In PETS, 2020.
-  Wei Wu, Jian Liu, Huimei Wang, Jialu Hao, and Ming Xian. Secure and efficient outsourced K-means clustering using fully homomorphic encryption with ciphertext packing technique. In IEEE Transactions on Knowledge and Data Engineering, 2020.
-  Angela Jäschke and Frederik Armknecht. Unsupervised Machine Learning on Encrypted Data (opens in new tab). In SAC, 2019.