Crypto-oriented Practical Privacy-Preserving Clustering

Bachelor Thesis, Master Thesis


The availability of a vast amount of data has led to the recent hype of machine learning (ML). ML applications range from business analytics over fraud detection to genome analytics and medical diagnostics. However, if data from several sources is needed to create high quality results, strong privacy protection is indispensable for customer data, business secrets, and sensitive personal information. This has led to intensive research in the area of privacy-preserving supervised ML [1-3]. However, privacy-preserving unsupervised ML techniques were mostly neglected so far. Clustering is a well-known unsupervised ML technique that groups similar elements into clusters while elements in different clusters should be maximally different. The first private clustering protocols mostly focus on a simple clustering algorithm called K-means [4-6]. However, K-means has several limitations making it impractical for real-world privacy-preserving clustering.


This work seeks to design, implement, and benchmark a practical privacy-preserving clustering protocol usable for multi-party computation and outsourcing that does not only achieve a good efficiency in terms of runtime and communication costs for large datasets, but also has practical memory requirements, achieves good clustering quality for datasets with different characteristics and only requires the input of no or mostly data-independent parameters that do not leak information.

The student is expected to review existing clustering algorithms, assess their suitability for privacy-preserving clustering, and decide about the new protocol design based on these insights. The final implementation should be benchmarked and compared to state-of-the-art privacy-preserving clustering protocols such as [4-6].


  • Good programming skills in C/C++
  • At least basic knowledge of cryptography and machine learning
  • High motivation + ability to work independently
  • Knowledge of the English language, Git, LaTeX, etc. goes without saying