Evaluating Attacks on Encrypted Databases with Real-World Query Data

Evaluating Attacks on Encrypted Databases with Real-World Query Data

Bachelor Thesis, Master Thesis

Motivation

The cryptographic community has created a number of structured encryption techniques that allow one or multiple clients to query outsourced encrypted databases (EDBs). These techniques transform the query into a token that the server can then use to compute the corresponding response. The existing schemes vary in functionality and in security, as they result in different kinds of information leakage about the queries. Current efficient schemes usually leak the access pattern (which documents satisfy a token) and the search pattern (whether tokens concern the same query). During this decade, a plethora of attacks, e.g., the ones by [IKK12, LZWT14, CGPR15, ZKP16, vRMÖ17], have emerged in the literature that showed how adversaries can abuse these leakage profiles to obtain information about a client's queries.

These attacks usually require some sort of prior knowledge about the EDB or the underlying queries. Even though this knowledge has to be rather large to obtain significant fractions of the plaintext, the attacks demonstrate that, at least in theory, common leakage profiles display significant client information. Practically, the attacks are evaluated on public datasets like, e.g., the entire Wikipedia corpus, acting as EDB. However, evaluations largely depend on assumptions about the query distribution, often times generating user queries uniformly at random, manually generating dummy queries, or taking the most frequent keywords of the EDB. This is due to a lack of public datasets of real-world client queries. Only [LZWT14] simulate queries from a real-world source by using an obfuscated version of Google Trends, but Google Trends is only an indication of how the queries of all clients are distributed. In this thesis, the student's task is to simulate keyword queries using public information of individuals, namely by crawling public IRC logs and posts of public, pseudonymous social media like, e.g., Reddit and Twitter. Then, a selection of existing attacks should be evaluated with the generated queries in order to assess and compare the real-world performances of known attacks.

Goal

The goal of this thesis is to generate a wide range of individual, simulated query datasets based on public information of individuals. This data is to be used to gain a sense of how real-world queries are distributed across selected domains and to reduce assumptions in the empirical evaluation of EDB attacks. Evaluations of certain attacks should be carried out on varying simulated queries across different domains.

Tasks

The student should study the literature to gain detailed knowledge about the mentioned attacks. Then, according to domain and client specifications carefully selected in consultation with the advisor, public client data should be collected, e.g., by crawling Reddit or Twitter. A suitable pre-processing procedure that creates single or multiple keywords from the posts needs to be developed in order to generate query datasets from the client data, e.g., by employing Natural Language Processing techniques. As a starting point, the attack of [LZWT14] should be evaluated on the generated keywords. Then, more sophisticated attacks can be evaluated like, e.g., the attack of [vRMÖ17] against multi-client EDBs. As a result, a comparison of the evaluated attacks needs to be carried out across different query datasets. In order to enhance the success of the evaluated attacks, varying degrees of knowledge about the domain can be assumed. The developed tools to evaluate the attacks and to generate queries from public social media or chat data need to be accessible for future use with different data sources.

Requirements

Applicants should have outstanding academic records and be able to work independently. Prior knowledge in the domain of Data Mining and Natural Language Processing is a big plus.

References

  • [CGPR15] David Cash, Paul Grubbs, Jason Perry, and Thomas Ristenpart: Leakage-Abuse Attacks against Searchable Encryption. In CCS'15.
  • [IKK12] Mohammad Saiful Islam, Mehmet Kuzu, and Murat Kantarcioglu: Access pattern disclosure on searchable encryption: Ramification, attack and mitigation. In NDSS'12.
  • [LZWT14] Chang Liu, Liehuang Zhu, Mingzhong Wang, and Yu-An Tan: Search Pattern Leakage in Searchable Encryption: Attacks and New Construction. In Information Sciences 2014.
  • [vRMÖ17] Cédric Van Rompay, Refik Molva, and Melek Önen: A Leakage-Abuse Attack against Multi-User Searchable Encryption. In PETS'17.
  • [ZKP16] Yupeng Zhang, Jonathan Katz, and Charalampos Papamanthou: All Your Queries Are Belong to Us: The Power of File-Injection Attacks on Searchable Encryption. In USENIX Security'16.