SIGMOD 2023 Submission: Multiple SJA under Differential Privacy

About The Project

This project is to demo our SIGMOD 2023 submission "Better than Composition: How to Answer Multiple Relational Queries under Differential Privacy".

The file structure is as below

Project
└───Code
└───Data
│   └───Graph
│   └───TPCH
└───Figure
└───Information
│   └───Graph
│   └───TPCH
└───Profile
└───Query
└───Result
│   └───Graph
│   └───TPCH
└───Script
└───Temp

./Code stores the codes.

./Data stores graph and TPCH datasets.

./Figure stores the figures used in the paper.

./Information stores the relationships between base table tuples and joined results for all queries.

./Store stores program licence and primary private relation combinations.

./Query stores the queries used in the experiments of the paper. Q1~Q8 are TPCH queries while Q9,Q10,Q14 are 1-line path, 2-line path, and 3-line path counting queries respectively.

./Result stores the experimental results.

./Script stores the scripts used in the experiments of the paper.

./Temp stores temporary files generated by programs.

Prerequisites

Tools

Before running this project, please install below tools

Also install the dependencies matplotlib, numpy, sys, getopt, os, math, psycopg2, time, copy, random, glob.

Mosek

Install in python.
Store the licence under ./Profile.

Download Data

Download two data packages (Graph and TPCH) and unzip them in ./Data. stackoverflow_c2a_10p.tsv and stackoverflow_a2q_10p.tsv store the graphs after removing the nodes with top 10% maximum degree.

Download two packages containing the relationships between base table tuples and join results (Graph and TPCH) and unzip them in ./Information.

Create PostgreSQL Database

To create an empty PostgreSQL database, for example, named "stackoverflow−a2q ", run

createdb stackoverflow−a2q;

Here, we need two databases for the graph dataset: stackoverflow−a2q, and stackoverflow−c2a. For TPCH dataset, we use 7 different scales: 0.125,0.25,..,8 which are marked as _0,_1,..,_6, For each scale, we create three databases: sc, psc and pso. sc has Customer and Supplier as the primary private relations. psc has PartSupp and Customer as the primary private relations. pso has PartSupp and Orders as the primary private relations. For example, for scale _3, we create sc_3, psc_3 and pso_3. The reason why we need to consider these three cases separately is that, we write the queries with multiple primary private relations as one self-join queries with single primary private relation where we create a new ID table and assign the IDs for the tuples in the designated primary private relations.

Import and Clean Data

To import/clean graph data into PostgreSQL database, go to ./Script and run ProcessGraphData.py. There are three parameters

-d: the name of graph dataset;
-D: the name of PostgreSQL database;
-m: the option of importing(0)/cleaning(1) data in the database;

To import/clean TPCH data into PostgreSQL database, go to ./Script and run ProcessTPCHData.py. There are four parameters

-d: the name of TPCH dataset, only indicating the scale;
-D: the name of PostgreSQL database;
-m: the option of importing(0)/cleaning(1) data in the database;
-r: the path of file containing the name(s) of the primary private relation(s). The ones used in the paper can be found in ./Profile;

For example, to import TPCH dataset with scale _0 into sc_0 database, run

python ProcessTPCHData.py -d _0 -D sc_0 -m 0 -r ../Profile/sc.txt

To import dataset data stackoverflow_a2q_10p.tsv into stackoverflow−a2q database, run

python ProcessGraphData.py -d stackoverflow_a2q_10p -D stackoverflow−a2q -m 0

Demo of Algorithms

PMSJA

We implement our algorithm for both self-join-free queries and self-join queries in ./Code/PMSJASJF.py and ./Code/PMSJA.py. Both these two programs have the input as the information of relationships between the base table's tuples and join results, which can be collected by ./Code/ExtractInfoMultiple.py.

The ExtractInfoMultiple.py has five parameters

-D: the name of PostgreSQL database;
-Q: the path of input query file;
-P: the name of primary private relation;
-K: the path of the file containing the primary key of the primary private relation; Here, we also provide the ones used in the paper in ./Query;
-O: output file prefix, each dimension's results are stored in one file. For the output files, the first column is the function value for the join result and the other columns are the IDs of base table tuples contributing to that join result.

For example,

python ExtractInfoMultiple.py -D stackoverflow−a2q -Q ../Query/Q9.txt -P node -K ../Query/Q9_key.txt -O ../Information/Graph/Q9_stackoverflow_a2q

Such information for the queries used in the paper have already been collected and store in ./Information/Graph and ./Information/TPCH. One notice is that, as reported in our paper, we only select some dimensions for some queries. We have already done it and the processed files are marked by _processed_. Please note that such information are also required by R2T and OptMean.

The PMSJA.py.py has four parameters

-I: input file prefix, each dimension's results are stored in one file.
-e: privacy budget epsilon;
-d: privacy budget delta;
-b: the parameter beta, which controls the probability of large error happening;

For example,

python PMSJA.py -I ../Information/Graph/Q9_stackoverflow_a2q -e 2 -d 0.0000001

The PMSJASJF.py has the same parameters.

We also give all scripts collecting experimental results, which are stored in ./script.

R2T

The R2T is implemented by ./Code/R2T.py.

The R2T.py has six parameters

-I: input file prefix, each dimension's results are stored in one file.
-e: privacy budget epsilon;
-d: privacy budget delta;
-b: the parameter beta, which controls the probability of large error happening;
-G: the predefined global sensitivity.
-p: the number of threads used.

For example,

python R2T.py -I .../Information/Graph/Q9_stackoverflow_a2q -e 2 -d 0.0000001 -b 0.1 -p 24  -G 1000000

OptMean

The OptMean is implemented by ./Code/OptMean.py.

OptMean.py has five parameters

-I: input file prefix, each dimension's results are stored in one file.
-e: privacy budget epsilon;
-d: privacy budget delta;
-b: the parameter beta, which controls the probability of large error happening;
-G: the predefined global sensitivity.

For example,

python OptMean.py -I .../Information/Graph/Q9_stackoverflow_a2q -e 2 -d 0.0000001 -b 0.1 -G 1000000

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SIGMOD 2023 Submission: Multiple SJA under Differential Privacy

Table of Contents

About The Project

Prerequisites

Tools

Download Data

Create PostgreSQL Database

Import and Clean Data

Demo of Algorithms

PMSJA

R2T

OptMean

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
Code		Code
Data		Data
Figure		Figure
Profile		Profile
Query		Query
Result		Result
Script		Script
README.md		README.md

hkustDB/PMSJA

Folders and files

Latest commit

History

Repository files navigation

SIGMOD 2023 Submission: Multiple SJA under Differential Privacy

Table of Contents

About The Project

Prerequisites

Tools

Download Data

Create PostgreSQL Database

Import and Clean Data

Demo of Algorithms

PMSJA

R2T

OptMean

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages