This project is to demo our SIGMOD 2023 submission "Better than Composition: How to Answer Multiple Relational Queries under Differential Privacy".
The file structure is as below
Project
└───Code
└───Data
│ └───Graph
│ └───TPCH
└───Figure
└───Information
│ └───Graph
│ └───TPCH
└───Profile
└───Query
└───Result
│ └───Graph
│ └───TPCH
└───Script
└───Temp
./Code
stores the codes.
./Data
stores graph and TPCH datasets.
./Figure
stores the figures used in the paper.
./Information
stores the relationships between base table tuples and joined results for all queries.
./Store
stores program licence and primary private relation combinations.
./Query
stores the queries used in the experiments of the paper. Q1~Q8
are TPCH queries while Q9,Q10,Q14
are 1-line path, 2-line path, and 3-line path counting queries respectively.
./Result
stores the experimental results.
./Script
stores the scripts used in the experiments of the paper.
./Temp
stores temporary files generated by programs.
Before running this project, please install below tools
- Also install the dependencies
matplotlib
,numpy
,sys
,getopt
,os
,math
,psycopg2
,time
,copy
,random
,glob
.
- Install in python.
- Store the licence under
./Profile
.
Download two data packages (Graph and TPCH) and unzip them in ./Data
.
stackoverflow_c2a_10p.tsv
and stackoverflow_a2q_10p.tsv
store the graphs after removing the nodes with top 10% maximum degree.
Download two packages containing the relationships between base table tuples and join results (Graph and TPCH) and unzip them in ./Information
.
To create an empty PostgreSQL database, for example, named "stackoverflow−a2q ", run
createdb stackoverflow−a2q;
Here, we need two databases for the graph dataset: stackoverflow−a2q
, and stackoverflow−c2a
. For TPCH dataset, we use 7
different scales: 0.125,0.25,..,8
which are marked as _0,_1,..,_6
, For each scale, we create three databases: sc
, psc
and pso
. sc
has Customer
and Supplier
as the primary private relations. psc
has PartSupp
and Customer
as the primary private relations. pso
has PartSupp
and Orders
as the primary private relations. For example, for scale _3
, we create sc_3
, psc_3
and pso_3
. The reason why we need to consider these three cases separately is that, we write the queries with multiple primary private relations as one self-join queries with single primary private relation where we create a new ID table and assign the IDs for the tuples in the designated primary private relations.
To import/clean graph data into PostgreSQL database, go to ./Script
and run ProcessGraphData.py
. There are three parameters
-d
: the name of graph dataset;-D
: the name of PostgreSQL database;-m
: the option of importing(0)/cleaning(1) data in the database;
To import/clean TPCH data into PostgreSQL database, go to ./Script
and run ProcessTPCHData.py
. There are four parameters
-d
: the name of TPCH dataset, only indicating the scale;-D
: the name of PostgreSQL database;-m
: the option of importing(0)/cleaning(1) data in the database;-r
: the path of file containing the name(s) of the primary private relation(s). The ones used in the paper can be found in./Profile
;
For example, to import TPCH dataset with scale _0
into sc_0
database, run
python ProcessTPCHData.py -d _0 -D sc_0 -m 0 -r ../Profile/sc.txt
To import dataset data stackoverflow_a2q_10p.tsv
into stackoverflow−a2q
database, run
python ProcessGraphData.py -d stackoverflow_a2q_10p -D stackoverflow−a2q -m 0
We implement our algorithm for both self-join-free queries and self-join queries in ./Code/PMSJASJF.py
and ./Code/PMSJA.py
. Both these two programs have the input as the information of relationships between the base table's tuples and join results, which can be collected by ./Code/ExtractInfoMultiple.py
.
The ExtractInfoMultiple.py
has five parameters
-D
: the name of PostgreSQL database;-Q
: the path of input query file;-P
: the name of primary private relation;-K
: the path of the file containing the primary key of the primary private relation; Here, we also provide the ones used in the paper in./Query
;-O
: output file prefix, each dimension's results are stored in one file. For the output files, the first column is the function value for the join result and the other columns are the IDs of base table tuples contributing to that join result.
For example,
python ExtractInfoMultiple.py -D stackoverflow−a2q -Q ../Query/Q9.txt -P node -K ../Query/Q9_key.txt -O ../Information/Graph/Q9_stackoverflow_a2q
Such information for the queries used in the paper have already been collected and store in ./Information/Graph
and ./Information/TPCH
. One notice is that, as reported in our paper, we only select some dimensions for some queries. We have already done it and the processed files are marked by _processed_
. Please note that such information are also required by R2T and OptMean.
The PMSJA.py.py
has four parameters
-I
: input file prefix, each dimension's results are stored in one file.-e
: privacy budget epsilon;-d
: privacy budget delta;-b
: the parameter beta, which controls the probability of large error happening;
For example,
python PMSJA.py -I ../Information/Graph/Q9_stackoverflow_a2q -e 2 -d 0.0000001
The PMSJASJF.py
has the same parameters.
We also give all scripts collecting experimental results, which are stored in ./script
.
The R2T is implemented by ./Code/R2T.py
.
The R2T.py
has six parameters
-I
: input file prefix, each dimension's results are stored in one file.-e
: privacy budget epsilon;-d
: privacy budget delta;-b
: the parameter beta, which controls the probability of large error happening;-G
: the predefined global sensitivity.-p
: the number of threads used.
For example,
python R2T.py -I .../Information/Graph/Q9_stackoverflow_a2q -e 2 -d 0.0000001 -b 0.1 -p 24 -G 1000000
The OptMean is implemented by ./Code/OptMean.py
.
OptMean.py
has five parameters
-I
: input file prefix, each dimension's results are stored in one file.-e
: privacy budget epsilon;-d
: privacy budget delta;-b
: the parameter beta, which controls the probability of large error happening;-G
: the predefined global sensitivity.
For example,
python OptMean.py -I .../Information/Graph/Q9_stackoverflow_a2q -e 2 -d 0.0000001 -b 0.1 -G 1000000