Skip to content

hkustDB/PMSJA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SIGMOD 2023 Submission: Multiple SJA under Differential Privacy

Table of Contents

About The Project

This project is to demo our SIGMOD 2023 submission "Better than Composition: How to Answer Multiple Relational Queries under Differential Privacy".

The file structure is as below

Project
└───Code
└───Data
│   └───Graph
│   └───TPCH
└───Figure
└───Information
│   └───Graph
│   └───TPCH
└───Profile
└───Query
└───Result
│   └───Graph
│   └───TPCH
└───Script
└───Temp

./Code stores the codes.

./Data stores graph and TPCH datasets.

./Figure stores the figures used in the paper.

./Information stores the relationships between base table tuples and joined results for all queries.

./Store stores program licence and primary private relation combinations.

./Query stores the queries used in the experiments of the paper. Q1~Q8 are TPCH queries while Q9,Q10,Q14 are 1-line path, 2-line path, and 3-line path counting queries respectively.

./Result stores the experimental results.

./Script stores the scripts used in the experiments of the paper.

./Temp stores temporary files generated by programs.

Prerequisites

Tools

Before running this project, please install below tools

  • Also install the dependencies matplotlib, numpy, sys, getopt, os, math, psycopg2, time, copy, random, glob.
  • Install in python.
  • Store the licence under ./Profile.

Download Data

Download two data packages (Graph and TPCH) and unzip them in ./Data. stackoverflow_c2a_10p.tsv and stackoverflow_a2q_10p.tsv store the graphs after removing the nodes with top 10% maximum degree.

Download two packages containing the relationships between base table tuples and join results (Graph and TPCH) and unzip them in ./Information.

Create PostgreSQL Database

To create an empty PostgreSQL database, for example, named "stackoverflow−a2q ", run

createdb stackoverflow−a2q;

Here, we need two databases for the graph dataset: stackoverflow−a2q, and stackoverflow−c2a. For TPCH dataset, we use 7 different scales: 0.125,0.25,..,8 which are marked as _0,_1,..,_6, For each scale, we create three databases: sc, psc and pso. sc has Customer and Supplier as the primary private relations. psc has PartSupp and Customer as the primary private relations. pso has PartSupp and Orders as the primary private relations. For example, for scale _3, we create sc_3, psc_3 and pso_3. The reason why we need to consider these three cases separately is that, we write the queries with multiple primary private relations as one self-join queries with single primary private relation where we create a new ID table and assign the IDs for the tuples in the designated primary private relations.

Import and Clean Data

To import/clean graph data into PostgreSQL database, go to ./Script and run ProcessGraphData.py. There are three parameters

  • -d: the name of graph dataset;
  • -D: the name of PostgreSQL database;
  • -m: the option of importing(0)/cleaning(1) data in the database;

To import/clean TPCH data into PostgreSQL database, go to ./Script and run ProcessTPCHData.py. There are four parameters

  • -d: the name of TPCH dataset, only indicating the scale;
  • -D: the name of PostgreSQL database;
  • -m: the option of importing(0)/cleaning(1) data in the database;
  • -r: the path of file containing the name(s) of the primary private relation(s). The ones used in the paper can be found in ./Profile;

For example, to import TPCH dataset with scale _0 into sc_0 database, run

python ProcessTPCHData.py -d _0 -D sc_0 -m 0 -r ../Profile/sc.txt

To import dataset data stackoverflow_a2q_10p.tsv into stackoverflow−a2q database, run

python ProcessGraphData.py -d stackoverflow_a2q_10p -D stackoverflow−a2q -m 0

Demo of Algorithms

PMSJA

We implement our algorithm for both self-join-free queries and self-join queries in ./Code/PMSJASJF.py and ./Code/PMSJA.py. Both these two programs have the input as the information of relationships between the base table's tuples and join results, which can be collected by ./Code/ExtractInfoMultiple.py.

The ExtractInfoMultiple.py has five parameters

  • -D: the name of PostgreSQL database;
  • -Q: the path of input query file;
  • -P: the name of primary private relation;
  • -K: the path of the file containing the primary key of the primary private relation; Here, we also provide the ones used in the paper in ./Query;
  • -O: output file prefix, each dimension's results are stored in one file. For the output files, the first column is the function value for the join result and the other columns are the IDs of base table tuples contributing to that join result.

For example,

python ExtractInfoMultiple.py -D stackoverflow−a2q -Q ../Query/Q9.txt -P node -K ../Query/Q9_key.txt -O ../Information/Graph/Q9_stackoverflow_a2q

Such information for the queries used in the paper have already been collected and store in ./Information/Graph and ./Information/TPCH. One notice is that, as reported in our paper, we only select some dimensions for some queries. We have already done it and the processed files are marked by _processed_. Please note that such information are also required by R2T and OptMean.

The PMSJA.py.py has four parameters

  • -I: input file prefix, each dimension's results are stored in one file.
  • -e: privacy budget epsilon;
  • -d: privacy budget delta;
  • -b: the parameter beta, which controls the probability of large error happening;

For example,

python PMSJA.py -I ../Information/Graph/Q9_stackoverflow_a2q -e 2 -d 0.0000001

The PMSJASJF.py has the same parameters.

We also give all scripts collecting experimental results, which are stored in ./script.

R2T

The R2T is implemented by ./Code/R2T.py.

The R2T.py has six parameters

  • -I: input file prefix, each dimension's results are stored in one file.
  • -e: privacy budget epsilon;
  • -d: privacy budget delta;
  • -b: the parameter beta, which controls the probability of large error happening;
  • -G: the predefined global sensitivity.
  • -p: the number of threads used.

For example,

python R2T.py -I .../Information/Graph/Q9_stackoverflow_a2q -e 2 -d 0.0000001 -b 0.1 -p 24  -G 1000000

OptMean

The OptMean is implemented by ./Code/OptMean.py.

OptMean.py has five parameters

  • -I: input file prefix, each dimension's results are stored in one file.
  • -e: privacy budget epsilon;
  • -d: privacy budget delta;
  • -b: the parameter beta, which controls the probability of large error happening;
  • -G: the predefined global sensitivity.

For example,

python OptMean.py -I .../Information/Graph/Q9_stackoverflow_a2q -e 2 -d 0.0000001 -b 0.1 -G 1000000

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages