Skip to content

This is official repo for paper "GTV: Generating Tabular Data via Vertical Federated Learning"

Notifications You must be signed in to change notification settings

zhao-zilong/gtv

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 

Repository files navigation

GTV: Generating Tabular Data via Vertical Federated Learning

This repo is the testbed for paper GTV: Generating Tabular Data via Vertical Federated Learning which is accepted at The 55th Annual IEEE/IFIP International Conference on Dependable Systems and Networks. The demo shown here is the GTV on Loan dataset. The code contains three parts: D_2_0_G_0_2, D_2_0_G_2_0 and Evaluation. D_2_0_G_0_2, D_2_0_G_2_0 are two configurations of GTV, Evaluation contains the code for evaluation of the synthetic data quality. The training process, we use D_2_0_G_0_2 as an example, the same process applies for D_2_0_G_2_0.

Training process

With folder D_2_0_G_0_2, there are three folders: Server, Client1 and Client2, following example will show you how to run GTV with one server and two clients.

Server

Since we use Pytorch RPC as our communication method, here are some parameters we need to provide to organize the GTV. For example, to run the demo with Loan dataset with two clients, we can first enter folder Server and type the following command:

python3 server.py -ip x.x.x.x -rank 0 -epochs 300 -batch_size 500 -n_critic 5 -world_size 3 -dataset Loan

server.py contains the server side code of GTV, and there is no training data in server side. There are several parameters in this command. ip is the ip of the server. rank indicates the rank of the current runner. It's important to set rank as 0 for running server. epoch is self-explained, it is the training epoch number, but be careful, the epoch here equals to the notion round in the paper. world_size is the number of clients plus one server. For instance, here we set world_size to 3, it means we will have two clients. dataset is the name of training dataset that we will use in the client side.

Clients

Under Client1 and Client2 folders, we have the script client_loan.py. For client to join the GTV, under Client1 folder, run:

python3 client_loan.py -ip x.x.x.x  -rank 1 -epochs 300 -batch_size 500 -n_critic 5 -world_size 3

And under Client2 folder, run

python3 client_loan.py -ip x.x.x.x  -rank 2 -epochs 300 -batch_size 500 -n_critic 5 -world_size 3

The only difference between the parameters is the -rank. We need to make sure each node in the GTV has its own unique rank. For N clients, their ranks should start from 1 to N without repetitions. And the ip is the server's ip.

Network configuration

  1. We tested our code under Ubuntu20.04. Under /etc/hosts, change ip 127.0.1.1 to the current ip in the subnet. We need to do that both in server and client computers.

  2. Also needs to allow all the connections from each other. Using following commands:

Server : sudo ufw allow from [Client IP]
Client : sudo ufw allow from [Server IP]

Be careful, even though we specify the port that RPC should use is 7788 in our code, but Pytorch RPC library open more ports in the backend. Only allow port 7788 is not enough for server and client connection.

Possible Training error

On the Linux system where we implement our experiment, one possible error is

RuntimeError: ECONNREFUSED: connection refused

This is because we use Gloo as backend. By default, Gloo backend will try to find the right network interface to use. But sometimes it's not correct. In that case, we need to override it with:

export GLOO_SOCKET_IFNAME=eth0

eth0 is your network interface name, you may need to change it. Use ifconfig in the terminal to check the name.

Output

The above training process will generate data under Server/generation/ folder. The output will be different files with name starts with sample_x.csv. For instance, for a folder named sample_299.csv, it is the synthetic data generated by GTV at 300th epoch. Each 10 round, it will generate one folder like that.

Evaluation of the result

We provided an ipython notebook Loan_Evaluation_Demo.ipynb under Evaluation folder. Just put the generated data into corresponding folder under Evaluation/Fake_Datasets/ and check if the real data is under Evaluation/Real_Datasets/ folder, then we can run the ipynb file to conduct the test. Machine learning utility and statistical similarity are all shown within the demo.

Bibtex

To cite this paper, you could use this bibtex

@inproceedings{zhao2025gtv,
  title={Gtv: Generating tabular data via vertical federated learning},
  author={Zhao, Zilong and Wu, Han and Van Moorsel, Aad and Chen, Lydia Y},
  booktitle={2025 55th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)},
  pages={33--46},
  year={2025},
  organization={IEEE}
}

About

This is official repo for paper "GTV: Generating Tabular Data via Vertical Federated Learning"

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published