-
Notifications
You must be signed in to change notification settings - Fork 12
Description
Dataset : https://drive.google.com/file/d/1NNn4rvijmrJnwb_D7-3KweOcqkv99-8j/view?usp=sharing : Lending Loan Club Dataset.
I was doing comparative analysis of using non_categorical_column parameter vs not using it wrt time taken for training and quality of generated data.
I understand that - To include columns in "non_categorical_columns", you actually need to add the column also in "categorical_columns".
So that's what I did :
synthesizer = CTABGAN(
raw_csv_path = real_path,
test_ratio = 0.20,
categorical_columns=cat,
log_columns = [],
mixed_columns= {},
general_columns =[],
non_categorical_columns = high_cardinality_cols,
integer_columns=num,
problem_type= {None:None})
where :
high_cardinality_cols refers to categorical columns with unique values >10k (namely 'emp_title')
cat refers to all categorical columns (including high cardinality ones)
num refers to all numeric or integer columns
I did training for first 100 rows for 150 epochs. Training completed in 2m20s but there seems to be error in
syn = synthesizer.generate_samples() #part of code
Error I encountered is :
/usr/local/lib/python3.10/dist-packages/sklearn/preprocessing/_label.py in inverse_transform(self, y)
160 diff = np.setdiff1d(y, np.arange(len(self.classes_)))
161 if len(diff):
--> 162 raise ValueError("y contains previously unseen labels: %s" % str(diff))
163 y = np.asarray(y)
164 return self.classes_[y]
ValueError: y contains previously unseen labels: [-25 -24 -22 -21 -17 -14 -12 -11 -10 -9 -8 -6 -5 -4 -3 -2 -1 91
92 93 94 98 99 101 103]
CLEARLY THE ERROR IS DUE TO INVERSE TRANFORMATION METHODS.
@zhao-zilong Can you tell the reason for the error . Along with some methods to determine which columns should go into non_categorical_column parameter?
Also please do give a formal defintion of these 3 parameters :
general_columns,mixed_columns,log_columns
Thanks in advance!!