Skip to content

Query regarding "non_Categorical_columns" #8

@yash-rathore-arya

Description

@yash-rathore-arya

Dataset : https://drive.google.com/file/d/1NNn4rvijmrJnwb_D7-3KweOcqkv99-8j/view?usp=sharing : Lending Loan Club Dataset.
I was doing comparative analysis of using non_categorical_column parameter vs not using it wrt time taken for training and quality of generated data.

I understand that - To include columns in "non_categorical_columns", you actually need to add the column also in "categorical_columns".
So that's what I did :

synthesizer = CTABGAN(
raw_csv_path = real_path,
test_ratio = 0.20,
categorical_columns=cat,
log_columns = [],
mixed_columns= {},
general_columns =[],
non_categorical_columns = high_cardinality_cols,
integer_columns=num,
problem_type= {None:None})

where :
high_cardinality_cols refers to categorical columns with unique values >10k (namely 'emp_title')
cat refers to all categorical columns (including high cardinality ones)
num refers to all numeric or integer columns

I did training for first 100 rows for 150 epochs. Training completed in 2m20s but there seems to be error in
syn = synthesizer.generate_samples() #part of code
Error I encountered is :
/usr/local/lib/python3.10/dist-packages/sklearn/preprocessing/_label.py in inverse_transform(self, y)
160 diff = np.setdiff1d(y, np.arange(len(self.classes_)))
161 if len(diff):
--> 162 raise ValueError("y contains previously unseen labels: %s" % str(diff))
163 y = np.asarray(y)
164 return self.classes_[y]

ValueError: y contains previously unseen labels: [-25 -24 -22 -21 -17 -14 -12 -11 -10 -9 -8 -6 -5 -4 -3 -2 -1 91
92 93 94 98 99 101 103]

CLEARLY THE ERROR IS DUE TO INVERSE TRANFORMATION METHODS.
@zhao-zilong Can you tell the reason for the error . Along with some methods to determine which columns should go into non_categorical_column parameter?

Also please do give a formal defintion of these 3 parameters :
general_columns,mixed_columns,log_columns

Thanks in advance!!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions