Query regarding "non_Categorical_columns"

Dataset : https://drive.google.com/file/d/1NNn4rvijmrJnwb_D7-3KweOcqkv99-8j/view?usp=sharing : Lending Loan Club Dataset.
I was doing comparative analysis of using non_categorical_column parameter vs not using it wrt time taken for training and quality of generated data.

I understand that - To include columns in "non_categorical_columns", you actually need to add the column also in "categorical_columns".
So that's what I did :

synthesizer = CTABGAN(
              raw_csv_path = real_path,
              test_ratio =  0.20,
              categorical_columns=cat,
              log_columns = [],
              mixed_columns= {},
              general_columns =[],
              non_categorical_columns = high_cardinality_cols,
              integer_columns=num,
              problem_type= {None:None}) 
              
where :
high_cardinality_cols refers to categorical columns with unique values >10k (namely 'emp_title')
cat refers to all categorical columns (including high cardinality ones)
num refers to all numeric or integer columns

I did training for first 100 rows for 150 epochs. Training completed in 2m20s but there seems to be error in 
syn = synthesizer.generate_samples() #part of code
Error I encountered is :
[/usr/local/lib/python3.10/dist-packages/sklearn/preprocessing/_label.py](https://localhost:8080/#) in inverse_transform(self, y)
    160         diff = np.setdiff1d(y, np.arange(len(self.classes_)))
    161         if len(diff):
--> 162             raise ValueError("y contains previously unseen labels: %s" % str(diff))
    163         y = np.asarray(y)
    164         return self.classes_[y]

ValueError: y contains previously unseen labels: [-25 -24 -22 -21 -17 -14 -12 -11 -10  -9  -8  -6  -5  -4  -3  -2  -1  91
  92  93  94  98  99 101 103]
  
  CLEARLY THE ERROR IS DUE TO INVERSE TRANFORMATION METHODS. 
  @zhao-zilong Can you tell the reason for the error . Along with some methods to determine which columns should go into non_categorical_column parameter?
  
  Also please do give a formal defintion of these 3 parameters :
  general_columns,mixed_columns,log_columns
  
  Thanks in advance!!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Query regarding "non_Categorical_columns" #8

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Query regarding "non_Categorical_columns" #8

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions