-
Notifications
You must be signed in to change notification settings - Fork 370
Open
Labels
bugSomething isn't workingSomething isn't workingdata:sequentialRelated to timeseries datasetsRelated to timeseries datasets
Milestone
Description
Environment Details
- SDV version: 1.20.0
- Python version: 3.11
- Operating System: Linux (Google Colab)
Error Description
When setting up my PARSynthesizer, I would like to add a context column that is an sdtype id
. This is to signify that:
(a) the value in this column is constant within a given sequence and
(b) the possible values in this columns should be generated from scratch (eg. using my provided regex format)
However in practice, I see that the synthesizer is crashing curing fit
. (This happens for any sdtype that is not modeled, including other PII types such as 'ssn'
, 'credit_card_number'
, etc.).
Steps to reproduce
The example below was created as a hypothetical based on a conversation with a Slack user.
import numpy as np
import pandas as pd
from sdv.metadata import Metadata
from sdv.sequential import PARSynthesizer
# create data and metadata where 'event_id' is the sequence key and
# 'event_source' is a context column of sdtype 'id'
data = pd.DataFrame(data={
'event_id': ['event-000']*5 + ['event-001']*2 + ['event-002']*3,
'event_source': ['source-AAA']*5 + ['source-BBB']*2 + ['source-CCC']*3,
'column_A': np.random.randint(low=0, high=10, size=10),
'column_B': np.random.choice(['Yes', 'No', 'Maybe'], size=10)
})
metadata = Metadata.load_from_dict({
'tables': {
'table': {
'sequence_key': 'event_id',
'columns': {
'event_id': {'sdtype': 'id', 'regex_format': 'event-[0-9]{3,4}'},
'event_source': { 'sdtype': 'id', 'regex_format': 'source-[A-Z]{3,5}'},
'column_A': { 'sdtype': 'numerical' },
'column_B': { 'sdtype': 'categorical'}
}
}
}
})
# supply the 'event_source' id column as a context column
synthesizer = PARSynthesizer(metadata, epochs=1, context_columns=['event_source'])
synthesizer.fit(data)
KeyError: "['event_source'] not in index"
See below for the full stack trace.
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workingdata:sequentialRelated to timeseries datasetsRelated to timeseries datasets