Skip to content

In PARSynthesizer, I cannot apply a context column that is sdtype id (or another PII type) #2466

@npatki

Description

@npatki

Environment Details

  • SDV version: 1.20.0
  • Python version: 3.11
  • Operating System: Linux (Google Colab)

Error Description

When setting up my PARSynthesizer, I would like to add a context column that is an sdtype id. This is to signify that:
(a) the value in this column is constant within a given sequence and
(b) the possible values in this columns should be generated from scratch (eg. using my provided regex format)

However in practice, I see that the synthesizer is crashing curing fit. (This happens for any sdtype that is not modeled, including other PII types such as 'ssn', 'credit_card_number', etc.).

Steps to reproduce

The example below was created as a hypothetical based on a conversation with a Slack user.

import numpy as np
import pandas as pd

from sdv.metadata import Metadata
from sdv.sequential import PARSynthesizer

# create data and metadata where 'event_id' is the sequence key and
# 'event_source' is a context column of sdtype 'id'
data = pd.DataFrame(data={
    'event_id': ['event-000']*5 + ['event-001']*2 + ['event-002']*3,
    'event_source': ['source-AAA']*5 + ['source-BBB']*2 + ['source-CCC']*3,
    'column_A': np.random.randint(low=0, high=10, size=10),
    'column_B': np.random.choice(['Yes', 'No', 'Maybe'], size=10)
})

metadata = Metadata.load_from_dict({
    'tables': {
        'table': {
            'sequence_key': 'event_id',
            'columns': {
                'event_id': {'sdtype': 'id', 'regex_format': 'event-[0-9]{3,4}'},
                'event_source': { 'sdtype': 'id', 'regex_format': 'source-[A-Z]{3,5}'},
                'column_A': { 'sdtype': 'numerical' },
                'column_B': { 'sdtype': 'categorical'}
            }
        }
    }
})

# supply the 'event_source' id column as a context column
synthesizer = PARSynthesizer(metadata, epochs=1, context_columns=['event_source'])
synthesizer.fit(data)
KeyError: "['event_source'] not in index"

See below for the full stack trace.

stack_trace.txt

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingdata:sequentialRelated to timeseries datasets

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions