- 
          
- 
                Notifications
    You must be signed in to change notification settings 
- Fork 563
Description
I am working with auto-claims data where i have incident level data having multiple daily snapshots with time varying covariates. My goal is to predict how long an incident will remain open (i.e., time to closure) using these daily snapshots. I initially used CoxTimeVaryingFitter, but struggled to convert partial hazards into survival probabilities or expected durations. I am using PiecwiseExponentialRegressionFitter now with defined breakpoints assuming the hazard stays constant during those breakpoints.
Snapshot of my data and code
Each row represents a daily interval for an incident:
| incident_id | days-elapsed | covariate_1 | covariate_2 | covariate_3 | start | stop | event | 
|---|---|---|---|---|---|---|---|
| 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 
| 1 | 1 | 1 | 0 | 0 | 1 | 2 | 0 | 
| 1 | 2 | 1 | 0 | 0 | 2 | 3 | 0 | 
| 1 | 3 | 1 | 0 | 1 | 3 | 4 | 1 | 
| 2 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 
| 2 | 1 | 0 | 1 | 0 | 1 | 2 | 0 | 
| 2 | 2 | 0 | 1 | 1 | 2 | 3 | 0 | 
| 2 | 3 | 1 | 1 | 1 | 3 | 4 | 1 | 
breaks = [1, 2, 3, 4, 60, 150, 200, 250, 300] (I calculated breakpoints looking at the KM survival curve towards the sharp drops)
# Fit model
pef = PiecewiseExponentialRegressionFitter(breakpoints=breaks)
pef.fit(df=df_sample_preprocessed,duration_col="stop", event_col="event",entry_col="start")
When i run above code i get:
<lifelines.PiecewiseExponentialRegressionFitter: fitted with 61 total observations, 59 right-censored observations>
The model treats each daily observation as censored unless an event occurs, which seems to imply that each row is independent even though multiple rows belong to the same incident.
Some questions:
- Is it appropriate for the model to treat each daily snapshot as independent when predicting survival probability at any point in time?
- Should I keep the incident ID column in my features to keep the grouping structure or is it better to exclude it?
- Can this format of data be used to estimate the total number of days an incident will survive, based on its daily covariates?