-
Notifications
You must be signed in to change notification settings - Fork 2
Description
Background
Each synthetic person in the synthetic population generated by scripts/create_synthetic_population has a homeId, which is a 17-digit number. The first 11 digits are based on the person's state and PUMA (will be based on state + county + census tract after #281 is merged in). The next 6 digits are a household-level identifier (the house_number) that is defined state-wide as a contiguous set of integers.
What's the problem?
Because there could be 1 million or more households in the synthetic population, restricting to just 6 digits could lead to household-level identifiers de facto being reused. In practice, in most instances, the overall homeId values will still be unique because of the PUMA / census tract component.
A similar issue may arise with workplaces and, theoretically, with schools.
What are possible solutions?
One solution would be to change the length of the household-identifier component in homeID to, say, 8-digits, which would allow for up to 99 million households in the synthetic population. This would be trivially easy to change in the code, but we'd have to check for any down-stream consequences on the how the homeId values are parsed. A more principled solution would be to create household-level identifiers that are sequential and nested within census tracts, though this would require more changes to the code to implement.