Skip to content

Normalization Logic

SteveScott edited this page Jul 21, 2022 · 5 revisions

Normalization Logic

What is normalization?

All counts in SafeGraph are based upon observed devices. However, not all people in the city have a device. How to we estimate the actual number of people visiting a place if not everyone has a SafeGraph-enabled device?

This is where normalization comes in. There are a few tables we can use to estimate the population.

  1. Home Panel Summary

The home panel summary, provided by SafeGraph, gives the number of devices that reside in a given census block group. Each time period has its own home panel summary. Weekly data publishes both a weekly patterns file and a home panel summary for that week. Likewise monthly patterns are released with a monthly HPS.

  1. Census

Safegraph provides complete census data annually (via American Community Survey or decennial census).

  1. Patterns

Inside the patterns tables (weekly or monthly), there are two different kinds of counts. One is a Point of Interest's visitor count. But also there is a breakdown of where people visited from. If too few people visit, the number is not recorded, but otherwise there is a JSON object that gives which block group people came from, and how many visited.

How is normalization calculated?

The general strategy is to multiply the visits by a population multiplier. This multiplier is population / devices residing. The multiplier can be calculated on a large, citywide scale. But here we decided to create a multiplier for each POI based on the census block groups visited.

The population multiplier is calculated separately for each point of interest. After all multipliers are found, all values, such as visits per hour, weekly visits, visitor_cbg, or any other field for that time period can be multiplied with this multiplier to convert, or normalize, device counts to a population estimate.

To calculate the multiplier for a point of interest, the program loops through the census block groups listed in the visitor_home_cbg column of the weekly patterns table. For each CBG listed, The program looks up the population of the CBG from the census table, and the devices residing from the home panel summary. Then each of these values is multiplied by the number of devices seen in the patterns table and a running sum is kept of weighted population and weighted device counts. This will create a weighted average. After all visitor home CBGs have been taken into account, the sum of the weighted population is divided by the sum of the weighted devices. When divided, the weights cancel out and you have the weighted multiplier.

Pseudo Code
multiplier_list = new list
for each POI
        pop_count = 0
        device_count = 0
          
        ### visitor_home_cbg is a dictionary 
        ### with census block groups as the key 
        ### and the number of devices seen in that block group is the value
        
        for each cbg, number_device_visits in POI['visitor_home_cbg']
            cbg_population = lookup census population by cbg
            cbg_devices = lookup device count by cbg
            pop_count = pop_count + (cbg_population * number_device_visits)
            device_count = device_count + (cbg_devices * number_device_visits)
        poi_multiplier = pop_count / device_count
        normalized_visits = poi_multiplier * observed_visits
        mormalized-visitors = poi_multiplier * observed_visitors
        normalized_visits_per_day = poi_multiplier * visits_per_day_list
        ### etc. for all values you wish to normalize.

One might ask "Why we do not take the population count from the visitor_home_cbg?". Often the number to normalize, such as raw_visit_counts or raw_visitor_counts, will be larger than the sum of the population in the visitor_home_cbg field. This is because, if the CBG has too few people visiting, the CBG will be omitted for privacy. Also, if the visitor is outside the US, the CBG will not appear, even for Canada where the CBG is known but the census data is not available. And as a later change due to processing limitations, CBGs outside NYC are also null. Only NYC visitors are counted in the visitor_home_cbg. The multiplier is the best estimate based on information available. However, the data normalized is aggregated and does not omit device counts (unless there are fewer than four visitors to a POI, in which case it defaults to 4, or 0 if there were none). The raw count includes international travelers, and will include people even if they are the only people from their CBG. Multiplying the count by the multiplier gives the best results and will ensure all devices are included in the normalization.

Sometimes there are no values for home_visitor_cbg. Usually this is because there are too few visitors. In this case, a default multiplier is applied. This multiplier is the sum of all NYC block group population divided by the sum of all devices per block group; The population per device for the entire city.

Clone this wiki locally