Skip to content

Deal with changing format

Tammy Yang edited this page Feb 19, 2020 · 4 revisions

Why it has to be a bit complicated?

Since Facebook may change the json structure in the downloaded archive, we developed a series of methods to deal with the problem. The following is a simple instruction to get a table with columns you want.
The core idea is, "Put everything into one table first, then get the columns you want from the table." In order to get the data you need from changing Facebook data format, the following three steps are required:

  1. Create tmp tables (df_list) from json files
  2. Merge sub-DataFrames under the same top_df to generate wanted_df
  3. Select only the columns we need (wanted_columns) to generate the final DataFrame

Assuming that you want to get the "creation_timestamp", "description", "title", "uri", and "upload_ip" of photos of posts from your Facebook data and put them in one flat table, here you go:

from fbjson2table.func_lib import parse_fb_json                              
from fbjson2table.table_class import TempDFs                                 
                                                                             
                                                                             
json_content = parse_fb_json($PATH_OF_JSON)                                  
temp_dfs = TempDFs(json_content)                                             
one_to_one_df, _ = temp_dfs.temp_to_wanted_df(                               
                   route_by_table_name='media',                              
                   strat_by='uri'                                            
                   )                                                         
print(one_to_one_df.columns)                                                 

The output will be:

Index(['creation_timestamp', 'description', 'id_0', 'id_attachments_1',      
    'id_data_2', 'id_media_3', 'title', 'uri', 'media_metadata_id_0',        
    'media_metadata_id_attachments_1', 'media_metadata_id_data_2',           
    'media_metadata_id_media_metadata_4', 'photo_metadata_id_0',             
    'photo_metadata_id_attachments_1', 'photo_metadata_id_data_2',           
    'photo_metadata_id_media_metadata_4',                                    
    'photo_metadata_id_photo_metadata_5', 'photo_metadata_upload_ip'],       
   dtype='object')                                                           

Here, the wanted_columns list should look like:

wanted_columns = ['creation_timestamp', 'description', 'title', 'uri',       
                  'photo_metadata_upload_ip']                                
photo_df, _ = temp_dfs.temp_to_wanted_df(                                    

Finally, photo_df is all you need.

Step-by-step explanation

Let's take a look of: link

The first thing we need to do is get_routed_dfs.

=> Since "df" in "df_list" map may diverge to different directions, we have to get the DataFrames that in the same path (route) toward "top_df".

Second, do get_start_peeling.

=> Find the index of top_df in the df_list. This index will be used for the next step.

Then, do merge_one_to_one_sub_df.

=> Merge all one-to-one sub_dfs under the same top_df.

Finally, do get_wanted_columns to create the final DataFrame

=> Extract the columns we need and return NaN if the columns are missing.

Closer look of temp_to_wanted_df

In order to run temp_to_wanted_df function, you need to specify four parameters:

  1. wanted_columns
  2. route_by_table_name
  3. start_by
  4. regex

The wanted_columns is a list of column names we want. Use the empty list [] to return all columns.

The route_by_table_name is the last table with the suffix of top_df which will be used when we want to choose one DataFrame as top_df where that df is in a diverging branch of df_list.

Take example_df_list for example,

If we plot the relationship of df_list:

Image of relation of df_list

We can find out that there are two paths (routes) in the picture. If we want to merge temp__attachments__data__media and its sub_dfs, we have to extract them from df_list. Otherwise, temp__data may cause problems when we do the join operation.

If we set route_by_table_name as "media", we will get the path (route) like:

Image of route media

The start_by is the unique and preserved column name of top_df in the routed_df_list.

When we do join operation, we have to choose one df in df_list as our top_df. Take the example_df_list, if we want to set "temp__attachments__data__media" as top_df, we have to find the unique column name of it such as "uri".

Please note, it is NOT allowed to use title as a column name since it is preserved for the tmp tables.

Finally, use regex to specify whether you want to use "regex" when finding start_by.

Clone this wiki locally