-
Notifications
You must be signed in to change notification settings - Fork 1
Deal with changing format
Since Facebook may change the json structure in the downloaded archive, we developed a series of methods to deal with the problem. The following is a simple instruction to get a table with columns you want.
The core idea is, "Put everything into one table first, then get the columns you want from the table." In order to get the data you need from changing Facebook data format, the following three steps are required:
- Create tmp tables (
df_list
) from json files - Merge sub-DataFrames under the same
top_df
to generatewanted_df
- Select only the columns we need (
wanted_columns
) to generate the final DataFrame
Assuming that you want to get the "creation_timestamp", "description", "title", "uri", and "upload_ip" of photos of posts from your Facebook data and put them in one flat table, here you go:
from fbjson2table.func_lib import parse_fb_json
from fbjson2table.table_class import TempDFs
json_content = parse_fb_json($PATH_OF_JSON)
temp_dfs = TempDFs(json_content)
one_to_one_df, _ = temp_dfs.temp_to_wanted_df(
route_by_table_name='media',
strat_by='uri'
)
print(one_to_one_df.columns)
The output will be:
Index(['creation_timestamp', 'description', 'id_0', 'id_attachments_1',
'id_data_2', 'id_media_3', 'title', 'uri', 'media_metadata_id_0',
'media_metadata_id_attachments_1', 'media_metadata_id_data_2',
'media_metadata_id_media_metadata_4', 'photo_metadata_id_0',
'photo_metadata_id_attachments_1', 'photo_metadata_id_data_2',
'photo_metadata_id_media_metadata_4',
'photo_metadata_id_photo_metadata_5', 'photo_metadata_upload_ip'],
dtype='object')
Here, the wanted_columns
list should look like:
wanted_columns = ['creation_timestamp', 'description', 'title', 'uri',
'photo_metadata_upload_ip']
photo_df, _ = temp_dfs.temp_to_wanted_df(
Finally, photo_df
is all you need.
Let's take a look of: link
The first thing we need to do is get_routed_dfs
.
=> Since "df" in "df_list" map may diverge to different directions, we have to get the DataFrames that in the same path (route) toward "top_df".
Second, do get_start_peeling
.
=> Find the index of top_df in the df_list. This index will be used for the next step.
Then, do merge_one_to_one_sub_df
.
=> Merge all one-to-one sub_dfs under the same top_df.
Finally, do get_wanted_columns
to create the final DataFrame
=> Extract the columns we need and return NaN
if the columns are missing.
In order to run temp_to_wanted_df
function, you need to specify four parameters:
- wanted_columns
- route_by_table_name
- start_by
- regex
The wanted_columns
is a list of column names we want. Use the empty list []
to return all columns.
The route_by_table_name
is the last table with the suffix of top_df
which will be used when we want to choose one DataFrame as top_df
where that df is in a diverging branch of df_list.
Take example_df_list for example,
If we plot the relationship of df_list:
We can find out that there are two paths (routes) in the picture. If we want to merge temp__attachments__data__media
and its sub_dfs, we have to extract them from df_list. Otherwise, temp__data
may cause problems when we do the join
operation.
If we set route_by_table_name
as "media"
, we will get the path (route) like:
The start_by
is the unique and preserved column name of top_df in the routed_df_list.
When we do join
operation, we have to choose one df in df_list as our top_df. Take the example_df_list, if we want to set "temp__attachments__data__media" as top_df, we have to find the unique column name of it such as "uri".
Please note, it is NOT allowed to use title
as a column name since it is preserved for the tmp tables.
Finally, use regex
to specify whether you want to use "regex" when finding start_by
.