This Python script fetches data from GA4 and then saved it with Google's BigQuery. The data is sorted, filtered, and stored in output.csv
file and is also sent to a specific BigQuery table.
- Fetches active users and event data based on a specified date range.
- Provides options to fetch data from yesterday or from a wider initial date range.
- Checks and prevents data duplication in BigQuery.
- Supports dynamic table creation in BigQuery based on data date.
- Use
run_report_with_pagination
to handle GA4's 10,000 row limit per API request - Outputs fetched data to
output.csv
.
- Google Analytics Data API V1 Beta
- Google OAuth 2.0
- Google BigQuery Client
- Google Cloud exceptions
- argparse
- datetime
- sys
- json
- os
- pickle
Before running the script, ensure you have a config.json
file in the same directory with the following structure:
{
"CLIENT_SECRET_FILE": "<Path to your client secret file>",
"SERVICE_ACCOUNT_FILE": "<Path to your service account JSON file>",
"SCOPES": ["https://www.googleapis.com/auth/analytics.readonly"],
"TABLE_PREFIX": "<Prefix for the BigQuery tables>",
"PROPERTY_ID": "<Google Analytics Property ID>",
"DATASET_ID": "<BigQuery Dataset ID>",
"INITIAL_FETCH_FROM_DATE": "2022-01-01"
}
Run the script with one of the following flags:
--yesterday
: To fetch data from yesterday only.--initial_fetch
: To fetch data from the initial date specified in the configuration up to today.
For example:
python ga4script.py --yesterday
Note: Using the --initial_fetch
might result in duplicated records in BigQuery, so the script will ask for confirmation before proceeding.
The script will output the fetched data to output.csv
. The data is also saved to BigQuery, with tables dynamically created based on the data's month and year.
- Ensure you have set up the correct permissions for your Google service account to read data from Google Analytics and to write to BigQuery.
- Keep your
config.json
, client secret, and service account JSON files confidential. - Handle the
token.pickle
file (generated after authentication) with care, as it contains your authentication token.
The data fetched by the script is stored in Google BigQuery, a fully-managed and serverless data warehouse. BigQuery allows super-fast SQL queries using the processing power of Google's infrastructure.
The script references a specific dataset, represented by the DATASET_ID
in the configuration. Within this dataset, tables are dynamically created based on the data's month and year.
The tables created have a name based on a prefix and the month and year of the data. This naming convention is structured as:
<TABLE_PREFIX><YEAR><MONTH>01
-
<TABLE_PREFIX>
: A constant prefix set in the configuration (config.json
). It aids in distinguishing tables related to this script from others in the same dataset. -
<YEAR>
and<MONTH>
: Derived from the date of the fetched data, ensuring data is grouped by month in separate tables.
For instance, if your TABLE_PREFIX
is GAData_
and the data is from September 2023, the table name would be GAData_20230901
.
Each table has a specific schema, with the following fields:
Event_Name
: STRINGEvent_Date
: INTEGEREvent_Count
: INTEGERIs_Conversion
: BOOLEANChannel
: STRING
When the script processes the fetched data, it checks whether a table for the respective month already exists. If not, it creates a new table with the aforementioned schema. As the script parses through the data, it structures each record to match this schema before insertion.
The script employs an intelligent mechanism to avoid data duplication. Before attempting to insert a new record, it checks if an identical record already exists in the table. If such a record is found, it skips the insertion for that particular data point.
503 Getting metadata from plugin failed with error: ('invalid_grant: Token has been expired or revoked.', {'error': 'invalid_grant', 'error_description': 'Token has been expired or revoked.'})
Solution: Delete the token.pickle
file from your directory and run the script again to refresh it.
Access blocked: This app’s request is invalid, Error 400: redirect_uri_mismatch
Solution: Click on "error details" when you see this message. You'll be provided with a URL, typically in the format http://localhost:8080/
. Copy this URL, go to your Google Cloud Console, navigate to APIs and services > Credentials > OAuth 2.0 Client IDs
, and add this URL under both "Authorised redirect URIs" and "Authorised JavaScript origins".
Error 400: redirect_uri_mismatch. You can't sign in to this app because it doesn't comply with Google's OAuth 2.0 policy.
Solution: When you see this error in your console, copy the URL provided and paste it directly into your browser's address bar.
When you run the script with the --yesterday
flag, you might see an output similar to the following:
λ python ga4script.py --yesterday
Starting fetching data from 2023-09-14 to 2023-09-14.
Authentication successful!
..record already exists in BigQuery (1)
..record already exists in BigQuery (1)
..record already exists in BigQuery (1)
Data saved to output.csv!
This output indicates the script's operations. The "..record already exists in BigQuery" messages are informing you that the data for those records already exists in your BigQuery table, and therefore, it's not adding duplicate entries. At the end of the script, the fetched data is saved to output.csv
.
When you run the script with the --initial_fetch
flag, you might see an output similar to the following:
λ python ga4script.py --initial_fetch
Using the initial_fetch might result in duplicated records. Do you want to proceed? (yes/no): yes
Starting fetching data from 2022-01-01 to 2023-09-15.
Authentication successful!
Data saved to output.csv!
Table test_backfill_v4_20221201 created.
Data saved to BigQuery for 12/2022!
Table test_backfill_v4_20230101 created.
Data saved to BigQuery for 01/2023!
Table test_backfill_v4_20230201 created.
Data saved to BigQuery for 02/2023!
Table test_backfill_v4_20230301 created.
Data saved to BigQuery for 03/2023!
Table test_backfill_v4_20230401 created.
Data saved to BigQuery for 04/2023!
Table test_backfill_v4_20230501 created.
Data saved to BigQuery for 05/2023!
Table test_backfill_v4_20230601 created.
Data saved to BigQuery for 06/2023!
Table test_backfill_v4_20230701 created.
Data saved to BigQuery for 07/2023!
Table test_backfill_v4_20230801 created.
Data saved to BigQuery for 08/2023!
Data saved to BigQuery for 09/2023!
- Introduced
run_report_with_pagination
function to address the Google Analytics 4 API constraint that fetches a maximum of 10,000 rows per request. This enhancement ensures complete data retrieval by systematically iterating through paginated data sets. - Added
event_type = "Traffic"
column for enriched data categorization. - Replaced
activeUsers
metrics withsessionDefaultChannelGroup
for more accurate session channel grouping. - Sorted events by date to facilitate easier data analysis and reporting.
- Initial release
After script execution - data will appear in BigQuery.
Read more on our site: https://www.createit.com/blog/ga4-data-extraction-to-bigquery/