This document contains preparation questions for the Airflow Fundamentals Certification, covering key concepts, components, and configurations in Apache Airflow. Each question includes the correct answer(s) marked with a +
.
Which of the following are most likely the main requirements for running Airflow for production in a multi-team environment? (Select all that apply)
- Multi-tenancy
- Scalability
- Low cost
- High availability
Which of the following is the most likely way a solo developer will run Airflow for the purposes of experimentation and learning?
- On a local machine or a free virtual machine (VM)
- On a paid cloud virtual machine (VM) service (e.g., EC2)
- On a multi-deployment cloud solution (e.g., EKS)
- On a highly secure on-premises architecture
Which of the following are the most likely ways Airflow will be run in production by a single team? (Select all that apply)
- On a container cloud offering (e.g., EKS)
- On a fully-managed Airflow service (e.g., Astro)
- On cloud virtual machines (e.g., EC2)
- On a single developer’s machine
Which of the following is the most likely way a developer or team is going to run Airflow if they have to adhere to strict security or regulatory guidelines?
- On a cloud software
- On-premise
- On a single developer’s machine
Which of the following are most likely the main requirements for running Airflow for production as a single developer? (Select all that apply)
- High levels of scalability
- Low infrastructure overhead
- Low cost
- Multi-tenancy
What is a task in Airflow?
- A configuration file for how Airflow is set up on a machine
- The entire data pipeline
- The direction the data pipeline executes
- A single unit of work in a DAG
Assume you are building a DAG and want to run a Python script. What type of Operator would you use to accomplish this goal?
- Sensor Operator
- Transfer Operator
- Action Operator
- There are no Operators that allow a DAG to run a Python script
Fill in the blank: In Airflow, an Operator is the specific work a _____ does.
- Answer: ______
Assume you are building a DAG and want to move data from one system or data source to another. What type of Operator would you use to accomplish this goal?
- Transfer Operator
- Sensor Operator
- There are no Operators that allow a DAG to move data from one system or data source to another
- Action Operator
Assume you are building a DAG and want to wait for a file to load before executing the next task. What type of Operator would you use to accomplish this goal?
- Action Operator
- There are no Operators that allow a DAG to wait for something to load
- Sensor Operator
- Transfer Operator
Which of the following are characteristics of an Airflow DAG? (Select all that apply)
- Graph based
- Represents multiple data pipelines
- Composed of Tasks
- Cyclical
- Directed
Which of the following scenarios would be a valid use-case for Airflow? (Select all that apply)
- Periodic analysis on sensor data from manufacturing equipment to forecast potential failures and schedule maintenance automatically +
- As a dedicated system for processing real-time, high-volume data streams for a stock trading application
- A healthcare application that schedules email reminders for patients for their appointments
- An AI chatbot that gives tips and solutions based on customer errors and actions
The benefits of using a modern data orchestration tool like Airflow is because it offers the ability to... (Select all that apply)
- Integrate with hundreds of external systems/software
- Schedule data based on events in addition to time
- Author pipelines as configuration files
- Use a rich feature set with features like built-in observability
- Store, secure, and preserve large quantities of data permanently
Which of the following best describes the role of the scheduler in Airflow?
- To define connections to external systems (e.g., Data Warehouses)
- To store metadata about the state of tasks
- To execute the instructions defined in tasks
- To determine which tasks need to be executed and when
Which Airflow component is in charge of executing the logic of a DAG's tasks?
- The scheduler
- The worker
- The metadata database
- The executor
Which of the following best describes the role of an executor in Airflow?
- To determine how and where tasks are executed
- To visualize the pipeline runs
- To execute the instructions of tasks
- To store tasks ready to be executed
After creating your DAG in a Python file, which action do you need to take in order for Airflow to start the process of detecting and running your DAG?
- Launch the API server
- Restart Airflow and make sure the DAG is visible on the UI
- Wait for the DAG File Processor to pick up the tasks defined in the DAG
- Put the Python file in the dags directory
By default, how long can it take for the Airflow DAG File Processor to detect a new DAG file in the dags directory?
- 5 minutes
- 1 minute
- 30 seconds
- 5 seconds
Do the workers have direct access to the Airflow metadata database?
- No
- Yes
In which part of a DAG file would you specify how often you want to trigger the DAG?
- In the DAG's dependencies
- In the import statements
- In the DAG's tasks
- In the DAG object definition
Assume a DAG has 3 tasks: task_extract, task_transform, and task_load. Which of the following represents the dependencies in a DAG that needs to meet the following conditions: task_transform is downstream of task_extract, task_load is downstream of task_transform?
- task_extract << task_load << task_transform
- task_extract >> task_transform >> task_load
- task_extract >> task_load >> task_transform
Which of the following Airflow components does the metadata database communicate with?
- The queue
- The scheduler
- The API server
- The worker(s)
What Astro CLI command will start running an Airflow project?
- Answer:
astro dev start
What is the purpose of the airflow_settings.yaml file in a new Airflow project generated by the Astro CLI?
- For storing Python files corresponding to data pipelines
- For specifying additional Python packages to install, along with their versions, to extend the functionality of the Airflow environment
- For storing configurations such as connections and variables to prevent loss when recreating the local development environment
- For configuring Airflow instances via environment variables
What Astro CLI command will generate a new Airflow project?
- Answer:
astro dev init
Which of the following are benefits of using the Astro CLI for local development? (Select all that apply)
- Its ability to dynamically generate new DAGs using AI
- A variety of useful commands for managing an Airflow instance
- Its ability to generate a standard project directory
- To provide a streamlined experience for installing Airflow with limited prerequisites
What is the purpose of the include directory in a new Airflow project generated by the Astro CLI?
- For storing configurations such as connections and variables to prevent loss when recreating the local development environment
- For configuring Airflow instances via environment variables
- For storing files like SQL queries, bash scripts, or Python functions needed in data pipelines to keep them clean and organized
- For the customization of the Airflow instance by adding new operators or modifying the UI
What's the best view to use to visualize your assets?
- Asset View
- DAGs View
- Graph View
What's the best view to check the historical states of the DAG Runs and Task Instances for a given DAG?
- Grid View
- Graph View
- DAGs View
What's the best view to get overall metrics of your Airflow instance?
- The Home View
- Code View
- Asset View
What's the best view to know if code updates have been applied?
- Home View
- Code View
- Graph View
What does the last column on the DAG View show?
- DAG run states of current and previous DAG Runs for a given DAG
- Task's states of the current or latest DAG Run for a given DAG
What's the role of the start date?
- Define when the DAG starts being scheduled
- Define the trigger interval
- Avoid running past non-triggered DAG Runs
What happens if you don't define a start date?
- Nothing, it's optional
- That raises an error
What's the role of tags?
- They allow better organizing DAGs
- They allow filtering DAGs
- They prevent from running DAGs that do not belong to the current user
How can you avoid assigning the dag object to every task you create?
- with DAG(...)
- dag = ...
- @dag(...)
- You can't. You must assign the dag object to every task
What happens when two DAGs share the same DAG id?
- The two DAGs appear on the UI
- You get an error
- One DAG randomly shows up on the UI
Does task_a >> task_b >> task_c
is equivalent to task_c << task_b << task_a
?
- Yes
- No
If a DAG runs every day at midnight with a start_date=datetime(2025, 1, 1)
. When will the 3rd DAG run be triggered?
- 2025-01-01 00:00
- 2025-01-02 00:00
- 2025-01-03 00:00
- 2025-01-04 00:00
If a DAG with a "@daily" schedule is triggered at 00:00 on 2025-02-02. What's the logical date value for this DAG Run?
- 2025-02-01 00:00
- 2025-02-02 00:00
- 2025-02-03 00:00
If a DAG with a "@daily" schedule is triggered at 00:00 on 2025-02-02. What's the data_interval_end
value for this DAG Run (Assuming data intervals are used)?
- 2025-02-01
- 2025-02-02
- 2025-02-03
- 2025-02-04
With catchup=False
, what happens when you run your DAG for the first time, and it has a start_date
defined as 30 days ago?
- Nothing
- The latest non-triggered DAG Run is triggered
- All non-triggered DAG Runs get triggered
logical_date = data_interval_start = data_interval_end
by default?
- Yes
- No
Select the 4 factors that define the uniqueness of an XCOM
- key
- value
- timestamp
- dag_id
- task_id
- logical_date
Is it possible to push an XCOM without explicitly specifying a key?
- Yes
- No
An XCOM is pushed into...
- The scheduler
- The worker
- The webserver
- The database
With Postgres, can you share 2Gb of data between 2 tasks with an XCOM?
- Yes
- No
How does the Scheduler know which XCOM to choose for a given DAG Run when multiple XCOMs have the same key, dag_id, and task_id?
- It selects one XCOM randomly
- It selects the XCOM based on the logical date
- You can't have multiple XCOMs with the same key, dag_id, and task_id
This DAG doesn’t show up on the UI. Why?
from airflow.decorators import dag,task
from pendulum import datetime
@dag(
'test_dag',
start_date = datetime(2023,3,1)
)
def test_dag():
@task
def test_task():
return ' Hello World'
test_task()
- The schedule parameter is missing
- The start_date is in the past
- test_dag() is not called
On manually triggering this DAG you don't see any task execution. Why?
from airflow.decorators import dag,task
from pendulum import datetime
@dag(
'test_dag',
start_date = datetime(3030,3,1)
)
def test_dag():
@task
def test_task():
return 'Hello World'
test_task()
test_dag()
- There is no end_date
- The start_date is in the future
- Because there is only one task and no proper pipeline
How many running DAG runs will you get as soon as you unpause this DAG?
from airflow.decorators import dag,task
from pendulum import datetime
@dag(
'test_dag',
start_date = datetime(2023,1,1),
schedule = '@daily',
catchup = False
)
def test_dag():
@task
def my_task():
return 'Hello World'
my_task()
test_dag()
- 0
- 1
- 16
- 32
You just finished writing your DAG and saved the file in the dags folder. How long will it take to appear on the UI?
- By default it may take up to 5 minutes or more.
- It will be added instantly
- It may take up to 30 seconds.
Is it possible to run tasks that are dependent on different versions of Python in the same DAG?
- Yes
- No
You can't find the connection type Amazon Web Services. What should you do?
- Install the apache-airflow-providers-amazon
- Install boto3
- Use the http connection type
If the file never arrives in the S3 bucket. When will the S3KeySensor time out?
- In 24 hours
- In 7 days
- Never
Does the Sensor instantly detect the file when it arrives in the bucket?
- Yes
- No, it depends on the poke_interval
Manually trigger the DAG first_dag
, wait for having 3 tasks running then trigger the DAG second_dag
. What's the task's state of runme
in second_dag
?
- running
- queued
- scheduled
How many worker slots are used?
- 3
- 2
- 1
Turn off the schedule of both DAGs and delete the two DAG runs by going to Browse and DAG Runs. Now, open the file first_dag.py
, add a new parameter mode='reschedule'
in partial()
, save the file, and manually trigger first_dag
. What is the status of the three tasks waiting_for_files
?
- scheduled
- queued
- running
- up_for_reschedule
How many worker slots are running?
- 1
- 0
- 3
Manually trigger the DAG second_dag
. Does the task runme
run?
- Yes
- No
Create a new empty file data_1.csv
in the folder include
. Go back to the Airflow UI and wait a minute. What do you see?
- Nothing
- 3 tasks are still up_for_reschedule
- 1 sensor has been successfully executed
A Sensor can be used for (Choose all that apply):
- waiting for files to appear in an S3 bucket
- waiting for a task in another DAG to complete
- waiting for data to be present in a SQL table
- waiting for a specified date and time
You have a sensor that waits for a file to arrive in an S3 bucket. Your DAG runs every 10 mins, and it takes 8 mins to complete. What is the most appropriate timeout duration for the sensor? (in seconds)
- 60 * 60 * 24 * 7
- 60 * 60
- 60 * 5
What mode doesn't take a worker slot while a Sensor waits?
- poke
- reschedule
- none
What Sensor(s) can be used to apply logic conditions? (Choose all that apply)
- PythonSensor
- @task.sensor
- S3KeySensor
What parameter can be useful to check for data to be present in a database without putting too much workload on each poke?
- timeout
- exponential_backoff
- mode
What is the purpose of airflow db init
and when might you use it?
- Creates a new user in Airflow
- Generates a report of all DAGs currently in your Airflow environment
- Starts the Airflow web server
- Initializes the Airflow metadata database, typically used when setting up Airflow for the first time
What is the use case for airflow config get-value
and how might you use it to troubleshoot issues?
- Lists all of the DAG files that have failed to import
- Retrieves the value of a specific configuration option, useful for verifying that configuration settings are correctly set
- Verifies the validity of the Airflow metadata database schema
- Reserializes a DAG file to ensure that it can be properly loaded by Airflow
You've recently upgraded your Airflow installation, but now your DAGs are not showing up in the UI. Which commands could you use to identify any import/parsing errors that might be preventing the DAG from being loaded?
- airflow dags list-import-errors
- airflow dags show
- airflow dags report
- airflow dags backfill
Why would you use Airflow's backfill functionality?
- To test DAGs before deploying them to a production environment
- To reschedule a DAG to run at a different frequency
- To fill in missing historical data that was not previously captured
- To remove all past DAG runs and start fresh with a new schedule
What does the airflow db check
command do?
- Upgrades the Airflow database schema to the latest version
- Checks the connection to the Airflow database
- Removes all data from the Airflow database
- Initializes the Airflow database
Can_you create two connections with identical IDs?
- Yes
- No
In Airflow, what is a connection?
- A connection is a way to define the order in which tasks in a DAG should be executed
- A connection is a type of sensor that waits for a specific condition to be met before proceeding with a task
- A connection is a way to store and reuse credentials or configuration settings for external systems that are used in Airflow tasks
What is the purpose of defining a connection in Airflow?
- The purpose of defining a connection in Airflow is to avoid hard-coding sensitive information like passwords or API keys into Airflow DAGs, and to allow for easy management and reuse of connection information across multiple DAGs
- The purpose of defining a connection in Airflow is to store output data from tasks for later analysis
- The purpose of defining a connection in Airflow is to specify the order in which tasks should be executed within a DAG
What are the different types of connections available in Airflow?
- Airflow only supports one type of connection, which is the SSH connection type
- Airflow comes with several built-in connection types, including HTTP, MySQL, Postgres, SSH, and more. It also allows you to define custom connection types if needed
- Airflow supports a very limited set of connection types, including MongoDB, Cassandra, and Redis
With 3 DAGs that fetch data from the same API. Does it make sense to store this API in a variable?
- Yes
- No
A variable can be stored in...
- The metadata database
- A secret backend
- An environment variable
- The metaverse
You create a variable with an environment variable. Does Airflow generate a connection to fetch the value of this variable?
- Yes
- No