Skip to content

Latest commit

 

History

History
150 lines (106 loc) · 7.98 KB

File metadata and controls

150 lines (106 loc) · 7.98 KB

First steps

Once you have followed the steps in the getting_started/installation.adoc section to install the Operator and its dependencies, deploy an Airflow cluster and its dependencies. Afterward you can verify that it works by running and tracking an example DAG.

Setup

With the external dependencies required by Airflow (Postgresql and Redis) installed, install the Airflow Stacklet itself.

Supported versions for PostgreSQL and Redis can be found in the Airflow documentation.

Airflow secrets

Secrets are required for the mandatory metadata database connection and the Airflow admin user. When using the celery executor it is also required to provide information for the celery database and broker. Create a file called airflow-credentials.yaml:

link:example$getting_started/code/airflow-credentials.yaml[role=include]

And apply it:

link:example$getting_started/code/getting_started.sh[role=include]

airflow-postgresql-credentials contains credentials for the SQL database storing the Airflow metadata. In this example we will use the same database for both the Airflow job metadata as well as the Celery broker metadata.

airflow-redis-credentials contains credentials for the the Redis instance used for queuing the jobs submitted to the Airflow executor(s).

airflow-admin-user-credentials: the adminUser.* fields are used to create the initial admin user.

Note
The admin user is disabled if you use a non-default authentication mechanism like LDAP.

Airflow

An Airflow cluster is made up of several components, two of which are optional:

  • webserver: this provides the main UI for user-interaction

  • executors: the CeleryExecutor or KubernetesExecutor nodes over which the job workload is distributed by the scheduler

  • scheduler: responsible for triggering jobs and persisting their metadata to the backend database

  • dagProcessors: (Optional) responsible for monitoring, parsing and preparing DAGs for processing. If this role is not specified then the process will be started as a scheduler subprocess (Airflow 2.x), or as a standalone process in the same container as the scheduler (Airflow 3.x+)

  • triggerers: (Optional) DAGs making use of deferrable operators can be used together with one or more triggerer processes to free up worker slots. This deferral process is also useful for providing a measure of high availability

Create a file named airflow.yaml with the following contents:

link:example$getting_started/code/airflow.yaml[role=include]

And apply it:

link:example$getting_started/code/getting_started.sh[role=include]

Where:

  • metadata.name: contains the name of the Airflow cluster.

  • spec.clusterConfig.metadataDatabase: specifies one of the supported database types (in this case, postgresql) along with references to the host, database and the secret containing the connection credentials.

  • the product version of the Docker image provided by Stackable must be set in spec.image.productVersion.

  • spec.celeryExecutors: deploy executors managed by Airflow’s Celery engine. Alternatively you can use kuberenetesExectors that use Airflow’s Kubernetes engine for executor management. For more information see https://airflow.apache.org/docs/apache-airflow/stable/executor/index.html#executor-types).

  • spec.celeryExecutors.celeryResultBackend: specifies one of the supported database types (in this case, postgresql) along with references to the host, database and the secret containing the connection credentials.

  • spec.celeryExecutors.celeryBrokerUrl: specifies one of the supported queue/broker types (in this case, redis) along with references to the host and the secret containing the connection credentials.

  • spec.clusterConfig.loadExamples: this key is optional and defaults to false. It is set to true here as the example DAGs are used when verifying the installation.

  • spec.clusterConfig.exposeConfig: this key is optional and defaults to false. It is set to true only as an aid to verify the configuration and should never be used as such in anything other than test or demo clusters.

  • spec.clusterConfig.credentialsSecret: specifies the secret containing the Airflow admin user information.

Note
The version you need to specify for spec.image.productVersion is the desired version of Apache Airflow. You can optionally specify the spec.image.stackableVersion to a certain release like 23.11.0 but it is recommended to leave it out and use the default provided by the operator. Check our image registry for a list of available versions. Information on how to browse the registry can be found here. It should generally be safe to simply use the latest version that is available.
Note
Refer to usage-guide/db-connect.adoc for more information about database/broker connections.

This creates the actual Airflow cluster.

After a while, all the Pods in the StatefulSets should be ready:

kubectl get statefulset

The output should show all pods ready, including the external dependencies:

NAME                            READY   AGE
airflow-postgresql              1/1     16m
airflow-redis-master            1/1     16m
airflow-redis-replicas          1/1     16m
airflow-scheduler-default       1/1     11m
airflow-webserver-default       1/1     11m
airflow-celery-executor-default 1/1     11m
airflow-dagprocessor-default    1/1     11m
airflow-triggerer-default       1/1     11m

When the Airflow cluster has been created and the database is initialized, Airflow can be opened in the browser: the webserver UI port defaults to 8080 can be forwarded to the local host:

link:example$getting_started/code/getting_started.sh[role=include]

Verify that it works

The Webserver UI can now be opened in the browser with http://localhost:8080. Enter the admin credentials from the Kubernetes secret:

Airflow login screen

Since the examples were loaded in the cluster definition, they appear under the DAGs tabs:

Example Airflow DAGs

Select one of these DAGs by clicking on the name in the left-hand column e.g. example_trigger_target_dag. Click on the arrow in the top right of the screen, select "Trigger DAG" and the DAG nodes are automatically highlighted as the job works through its phases.

Airflow DAG in action

Great! You have set up an Airflow cluster, connected to it and run your first DAG!

Alternative: Using the command line

If you prefer to interact directly with the API instead of using the web interface, you can use the following commands. The DAG is one of the example DAGs named example_trigger_target_dag. To trigger a DAG run via the API requires an initial extra step to ensure that the DAG is not in a paused state:

link:example$getting_started/code/getting_started.sh[role=include]

A DAG can then be triggered by providing the DAG name (in this case, example_trigger_target_dag). The response identifies the DAG identifier, which can be parse out of the JSON like this:

link:example$getting_started/code/getting_started.sh[role=include]

If this identifier is stored in a variable such as dag_id (manually replaced in the command) you can run this command to access the status of the DAG run:

link:example$getting_started/code/getting_started.sh[role=include]

What’s next

Look at the usage-guide/index.adoc to find out more about configuring your Airflow Stacklet and loading your own DAG files.