Why Setting an SLA for Your DAGs Might Be a Very Good Idea

While I feel like I’d already taken plenty of steps to be notified if there’s a problem when a DAG fails, I recently discovered that I hadn’t done anything to catch those scenarios where a DAG takes longer–or in my case, a lot longer–to complete (or fail) than normal. Thus, the need to learn how to set an SLA (service level agreement in support parlance) to account for these (rare) occurrences…

Configuring the SLA

First, things first, there is NOT a global SLA (within the default_args dictionary or DAG definition). The SLA must be set at the task level, like this:

...[imports, default_args dict, etc.]...

with DAG(
    dag_id="daily_email_report",
    catchup=False,
    default_args=default_args,
    schedule_interval="@daily",
) as dag:

    ...[other tasks]...

    send_email_report = PythonOperator(
        task_id="send_email_report",
        python_callable=send_email_report,
        sla=timedelta(hours=1),  # <=== SLA definition
        provide_context=True,
        on_failure_callback=send_slack_message_failed,
    )
  • Make sure that datetime.timedelta has been imported at the top of your DAG!

Yes, it really is that easy to define the SLA! You just have to assign the amount of time for the task to complete from the start of the DAG run (so in the case above, it will start at midnight–therefore it has until 1:00am to complete.

However, only assigning an SLA will generate an email being sent to whoever is listed in the default_args/DAG definition. And the body of the emails look like this:

Here's a list of tasks that missed their SLAs:
send_email_report on 2019-09-09T00:00:00+00:00
Blocking tasks:
wait_for_legacy_etl_complete on 2019-09-09T00:00:00+00:00

      =,             .=
     =.|    ,---.    |.=
     =.| "-(:::::)-" |.=
      \\__/`-.|.-'\__//
       `-| .::| .::|-'      Pillendreher
        _|`-._|_.-'|_       (Scarabaeus sacer)
      /.-|    | .::|-.\
     // ,| .::|::::|. \\
    || //\::::|::' /\\ ||
    /'\|| `.__|__.' ||/'\
   ^    \\         //    ^
        /'\       /'\
       ^             ^

And that email (including the ASCII bug) tells you what task didn’t complete on time & the name of the task that was active at the time of the SLA miss. That’s nice to know, but wouldn’t it be better if you could get a little more information and/or be notified another way?

Configuring sla_miss_callback

Just like on_retry_callback, on_failure_callback, etc., you can configure either a global or per-task sla_miss_callback. (And an obvious idea could be to send a Slack message much like I showed in this post.)

Warning About Configuring an SLA and/or sla_miss_callback

When setting an SLA, it does NOT cause the DAG (or task) to fail. It’s just there to bring attention to the fact that a DAG that is taking longer than intended or expected to complete.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.