Skip to content

Commit a862d54

Browse files
Merge remote-tracking branch 'upstream/main' into pydabs-patterns-airflow
2 parents 285ffde + 1cf3dba commit a862d54

41 files changed

Lines changed: 25897 additions & 0 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.
Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
DATABRICKS INGESTION MONITORING DABS
2+
====================================
3+
4+
0.3.4
5+
6+
- Add support for pipeline discovery using pipeline tags
7+
- Enhance AI/BI dashboards to support pipeline selection using tags
8+
9+
0.3.3
10+
11+
- Add support for monitoring expectation check results
12+
- Extend `table_events_metrics` with a new column `num_expectation_dropped_records` that contains the number of rows dropped by expectations
13+
- Add a table `table_events_expectation_checks` which contains the number of rows that passed or failed specific expectation checks
14+
- Update the generic SDP dashboard to expose metrics/visualizations about expectation failures.
15+
- Bugfixes in the Datadog sink
16+
17+
0.3.2
18+
19+
- All monitoring ETL pipelines are now configured to write their event logs to the monitoring schema so that the monitoring pipelines can also be monitored. For example, the CDC Monitoring ETL pipeline will write its event log into `{monitoring_catalog}.{monitoring_schema}.cdc_connector_monitoring_etl_event_log` and the Generic SDP monitoring ETL pipeline will write its event log into `{monitoring_catalog}.{monitoring_schema}.generic_sdp_monitoring_etl_event_log`.
20+
- Added a fix for an issue that would cause the Monitoring ETL pipelines to periodically get stuck on `flow_targets` update.
21+
22+
23+
0.3.1
24+
25+
- Fix an issue with pipelines execution time graph across DABs
Lines changed: 137 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,137 @@
1+
# Common Configuration Guide
2+
3+
This document describes common configuration parameters shared among monitoring DABs (Databricks Asset Bundles).
4+
5+
Configuration is done through variables in a DAB deployment target.
6+
7+
8+
## Required: Specify Monitoring Catalog and Schema
9+
10+
Configure `monitoring_catalog` and `monitoring_schema` to specify where the monitoring tables will be created. The catalog must already exist, but the schema will be created automatically if it doesn't exist.
11+
12+
13+
## Required: Specify Pipelines to Monitor
14+
15+
Configuring which pipelines to monitor involves two steps:
16+
1. Choose the method to extract pipeline event logs
17+
2. Identify which pipelines to monitor
18+
19+
### Event Log Extraction Methods
20+
21+
There are two methods to extract a pipeline's event logs:
22+
23+
**Ingesting (Preferred)**
24+
- Extracts event logs directly from a Delta table where the pipeline writes its logs
25+
- Available for pipelines configured with the `event_log` field ([see documentation](https://docs.databricks.com/api/workspace/pipelines/update#event_log))
26+
- Any UC-enabled pipeline using `catalog` and `schema` fields can be configured to store its event log in a Delta table
27+
- Lower cost and better performance than importing
28+
29+
**Importing (Alternative)**
30+
- First imports the pipeline's event log into a Delta table, then extracts from there
31+
- More expensive operation compared to ingesting
32+
- Use only for UC pipelines that use the legacy `catalog`/`target` configuration style
33+
- Requires configuring dedicated import jobs (see ["Optional: Configure Event Log Import Job(s)"](#optional-configure-event-log-import-jobs))
34+
35+
### Pipeline Identification Methods
36+
37+
For both ingested and imported event logs, you can identify pipelines using:
38+
39+
**1. Direct Pipeline IDs**
40+
- Use `directly_monitored_pipeline_ids` for ingested event logs
41+
- Use `imported_pipeline_ids` for imported event logs
42+
- Format: Comma-separated list of pipeline IDs
43+
44+
**2. Pipeline Tags**
45+
- Use `directly_monitored_pipeline_tags` for ingested event logs
46+
- Use `imported_pipeline_tags` for imported event logs
47+
- Format: Semi-colon-separated lists of comma-separated `tag[:value]` pairs
48+
- **Semicolons (`;`)** = OR logic - pipelines matching ANY list will be selected
49+
- **Commas (`,`)** = AND logic - pipelines matching ALL tags in the list will be selected
50+
- `tag` without a value is equivalent to `tag:` (empty value)
51+
52+
**Example:**
53+
```
54+
directly_monitored_pipeline_tags: "tier:T0;team:data,tier:T1"
55+
```
56+
This selects pipelines with either:
57+
- Tag `tier:T0`, OR
58+
- Tags `team:data` AND `tier:T1`
59+
60+
**Combining Methods:**
61+
All pipeline identification methods can be used together. Pipelines matching any criteria will be included.
62+
63+
> **Performance Tip:** For workspaces with hundreds or thousands of pipelines, enable pipeline tags indexing to significantly speed up tag-based discovery. See ["Optional: Configure Pipelines Tags Indexing Job"](#optional-configure-pipelines-tags-indexing-job) for more information.
64+
65+
66+
## Optional: Monitoring ETL Pipeline Configuration
67+
68+
**Schedule Configuration:**
69+
- Customize the monitoring ETL pipeline schedule using the `monitoring_etl_cron_schedule` variable
70+
- Default: Runs hourly
71+
- Trade-off: Higher frequency increases data freshness but also increases DBU costs
72+
73+
For additional configuration options, refer to the `variables` section in the `databricks.yml` file for the DAB containing the monitoring ETL pipeline.
74+
75+
76+
## Optional: Configure Event Log Import Job(s)
77+
78+
> **Note:** Only needed if you're using the "Importing" event log extraction method.
79+
80+
**Basic Configuration:**
81+
1. Set `import_event_log_schedule_state` to `UNPAUSED`
82+
- Default schedule: Hourly (configurable via `import_event_log_cron_schedule`)
83+
84+
2. Configure the `imported_event_log_tables` variable in the monitoring ETL pipeline
85+
- Specify the table name(s) where imported logs are stored
86+
- You can reference `${var.imported_event_logs_table_name}`
87+
- Multiple tables can be specified as a comma-separated list
88+
89+
**Handling Pipeline Ownership:**
90+
- If monitored pipelines have a different owner than the DAB owner:
91+
- Edit `common/resources/import_event_logs.job.yml`
92+
- Uncomment the `run_as` principal lines
93+
- Specify the appropriate principal
94+
95+
- If multiple sets of pipelines have different owners:
96+
- Duplicate the job definition in `common/resources/import_event_logs.job.yml`
97+
- Give each job a unique name
98+
- Configure the `run_as` principal for each job as needed
99+
- All jobs can share the same target table (`imported_event_logs_table_name`)
100+
101+
See [common/vars/import_event_logs.vars.yml](common/vars/import_event_logs.vars.yml) for detailed configuration variable descriptions.
102+
103+
104+
## Optional: Configure Pipelines Tags Indexing Job
105+
106+
> **When to use:** For large-scale deployments with hundreds or thousands of pipelines using tag-based identification.
107+
108+
**Why indexing matters:**
109+
Tag-based pipeline discovery requires fetching metadata for every pipeline via the Databricks API on each event log import and monitoring ETL execution. For large deployments, this can be slow and expensive. The tags index caches this information to significantly improve performance.
110+
111+
**Configuration Steps:**
112+
113+
1. **Enable the index:**
114+
- Set `pipeline_tags_index_enabled` to `true`
115+
116+
2. **Enable the index refresh job:**
117+
- Set `pipeline_tags_index_schedule_state` to `UNPAUSED`
118+
- This job periodically refreshes the index to keep it up-to-date
119+
120+
3. **Optional: Customize refresh schedule**
121+
- Configure `pipeline_tags_index_cron_schedule` (default: daily)
122+
- If you change the schedule, consider adjusting `pipeline_tags_index_max_age_hours` (default: 48 hours)
123+
- When the index is older than the max age threshold, the system falls back to API-based discovery
124+
125+
See [common/vars/pipeline_tags_index.vars.yml](common/vars/pipeline_tags_index.vars.yml) for detailed configuration variable descriptions.
126+
127+
> **Notes:**
128+
1. The system gracefully falls back to API-based discovery if the index is disabled, unavailable, or stale.
129+
2. If a recently created or tagged pipeline is missing from the monitoring ETL output, this can be due to the staleness of the index. Run the corresponding `Build *** pipeline tags index` job to refresh the index and re-run the monitoring ETL pipeline.
130+
131+
132+
## Optional: Configure Third-Party Monitoring Integration
133+
134+
You can export monitoring data to third-party monitoring platforms such as Datadog, Splunk, New Relic, or Azure Monitor.
135+
136+
See [README-third-party-monitoring.md](README-third-party-monitoring.md) for detailed configuration instructions.
137+
Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
DB license
2+
3+
Copyright (2022) Databricks, Inc.
4+
5+
Definitions.
6+
7+
Agreement: The agreement between Databricks, Inc., and you governing the use of the Databricks Services, which shall
8+
be, with respect to Databricks, the Databricks Terms of Service located at www.databricks.com/termsofservice, and with
9+
respect to Databricks Community Edition, the Community Edition Terms of Service located at
10+
www.databricks.com/ce-termsofuse, in each case unless you have entered into a separate written agreement with
11+
Databricks governing the use of the applicable Databricks Services.
12+
13+
Software: The source code and object code to which this license applies.
14+
15+
Scope of Use. You may not use this Software except in connection with your use of the Databricks Services pursuant to
16+
the Agreement. Your use of the Software must comply at all times with any restrictions applicable to the Databricks
17+
Services, generally, and must be used in accordance with any applicable documentation. You may view, use, copy,
18+
modify, publish, and/or distribute the Software solely for the purposes of using the code within or connecting to the
19+
Databricks Services. If you do not agree to these terms, you may not view, use, copy, modify, publish, and/or
20+
distribute the Software.
21+
22+
Redistribution. You may redistribute and sublicense the Software so long as all use is in compliance with these terms.
23+
In addition:
24+
25+
You must give any other recipients a copy of this License;
26+
You must cause any modified files to carry prominent notices stating that you changed the files;
27+
You must retain, in the source code form of any derivative works that you distribute, all copyright, patent,
28+
trademark, and attribution notices from the source code form, excluding those notices that do not pertain to any part
29+
of the derivative works; and
30+
If the source code form includes a "NOTICE" text file as part of its distribution, then any derivative works that you
31+
distribute must include a readable copy of the attribution notices contained within such NOTICE file, excluding those
32+
notices that do not pertain to any part of the derivative works.
33+
You may add your own copyright statement to your modifications and may provide additional license terms and conditions
34+
for use, reproduction, or distribution of your modifications, or for any such derivative works as a whole, provided
35+
your use, reproduction, and distribution of the Software otherwise complies with the conditions stated in this
36+
License.
37+
38+
Termination. This license terminates automatically upon your breach of these terms or upon the termination of your
39+
Agreement. Additionally, Databricks may terminate this license at any time on notice. Upon termination, you must
40+
permanently delete the Software and all copies thereof.
41+
42+
DISCLAIMER; LIMITATION OF LIABILITY.
43+
44+
THE SOFTWARE IS PROVIDED “AS-IS” AND WITH ALL FAULTS. DATABRICKS, ON BEHALF OF ITSELF AND ITS LICENSORS, SPECIFICALLY
45+
DISCLAIMS ALL WARRANTIES RELATING TO THE SOURCE CODE, EXPRESS AND IMPLIED, INCLUDING, WITHOUT LIMITATION, IMPLIED
46+
WARRANTIES, CONDITIONS AND OTHER TERMS OF MERCHANTABILITY, SATISFACTORY QUALITY OR FITNESS FOR A PARTICULAR PURPOSE,
47+
AND NON-INFRINGEMENT. DATABRICKS AND ITS LICENSORS TOTAL AGGREGATE LIABILITY RELATING TO OR ARISING OUT OF YOUR USE OF
48+
OR DATABRICKS’ PROVISIONING OF THE SOURCE CODE SHALL BE LIMITED TO ONE THOUSAND ($1,000) DOLLARS. IN NO EVENT SHALL
49+
THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF
50+
CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
51+
THE SOFTWARE.
Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
Copyright (2025) Databricks, Inc.
2+
3+
This Software includes software developed at Databricks (https://www.databricks.com/) and its use is subject to the included LICENSE file.
4+
5+
__________
6+
This Software contains code from the following open source projects, licensed under the Apache 2.0 license (https://www.apache.org/licenses/LICENSE-2.0):
7+
8+
requests - https://pypi.org/project/requests/
9+
Copyright 2019 Kenneth Reitz
10+
11+
tenacity - https://pypi.org/project/tenacity/
12+
Copyright Julien Danjou
13+
14+
pyspark - https://pypi.org/project/pyspark/
15+
Copyright 2014 and onwards The Apache Software Foundation.
16+
17+

0 commit comments

Comments
 (0)