Skip to content

Commit c4cddde

Browse files
committed
add job backfill example
1 parent 66d3de8 commit c4cddde

6 files changed

Lines changed: 159 additions & 0 deletions

File tree

Lines changed: 67 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,67 @@
1+
# pydabs_job_backfill_data
2+
3+
This example demonstrates a Databricks Asset Bundle (DABs) Job that runs a SQL task with a date parameter for backfilling data.
4+
5+
The Job consists of:
6+
7+
1. **run_daily_sql** — A SQL task that runs `src/my_query.sql` with a `run_date` job parameter. The query inserts data from a source table into a target table filtered by `event_date = run_date`, so you can backfill or reprocess specific dates.
8+
9+
* `src/`: SQL and notebook source code for this project.
10+
* `src/my_query.sql`: Daily insert query that uses the `:run_date` parameter to filter by event date.
11+
* `resources/`: Resource configurations (jobs, pipelines, etc.)
12+
* `resources/backfill_data.py`: PyDABs job definition with a parameterized SQL task.
13+
14+
## Job parameters
15+
16+
| Parameter | Default | Description |
17+
|------------|-------------|--------------------------------------|
18+
| `run_date` | `2024-01-01` | Date used to filter data (e.g. `event_date`). |
19+
20+
Before deploying, set `warehouse_id` in `resources/backfill_data.py` to your SQL warehouse ID, and adjust the catalog/schema/table names in `src/my_query.sql` to match your environment.
21+
22+
## Getting started
23+
24+
Choose how you want to work on this project:
25+
26+
(a) Directly in your Databricks workspace, see
27+
https://docs.databricks.com/dev-tools/bundles/workspace.
28+
29+
(b) Locally with an IDE like Cursor or VS Code, see
30+
https://docs.databricks.com/vscode-ext.
31+
32+
(c) With command line tools, see https://docs.databricks.com/dev-tools/cli/databricks-cli.html
33+
34+
If you're developing with an IDE, dependencies for this project should be installed using uv:
35+
36+
* Make sure you have the UV package manager installed.
37+
It's an alternative to tools like pip: https://docs.astral.sh/uv/getting-started/installation/.
38+
* Run `uv sync --dev` to install the project's dependencies.
39+
40+
## Using this project with the CLI
41+
42+
The Databricks workspace and IDE extensions provide a graphical interface for working
43+
with this project. You can also use the CLI:
44+
45+
1. Authenticate to your Databricks workspace, if you have not done so already:
46+
```
47+
$ databricks configure
48+
```
49+
50+
2. To deploy a development copy of this project, run:
51+
```
52+
$ databricks bundle deploy --target dev
53+
```
54+
(Note: "dev" is the default target, so `--target` is optional.)
55+
56+
This deploys everything defined for this project, including the job
57+
`[dev yourname] sql_backfill_example`. You can find it under **Workflows** (or **Jobs & Pipelines**) in your workspace.
58+
59+
3. To run the job with the default `run_date`:
60+
```
61+
$ databricks bundle run sql_backfill_example
62+
```
63+
64+
4. To run the job for a specific date (e.g. backfill):
65+
```
66+
$ databricks bundle run sql_backfill_example --parameters run_date=2024-02-01
67+
```
Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
# This is a Databricks asset bundle definition for pydabs_airflow.
2+
# See https://docs.databricks.com/dev-tools/bundles/index.html for documentation.
3+
bundle:
4+
name: pydabs_job_conditional_execution
5+
6+
python:
7+
venv_path: .venv
8+
# Functions called to load resources defined in Python. See resources/__init__.py
9+
resources:
10+
- "resources:load_resources"
11+
12+
include:
13+
- resources/*.yml
14+
- resources/*/*.yml
15+
16+
targets:
17+
dev:
18+
mode: development
19+
default: true
20+
workspace:
21+
host: https://myworkspace.databricks.com
Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
[project]
2+
name = "pydabs_job_backfill_data"
3+
version = "0.0.1"
4+
authors = [{ name = "Databricks Field Engineering" }]
5+
requires-python = ">=3.10,<=3.13"
6+
dependencies = [
7+
# Any dependencies for jobs and pipelines in this project can be added here
8+
# See also https://docs.databricks.com/dev-tools/bundles/library-dependencies
9+
#
10+
# LIMITATION: for pipelines, dependencies are cached during development;
11+
# add dependencies to the 'environment' section of pipeline.yml file instead
12+
]
13+
14+
[dependency-groups]
15+
dev = [
16+
"pytest",
17+
"databricks-connect>=15.4,<15.5",
18+
"databricks-bundles==0.275.0",
19+
]
20+
21+
[build-system]
22+
requires = ["hatchling"]
23+
build-backend = "hatchling.build"
24+
25+
[tool.black]
26+
line-length = 125
Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
from databricks.bundles.core import (
2+
Bundle,
3+
Resources,
4+
load_resources_from_current_package_module,
5+
)
6+
7+
8+
def load_resources(bundle: Bundle) -> Resources:
9+
"""
10+
'load_resources' function is referenced in databricks.yml and is responsible for loading
11+
bundle resources defined in Python code. This function is called by Databricks CLI during
12+
bundle deployment. After deployment, this function is not used.
13+
"""
14+
15+
# the default implementation loads all Python files in 'resources' directory
16+
return load_resources_from_current_package_module()
Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
from databricks.bundles.jobs import (
2+
Job,
3+
Task,
4+
SqlTask,
5+
SqlTaskFile,
6+
JobParameterDefinition,
7+
)
8+
9+
run_daily_sql = Task(
10+
task_key="run_daily_sql",
11+
sql_task=SqlTask(
12+
warehouse_id="<your_warehouse_id>",
13+
file=SqlTaskFile(path="src/my_query.sql"),
14+
parameters={"run_date": "{{job.parameters.run_date}}"},
15+
),
16+
)
17+
18+
sql_backfill_example = Job(
19+
name="sql_backfill_example",
20+
tasks=[run_daily_sql],
21+
parameters=[
22+
JobParameterDefinition(name="run_date", default="2024-01-01"),
23+
],
24+
)
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
-- referenced by sql_task
2+
INSERT INTO catalog.schema.target_table
3+
SELECT *
4+
FROM catalog.schema.source_table
5+
WHERE event_date = date(:run_date);

0 commit comments

Comments
 (0)