Databricks Sink Node
Quick Reference
Databricks Table Select an existing Databricks table data asset or create a new one.
Overview
The Databricks Sink Node allows you to ingest processed data directly into Databricks Unity Catalog tables. Unlike direct file writers, this node leverages a Databricks SQL Warehouse to ensure ACID compliance and governance within the Databricks ecosystem.
How It Works
This node operates in a three-step "Stage and Load" process to maximize reliability and throughput:
- Stage: Data is converted to Parquet format locally.
- Upload: Parquet files are uploaded securely to a Databricks Volume (Unity Catalog storage).
- Load: A
COPY INTOSQL command is executed on your designated SQL Warehouse. This command loads the data from the Volume into your target table transactionally.
Configuration
| UI Selection | Description |
|---|---|
| Databricks Table | Select an existing Databricks table data asset or create a new one (e.g., prod.finance.revenue_reports). You can also create/import tables in the Data Assets section. |
Prerequisites
Before using the Databricks Sink, you must have a Databricks Integration configured in Fleak. This integration provides the OAuth credentials needed to connect to your Databricks workspace.
Creating a New Databricks Table Data Asset
When creating a new data asset, you will need to provide:
- Integration — the Databricks integration to use
- Catalog — the Unity Catalog catalog name
- Schema — the schema (database) within the catalog
- SQL Warehouse — the warehouse used to execute the
COPY INTOcommand - Staging Volume — the volume where Parquet files are staged before loading
- Table — the target table to write data into
Advanced Settings
These settings control the performance and behavior of the ingestion process but do not affect the destination topology.
| Field Name | Description | Default |
|---|---|---|
| Batch Size | Number of records to accumulate before triggering a "Stage and Load" operation. | 10,000 |
Schema Validation
The Databricks Sink validates the schema of your selected table against the actual Databricks table schema at startup. If the schemas do not match, the pipeline will not start. This prevents data quality issues caused by schema drift between your pipeline and the target table.
When to use the Databricks Sink
- Choose Databricks Sink if you are building business-critical tables that analysts query immediately, and you want the safety and ease of Databricks Unity Catalog governance.
- Choose Delta Lake Sink if you are building a raw data landing zone, are sensitive to compute costs, or need to write data to storage that isn't strictly coupled to a Databricks workspace.