Skip to main content

Databricks Sink Node

Quick Reference

NameDescription
Table (Asset)Select or create a table from databricks
Select WarehouseSelect the Databricks SQL Warehouse to execute the load command.

Overview

The Databricks Sink Node allows you to ingest processed data directly into Databricks Unity Catalog tables. Unlike direct file writers, this node leverages the Databricks SQL Compute engine to ensure ACID compliance and governance within the Databricks ecosystem.

How It Works

This node operates in a three-step "Stage and Load" process to maximize reliability and throughput:

  1. Stage: Data is converted to Parquet format locally.
  2. Upload: Parquet files are uploaded securely to a Databricks Volume (Unity Catalog managed storage).
  3. Load: A COPY INTO SQL command is executed on your designated SQL Warehouse. This command loads the data from the Volume into your target table transactionally.

Configuration

UI SelectionDescription
Select Table (Asset)Choose a target table from the dropdown (e.g., prod.finance.revenue_reports). You can select an existing table or define a new one directly in the UI.
You can also create/import tables in the Data Assets section
Select WarehouseSelect the Databricks SQL Warehouse to execute the load command.

Advanced Settings

These settings control the performance and behavior of the ingestion process but do not affect the destination topology.

Field NameDescriptionDefault
Batch SizeNumber of records to accumulate before triggering a "Stage and Load" operation.10,000
Flush IntervalMaximum time (ms) to wait before forcing a write, ensuring low latency for low-volume streams.30,000 (30s)
Cleanup After CopyAutomatically deletes the temporary Parquet files from the Databricks Volume after a successful load.True

When to use the Databricks Sink

  • Choose Databricks Sink if you are building business-critical tables that analysts query immediately, and you want the safety and ease of Databricks Unity Catalog governance.
  • Choose Delta Lake Sink if you are building a raw data landing zone, are sensitive to compute costs, or need to write data to storage that isn't strictly coupled to a Databricks workspace.