Skip to main content

Databricks Sink Node

Quick Reference

Databricks Table Select an existing Databricks table data asset or create a new one.

Overview

The Databricks Sink Node allows you to ingest processed data directly into Databricks Unity Catalog tables. Unlike direct file writers, this node leverages a Databricks SQL Warehouse to ensure ACID compliance and governance within the Databricks ecosystem.

How It Works

This node operates in a three-step "Stage and Load" process to maximize reliability and throughput:

  1. Stage: Data is converted to Parquet format locally.
  2. Upload: Parquet files are uploaded securely to a Databricks Volume (Unity Catalog storage).
  3. Load: A COPY INTO SQL command is executed on your designated SQL Warehouse. This command loads the data from the Volume into your target table transactionally.

Configuration

UI SelectionDescription
Databricks TableSelect an existing Databricks table data asset or create a new one (e.g., prod.finance.revenue_reports). You can also create/import tables in the Data Assets section.

Prerequisites

Before using the Databricks Sink, you must have a Databricks Integration configured in Fleak. This integration provides the OAuth credentials needed to connect to your Databricks workspace.

Creating a New Databricks Table Data Asset

When creating a new data asset, you will need to provide:

  • Integration — the Databricks integration to use
  • Catalog — the Unity Catalog catalog name
  • Schema — the schema (database) within the catalog
  • SQL Warehouse — the warehouse used to execute the COPY INTO command
  • Staging Volume — the volume where Parquet files are staged before loading
  • Table — the target table to write data into

Advanced Settings

These settings control the performance and behavior of the ingestion process but do not affect the destination topology.

Field NameDescriptionDefault
Batch SizeNumber of records to accumulate before triggering a "Stage and Load" operation.10,000

Schema Validation

The Databricks Sink validates the schema of your selected table against the actual Databricks table schema at startup. If the schemas do not match, the pipeline will not start. This prevents data quality issues caused by schema drift between your pipeline and the target table.

When to use the Databricks Sink

  • Choose Databricks Sink if you are building business-critical tables that analysts query immediately, and you want the safety and ease of Databricks Unity Catalog governance.
  • Choose Delta Lake Sink if you are building a raw data landing zone, are sensitive to compute costs, or need to write data to storage that isn't strictly coupled to a Databricks workspace.