Delta Sink Node
When using S3 as the cloud storage backend, Delta Lake does not support distributed writing. Deploying a data pipeline job with multiple replicas writing data in Delta Lake format to S3 will likely cause data corruption due to the lack of a distributed lock implementation. For S3 deployments, ensure your pipeline runs with a single replica.
Quick Reference
| Name | Description |
|---|---|
| Use Credentials | Credentials used to authenticate with your Delta Lake storage. ex: AWS Prod Credentials |
| Delta Lake Table Path | The full path to the Delta Lake table you want to write to. ex: s3://my-bucket/data/events_delta |
| Select a Saved Table | Choose from previously saved Delta Lake table paths. ex: UserEventsProdDeltaTable |
| Partition Columns | Columns used to partition the Delta Lake table for optimized reads and writes. ex: event_date |
| Hadoop Configuration | Custom Hadoop properties applied during the Delta Lake write. ex: fs.s3a.endpoint = s3.eu-west-1.amazonaws.com |
Overview
The Delta Lake Sink Node enables you to write processed workflow data directly into a physical storage location in the Delta Lake table format.
Configuration
| Field Name | Description | Required? | Default |
|---|---|---|---|
| Use Credentials | Select a stored credential object (e.g., AWS Keys, GCP Service Account) to authenticate with the storage provider. | Yes (for Cloud) | None |
| Delta Lake Table Path | The full URI to the Delta table folder. Supported schemes include s3a://, gs://, abfs://, hdfs://, and file://. | Yes | N/A |
| Batch Size | The number of records to accumulate before writing a transaction commit. Max recommended is 10,000. | No | 1000 |
| Partition Columns | A list of column names used to partition the data physically on storage. These columns must exist in the target table's schema. | No | None |
| Hadoop Configuration | Advanced Key-Value pairs to override underlying Hadoop file system settings. | No | None |
Important Prerequisites
Pre-existing Tables Only: This node does not create new Delta tables automatically. The table must already exist
at the specified Delta Lake Table Path with a defined schema. If the table is not found during initialization, the
workflow will fail.
One can create new or import existing delta tables in the Data Assets section.
Schema Validation: Incoming data is strictly validated against the existing Delta table's schema.
Storage & Authentication
The node supports various storage backends. You must provide the correct credential type for your chosen storage path.
| Storage Provider | Path Scheme | Required Credential Type | Note |
|---|---|---|---|
| AWS S3 | s3:// or s3a:// | Username/Password | Username = Access Key IDPassword = Secret Access Key |
| Google Cloud (GCS) | gs:// | GCP Credential | Supports Service Account JSON Keyfile or Access Token. |
| Azure Blob (ADLS) | abfs:// | API Key | The API Key is the Storage Account Key. The system attempts to extract the account name from the URL. |
| HDFS | hdfs:// | None | Configure authentication via the Hadoop Configuration section directly. |
When to use the Delta Lake Sink
- Choose Databricks Sink if you are building business-critical tables that analysts query immediately, and you want the safety and ease of Databricks Unity Catalog governance.
- Choose Delta Lake Sink if you are building a raw data landing zone, are sensitive to compute costs, or need to write data to storage that isn't strictly coupled to a Databricks workspace.