Skip to main content

Azure Blob Storage Sink Node

The azureblobsink node writes pipeline records to an Azure Blob Storage container as newline-delimited JSON (.jsonl) blobs. It batches records in memory and uploads each batch as a single blob, organized into date-partitioned virtual folders.

Typical use cases include archiving streaming events to Azure cold storage, producing date-partitioned data lake landings consumed by Synapse or Databricks, and regulatory retention of processed records.

Key Features

  • Batched NDJSON writes: records accumulate until the batch reaches batchSize, then are uploaded as a single .jsonl blob
  • Date-partitioned naming: every blob name embeds a UTC yyyy/MM/dd/HH-mm-ss path component plus a UUID, ready for tools that expect Hive-style partitions
  • Dual-mode auth: authenticate with either a full Azure Storage connection string or a credential containing the storage account name and key

Configuration

FieldTypeRequiredDefaultDescription
containerNameStringYesAzure Blob Storage container to write to
connectionStringStringOne of connectionString or credentialId is requiredFull Azure Storage connection string. When set, takes priority over credentialId
credentialIdStringOne of connectionString or credentialId is requiredID of a UsernamePasswordCredential in jobContext.otherProperties where username = storage account name and password = storage account key
blobNamePrefixStringNoevents/Prefix prepended to every generated blob name. Include a trailing / to act as a virtual folder
batchSizeIntegerNo100Number of records buffered in memory before flushing as a single blob. Minimum: 1

Authentication

Either connectionString or credentialId must be provided. If both are set, connectionString wins.

Connection String

config:
connectionString: "DefaultEndpointsProtocol=https;AccountName=mystorage;AccountKey=...;EndpointSuffix=core.windows.net"

Credential

jobContext:
otherProperties:
azure-blob-cred:
username: "mystorage" # storage account name
password: "<storage account key>"

The connector builds a connection string at runtime from these two values.

Blob Naming

Every blob the sink writes follows the pattern:

{blobNamePrefix}{yyyy/MM/dd/HH-mm-ss}-{uuid}.jsonl

For example, with blobNamePrefix: "events/":

events/2026/04/27/12-34-56-3f1a8b9d-2c5e-4a8c-9d6e-7b1c2a3d4e5f.jsonl

The timestamp is in UTC, and each line in the blob is one record from the workflow batch.

DAG Example

jobContext:
otherProperties:
azure-blob-cred:
username: "mystorage"
password: "<storage account key>"
metricTags: {}
dlqConfig:

dag:
- id: "source"
commandName: "kafkasource"
config:
broker: "kafka:9092"
topic: "raw-events"
groupId: "azure-archiver"
encodingType: "JSON_OBJECT"
outputs:
- "sink"

- id: "sink"
commandName: "azureblobsink"
config:
containerName: "workflow-output"
blobNamePrefix: "archive/orders/"
credentialId: "azure-blob-cred"
batchSize: 500

Tuning Batch Size

batchSize is a tradeoff between write frequency and memory:

  • Larger batches reduce blob writes and produce fewer, larger objects (cheaper to scan downstream).
  • Smaller batches lower memory usage and reduce time-to-availability of records in storage.

The default of 100 is conservative; for high-throughput pipelines 100010000 is typical.

  • azureblobsource: Read blobs from an Azure Blob Storage container
  • gcssink: Write records to a Google Cloud Storage bucket
  • s3sink: Write records to AWS S3