Monitoring and Debugging

Grid provides comprehensive observability for distributed data pipeline execution through an integrated monitoring stack. Access real-time metrics, logs, and debugging tools whether running on-premises or through Fleak's managed platform.

Monitoring Stack Components

Grid's observability infrastructure consists of four integrated components deployed via Helm charts:

VictoriaLogs - Log Aggregation

VictoriaLogs provides centralized log collection and search capabilities:

Features:

Collects logs from all job tasks across worker nodes
Flexible querying with LogSQL syntax
Filter by job ID, task ID, time ranges, and custom fields
10GB persistent storage (configurable via Helm values)
Efficient compression and indexing

Query Capabilities:

# Find all logs for a specific job
job_id:"job-12345"

# Search for errors in a task
task_id:"task-67890" AND level:"ERROR"

# Time-range filtered logs
job_id:"job-12345" AND _time:[now-1h, now]

Access:

Via Grafana dashboards (Explore view)
REST API: GET /api/v1/logs/search
On-prem UI: Built-in log viewer with tail functionality

InfluxDB v2 - Metrics Storage

InfluxDB stores time-series metrics for performance monitoring:

Collected Metrics:

Task Metrics: Input/output byte counts, event rates, error counts
Worker Metrics: CPU usage, memory utilization, task capacity
Job Metrics: Total processing time, task distribution, throughput

Query Language: Supports Flux for advanced time-series analysis:

from(bucket: "grid-metrics")
  |> range(start: -1h)
  |> filter(fn: (r) => r._measurement == "task_metrics")
  |> filter(fn: (r) => r.job_id == "job-12345")
  |> aggregateWindow(every: 1m, fn: mean)

Vector - Log Pipeline

Vector handles log ingestion and routing:

Pipeline Flow:

Input: Listens on Logstash TCP port (default 9000)
Transform: Enriches logs with metadata
- job_id
- task_id
- worker_id
- source identifier
Output: Forwards to VictoriaLogs via HTTP

Reliability:

Automatic batching (1MB max batch size)
5-second flush timeout
Built-in retry logic with exponential backoff
Memory and disk buffering

Configuration: Customize via Helm values:

vector:
  port: 9000
  batchMaxBytes: 1048576
  batchTimeout: 5

Grafana - Visualization Dashboard

Grafana provides unified observability across logs and metrics:

Pre-configured Dashboards:

Cluster Overview: Worker health, capacity, task distribution
Job Monitoring: Task states, throughput, error rates
Task Details: Individual task metrics and performance
System Health: Infrastructure metrics, resource usage

Data Sources:

InfluxDB v2 for metrics queries
VictoriaLogs for log exploration
Prometheus (optional) for Kubernetes metrics

Alerting:

Configure alerts on metric thresholds
Notification channels (email, Slack, PagerDuty)
Alert deduplication and grouping

Accessing Grafana Locally

For on-premises deployments, Grafana is deployed as part of the Grid Helm chart and can be accessed locally using Kubernetes port forwarding.

Port Forward to Grafana Service:

kubectl port-forward -n <namespace> svc/grid-grafana 3000:80

Replace <namespace> with your Grid deployment namespace (typically grid or default).

Access Grafana: Once port forwarding is active, open your browser to:

http://localhost:3000

Default Credentials: The default Grafana admin credentials are configured in your Helm values. Check your deployment configuration or use:

kubectl get secret -n <namespace> grid-grafana -o jsonpath="{.data.admin-password}" | base64 --decode

Pre-configured Data Sources: Grafana comes with pre-configured connections to:

InfluxDB v2: Access via "Data Sources" → "InfluxDB"
- URL: http://grid-influxdb:8086
- Organization and bucket are pre-configured
VictoriaLogs: Access via "Data Sources" → "VictoriaLogs" or "Loki"
- URL: http://grid-victorialogs:9428

On-Premises Monitoring UI

Grid's on-premises deployment includes a browser-based monitoring interface accessible via the JobMaster service. The UI provides comprehensive visibility into cluster operations and job execution.

Cluster Overview

Monitor the health and capacity of your worker cluster:

Worker Status:

Worker ID and endpoint URL
Current task count vs. maximum capacity
State indicators: Active, Inactive, Pending
Last heartbeat timestamp
Registration and creation dates
Paginated view with configurable page size

Real-time Updates:

Automatic refresh every 5 seconds
Visual indicators for worker health
Capacity utilization bars
Unresponsive worker alerts

Job Management Dashboard

View and control all jobs in the system:

Job List Features:

Job ID with clickable links to detail view
External ID mapping for system integration
Task counters: Ready, Scheduled, Running, Completed, Failed
Status badges with color coding:
- COMPLETED (green): All tasks finished successfully
- RUNNING (blue): Job in progress
- FAILED (red): One or more tasks failed
- TIMEOUT (orange): Tasks exceeded time limits
- MIXED (yellow): Combination of states
- EMPTY (gray): No tasks in job
Creation date and sorting
Pagination controls

Bulk Operations:

Kill multiple jobs simultaneously
Retry failed jobs
Export job data

REST API for Programmatic Access

Grid exposes comprehensive REST APIs for integration with external monitoring tools and custom dashboards.

Metrics API

Query time-series metrics from InfluxDB:

Endpoint:

GET /api/v1/metrics

Query Parameters:

from: Start timestamp (ISO 8601 or relative like -1h)
to: End timestamp (ISO 8601 or now)
interval: Aggregation window (e.g., 1m, 5m, 1h)
limit: Maximum data points to return
tag.<key>=<value>: Filter by tag (e.g., tag.job_id=job-12345)

Response Format:

{
  "dataPoints": [
    {
      "timestamp": "2024-01-29T10:00:00Z",
      "metric": "task_throughput",
      "value": 1250.5,
      "tags": {
        "job_id": "job-12345",
        "task_id": "task-67890"
      }
    }
  ],
  "aggregation": "mean",
  "interval": "1m"
}

Example Query:

curl -X GET "http://jobmaster:8080/api/v1/metrics?from=-1h&to=now&interval=1m&tag.job_id=job-12345"

Job Status API

Manage and monitor jobs:

List All Jobs:

GET /api/v1/jobs?page=0&size=20

Get Job Details:

GET /api/v1/jobs/{jobId}

Response:

{
  "jobId": "job-12345",
  "externalId": "workflow-abc",
  "createdAt": "2024-01-29T09:00:00Z",
  "taskCounters": {
    "ready": 0,
    "scheduled": 2,
    "running": 5,
    "completed": 100,
    "failed": 3
  },
  "status": "RUNNING"
}

List Job Tasks:

GET /api/v1/jobs/{jobId}/tasks?page=0&size=50

Kill Job:

DELETE /api/v1/jobs/kill/{jobId}

Task State Tracking

Grid maintains a comprehensive audit trail for every task through a well-defined state machine.

State Machine

READY → SCHEDULED → RUNNING → COMPLETED
                            ↘ FAILED → READY (if retries available)

State Descriptions:

READY: Task is queued and waiting for worker assignment
SCHEDULED: Task assigned to worker but not yet started
RUNNING: Task is actively executing on worker
COMPLETED: Task finished successfully
FAILED: Task encountered an error (may retry)

Retry Logic

Failed tasks are automatically retried based on configuration:

Retry Conditions:

Task has not exceeded max retry count
Failure is not marked as non-retryable
Worker is healthy and available

Retry Strategy:

Exponential backoff between attempts
Configurable max retries (default 3)
Retry delay: 2^(retry_count) seconds
Reset to READY state for rescheduling

Non-Retryable Failures:

Invalid DAG configuration
Authentication failures
Unrecoverable errors (as marked by task)

tip

Set up Grafana alerts to notify you proactively of issues rather than manually checking dashboards. This reduces mean time to detection (MTTD) significantly.

info

For questions about monitoring configuration or troubleshooting, contact support@fleak.ai

Monitoring Stack Components​

VictoriaLogs - Log Aggregation​

InfluxDB v2 - Metrics Storage​

Vector - Log Pipeline​

Grafana - Visualization Dashboard​

Accessing Grafana Locally​

On-Premises Monitoring UI​

Cluster Overview​

Job Management Dashboard​

REST API for Programmatic Access​

Metrics API​

Job Status API​

Task State Tracking​

State Machine​

Retry Logic​