Skip to main content

Monitoring and Debugging

Grid provides comprehensive observability for distributed data pipeline execution through an integrated monitoring stack. Access real-time metrics, logs, and debugging tools whether running on-premises or through Fleak's managed platform.

Monitoring Stack Components

Grid's observability infrastructure consists of four integrated components deployed via Helm charts:

VictoriaLogs - Log Aggregation

VictoriaLogs provides centralized log collection and search capabilities:

Features:

  • Collects logs from all job tasks across worker nodes
  • Flexible querying with LogSQL syntax
  • Filter by job ID, task ID, time ranges, and custom fields
  • 10GB persistent storage (configurable via Helm values)
  • Efficient compression and indexing

Query Capabilities:

# Find all logs for a specific job
job_id:"job-12345"

# Search for errors in a task
task_id:"task-67890" AND level:"ERROR"

# Time-range filtered logs
job_id:"job-12345" AND _time:[now-1h, now]

Access:

  • Via Grafana dashboards (Explore view)
  • REST API: GET /api/v1/logs/search
  • On-prem UI: Built-in log viewer with tail functionality

InfluxDB v2 - Metrics Storage

InfluxDB stores time-series metrics for performance monitoring:

Collected Metrics:

  • Task Metrics: Input/output byte counts, event rates, error counts
  • Worker Metrics: CPU usage, memory utilization, task capacity
  • Job Metrics: Total processing time, task distribution, throughput

Query Language: Supports Flux for advanced time-series analysis:

from(bucket: "grid-metrics")
|> range(start: -1h)
|> filter(fn: (r) => r._measurement == "task_metrics")
|> filter(fn: (r) => r.job_id == "job-12345")
|> aggregateWindow(every: 1m, fn: mean)

Vector - Log Pipeline

Vector handles log ingestion and routing:

Pipeline Flow:

  1. Input: Listens on Logstash TCP port (default 9000)
  2. Transform: Enriches logs with metadata
    • job_id
    • task_id
    • worker_id
    • source identifier
  3. Output: Forwards to VictoriaLogs via HTTP

Reliability:

  • Automatic batching (1MB max batch size)
  • 5-second flush timeout
  • Built-in retry logic with exponential backoff
  • Memory and disk buffering

Configuration: Customize via Helm values:

vector:
port: 9000
batchMaxBytes: 1048576
batchTimeout: 5

Grafana - Visualization Dashboard

Grafana provides unified observability across logs and metrics:

Pre-configured Dashboards:

  • Cluster Overview: Worker health, capacity, task distribution
  • Job Monitoring: Task states, throughput, error rates
  • Task Details: Individual task metrics and performance
  • System Health: Infrastructure metrics, resource usage

Data Sources:

  • InfluxDB v2 for metrics queries
  • VictoriaLogs for log exploration
  • Prometheus (optional) for Kubernetes metrics

Alerting:

  • Configure alerts on metric thresholds
  • Notification channels (email, Slack, PagerDuty)
  • Alert deduplication and grouping

Accessing Grafana Locally

For on-premises deployments, Grafana is deployed as part of the Grid Helm chart and can be accessed locally using Kubernetes port forwarding.

Port Forward to Grafana Service:

kubectl port-forward -n <namespace> svc/grid-grafana 3000:80

Replace <namespace> with your Grid deployment namespace (typically grid or default).

Access Grafana: Once port forwarding is active, open your browser to:

http://localhost:3000

Default Credentials: The default Grafana admin credentials are configured in your Helm values. Check your deployment configuration or use:

kubectl get secret -n <namespace> grid-grafana -o jsonpath="{.data.admin-password}" | base64 --decode

Pre-configured Data Sources: Grafana comes with pre-configured connections to:

  • InfluxDB v2: Access via "Data Sources" → "InfluxDB"
    • URL: http://grid-influxdb:8086
    • Organization and bucket are pre-configured
  • VictoriaLogs: Access via "Data Sources" → "VictoriaLogs" or "Loki"
    • URL: http://grid-victorialogs:9428

On-Premises Monitoring UI

Grid's on-premises deployment includes a browser-based monitoring interface accessible via the JobMaster service. The UI provides comprehensive visibility into cluster operations and job execution.

Cluster Overview

Monitor the health and capacity of your worker cluster:

Worker Status:

  • Worker ID and endpoint URL
  • Current task count vs. maximum capacity
  • State indicators: Active, Inactive, Pending
  • Last heartbeat timestamp
  • Registration and creation dates
  • Paginated view with configurable page size

Real-time Updates:

  • Automatic refresh every 5 seconds
  • Visual indicators for worker health
  • Capacity utilization bars
  • Unresponsive worker alerts

Job Management Dashboard

View and control all jobs in the system:

Job List Features:

  • Job ID with clickable links to detail view
  • External ID mapping for system integration
  • Task counters: Ready, Scheduled, Running, Completed, Failed
  • Status badges with color coding:
    • COMPLETED (green): All tasks finished successfully
    • RUNNING (blue): Job in progress
    • FAILED (red): One or more tasks failed
    • TIMEOUT (orange): Tasks exceeded time limits
    • MIXED (yellow): Combination of states
    • EMPTY (gray): No tasks in job
  • Creation date and sorting
  • Pagination controls

Bulk Operations:

  • Kill multiple jobs simultaneously
  • Retry failed jobs
  • Export job data

REST API for Programmatic Access

Grid exposes comprehensive REST APIs for integration with external monitoring tools and custom dashboards.

Metrics API

Query time-series metrics from InfluxDB:

Endpoint:

GET /api/v1/metrics

Query Parameters:

  • from: Start timestamp (ISO 8601 or relative like -1h)
  • to: End timestamp (ISO 8601 or now)
  • interval: Aggregation window (e.g., 1m, 5m, 1h)
  • limit: Maximum data points to return
  • tag.<key>=<value>: Filter by tag (e.g., tag.job_id=job-12345)

Response Format:

{
"dataPoints": [
{
"timestamp": "2024-01-29T10:00:00Z",
"metric": "task_throughput",
"value": 1250.5,
"tags": {
"job_id": "job-12345",
"task_id": "task-67890"
}
}
],
"aggregation": "mean",
"interval": "1m"
}

Example Query:

curl -X GET "http://jobmaster:8080/api/v1/metrics?from=-1h&to=now&interval=1m&tag.job_id=job-12345"

Job Status API

Manage and monitor jobs:

List All Jobs:

GET /api/v1/jobs?page=0&size=20

Get Job Details:

GET /api/v1/jobs/{jobId}

Response:

{
"jobId": "job-12345",
"externalId": "workflow-abc",
"createdAt": "2024-01-29T09:00:00Z",
"taskCounters": {
"ready": 0,
"scheduled": 2,
"running": 5,
"completed": 100,
"failed": 3
},
"status": "RUNNING"
}

List Job Tasks:

GET /api/v1/jobs/{jobId}/tasks?page=0&size=50

Kill Job:

DELETE /api/v1/jobs/kill/{jobId}

Task State Tracking

Grid maintains a comprehensive audit trail for every task through a well-defined state machine.

State Machine

READY → SCHEDULED → RUNNING → COMPLETED
↘ FAILED → READY (if retries available)

State Descriptions:

  • READY: Task is queued and waiting for worker assignment
  • SCHEDULED: Task assigned to worker but not yet started
  • RUNNING: Task is actively executing on worker
  • COMPLETED: Task finished successfully
  • FAILED: Task encountered an error (may retry)

Retry Logic

Failed tasks are automatically retried based on configuration:

Retry Conditions:

  • Task has not exceeded max retry count
  • Failure is not marked as non-retryable
  • Worker is healthy and available

Retry Strategy:

  • Exponential backoff between attempts
  • Configurable max retries (default 3)
  • Retry delay: 2^(retry_count) seconds
  • Reset to READY state for rescheduling

Non-Retryable Failures:

  • Invalid DAG configuration
  • Authentication failures
  • Unrecoverable errors (as marked by task)
tip

Set up Grafana alerts to notify you proactively of issues rather than manually checking dashboards. This reduces mean time to detection (MTTD) significantly.

info

For questions about monitoring configuration or troubleshooting, contact support@fleak.ai