DeltaFile Analytics Capability

DeltaFile analytics provide the ability to extract insights on behavior, performance, and volumetrics of processed DeltaFiles by dataSource and any applied annotations.

Disabling Analytics

The analytics collector can be disabled via site/values.yaml:

yaml

deltafi:
  analytics:
    enabled: false

When disabled, the analytics container is not started, and the timescaleAnalyticsEnabled, parquetAnalyticsEnabled, and provenanceEnabled system properties are forced to false and locked in the System Properties UI. No analytics events are produced or stored.

See Site Configuration - Observability Components for details.

DeltaFile Analytics

The following events result in creation or updates to analytic events:

Creation of a DeltaFile
Egress of a DeltaFile
Annotation of a DeltaFile
Error events
Filter events
Cancellation of a DeltaFile

Data Retention

Raw analytics data is retained for 3 days, while continuous aggregate views maintain data for up to 30 days. This allows you to query historical trends even after the raw data has been purged.

DeltaFile Analytics Grafana Dashboards

DeltaFi ships with three default Grafana dashboards for analytics visualization:

DeltaFile Analytics
Error Analysis
Filter Analysis

These dashboards are located in the Grafana "DeltaFi" folder. They serve as examples of how to visualize DeltaFi data, and you can customize or extend them as needed.

Survey API for Analytics

/api/v2/survey is the REST API for injecting "survey" data into the DeltaFi analytics database. The API is used to add an array of DeltaFile analytic entries directly into the analytics database without needing to process the data through DeltaFi dataSources. This allows for hypothetical, observed, or orthogonally processed data to be represented alongside the analytic data for processed DeltaFiles that are automatically added to the database as part of the usual DeltaFi processing. The "survey" data can be viewed in Grafana dashboards and distinguished from other analytic data by the pattern of the did field (/^survey-.*/).

Endpoint

/api/v2/survey

Method

POST

Request Body

The request body should be a JSON array containing survey data. Each element in the array is an individual DeltaFile analytic entry. The following fields are supported in the analytic entries:

timestamp: Creation timestamp for the entry. Defaults to the current time.
dataSource: The name of the dataSource being surveyed (required)
files: The number of files being surveyed. Must be greater than 0. (required)
ingressBytes: Ingress bytes for the survey entry. Must be greater than or equal to 0. (ingress_bytes is also supported for backward compatibility)

Any additional fields will be stored as annotations on the survey entry, provided that the annotation key is allow-listed (see Analytics Configuration below).

Response

On success, the API returns HTTP 200 with no body.

If there are validation errors, it returns HTTP 400 with a JSON body containing a message for each invalid event:

json

  {
  "error": [
    {
      "message": "Error message describing the issue",
      "event": { "...": "..." }
    }
  ],
  "timestamp": "2023-08-21T10:00:00Z"
}

If analytics are disabled, it returns HTTP 501 with:

json

  {
    "error": "Survey analytics are disabled",
    "timestamp": "2023-08-21T10:00:00Z"
  }

Example Survey Post

json

    [
      {
        "dataSource": "Sample",
        "files": 1,
        "timestamp": "2023-08-21T10:00:00Z",
        "ingressBytes": 100,
        "annotation1": "value1",
        "annotation2": "value2"
      },
      {
        "dataSource": "Sample",
        "files": 1,
        "timestamp": "2023-08-21T10:01:00Z",
        "ingressBytes": 200,
        "annotation1": "value1",
        "annotation2": "value3"
      }
    ]

You can post the survey to /api/v2/survey with a command like:

bash

# Example Usage
curl -X POST -H "Content-Type: application/json" -d '[
  {
    "dataSource": "Sample",
    "files": 1,
    "ingressBytes": 100,
    "annotation1": "value1"
  }
]' http://your-api-endpoint/api/v2/survey

Timestamp Format

The following formats can be used for timestamp fields:

ISO 8601 Format (with timezone or offset), for example: 2023-08-21T10:00:00Z or 2023-08-21T10:00:00+05:30
RFC 2822 Format, for example: Thu, 21 Aug 2023 10:00:00 +0530
Ruby Date-Time String, for example: Wed Aug 21 10:00:00 UTC 2023
Short Date Format, for example: 2023-08-21

Analytics Configuration

Annotation Allow-List: By default, no annotations submitted with a DeltaFile or survey event are stored in the analytics system. If you wish to record annotations for grouping or filtering, you must explicitly specify a comma-separated list of keys in the allowedAnalyticsAnnotations property (configured via the GUI or TUI). Only those annotation keys present in this allow-list will be inserted into the analytics database.

Example configuration in the GUI or TUI:

allowedAnalyticsAnnotations="annotation1,annotation2,annotation3"

This configuration ensures that only the keys annotation1, annotation2, and annotation3 are recorded in analytics.

Analytics Group Name: Analytics supports an additional top-level group for classification, controlled by the analyticsGroupName property. When a new analytic event is recorded, if an annotation key matches the analyticsGroupName, its value is used as the event group. Otherwise, the event group is recorded as "Not Present".

Example configuration in the GUI or TUI:

analyticsGroupName="annotation1"

If an annotation named annotation1 is present, its value becomes the event group in the analytics table; if not, "Not Present" is used. This provides an extra grouping dimension on top of the data source.

Note: Analytics configurations will only apply to events recorded after the change. They are not retroactively applied to existing records.

Querying Analytics Data

The get_analytics_data function provides a standardized way to query analytics data with proper handling of annotations and filtering:

sql

get_analytics_data(
  p_annotation_key_name text,        -- Annotation key name to filter by, use 'All' for all
  p_annotation_values_text text[],   -- Array of annotation value names, include 'All' for all values
  p_datasources_text text[],         -- Array of data source names, include 'All' for all
  p_groups_text text[],              -- Array of group names, include 'All' for all
  p_analytic_ingress_types text[],   -- Array of ingress types (DATA_SOURCE, CHILD, SURVEY), include 'All' for all
  p_start_time timestamptz,          -- Start time for the query
  p_end_time timestamptz,            -- End time for the query
  p_interval_str text                -- Time bucket interval (e.g., '5 minutes')
)

Parameter Usage

To select all values for a parameter, use 'All' in the array
For annotation values, the function will automatically handle "No Annotation" data
For time range parameters in Grafana, use $__timeFrom() and $__timeTo()
For interval, use ($__interval_ms || ' ms') to adapt to zoom level
Missing or unknown groups are labeled as "No Group"

Return Values

The function returns a table with the following columns:

time: Timestamp bucket
datasource_name: Name of the data source
group_name: Name of the event group
annotation_value: Value of the annotation (or "No Annotation")
ingress_bytes: Total bytes ingressed
ingress_files: Count of files ingressed
egress_bytes: Total bytes egressed
egress_files: Count of files egressed
error_files: Count of files with errors
filter_files: Count of filtered files

This output format works directly with Grafana's time series and table visualizations.

Querying Error and Filter Data

The get_errors_filters_data function provides detailed analysis of errors and filters with additional dimensions:

sql

get_errors_filters_data(
  p_annotation_key_name text,        -- Annotation key name to filter by, use 'All' for all
  p_annotation_values_text text[],   -- Array of annotation value names, include 'All' for all values
  p_datasources_text text[],         -- Array of data source names, include 'All' for all
  p_groups_text text[],              -- Array of group names, include 'All' for all
  p_start_time timestamptz,          -- Start time for the query
  p_end_time timestamptz,            -- End time for the query
  p_interval_str text,               -- Time bucket interval (e.g., '5 minutes')
  p_event_type text DEFAULT 'BOTH'   -- Filter by 'ERRORS', 'FILTERS', or 'BOTH'
)

Parameter Usage

To select all values for a parameter, use 'All' in the array
For annotation values, the function will automatically handle "No Annotation" data
For time range parameters in Grafana, use $__timeFrom() and $__timeTo()
For interval, use ($__interval_ms || ' ms') to adapt to zoom level
Missing or unknown groups are labeled as "No Group"

Return Values

The function returns a table with the following columns:

time: Timestamp bucket
datasource_name: Name of the data source
group_name: Name of the event group
annotation_value: Value of the annotation (or "No Annotation")
flow_name: Name of the flow where the error or filter occurred
action_name: Name of the action where the error or filter occurred
cause: Error or filter cause
error_files: Count of files with errors
filter_files: Count of filtered files

Grafana Integration

Both functions are designed to work with Grafana's visualization capabilities. For optimal performance in time series visualizations, you should:

Call the function directly for the data you need
Apply time_bucket_gapfill in your query if needed for visualization

Example Grafana Queries

sql

-- Time series visualization query with gap filling
SELECT
  time_bucket_gapfill(($__interval_ms || ' ms')::interval, "time") AS gapped_time,
  CASE
    WHEN $annotationKey = 'All' THEN datasource_name || ' - ' || group_name
    ELSE datasource_name || ' - ' || group_name || ' - ' || annotation_value
    END AS series_name,
  ingress_bytes,
  ingress_files,
  egress_bytes,
  egress_files,
  error_files,
  filter_files
FROM get_analytics_data(
        $annotationKey,
        ARRAY[$annotationValues],
        ARRAY[$dataSources],
        ARRAY[$groups],
        $__timeFrom(),
  $__timeTo(),
  ($__interval_ms || ' ms')
)

-- Summary table query (no gap filling needed)
SELECT
  datasource_name AS "Data Source",
  group_name AS "Group",
  CASE
    WHEN $annotationKey != 'All' THEN annotation_value
    END AS "Annotation Value",
  SUM(ingress_bytes) AS "Ingress Bytes",
  SUM(ingress_files) AS "Ingress Files",
  SUM(egress_bytes) AS "Egress Bytes",
  SUM(egress_files) AS "Egress Files",
  SUM(error_files) AS "Errored",
  SUM(filter_files) AS "Filtered"
FROM get_analytics_data(
        $annotationKey,
        ARRAY[$annotationValues],
        ARRAY[$dataSources],
        ARRAY[$groups],
        $__timeFrom(),
  $__timeTo(),
  '100 years' -- Using a very large interval for summary
)
GROUP BY
  datasource_name,
  group_name,
  CASE
    WHEN $annotationKey != 'All' THEN annotation_value
    END
ORDER BY "Data Source", "Group", "Annotation Value"

-- Error analysis with gap filling
SELECT
  time_bucket_gapfill(($__interval_ms || ' ms')::interval, "time") AS gapped_time,
  datasource_name || ' - ' || cause AS series_name,
  SUM(error_files) AS errors
FROM get_errors_filters_data(
        'All',
        ARRAY['All'],
        ARRAY[$dataSources],
        ARRAY['All'],
        $__timeFrom(),
  $__timeTo(),
  ($__interval_ms || ' ms'),
  'ERRORS'
)
GROUP BY gapped_time, datasource_name, cause
ORDER BY gapped_time, series_name

Performance Considerations

The analytics functions have been optimized for performance:

Time bucketing is applied early in the query pipeline to reduce data volume
Gap filling is left to the caller to avoid unnecessary processing for aggregate queries
Dimensions are joined efficiently to minimize row multiplication
"No Annotation" values are handled properly with minimal overhead

For best performance:

Use reasonable time intervals appropriate to your data density
Apply additional filtering (datasources, groups, annotations) when possible
For summary tables, use a large interval (e.g., '100 years') to avoid unnecessary bucketing
Only use gap filling for time series visualizations where it's needed

Parquet Analytics (Beta)

DeltaFi includes a beta analytics system based on Parquet columnar storage and DuckDB queries, designed for high-volume deployments handling billions of analytics events.

Enabling Parquet Analytics

Enable the beta analytics system by setting the parquetAnalyticsEnabled system property to true in the UI or TUI. When enabled, analytics events are sent to both the existing TimescaleDB system and the new Parquet collector. This parallel operation allows for A/B testing and data accumulation before a full cutover.

To disable TimescaleDB analytics (e.g., after fully migrating to Parquet), set timescaleAnalyticsEnabled to false. This reduces database load by stopping analytics writes to TimescaleDB while continuing to send events to the Parquet collector.

Key Differences

Feature	TimescaleDB Analytics	Parquet Analytics
Storage	PostgreSQL with TimescaleDB extension	Parquet files with DuckDB queries
Query latency	Real-time	~2 minutes (after compaction)
Scale	Millions of events	Billions of events
Dashboard	DeltaFile Analytics	Parquet Analytics
Annotation filtering	Via SQL function parameters	Via MAP column access

Parquet Analytics Dashboard

A separate Grafana dashboard called "Parquet Analytics" is available in the DeltaFi folder. This dashboard uses the Infinity datasource to query the analytics collector HTTP API.

Dashboard features:

Multi-select data source filtering
Up to two annotation key/value filters
Group-by annotation key (defaults to the configured analyticsGroupName if present)
Time series, pie charts, and summary tables for ingress, egress, errors, and filters

Data Flow

Event Collection: deltafi-core buffers events (10k or 10 seconds) and sends them to the analytics collector
Parquet Writing: The collector writes events to daily-partitioned Parquet files
Compaction: Every minute, raw events are aggregated into 5-minute time buckets with annotations joined
Querying: Grafana queries only the aggregated Parquet files via HTTP API

Data Retention

Data Type	Retention
Raw events and annotations	3 days (then archived)
Aggregated files	Indefinite (manual cleanup required)

Migration Strategy

The recommended approach for migrating to Parquet analytics:

Enable parquetAnalyticsEnabled to run both systems in parallel
Validate data accuracy by comparing dashboards for 30+ days
When confident, switch primary dashboards to Parquet Analytics

Running in parallel for an extended period builds up historical data in the Parquet system before cutover, since TimescaleDB metrics will not be migrated.

A future DeltaFi release will replace the TimescaleDB analytics system with Parquet analytics

DeltaFile Analytics Capability ​

Disabling Analytics ​

DeltaFile Analytics ​

Data Retention ​

DeltaFile Analytics Grafana Dashboards ​

Survey API for Analytics ​

Response ​

Example Survey Post ​

Timestamp Format ​

Analytics Configuration ​

Querying Analytics Data ​

Parameter Usage ​

Return Values ​

Querying Error and Filter Data ​

Parameter Usage ​

Return Values ​

Grafana Integration ​

Example Grafana Queries ​

Performance Considerations ​

Parquet Analytics (Beta) ​

Enabling Parquet Analytics ​

Key Differences ​

Parquet Analytics Dashboard ​

Data Flow ​

Data Retention ​

Migration Strategy ​

DeltaFile Analytics Capability

Disabling Analytics

DeltaFile Analytics

Data Retention

DeltaFile Analytics Grafana Dashboards

Survey API for Analytics

Response

Example Survey Post

Timestamp Format

Analytics Configuration

Querying Analytics Data

Parameter Usage

Return Values

Querying Error and Filter Data

Parameter Usage

Return Values

Grafana Integration

Example Grafana Queries

Performance Considerations

Parquet Analytics (Beta)

Enabling Parquet Analytics

Key Differences

Parquet Analytics Dashboard

Data Flow

Data Retention

Migration Strategy