DeltaFile Analytics Capability
DeltaFile analytics provide the ability to extract insights on behavior, performance, and volumetrics of processed DeltaFiles by dataSource and any applied annotations.
DeltaFile Analytics
The following events result in creation or updates to analytic events:
- Creation of a DeltaFile
- Egress of a DeltaFile
- Annotation of a DeltaFile
- Error events
- Filter events
- Cancellation of a DeltaFile
Data Retention
Raw analytics data is retained for 3 days, while continuous aggregate views maintain data for up to 30 days. This allows you to query historical trends even after the raw data has been purged.
DeltaFile Analytics Grafana Dashboards
DeltaFi ships with three default Grafana dashboards for analytics visualization:
- DeltaFile Analytics
- Error Analysis
- Filter Analysis
These dashboards are located in the Grafana "DeltaFi" folder. They serve as examples of how to visualize DeltaFi data, and you can customize or extend them as needed.
Survey API for Analytics
/api/v2/survey is the REST API for injecting "survey" data into the DeltaFi analytics database. The API is used to add an array of DeltaFile analytic entries directly into the analytics database without needing to process the data through DeltaFi dataSources. This allows for hypothetical, observed, or orthogonally processed data to be represented alongside the analytic data for processed DeltaFiles that are automatically added to the database as part of the usual DeltaFi processing. The "survey" data can be viewed in Grafana dashboards and distinguished from other analytic data by the pattern of the did field (/^survey-.*/).
- Endpoint
- /api/v2/survey
- Method
- POST
- Request Body
- The request body should be a JSON array containing survey data. Each element in the array is an individual DeltaFile analytic entry. The following fields are supported in the analytic entries:
timestamp: Creation timestamp for the entry. Defaults to the current time.dataSource: The name of the dataSource being surveyed (required)files: The number of files being surveyed. Must be greater than 0. (required)ingressBytes: Ingress bytes for the survey entry. Must be greater than or equal to 0. (ingress_bytes is also supported for backward compatibility)
Response
On success, the API returns HTTP 200 with no body.
If there are validation errors, it returns HTTP 400 with a JSON body containing a message for each invalid event:
{
"error": [
{
"message": "Error message describing the issue",
"event": { "...": "..." }
}
],
"timestamp": "2023-08-21T10:00:00Z"
}If analytics are disabled, it returns HTTP 501 with:
{
"error": "Survey analytics are disabled",
"timestamp": "2023-08-21T10:00:00Z"
}Example Survey Post
[
{
"dataSource": "Sample",
"files": 1,
"timestamp": "2023-08-21T10:00:00Z",
"ingressBytes": 100,
"annotation1": "value1",
"annotation2": "value2"
},
{
"dataSource": "Sample",
"files": 1,
"timestamp": "2023-08-21T10:01:00Z",
"ingressBytes": 200,
"annotation1": "value1",
"annotation2": "value3"
}
]You can post the survey to /api/v2/survey with a command like:
# Example Usage
curl -X POST -H "Content-Type: application/json" -d '[
{
"dataSource": "Sample",
"files": 1,
"ingressBytes": 100,
"annotation1": "value1"
}
]' http://your-api-endpoint/api/v2/surveyTimestamp Format
The following formats can be used for timestamp fields:
- ISO 8601 Format (with timezone or offset), for example: 2023-08-21T10:00:00Z or 2023-08-21T10:00:00+05:30
- RFC 2822 Format, for example: Thu, 21 Aug 2023 10:00:00 +0530
- Ruby Date-Time String, for example: Wed Aug 21 10:00:00 UTC 2023
- Short Date Format, for example: 2023-08-21
Analytics Configuration
Annotation Allow-List: By default, no annotations submitted with a DeltaFile or survey event are stored in the analytics system. If you wish to record annotations for grouping or filtering, you must explicitly specify a comma-separated list of keys in the allowedAnalyticsAnnotations property (configured via the GUI or TUI). Only those annotation keys present in this allow-list will be inserted into the analytics database.
Example configuration in the GUI or TUI:
allowedAnalyticsAnnotations="annotation1,annotation2,annotation3"
This configuration ensures that only the keys annotation1, annotation2, and annotation3 are recorded in analytics.
Analytics Group Name: Analytics supports an additional top-level group for classification, controlled by the analyticsGroupName property. When a new analytic event is recorded, if an annotation key matches the analyticsGroupName, its value is used as the event group. Otherwise, the event group is recorded as "Not Present".
Example configuration in the GUI or TUI:
analyticsGroupName="annotation1"
If an annotation named annotation1 is present, its value becomes the event group in the analytics table; if not, "Not Present" is used. This provides an extra grouping dimension on top of the data source.
Note: Analytics configurations will only apply to events recorded after the change. They are not retroactively applied to existing records.
Querying Analytics Data
The get_analytics_data function provides a standardized way to query analytics data with proper handling of annotations and filtering:
get_analytics_data(
p_annotation_key_name text, -- Annotation key name to filter by, use 'All' for all
p_annotation_values_text text[], -- Array of annotation value names, include 'All' for all values
p_datasources_text text[], -- Array of data source names, include 'All' for all
p_groups_text text[], -- Array of group names, include 'All' for all
p_analytic_ingress_types text[], -- Array of ingress types (DATA_SOURCE, CHILD, SURVEY), include 'All' for all
p_start_time timestamptz, -- Start time for the query
p_end_time timestamptz, -- End time for the query
p_interval_str text -- Time bucket interval (e.g., '5 minutes')
)Parameter Usage
- To select all values for a parameter, use 'All' in the array
- For annotation values, the function will automatically handle "No Annotation" data
- For time range parameters in Grafana, use
$__timeFrom()and$__timeTo() - For interval, use
($__interval_ms || ' ms')to adapt to zoom level - Missing or unknown groups are labeled as "No Group"
Return Values
The function returns a table with the following columns:
time: Timestamp bucketdatasource_name: Name of the data sourcegroup_name: Name of the event groupannotation_value: Value of the annotation (or "No Annotation")ingress_bytes: Total bytes ingressedingress_files: Count of files ingressedegress_bytes: Total bytes egressedegress_files: Count of files egressederror_files: Count of files with errorsfilter_files: Count of filtered files
This output format works directly with Grafana's time series and table visualizations.
Querying Error and Filter Data
The get_errors_filters_data function provides detailed analysis of errors and filters with additional dimensions:
get_errors_filters_data(
p_annotation_key_name text, -- Annotation key name to filter by, use 'All' for all
p_annotation_values_text text[], -- Array of annotation value names, include 'All' for all values
p_datasources_text text[], -- Array of data source names, include 'All' for all
p_groups_text text[], -- Array of group names, include 'All' for all
p_start_time timestamptz, -- Start time for the query
p_end_time timestamptz, -- End time for the query
p_interval_str text, -- Time bucket interval (e.g., '5 minutes')
p_event_type text DEFAULT 'BOTH' -- Filter by 'ERRORS', 'FILTERS', or 'BOTH'
)Parameter Usage
- To select all values for a parameter, use 'All' in the array
- For annotation values, the function will automatically handle "No Annotation" data
- For time range parameters in Grafana, use
$__timeFrom()and$__timeTo() - For interval, use
($__interval_ms || ' ms')to adapt to zoom level - Missing or unknown groups are labeled as "No Group"
Return Values
The function returns a table with the following columns:
time: Timestamp bucketdatasource_name: Name of the data sourcegroup_name: Name of the event groupannotation_value: Value of the annotation (or "No Annotation")flow_name: Name of the flow where the error or filter occurredaction_name: Name of the action where the error or filter occurredcause: Error or filter causeerror_files: Count of files with errorsfilter_files: Count of filtered files
Grafana Integration
Both functions are designed to work with Grafana's visualization capabilities. For optimal performance in time series visualizations, you should:
- Call the function directly for the data you need
- Apply time_bucket_gapfill in your query if needed for visualization
Example Grafana Queries
-- Time series visualization query with gap filling
SELECT
time_bucket_gapfill(($__interval_ms || ' ms')::interval, "time") AS gapped_time,
CASE
WHEN $annotationKey = 'All' THEN datasource_name || ' - ' || group_name
ELSE datasource_name || ' - ' || group_name || ' - ' || annotation_value
END AS series_name,
ingress_bytes,
ingress_files,
egress_bytes,
egress_files,
error_files,
filter_files
FROM get_analytics_data(
$annotationKey,
ARRAY[$annotationValues],
ARRAY[$dataSources],
ARRAY[$groups],
$__timeFrom(),
$__timeTo(),
($__interval_ms || ' ms')
)
-- Summary table query (no gap filling needed)
SELECT
datasource_name AS "Data Source",
group_name AS "Group",
CASE
WHEN $annotationKey != 'All' THEN annotation_value
END AS "Annotation Value",
SUM(ingress_bytes) AS "Ingress Bytes",
SUM(ingress_files) AS "Ingress Files",
SUM(egress_bytes) AS "Egress Bytes",
SUM(egress_files) AS "Egress Files",
SUM(error_files) AS "Errored",
SUM(filter_files) AS "Filtered"
FROM get_analytics_data(
$annotationKey,
ARRAY[$annotationValues],
ARRAY[$dataSources],
ARRAY[$groups],
$__timeFrom(),
$__timeTo(),
'100 years' -- Using a very large interval for summary
)
GROUP BY
datasource_name,
group_name,
CASE
WHEN $annotationKey != 'All' THEN annotation_value
END
ORDER BY "Data Source", "Group", "Annotation Value"
-- Error analysis with gap filling
SELECT
time_bucket_gapfill(($__interval_ms || ' ms')::interval, "time") AS gapped_time,
datasource_name || ' - ' || cause AS series_name,
SUM(error_files) AS errors
FROM get_errors_filters_data(
'All',
ARRAY['All'],
ARRAY[$dataSources],
ARRAY['All'],
$__timeFrom(),
$__timeTo(),
($__interval_ms || ' ms'),
'ERRORS'
)
GROUP BY gapped_time, datasource_name, cause
ORDER BY gapped_time, series_namePerformance Considerations
The analytics functions have been optimized for performance:
- Time bucketing is applied early in the query pipeline to reduce data volume
- Gap filling is left to the caller to avoid unnecessary processing for aggregate queries
- Dimensions are joined efficiently to minimize row multiplication
- "No Annotation" values are handled properly with minimal overhead
For best performance:
- Use reasonable time intervals appropriate to your data density
- Apply additional filtering (datasources, groups, annotations) when possible
- For summary tables, use a large interval (e.g., '100 years') to avoid unnecessary bucketing
- Only use gap filling for time series visualizations where it's needed

