Metrics and Observability¶
Metrics and observability in SemanticKernel.Graph provide comprehensive insights into graph execution performance, resource usage, and operational health. This guide covers performance metrics collection, export capabilities, execution tracing, and monitoring dashboards.
What You'll Learn¶
- How to configure and enable comprehensive metrics collection
- Understanding node-level and path-level performance metrics
- Exporting metrics to various monitoring systems and dashboards
- Setting up execution tracing and correlation
- Monitoring system resources and performance indicators
- Best practices for production observability
Concepts and Techniques¶
GraphPerformanceMetrics: Comprehensive metrics collector that tracks node execution times, success rates, execution paths, and system resource usage.
NodeExecutionMetrics: Individual node performance tracking including execution counts, timing percentiles (p50, p95, p99), and success/failure rates.
ExecutionPathMetrics: Analysis of complete execution routes through the graph, including path frequency and performance characteristics.
Metrics Exporters: Specialized export capabilities for various monitoring systems including Prometheus, Grafana, and custom dashboards.
Execution Tracing: OpenTelemetry-based tracing with correlation between execution spans and streaming events.
Resource Monitoring: CPU and memory usage tracking with configurable sampling intervals.
Prerequisites¶
- First Graph Tutorial completed
- Basic understanding of graph execution concepts
- Familiarity with metrics and monitoring concepts
- Microsoft.Extensions.Logging configured (optional but recommended)
Enabling Metrics Collection¶
Basic Metrics Configuration¶
Enable metrics collection at the kernel level:
using SemanticKernel.Graph.Extensions;
var kernel = Kernel.CreateBuilder()
.AddOpenAIChatCompletion("gpt-3.5-turbo", apiKey)
.AddGraphSupport(options =>
{
options.EnableMetrics = true;
options.EnableLogging = true;
})
.Build();
Graph-Level Metrics Configuration¶
Configure detailed metrics collection for specific graphs:
using SemanticKernel.Graph.Core;
// Create graph with metrics enabled
var graph = new GraphExecutor("PerformanceGraph", "High-performance workflow");
// Enable development metrics (detailed tracking, frequent sampling)
graph.EnableDevelopmentMetrics();
// Or use production metrics (optimized for performance)
// graph.EnableProductionMetrics();
// Or customize metrics options
var customMetricsOptions = new GraphMetricsOptions
{
EnableResourceMonitoring = true,
ResourceSamplingInterval = TimeSpan.FromSeconds(10),
MaxSampleHistory = 5000,
EnableDetailedPathTracking = true,
EnablePercentileCalculations = true,
MetricsRetentionPeriod = TimeSpan.FromDays(7)
};
graph.ConfigureMetrics(customMetricsOptions);
Preset Configurations¶
Use predefined configurations for common scenarios:
// Development environment (detailed tracking)
var devOptions = GraphMetricsOptions.CreateDevelopmentOptions();
graph.ConfigureMetrics(devOptions);
// Production environment (performance optimized)
var prodOptions = GraphMetricsOptions.CreateProductionOptions();
graph.ConfigureMetrics(prodOptions);
// High-performance scenario (minimal overhead)
var perfOptions = GraphMetricsOptions.CreatePerformanceOptions();
graph.ConfigureMetrics(perfOptions);
Performance Metrics Collection¶
Node-Level Metrics¶
Track individual node performance characteristics:
// Get metrics for a specific node
var nodeMetrics = graph.GetNodeMetrics("processing_node");
if (nodeMetrics != null)
{
Console.WriteLine($"Node: {nodeMetrics.NodeName}");
Console.WriteLine($"Total Executions: {nodeMetrics.TotalExecutions}");
Console.WriteLine($"Success Rate: {nodeMetrics.SuccessRate:F1}%");
Console.WriteLine($"Average Time: {nodeMetrics.AverageExecutionTime.TotalMilliseconds:F2}ms");
// Get percentile performance
var p50 = nodeMetrics.GetPercentile(50);
var p95 = nodeMetrics.GetPercentile(95);
var p99 = nodeMetrics.GetPercentile(99);
Console.WriteLine($"P50: {p50.TotalMilliseconds:F2}ms");
Console.WriteLine($"P95: {p95.TotalMilliseconds:F2}ms");
Console.WriteLine($"P99: {p99.TotalMilliseconds:F2}ms");
// Performance classification
var rating = nodeMetrics.GetPerformanceClassification();
Console.WriteLine($"Performance Rating: {rating}");
}
Execution Path Metrics¶
Analyze complete execution routes through the graph:
// Get all execution path metrics
var pathMetrics = graph.GetAllPathMetrics();
foreach (var path in pathMetrics.OrderByDescending(p => p.Value.ExecutionCount))
{
var metrics = path.Value;
Console.WriteLine($"Path: {metrics.PathKey}");
Console.WriteLine($" Executions: {metrics.ExecutionCount}");
Console.WriteLine($" Success Rate: {metrics.SuccessRate:F1}%");
Console.WriteLine($" Average Time: {metrics.AverageExecutionTime.TotalMilliseconds:F2}ms");
Console.WriteLine($" Path Length: {metrics.PathLength} nodes");
Console.WriteLine($" Frequency: {metrics.ExecutionsPerHour:F2}/hour");
// Get path-specific percentiles
var p95 = metrics.GetPercentile(95);
Console.WriteLine($" P95: {p95.TotalMilliseconds:F2}ms");
}
Overall Performance Summary¶
Get comprehensive performance overview:
// Get performance summary for the last hour
var summary = graph.GetPerformanceSummary(TimeSpan.FromHours(1));
if (summary != null)
{
Console.WriteLine("📊 PERFORMANCE SUMMARY");
Console.WriteLine("".PadRight(50, '-'));
Console.WriteLine($"Total Executions: {summary.TotalExecutions}");
Console.WriteLine($"Success Rate: {summary.SuccessRate:F1}%");
Console.WriteLine($"Average Execution Time: {summary.AverageExecutionTime.TotalMilliseconds:F2}ms");
Console.WriteLine($"Min/Max Time: {summary.MinExecutionTime.TotalMilliseconds:F2}ms / {summary.MaxExecutionTime.TotalMilliseconds:F2}ms");
Console.WriteLine($"Throughput: {summary.Throughput:F2} executions/second");
Console.WriteLine($"Current CPU Usage: {summary.CurrentCpuUsage:F1}%");
Console.WriteLine($"Available Memory: {summary.CurrentAvailableMemoryMB:F0} MB");
// System health assessment
var isHealthy = summary.IsHealthy();
Console.WriteLine($"System Health: {(isHealthy ? "🟢 HEALTHY" : "🔴 NEEDS ATTENTION")}");
if (!isHealthy)
{
var alerts = summary.GetPerformanceAlerts();
Console.WriteLine("Performance Alerts:");
foreach (var alert in alerts)
{
Console.WriteLine($" - {alert}");
}
}
}
Resource Monitoring¶
System Resource Tracking¶
Monitor CPU and memory usage during graph execution:
// Enable resource monitoring
var resourceOptions = new GraphMetricsOptions
{
EnableResourceMonitoring = true,
ResourceSamplingInterval = TimeSpan.FromSeconds(5)
};
graph.ConfigureMetrics(resourceOptions);
// Access current resource metrics
var metrics = graph.GetPerformanceMetrics();
if (metrics != null)
{
Console.WriteLine($"Current CPU Usage: {metrics.CurrentCpuUsage:F1}%");
Console.WriteLine($"Available Memory: {metrics.CurrentAvailableMemoryMB:F0} MB");
Console.WriteLine($"Overall Throughput: {metrics.OverallThroughput:F2} executions/sec");
Console.WriteLine($"Average Latency: {metrics.AverageExecutionLatency.TotalMilliseconds:F2}ms");
}
Resource Sampling Configuration¶
Configure resource monitoring behavior:
var resourceOptions = new GraphMetricsOptions
{
EnableResourceMonitoring = true,
ResourceSamplingInterval = TimeSpan.FromSeconds(10), // Sample every 10 seconds
MaxSampleHistory = 10000, // Keep 10K samples
MetricsRetentionPeriod = TimeSpan.FromDays(7) // Retain for 7 days
};
graph.ConfigureMetrics(resourceOptions);
Metrics Export and Integration¶
GraphMetricsExporter¶
Export metrics to various monitoring systems:
using SemanticKernel.Graph.Core;
var exporter = new GraphMetricsExporter(
new GraphMetricsExportOptions
{
IndentedOutput = true,
UseCamelCase = true,
IncludePercentileData = true,
IncludeTrendAnalysis = true,
IncludeRecommendations = true
}
);
// Export in different formats
var metrics = graph.GetPerformanceMetrics();
if (metrics != null)
{
// JSON format for web dashboards
var jsonMetrics = exporter.ExportMetrics(metrics, MetricsExportFormat.Json);
// Prometheus format for monitoring systems
var prometheusMetrics = exporter.ExportMetrics(metrics, MetricsExportFormat.Prometheus);
// CSV format for spreadsheet analysis
var csvMetrics = exporter.ExportMetrics(metrics, MetricsExportFormat.Csv);
// XML format for legacy systems
var xmlMetrics = exporter.ExportMetrics(metrics, MetricsExportFormat.Xml);
}
Dashboard Integration¶
Export metrics specifically formatted for popular dashboards:
// Export for Grafana
var grafanaMetrics = exporter.ExportForDashboard(metrics, DashboardType.Grafana);
// Export for Chart.js
var chartJsMetrics = exporter.ExportForDashboard(metrics, DashboardType.ChartJs);
// Export for custom dashboards
var customMetrics = exporter.ExportForDashboard(metrics, DashboardType.Custom);
Prometheus Integration¶
Export metrics in Prometheus format for monitoring systems:
// Export Prometheus metrics
var prometheusMetrics = exporter.ExportMetrics(metrics, MetricsExportFormat.Prometheus);
// Example output:
// # HELP graph_node_execution_total Total number of node executions
// # TYPE graph_node_execution_total counter
// graph_node_execution_total{node_id="processing_node",node_name="Processing"} 150
// graph_node_execution_total{node_id="decision_node",node_name="Decision"} 75
//
// # HELP graph_node_execution_duration_seconds Node execution duration in seconds
// # TYPE graph_node_execution_duration_seconds histogram
// graph_node_execution_duration_seconds_bucket{node_id="processing_node",le="0.1"} 45
// graph_node_execution_duration_seconds_bucket{node_id="processing_node",le="0.5"} 120
// graph_node_execution_duration_seconds_bucket{node_id="processing_node",le="1.0"} 150
Execution Tracing and Correlation¶
OpenTelemetry Integration¶
Enable distributed tracing with correlation:
using System.Diagnostics;
// Configure ActivitySource for tracing
var activitySource = new ActivitySource("SemanticKernel.Graph");
// Enable tracing in graph options
var graphOptions = new GraphOptions
{
EnableMetrics = true,
EnableLogging = true
};
// GraphExecutor automatically creates tracing spans
var graph = new GraphExecutor("TracedGraph", "Graph with tracing enabled");
// Execute with automatic tracing
using var activity = activitySource.StartActivity("Graph.Execute");
if (activity != null)
{
activity.SetTag("graph.id", graph.GraphId);
activity.SetTag("graph.name", graph.Name);
var result = await graph.ExecuteAsync(kernel, arguments);
activity.SetTag("execution.success", true);
activity.SetTag("execution.result", result.GetValue<string>());
}
Span Correlation¶
Correlate execution spans with streaming events:
// Execute with streaming and tracing
var stream = streamingExecutor.ExecuteStreamAsync(kernel, arguments);
await foreach (var evt in stream)
{
// Each event includes correlation information
Console.WriteLine($"Event: {evt.EventType}");
Console.WriteLine($"Execution ID: {evt.ExecutionId}");
Console.WriteLine($"Node ID: {evt.NodeId}");
Console.WriteLine($"Correlation ID: {evt.CorrelationId}");
// Use correlation ID to link with tracing spans
if (Activity.Current != null)
{
Activity.Current.SetTag("event.correlation_id", evt.CorrelationId);
Activity.Current.SetTag("event.node_id", evt.NodeId);
}
}
Custom Tracing¶
Add custom tracing to your graph nodes:
public class CustomTracingNode : IGraphNode
{
public async Task<FunctionResult> ExecuteAsync(Kernel kernel, KernelArguments arguments)
{
using var activity = ActivitySource.StartActivity("CustomNode.Execute");
if (activity != null)
{
activity.SetTag("node.type", "CustomTracing");
activity.SetTag("node.custom_data", "example");
}
try
{
// Node execution logic
var result = await ProcessDataAsync(arguments);
activity?.SetTag("execution.success", true);
return result;
}
catch (Exception ex)
{
activity?.SetTag("execution.success", false);
activity?.SetTag("exception.type", ex.GetType().Name);
activity?.SetTag("exception.message", ex.Message);
throw;
}
}
}
Performance Analysis and Optimization¶
Identifying Performance Bottlenecks¶
Analyze node performance to find bottlenecks:
// Get all node metrics and identify slow nodes
var allNodeMetrics = graph.GetAllNodeMetrics();
var slowNodes = allNodeMetrics
.Where(n => n.Value.AverageExecutionTime.TotalMilliseconds > 1000) // > 1 second
.OrderByDescending(n => n.Value.AverageExecutionTime.TotalMilliseconds);
Console.WriteLine("🐌 SLOW NODES (>1s average)");
foreach (var node in slowNodes)
{
var metrics = node.Value;
Console.WriteLine($"Node: {metrics.NodeName}");
Console.WriteLine($" Average Time: {metrics.AverageExecutionTime.TotalMilliseconds:F2}ms");
Console.WriteLine($" P95 Time: {metrics.GetPercentile(95).TotalMilliseconds:F2}ms");
Console.WriteLine($" Success Rate: {metrics.SuccessRate:F1}%");
Console.WriteLine($" Total Executions: {metrics.TotalExecutions}");
}
Path Performance Analysis¶
Analyze execution path performance:
// Find the most frequently executed paths
var frequentPaths = graph.GetAllPathMetrics()
.OrderByDescending(p => p.Value.ExecutionCount)
.Take(5);
Console.WriteLine("🛤️ MOST FREQUENT EXECUTION PATHS");
foreach (var path in frequentPaths)
{
var metrics = path.Value;
Console.WriteLine($"Path: {metrics.PathKey}");
Console.WriteLine($" Frequency: {metrics.ExecutionsPerHour:F2}/hour");
Console.WriteLine($" Success Rate: {metrics.SuccessRate:F1}%");
Console.WriteLine($" Average Time: {metrics.AverageExecutionTime.TotalMilliseconds:F2}ms");
// Check if path has performance issues
if (metrics.SuccessRate < 90 || metrics.AverageExecutionTime.TotalMilliseconds > 5000)
{
Console.WriteLine(" ⚠️ Performance issues detected!");
}
}
Trend Analysis¶
Monitor performance trends over time:
// Get performance summary for different time windows
var timeWindows = new[]
{
TimeSpan.FromMinutes(5),
TimeSpan.FromMinutes(15),
TimeSpan.FromHours(1),
TimeSpan.FromHours(6)
};
Console.WriteLine("📈 PERFORMANCE TRENDS");
foreach (var window in timeWindows)
{
var summary = graph.GetPerformanceSummary(window);
if (summary != null)
{
Console.WriteLine($"\n{window} Window:");
Console.WriteLine($" Executions: {summary.TotalExecutions}");
Console.WriteLine($" Success Rate: {summary.SuccessRate:F1}%");
Console.WriteLine($" Average Time: {summary.AverageExecutionTime.TotalMilliseconds:F2}ms");
Console.WriteLine($" Throughput: {summary.Throughput:F2}/sec");
}
}
Monitoring and Alerting¶
Health Checks¶
Implement automated health monitoring:
public class GraphHealthMonitor
{
public async Task<HealthReport> CheckHealthAsync(GraphExecutor graph)
{
var summary = graph.GetPerformanceSummary(TimeSpan.FromMinutes(5));
if (summary == null)
{
return new HealthReport(HealthStatus.Unhealthy, "No metrics available");
}
var issues = new List<string>();
// Check success rate
if (summary.SuccessRate < 95)
{
issues.Add($"Low success rate: {summary.SuccessRate:F1}%");
}
// Check response time
if (summary.AverageExecutionTime.TotalMilliseconds > 5000)
{
issues.Add($"High response time: {summary.AverageExecutionTime.TotalMilliseconds:F0}ms");
}
// Check throughput
if (summary.Throughput < 1.0)
{
issues.Add($"Low throughput: {summary.Throughput:F2}/sec");
}
// Check system resources
if (summary.CurrentCpuUsage > 80)
{
issues.Add($"High CPU usage: {summary.CurrentCpuUsage:F1}%");
}
if (summary.CurrentAvailableMemoryMB < 1000)
{
issues.Add($"Low memory: {summary.CurrentAvailableMemoryMB:F0} MB available");
}
var status = issues.Count == 0 ? HealthStatus.Healthy : HealthStatus.Unhealthy;
return new HealthReport(status, string.Join("; ", issues));
}
}
// Usage
var healthMonitor = new GraphHealthMonitor();
var health = await healthMonitor.CheckHealthAsync(graph);
if (health.Status == HealthStatus.Unhealthy)
{
Console.WriteLine($"🔴 Health Check Failed: {health.Description}");
// Send alert, log error, etc.
}
else
{
Console.WriteLine("🟢 Health Check Passed");
}
Performance Alerts¶
Set up automated performance monitoring:
public class PerformanceAlerting
{
private readonly GraphExecutor _graph;
private readonly Timer _alertingTimer;
public PerformanceAlerting(GraphExecutor graph)
{
_graph = graph;
_alertingTimer = new Timer(CheckPerformance, null, TimeSpan.Zero, TimeSpan.FromMinutes(1));
}
private void CheckPerformance(object? state)
{
var summary = _graph.GetPerformanceSummary(TimeSpan.FromMinutes(5));
if (summary == null) return;
var alerts = summary.GetPerformanceAlerts();
foreach (var alert in alerts)
{
Console.WriteLine($"🚨 PERFORMANCE ALERT: {alert}");
// Send notification, log alert, etc.
}
}
}
// Usage
var alerting = new PerformanceAlerting(graph);
Best Practices¶
Metrics Configuration¶
- Development: Use
CreateDevelopmentOptions()
for detailed debugging - Production: Use
CreateProductionOptions()
for performance optimization - High-Throughput: Use
CreatePerformanceOptions()
for minimal overhead - Resource Monitoring: Enable only when needed to avoid performance impact
Performance Monitoring¶
- Sampling Intervals: Balance accuracy with performance (5-30 seconds for resources)
- Retention Periods: Keep metrics long enough for trend analysis (7-30 days)
- Percentile Tracking: Focus on p95 and p99 for latency monitoring
- Path Analysis: Monitor execution paths for optimization opportunities
Export and Integration¶
- Prometheus: Use for Kubernetes and cloud-native monitoring
- Grafana: Export dashboard-ready metrics for visualization
- Custom Dashboards: Use JSON export for web-based monitoring
- Alerting: Set up automated alerts for critical performance issues
Tracing and Correlation¶
- Correlation IDs: Use stable IDs for linking spans and events
- Span Naming: Use descriptive names for better observability
- Tag Strategy: Add business context to tracing spans
- Sampling: Configure appropriate sampling rates for production
Troubleshooting¶
Common Issues¶
Metrics not collecting: Ensure EnableMetrics
is true in graph options and metrics are properly configured.
High memory usage: Reduce MaxSampleHistory
and MaxPathHistoryPerPath
in metrics options.
Performance impact: Use production-optimized metrics options and disable resource monitoring if not needed.
Export failures: Check export format compatibility and ensure metrics data is available.
Performance Optimization¶
// Optimize metrics collection for high-throughput scenarios
var optimizedOptions = new GraphMetricsOptions
{
EnableResourceMonitoring = false, // Disable if not needed
ResourceSamplingInterval = TimeSpan.FromMinutes(5),
MaxSampleHistory = 1000, // Reduce sample history
EnableDetailedPathTracking = false, // Disable if not needed
MaxPathHistoryPerPath = 100, // Reduce path history
EnablePercentileCalculations = true, // Keep percentiles
MetricsRetentionPeriod = TimeSpan.FromHours(2), // Shorter retention
EnableRealTimeMetrics = false, // Disable for performance
AggregationInterval = TimeSpan.FromMinutes(5) // Less frequent aggregation
};
graph.ConfigureMetrics(optimizedOptions);
See Also¶
- Debug and Inspection - Using metrics for debugging and analysis
- State Management - Understanding execution state and context
- Graph Execution - Execution lifecycle and performance
- Examples - Practical examples of metrics and monitoring