Error Policies¶

This document covers the comprehensive error handling system in SemanticKernel.Graph, including policy management, retry mechanisms, error handling nodes, and metrics collection. The system provides robust resilience patterns with configurable policies, automatic retry logic, and comprehensive error tracking.

ErrorPolicyRegistry¶

The ErrorPolicyRegistry provides centralized error handling policies across the graph execution system, supporting retry, circuit breaker, and budget policies with runtime policy resolution.

Overview¶

This registry manages error handling policies with versioning, runtime resolution, and integration with the graph execution context. It supports policy rules based on error types, node types, and custom conditions.

Key Features¶

Centralized Policy Management: Single source of truth for all error handling policies
Runtime Policy Resolution: Dynamic policy selection based on error context and execution state
Policy Versioning: Support for policy updates and rollbacks
Circuit Breaker Integration: Built-in circuit breaker policies per node
Budget Management: Resource budget policies with automatic enforcement
Thread Safety: All operations are thread-safe for concurrent access

Policy Registration¶

var registry = new ErrorPolicyRegistry(new ErrorPolicyRegistryOptions());

// Register retry policy for specific error types using PolicyRule
// (use RegisterPolicyRule so the rule is properly indexed and versioned)
registry.RegisterPolicyRule(new PolicyRule
{
    ContextId = "Examples",
    ErrorType = GraphErrorType.Network,
    RecoveryAction = ErrorRecoveryAction.Retry,
    MaxRetries = 3,
    RetryDelay = TimeSpan.FromSeconds(1),
    BackoffMultiplier = 2.0,
    Priority = 100,
    Description = "Retry network errors"
});

// Register circuit breaker policy for a node using the correct config properties
registry.RegisterNodeCircuitBreakerPolicy("api-node", new CircuitBreakerPolicyConfig
{
    Enabled = true,
    FailureThreshold = 5,
    OpenTimeout = TimeSpan.FromMinutes(1),
    FailureWindow = TimeSpan.FromMinutes(5)
});

Runnable example¶

The repository contains a tested, runnable example that exercises the snippets in this document:

Example source: semantic-kernel-graph-docs/examples/ErrorPoliciesExample.cs
Run the example from the repo root (requires .NET 8):

dotnet run --project semantic-kernel-graph-docs/examples/Examples.csproj -- error-policies

This example registers the sample policies, executes an ErrorHandlerGraphNode with a simulated HttpRequestException, and records a sample event in ErrorMetricsCollector so you can inspect the runtime output and verify the documented behavior.

Policy Resolution¶

Policies are resolved based on error context and node information:

var errorContext = new ErrorHandlingContext
{
    Exception = exception,
    ErrorType = GraphErrorType.Network,
    Severity = ErrorSeverity.Medium,
    AttemptNumber = 1,
    IsTransient = true
};

var policy = registry.ResolvePolicy(errorContext, executionContext);
if (policy?.RecoveryAction == ErrorRecoveryAction.Retry)
{
    // Apply retry logic with configured parameters
    var delay = CalculateRetryDelay(policy, errorContext.AttemptNumber);
    await Task.Delay(delay);
}

Policy Rule Configuration¶

public class PolicyRule
{
    public ErrorRecoveryAction RecoveryAction { get; set; }
    public int MaxRetries { get; set; }
    public TimeSpan RetryDelay { get; set; }
    public double BackoffMultiplier { get; set; }
    public TimeSpan MaxRetryDelay { get; set; }
    public int Priority { get; set; }
    public string? NodeTypePattern { get; set; }
    public ErrorSeverity? SeverityThreshold { get; set; }
    public Func<ErrorHandlingContext, GraphExecutionContext?, bool>? CustomCondition { get; set; }
}

RetryPolicyGraphNode¶

The RetryPolicyGraphNode wraps another node with automatic retry capabilities, handling transient failures with configurable retry policies and backoff strategies.

Overview¶

This specialized node provides automatic retry logic for wrapped nodes, supporting multiple retry strategies, error type filtering, and comprehensive retry statistics. It enhances graph resilience by automatically handling transient failures.

Key Features¶

Automatic Retry Logic: Configurable retry attempts with intelligent backoff
Multiple Retry Strategies: Fixed delay, exponential backoff, linear backoff, and custom strategies
Error Type Filtering: Retry only specific error types or use custom retry conditions
Jitter Support: Random jitter to prevent thundering herd problems
Retry Statistics: Comprehensive tracking of retry attempts and performance
Metadata Augmentation: Adds retry context to kernel arguments and results

Configuration¶

var retryConfig = new RetryPolicyConfig
{
    MaxRetries = 3,
    BaseDelay = TimeSpan.FromSeconds(1),
    MaxDelay = TimeSpan.FromSeconds(30),
    Strategy = RetryStrategy.ExponentialBackoff,
    BackoffMultiplier = 2.0,
    UseJitter = true,
    RetryableErrorTypes = new HashSet<GraphErrorType>
    {
        GraphErrorType.Network,
        GraphErrorType.ServiceUnavailable,
        GraphErrorType.Timeout
    }
};

var retryNode = new RetryPolicyGraphNode(wrappedNode, retryConfig);

Retry Strategies¶

public enum RetryStrategy
{
    None = 0,                    // No retry attempts
    FixedDelay = 1,             // Fixed delay between attempts
    ExponentialBackoff = 2,     // Exponential increase in delay
    LinearBackoff = 3,          // Linear increase in delay
    RandomJitter = 4,           // Random jitter added to delay
    Custom = 5                   // Custom retry logic
}

Usage Examples¶

Basic Retry Wrapper¶

// Wrap a function node with retry policy
var functionNode = new FunctionGraphNode(kernelFunction, "api-call");
var retryNode = new RetryPolicyGraphNode(functionNode, new RetryPolicyConfig
{
    MaxRetries = 3,
    BaseDelay = TimeSpan.FromSeconds(1),
    Strategy = RetryStrategy.ExponentialBackoff
});

// Connect retry node in graph
graph.AddNode(retryNode);
graph.AddEdge(previousNode, retryNode);

Custom Retry Condition¶

var retryNode = new RetryPolicyGraphNode(wrappedNode, new RetryPolicyConfig
{
    MaxRetries = 5,
    BaseDelay = TimeSpan.FromSeconds(2),
    CustomRetryCondition = (exception, attemptNumber) =>
    {
        // Only retry on specific exceptions
        if (exception is HttpRequestException httpEx)
        {
            return httpEx.StatusCode == System.Net.HttpStatusCode.TooManyRequests ||
                   httpEx.StatusCode == System.Net.HttpStatusCode.ServiceUnavailable;
        }
        return false;
    }
});

Retry Outcome Routing¶

// Add edge that only executes after retry attempts
retryNode.AddEdgeForRetryOutcome(
    targetNode: fallbackNode,
    onlyOnRetrySuccess: false,
    minRetryAttempts: 2
);

// Add edge for successful retry scenarios
retryNode.AddEdgeForRetryOutcome(
    targetNode: successNode,
    onlyOnRetrySuccess: true,
    minRetryAttempts: 1
);

Retry Statistics¶

The node provides comprehensive retry statistics:

var statistics = retryNode.RetryStatistics;
Console.WriteLine($"Total retry attempts: {statistics.TotalRetryAttempts}");
Console.WriteLine($"Successful retries: {statistics.SuccessfulRetries}");
Console.WriteLine($"Average retry delay: {statistics.AverageRetryDelay}");
Console.WriteLine($"Last retry error: {statistics.LastRetryError?.Message}");

ErrorHandlerGraphNode¶

The ErrorHandlerGraphNode is a specialized node for handling errors during graph execution, providing error categorization, recovery actions, and conditional routing based on error types.

Overview¶

This node implements sophisticated error handling logic with automatic error categorization, configurable recovery actions, and intelligent routing decisions. It serves as a central error handling hub in complex workflows.

Key Features¶

Automatic Error Categorization: Maps exceptions to GraphErrorType enum values
Configurable Recovery Actions: Retry, Skip, Fallback, Rollback, Halt, Escalate, Continue
Conditional Routing: Dynamic edge selection based on error handling outcomes
Fallback Node Support: Alternative execution paths for different error scenarios
Comprehensive Telemetry: Detailed error tracking and recovery metrics
Default Error Handlers: Pre-configured handling strategies for common error types

Error Categorization¶

The node automatically categorizes exceptions into error types:

// Automatic categorization based on exception type
ArgumentException → GraphErrorType.Validation
TimeoutException → GraphErrorType.Timeout
OperationCanceledException → GraphErrorType.Cancellation
HttpRequestException → GraphErrorType.Network
UnauthorizedAccessException → GraphErrorType.Authentication
OutOfMemoryException → GraphErrorType.ResourceExhaustion

Default Error Handlers¶

// Transient errors - retry automatically
_errorHandlers[GraphErrorType.Network] = ErrorRecoveryAction.Retry;
_errorHandlers[GraphErrorType.ServiceUnavailable] = ErrorRecoveryAction.Retry;
_errorHandlers[GraphErrorType.Timeout] = ErrorRecoveryAction.Retry;
_errorHandlers[GraphErrorType.RateLimit] = ErrorRecoveryAction.Retry;

// Authentication errors - halt execution
_errorHandlers[GraphErrorType.Authentication] = ErrorRecoveryAction.Halt;

// Validation errors - skip to next node
_errorHandlers[GraphErrorType.Validation] = ErrorRecoveryAction.Skip;

// Critical system errors - halt execution
_errorHandlers[GraphErrorType.ResourceExhaustion] = ErrorRecoveryAction.Halt;
_errorHandlers[GraphErrorType.GraphStructure] = ErrorRecoveryAction.Halt;

Usage Examples¶

Basic Error Handler¶

var errorHandler = new ErrorHandlerGraphNode(
    nodeId: "error-handler-1",
    name: "MainErrorHandler",
    description: "Handles errors in the main workflow",
    logger: graphLogger
);

// Add to graph
graph.AddNode(errorHandler);
graph.AddEdge(failingNode, errorHandler);

Custom Error Handling¶

var errorHandler = new ErrorHandlerGraphNode("custom-error", "CustomErrorHandler");

// Configure custom error handling for specific error types
errorHandler.ConfigureErrorHandler(GraphErrorType.Network, ErrorRecoveryAction.Retry);
errorHandler.ConfigureErrorHandler(GraphErrorType.Authentication, ErrorRecoveryAction.Escalate);

// Set fallback nodes for specific error types
errorHandler.SetFallbackNode(GraphErrorType.ServiceUnavailable, alternativeServiceNode);
errorHandler.SetFallbackNode(GraphErrorType.Validation, validationHelperNode);

Conditional Routing Based on Error Outcomes¶

// Route to different nodes based on recovery action
errorHandler.AddEdgeForRecoveryAction(
    targetNode: retryNode,
    recoveryAction: ErrorRecoveryAction.Retry
);

errorHandler.AddEdgeForRecoveryAction(
    targetNode: fallbackNode,
    recoveryAction: ErrorRecoveryAction.Fallback
);

errorHandler.AddEdgeForRecoveryAction(
    targetNode: escalationNode,
    recoveryAction: ErrorRecoveryAction.Escalate
);

Error Context and Recovery¶

The node processes error context from kernel arguments:

// Input parameters expected by the error handler
public IReadOnlyList<string> InputParameters { get; } = new[]
{
    "LastError",           // Exception that occurred
    "ErrorContext",        // Additional error context
    "ErrorType",           // Categorized error type
    "ErrorSeverity",       // Error severity level
    "AttemptCount"         // Current attempt number
}.AsReadOnly();

// Output parameters provided by the error handler
public IReadOnlyList<string> OutputParameters { get; } = new[]
{
    "ErrorHandled",        // Whether error was handled
    "RecoveryAction",      // Action taken to recover
    "ShouldRetry",         // Whether retry is recommended
    "RetryDelay",          // Suggested retry delay
    "FallbackExecuted",    // Whether fallback was used
    "EscalationRequired"   // Whether escalation is needed
}.AsReadOnly();

ErrorMetricsCollector¶

The ErrorMetricsCollector collects, aggregates, and analyzes error metrics across graph executions, providing trend analysis, performance insights, and anomaly detection.

Overview¶

This component provides comprehensive error tracking with real-time metrics, historical analysis, and performance insights. It integrates with the error handling system to provide actionable intelligence for system reliability.

Key Features¶

Real-time Metrics Collection: Immediate error event processing and aggregation
Multi-dimensional Analysis: Metrics by execution, node, error type, and time
Performance Insights: Error rates, recovery success rates, and trend analysis
Anomaly Detection: Automatic identification of unusual error patterns
Configurable Retention: Adjustable data retention and cleanup policies
Integration Ready: Easy integration with monitoring and alerting systems

Metrics Structure¶

// Execution-level metrics
public class ExecutionErrorMetrics
{
    public string ExecutionId { get; set; }
    public int TotalErrors { get; set; }
    public List<GraphErrorType> ErrorTypes { get; set; }
    public double RecoverySuccessRate { get; set; }
    public double AverageErrorSeverity { get; set; }
    public DateTimeOffset FirstError { get; set; }
    public DateTimeOffset LastError { get; set; }
    public double ErrorRate { get; set; }
    public GraphErrorType MostCommonErrorType { get; set; }
}

// Node-level metrics
public class NodeErrorMetrics
{
    public string NodeId { get; set; }
    public int TotalErrors { get; set; }
    public double ErrorRate { get; set; }
    public double AverageErrorSeverity { get; set; }
    public double RecoverySuccessRate { get; set; }
    public DateTimeOffset LastErrorTime { get; set; }
    public GraphErrorType MostCommonErrorType { get; set; }
    public List<GraphErrorType> ErrorTypes { get; set; }
    public int RecoveryAttempts { get; set; }
    public int SuccessfulRecoveries { get; set; }
}

Usage Examples¶

Basic Metrics Collection¶

var metricsCollector = new ErrorMetricsCollector(new ErrorMetricsOptions
{
    AggregationInterval = TimeSpan.FromMinutes(1),
    MaxEventQueueSize = 10000,
    EnableMetricsCleanup = true,
    MetricsRetentionPeriod = TimeSpan.FromDays(7)
});

// Record an error event
metricsCollector.RecordError(
    executionId: "exec-123",
    nodeId: "api-node",
    errorContext: errorContext,
    recoveryAction: ErrorRecoveryAction.Retry,
    recoverySuccess: true
);

Metrics Queries¶

// Get execution-specific metrics
var executionMetrics = metricsCollector.GetExecutionMetrics("exec-123");
if (executionMetrics != null)
{
    Console.WriteLine($"Total errors: {executionMetrics.TotalErrors}");
    Console.WriteLine($"Recovery success rate: {executionMetrics.RecoverySuccessRate:F2}%");
    Console.WriteLine($"Most common error: {executionMetrics.MostCommonErrorType}");
}

// Get node-specific metrics
var nodeMetrics = metricsCollector.GetNodeMetrics("api-node");
if (nodeMetrics != null)
{
    Console.WriteLine($"Node error rate: {nodeMetrics.ErrorRate:F2} errors/min");
    Console.WriteLine($"Recovery success rate: {nodeMetrics.RecoverySuccessRate:F2}%");
}

// Get overall statistics
var overallStats = metricsCollector.OverallStatistics;
Console.WriteLine($"Total errors recorded: {overallStats.TotalErrors}");
Console.WriteLine($"Current error rate: {overallStats.CurrentErrorRate:F2} errors/min");
Console.WriteLine($"Overall recovery success rate: {overallStats.RecoverySuccessRate:F2}%");

Batch Processing¶

// Record multiple error events at once
var errorEvents = new List<ErrorEvent>
{
    new ErrorEvent
    {
        ExecutionId = "exec-123",
        NodeId = "node-1",
        ErrorType = GraphErrorType.Network,
        Severity = ErrorSeverity.Medium,
        IsTransient = true,
        Timestamp = DateTimeOffset.UtcNow
    },
    new ErrorEvent
    {
        ExecutionId = "exec-123",
        NodeId = "node-2",
        ErrorType = GraphErrorType.Timeout,
        Severity = ErrorSeverity.High,
        IsTransient = true,
        Timestamp = DateTimeOffset.UtcNow
    }
};

metricsCollector.RecordErrorBatch(errorEvents);

Configuration Options¶

public class ErrorMetricsOptions
{
    public TimeSpan AggregationInterval { get; set; } = TimeSpan.FromMinutes(1);
    public int MaxEventQueueSize { get; set; } = 10000;
    public bool EnableMetricsCleanup { get; set; } = true;
    public TimeSpan MetricsRetentionPeriod { get; set; } = TimeSpan.FromDays(7);
}

Integration with Error Handling¶

The metrics collector integrates seamlessly with the error handling system:

// In ErrorHandlerGraphNode
public async Task<FunctionResult> ExecuteAsync(Kernel kernel, KernelArguments arguments, CancellationToken cancellationToken = default)
{
    try
    {
        // ... error handling logic ...

        // Record metrics for successful error handling
        _metricsCollector?.RecordError(
            executionId: arguments.GetExecutionId(),
            nodeId: NodeId,
            errorContext: errorContext,
            recoveryAction: recoveryAction,
            recoverySuccess: true
        );

        return result;
    }
    catch (Exception ex)
    {
        // Record metrics for failed error handling
        _metricsCollector?.RecordError(
            executionId: arguments.GetExecutionId(),
            nodeId: NodeId,
            errorContext: new ErrorHandlingContext { Exception = ex },
            recoveryAction: null,
            recoverySuccess: false
        );

        throw;
    }
}

Error Types and Recovery Actions¶

GraphErrorType Enum¶

public enum GraphErrorType
{
    Unknown = 0,           // Unknown or unspecified error type
    Validation = 1,        // Validation error before or during execution
    NodeExecution = 2,     // Node execution failed due to internal logic error
    Timeout = 3,           // Timeout occurred during node or graph execution
    Network = 4,           // Network-related error (transient, retryable)
    ServiceUnavailable = 5, // External service unavailable (potentially transient)
    RateLimit = 6,         // Rate limiting exceeded (potentially transient)
    Authentication = 7,     // Authentication or authorization failure
    ResourceExhaustion = 8, // Resource exhaustion (memory, disk, etc.)
    GraphStructure = 9,    // Graph structure or navigation error
    Cancellation = 10,     // Cancellation was requested
    CircuitBreakerOpen = 11, // Circuit breaker is open (operation short-circuited)
    BudgetExhausted = 12   // Resource budget exhausted (CPU, memory, or cost limits)
}

ErrorRecoveryAction Enum¶

public enum ErrorRecoveryAction
{
    Continue = 0,          // Continue execution without recovery
    Retry = 1,             // Retry the failed operation
    Skip = 2,              // Skip the failed node and continue
    Fallback = 3,          // Execute fallback logic or alternative path
    Rollback = 4,          // Rollback to previous known good state
    Halt = 5,              // Halt execution and propagate error
    Escalate = 6,          // Escalate to human intervention
    CircuitBreaker = 7     // Open circuit breaker for the node
}

ErrorSeverity Enum¶

public enum ErrorSeverity
{
    Low = 0,               // Low severity - continue execution with logging
    Medium = 1,            // Medium severity - attempt recovery or retry
    High = 2,              // High severity - halt current branch, try alternatives
    Critical = 3           // Critical severity - halt entire graph execution
}

Error Policies¶

ErrorPolicyRegistry¶

Overview¶

Key Features¶

Policy Registration¶

Runnable example¶

Policy Resolution¶

Policy Rule Configuration¶

RetryPolicyGraphNode¶

Overview¶

Key Features¶

Configuration¶

Retry Strategies¶

Usage Examples¶

Basic Retry Wrapper¶

Custom Retry Condition¶

Retry Outcome Routing¶

Retry Statistics¶

ErrorHandlerGraphNode¶

Overview¶

Key Features¶

Error Categorization¶

Default Error Handlers¶

Usage Examples¶

Basic Error Handler¶

Custom Error Handling¶

Conditional Routing Based on Error Outcomes¶

Error Context and Recovery¶

ErrorMetricsCollector¶

Overview¶

Key Features¶

Metrics Structure¶

Usage Examples¶

Basic Metrics Collection¶

Metrics Queries¶

Batch Processing¶

Configuration Options¶

Integration with Error Handling¶

Error Types and Recovery Actions¶

GraphErrorType Enum¶

ErrorRecoveryAction Enum¶

ErrorSeverity Enum¶

See Also¶