Skip to main content

Inference Server Monitoring

Monitor inference server performance, resource usage, and system health to ensure optimal serving performance.

Active Models

Track all active models deployed on inference servers:

Model Information:

Model Name: Name of the deployed model
Instance Count: Number of instances serving the model
Status: Current deployment status
Deployment Time: When the model was deployed

Model Statistics:

Total Models: Number of models currently deployed
Active Models: Models currently serving requests
Total Instances: Total number of instances across all models

Resource Usage

Monitor resource usage across inference servers:

Resource Metrics:

Total GPUs: Total GPUs available in the cluster
Allocated GPUs: GPUs currently allocated to inference servers
Available GPUs: GPUs available for new deployments
Utilization Percentage: Percentage of GPUs in use

Resource Breakdown:

Per Model: GPU allocation per deployed model
Per Instance: GPU usage per instance
Total Usage: Overall GPU utilization

Resource Limits:

Maximum GPUs: Maximum GPUs that can be allocated
Per Model Limit: Maximum GPUs per model
Per Instance Limit: Maximum GPUs per instance

Performance (Response Time)

Track inference performance and response times:

Performance Metrics:

Average Response Time: Average time to process a request
P50 Latency: Median response time
P95 Latency: 95th percentile response time
P99 Latency: 99th percentile response time
Min/Max Latency: Minimum and maximum response times

Performance Trends:

Over Time: Response time trends over time
By Model: Performance per deployed model
By Instance: Performance per instance
Peak Times: Performance during peak usage

Performance Targets:

Target Latency: Desired response time
SLA Compliance: Meeting service level agreements
Performance Alerts: Alerts when latency exceeds thresholds

System Health (Uptime)

Monitor system health and availability:

Health Metrics:

Uptime: How long the service has been running
Availability: Percentage of time service is available
Health Status: Overall system health indicator
Incident History: History of service incidents

Health Indicators:

Healthy: All systems operating normally
Warning: Some issues detected but service operational
Critical: Service degradation or failures
Maintenance: System under maintenance

Availability Metrics:

Current Uptime: Current uptime duration
30-Day Uptime: Uptime over the last 30 days
MTTR: Mean Time To Recovery
Incident Count: Number of incidents in a period

Next Steps

Check the Usage Guide for best practices
Review Deployment operations

Active Models
Resource Usage
Performance (Response Time)
System Health (Uptime)
Next Steps