Inference Server Monitoring
Monitor inference server performance, resource usage, and system health to ensure optimal serving performance.
Active Models
Track all active models deployed on inference servers:
Model Information:
- Model Name: Name of the deployed model
- Instance Count: Number of instances serving the model
- Status: Current deployment status
- Deployment Time: When the model was deployed
Model Statistics:
- Total Models: Number of models currently deployed
- Active Models: Models currently serving requests
- Total Instances: Total number of instances across all models
Resource Usage
Monitor resource usage across inference servers:
Resource Metrics:
- Total GPUs: Total GPUs available in the cluster
- Allocated GPUs: GPUs currently allocated to inference servers
- Available GPUs: GPUs available for new deployments
- Utilization Percentage: Percentage of GPUs in use
Resource Breakdown:
- Per Model: GPU allocation per deployed model
- Per Instance: GPU usage per instance
- Total Usage: Overall GPU utilization
Resource Limits:
- Maximum GPUs: Maximum GPUs that can be allocated
- Per Model Limit: Maximum GPUs per model
- Per Instance Limit: Maximum GPUs per instance
Performance (Response Time)
Track inference performance and response times:
Performance Metrics:
- Average Response Time: Average time to process a request
- P50 Latency: Median response time
- P95 Latency: 95th percentile response time
- P99 Latency: 99th percentile response time
- Min/Max Latency: Minimum and maximum response times
Performance Trends:
- Over Time: Response time trends over time
- By Model: Performance per deployed model
- By Instance: Performance per instance
- Peak Times: Performance during peak usage
Performance Targets:
- Target Latency: Desired response time
- SLA Compliance: Meeting service level agreements
- Performance Alerts: Alerts when latency exceeds thresholds
System Health (Uptime)
Monitor system health and availability:
Health Metrics:
- Uptime: How long the service has been running
- Availability: Percentage of time service is available
- Health Status: Overall system health indicator
- Incident History: History of service incidents
Health Indicators:
- Healthy: All systems operating normally
- Warning: Some issues detected but service operational
- Critical: Service degradation or failures
- Maintenance: System under maintenance
Availability Metrics:
- Current Uptime: Current uptime duration
- 30-Day Uptime: Uptime over the last 30 days
- MTTR: Mean Time To Recovery
- Incident Count: Number of incidents in a period
Next Steps
- Check the Usage Guide for best practices
- Review Deployment operations