Skip to main content

Inference Server Monitoring

Monitor inference server performance, resource usage, and system health to ensure optimal serving performance.

Active Models

Track all active models deployed on inference servers:

Model Information:

  • Model Name: Name of the deployed model
  • Instance Count: Number of instances serving the model
  • Status: Current deployment status
  • Deployment Time: When the model was deployed

Model Statistics:

  • Total Models: Number of models currently deployed
  • Active Models: Models currently serving requests
  • Total Instances: Total number of instances across all models

Resource Usage

Monitor resource usage across inference servers:

Resource Metrics:

  • Total GPUs: Total GPUs available in the cluster
  • Allocated GPUs: GPUs currently allocated to inference servers
  • Available GPUs: GPUs available for new deployments
  • Utilization Percentage: Percentage of GPUs in use

Resource Breakdown:

  • Per Model: GPU allocation per deployed model
  • Per Instance: GPU usage per instance
  • Total Usage: Overall GPU utilization

Resource Limits:

  • Maximum GPUs: Maximum GPUs that can be allocated
  • Per Model Limit: Maximum GPUs per model
  • Per Instance Limit: Maximum GPUs per instance

Performance (Response Time)

Track inference performance and response times:

Performance Metrics:

  • Average Response Time: Average time to process a request
  • P50 Latency: Median response time
  • P95 Latency: 95th percentile response time
  • P99 Latency: 99th percentile response time
  • Min/Max Latency: Minimum and maximum response times

Performance Trends:

  • Over Time: Response time trends over time
  • By Model: Performance per deployed model
  • By Instance: Performance per instance
  • Peak Times: Performance during peak usage

Performance Targets:

  • Target Latency: Desired response time
  • SLA Compliance: Meeting service level agreements
  • Performance Alerts: Alerts when latency exceeds thresholds

System Health (Uptime)

Monitor system health and availability:

Health Metrics:

  • Uptime: How long the service has been running
  • Availability: Percentage of time service is available
  • Health Status: Overall system health indicator
  • Incident History: History of service incidents

Health Indicators:

  • Healthy: All systems operating normally
  • Warning: Some issues detected but service operational
  • Critical: Service degradation or failures
  • Maintenance: System under maintenance

Availability Metrics:

  • Current Uptime: Current uptime duration
  • 30-Day Uptime: Uptime over the last 30 days
  • MTTR: Mean Time To Recovery
  • Incident Count: Number of incidents in a period

Next Steps