Skip to main content

Deployment Operations

Learn how to deploy models to inference servers, manage instances, and handle common deployment scenarios.

Deploying a Model

To deploy a model to an Inference Server:

  1. Navigate to Inference Server: Go to the Inference Server section
  2. Click "Deploy Model": Start the deployment process
  3. Select Model: Choose the model you want to deploy
  4. Configure Server:
    • Set instance count
    • Configure GPU allocation
    • Set resource limits
    • Configure auto-scaling (if available)
  5. Review Configuration: Review all settings
  6. Deploy: Start the deployment process
  7. Monitor Deployment: Track deployment progress
  8. Verify Serving: Test that the model is serving correctly

Deployment Configuration:

  • Model Selection: Choose from available trained models
  • Instance Count: Number of instances to deploy
  • GPU Count: GPUs per instance
  • Resource Limits: CPU and memory limits
  • Scaling Policy: Auto-scaling configuration

Undeploying a Model

To undeploy a model from an Inference Server:

  1. Navigate to Inference Server: Go to the Inference Server section
  2. Select Server: Find the server with the model you want to undeploy
  3. Click "Undeploy": Start the undeployment process
  4. Confirm Undeployment: Confirm you want to undeploy the model
  5. Monitor Undeployment: Track the undeployment process
  6. Verify Removal: Confirm the model is no longer serving

Undeployment Considerations:

  • Active Requests: Wait for active requests to complete
  • Graceful Shutdown: Shut down instances gracefully
  • Resource Cleanup: Release allocated resources
  • Model Files: Model files remain available for redeployment

Managing Instances

Manage inference server instances:

Instance Operations:

  • View Instances: See all instances for a deployed model
  • Scale Instances: Adjust the number of instances
  • Monitor Instances: Track instance health and performance
  • Restart Instances: Restart instances if needed

Instance Configuration:

  • Instance Count: Current number of instances
  • GPU Allocation: GPUs per instance
  • Resource Limits: CPU and memory limits per instance
  • Scaling Policy: Auto-scaling configuration

Instance Status:

  • Active: Instance is running and serving requests
  • Starting: Instance is starting up
  • Stopping: Instance is shutting down
  • Error: Instance has encountered an error

Monitoring Performance

Monitor inference server performance:

Performance Metrics:

  • Request Rate: Requests per second
  • Latency: Average response time
  • Throughput: Predictions processed per second
  • Error Rate: Percentage of failed requests

Resource Metrics:

  • GPU Utilization: GPU usage per instance
  • CPU Usage: CPU utilization
  • Memory Usage: Memory consumption
  • Network I/O: Network traffic

Health Metrics:

  • Instance Health: Health status of each instance
  • Service Availability: Overall service availability
  • Uptime: How long the service has been running

Next Steps