Deployment Operations
Learn how to deploy models to inference servers, manage instances, and handle common deployment scenarios.
Deploying a Model
To deploy a model to an Inference Server:
- Navigate to Inference Server: Go to the Inference Server section
- Click "Deploy Model": Start the deployment process
- Select Model: Choose the model you want to deploy
- Configure Server:
- Set instance count
- Configure GPU allocation
- Set resource limits
- Configure auto-scaling (if available)
- Review Configuration: Review all settings
- Deploy: Start the deployment process
- Monitor Deployment: Track deployment progress
- Verify Serving: Test that the model is serving correctly
Deployment Configuration:
- Model Selection: Choose from available trained models
- Instance Count: Number of instances to deploy
- GPU Count: GPUs per instance
- Resource Limits: CPU and memory limits
- Scaling Policy: Auto-scaling configuration
Undeploying a Model
To undeploy a model from an Inference Server:
- Navigate to Inference Server: Go to the Inference Server section
- Select Server: Find the server with the model you want to undeploy
- Click "Undeploy": Start the undeployment process
- Confirm Undeployment: Confirm you want to undeploy the model
- Monitor Undeployment: Track the undeployment process
- Verify Removal: Confirm the model is no longer serving
Undeployment Considerations:
- Active Requests: Wait for active requests to complete
- Graceful Shutdown: Shut down instances gracefully
- Resource Cleanup: Release allocated resources
- Model Files: Model files remain available for redeployment
Managing Instances
Manage inference server instances:
Instance Operations:
- View Instances: See all instances for a deployed model
- Scale Instances: Adjust the number of instances
- Monitor Instances: Track instance health and performance
- Restart Instances: Restart instances if needed
Instance Configuration:
- Instance Count: Current number of instances
- GPU Allocation: GPUs per instance
- Resource Limits: CPU and memory limits per instance
- Scaling Policy: Auto-scaling configuration
Instance Status:
- Active: Instance is running and serving requests
- Starting: Instance is starting up
- Stopping: Instance is shutting down
- Error: Instance has encountered an error
Monitoring Performance
Monitor inference server performance:
Performance Metrics:
- Request Rate: Requests per second
- Latency: Average response time
- Throughput: Predictions processed per second
- Error Rate: Percentage of failed requests
Resource Metrics:
- GPU Utilization: GPU usage per instance
- CPU Usage: CPU utilization
- Memory Usage: Memory consumption
- Network I/O: Network traffic
Health Metrics:
- Instance Health: Health status of each instance
- Service Availability: Overall service availability
- Uptime: How long the service has been running
Next Steps
- Learn about Monitoring for detailed metrics
- Check the Usage Guide for best practices