Skip to main content

Deployment Operations

Learn how to deploy models to inference servers, manage instances, and handle common deployment scenarios.

Deploying a Model

To deploy a model to an Inference Server:

Navigate to Inference Server: Go to the Inference Server section
Click "Deploy Model": Start the deployment process
Select Model: Choose the model you want to deploy
Configure Server:
- Set instance count
- Configure GPU allocation
- Set resource limits
- Configure auto-scaling (if available)
Review Configuration: Review all settings
Deploy: Start the deployment process
Monitor Deployment: Track deployment progress
Verify Serving: Test that the model is serving correctly

Deployment Configuration:

Model Selection: Choose from available trained models
Instance Count: Number of instances to deploy
GPU Count: GPUs per instance
Resource Limits: CPU and memory limits
Scaling Policy: Auto-scaling configuration

Undeploying a Model

To undeploy a model from an Inference Server:

Navigate to Inference Server: Go to the Inference Server section
Select Server: Find the server with the model you want to undeploy
Click "Undeploy": Start the undeployment process
Confirm Undeployment: Confirm you want to undeploy the model
Monitor Undeployment: Track the undeployment process
Verify Removal: Confirm the model is no longer serving

Undeployment Considerations:

Active Requests: Wait for active requests to complete
Graceful Shutdown: Shut down instances gracefully
Resource Cleanup: Release allocated resources
Model Files: Model files remain available for redeployment

Managing Instances

Manage inference server instances:

Instance Operations:

View Instances: See all instances for a deployed model
Scale Instances: Adjust the number of instances
Monitor Instances: Track instance health and performance
Restart Instances: Restart instances if needed

Instance Configuration:

Instance Count: Current number of instances
GPU Allocation: GPUs per instance
Resource Limits: CPU and memory limits per instance
Scaling Policy: Auto-scaling configuration

Instance Status:

Active: Instance is running and serving requests
Starting: Instance is starting up
Stopping: Instance is shutting down
Error: Instance has encountered an error

Monitoring Performance

Monitor inference server performance:

Performance Metrics:

Request Rate: Requests per second
Latency: Average response time
Throughput: Predictions processed per second
Error Rate: Percentage of failed requests

Resource Metrics:

GPU Utilization: GPU usage per instance
CPU Usage: CPU utilization
Memory Usage: Memory consumption
Network I/O: Network traffic

Health Metrics:

Instance Health: Health status of each instance
Service Availability: Overall service availability
Uptime: How long the service has been running

Next Steps

Learn about Monitoring for detailed metrics
Check the Usage Guide for best practices

Deploying a Model
Undeploying a Model
Managing Instances
Monitoring Performance
Next Steps