Training Operations
Learn how to view training jobs, analyze results, use checkpoints, and manage multiple training jobs.
Viewing Training Jobs
View and manage training jobs:
Training List:
- View all training jobs
- Filter by status, date, or model
- Search by name
- Sort by various criteria
Training Information:
- Training Name: Name of the training
- Status: Current training status
- Progress: Training progress percentage
- Duration: Training duration
- Metrics: Current training metrics
Training Status:
- Running: Training is in progress
- Completed: Training completed successfully
- Failed: Training failed
- Stopped: Training was stopped
Analyzing Results
Analyze training results:
Result Analysis:
- Metrics Comparison: Compare training vs validation metrics
- Trend Analysis: Analyze metric trends over time
- Performance Evaluation: Evaluate model performance
- Error Analysis: Analyze prediction errors
Analysis Tools:
- Metric Charts: Visualize metrics over time
- Comparison Views: Compare different trainings
- Performance Reports: Generate performance reports
- Export Results: Export results for further analysis
Using Checkpoints
Use checkpoints for evaluation and deployment:
Checkpoint Usage:
- Model Evaluation: Evaluate checkpoints on test data
- Model Comparison: Compare different checkpoints
- Best Checkpoint Selection: Select best performing checkpoint
- Model Deployment: Deploy checkpoints to inference servers
Checkpoint Operations:
- View Checkpoints: View all checkpoints for a training
- Compare Metrics: Compare checkpoint metrics
- Download: Download checkpoint files
- Deploy: Deploy checkpoint to inference server
Managing Multiple Training Jobs
Manage multiple training jobs efficiently:
Management Features:
- Bulk Operations: Select and manage multiple trainings
- Status Monitoring: Monitor status of all trainings
- Resource Management: Manage GPU allocation across trainings
- Priority Management: Set training priorities
Best Practices:
- Organize by Project: Group trainings by project
- Use Descriptive Names: Use clear naming conventions
- Monitor Resources: Monitor resource usage
- Clean Up: Remove completed or failed trainings
Next Steps
- Check Usage Guide for best practices
- Review Core Concepts to understand training