Training Usage Guide
Complete step-by-step guide to training models, recommended configurations, and troubleshooting.
Complete Step-by-Step Guide
Complete guide to training a model:
Step 1: Prepare Datasets
- Ensure datasets are READY
- Verify features and targets are configured
- Check dataset health
Step 2: Create Training
- Navigate to Training section
- Click "New Training"
- Enter training name and description
Step 3: Select Datasets
- Choose datasets for training
- Configure features and targets
- Validate dataset selection
Step 4: Configure Data Split
- Choose split method (percentage or date range)
- Configure training/validation split
- Review split configuration
Step 5: Configure Architecture
- Select model architecture (NeoLDM or Transformer)
- Configure model size
- Customize YAML configuration if needed
- Set GPU count
Step 6: Start Training
- Review all configurations
- Start training job
- Monitor training progress
Step 7: Monitor and Evaluate
- Monitor training metrics
- Track checkpoints
- Evaluate model performance
- Select best checkpoint
Recommended Configurations
Best practices for training configuration:
Small Models:
- Architecture: NeoLDM Small
- GPUs: 1-2 GPUs
- Batch Size: 2048-4096
- Use Case: Development, testing, small datasets
Medium Models:
- Architecture: NeoLDM Medium
- GPUs: 2-4 GPUs
- Batch Size: 4096-8192
- Use Case: Moderate datasets, production training
Large Models:
- Architecture: NeoLDM Large
- GPUs: 4+ GPUs
- Batch Size: 8192+
- Use Case: Large datasets, high-performance requirements
Hyperparameter Guidelines:
- Learning Rate: Start with 1e-5, adjust based on results
- Batch Size: Larger for more stable gradients
- Epochs: Monitor for overfitting
- Dropout: Use for regularization
Interpreting Metrics
Understanding training metrics:
Loss Metrics:
- Training Loss: Should decrease over time
- Validation Loss: Should track training loss
- Gap: Large gap indicates overfitting
Accuracy Metrics:
- Training Accuracy: Model performance on training data
- Validation Accuracy: Model performance on validation data
- Improvement: Should improve over epochs
Other Metrics:
- Learning Rate: Should be appropriate for convergence
- Gradient Norm: Should be stable
- Resource Usage: Monitor GPU and memory usage
Red Flags:
- Validation loss increasing (overfitting)
- Training not converging
- Metrics not improving
- Resource exhaustion
Common Troubleshooting
Issue: Training Fails to Start
- Symptom: Training job fails immediately
- Possible Causes:
- Invalid configuration
- Insufficient resources
- Dataset issues
- Solutions:
- Review configuration
- Check resource availability
- Verify dataset status
Issue: Training Stalls
- Symptom: Training stops making progress
- Possible Causes:
- Resource constraints
- Data loading issues
- Network problems
- Solutions:
- Check resource usage
- Review data loading
- Check network connectivity
Issue: Overfitting
- Symptom: Large gap between training and validation metrics
- Possible Causes:
- Model too complex
- Insufficient regularization
- Small dataset
- Solutions:
- Increase regularization
- Reduce model complexity
- Use more data
Issue: Underfitting
- Symptom: Both training and validation metrics are poor
- Possible Causes:
- Model too simple
- Insufficient training
- Poor feature selection
- Solutions:
- Increase model complexity
- Train for more epochs
- Improve feature selection
Best Practices
Training Best Practices:
- Start with small models for experimentation
- Monitor metrics closely during training
- Use appropriate data splits
- Regularize to prevent overfitting
- Track experiments systematically
Resource Management:
- Allocate GPUs appropriately
- Monitor resource usage
- Optimize batch sizes
- Use distributed training for large models
Experiment Management:
- Use descriptive training names
- Document configurations
- Track experiments
- Compare results systematically
Next Steps
- Learn about Benchmark to evaluate models
- Explore Inference Server to deploy models
- Check Datasets to prepare data