Training Usage Guide

Complete step-by-step guide to training models, recommended configurations, and troubleshooting.

Complete Step-by-Step Guide

Complete guide to training a model:

Step 1: Prepare Datasets

Ensure datasets are READY
Verify features and targets are configured
Check dataset health

Step 2: Create Training

Navigate to Training section
Click "New Training"
Enter training name and description

Step 3: Select Datasets

Choose datasets for training
Configure features and targets
Validate dataset selection

Step 4: Configure Data Split

Choose split method (percentage or date range)
Configure training/validation split
Review split configuration

Step 5: Configure Architecture

Select model architecture (NeoLDM or Transformer)
Configure model size
Customize YAML configuration if needed
Set GPU count

Step 6: Start Training

Review all configurations
Start training job
Monitor training progress

Step 7: Monitor and Evaluate

Monitor training metrics
Track checkpoints
Evaluate model performance
Select best checkpoint

Recommended Configurations

Best practices for training configuration:

Small Models:

Architecture: NeoLDM Small
GPUs: 1-2 GPUs
Batch Size: 2048-4096
Use Case: Development, testing, small datasets

Medium Models:

Architecture: NeoLDM Medium
GPUs: 2-4 GPUs
Batch Size: 4096-8192
Use Case: Moderate datasets, production training

Large Models:

Architecture: NeoLDM Large
GPUs: 4+ GPUs
Batch Size: 8192+
Use Case: Large datasets, high-performance requirements

Hyperparameter Guidelines:

Learning Rate: Start with 1e-5, adjust based on results
Batch Size: Larger for more stable gradients
Epochs: Monitor for overfitting
Dropout: Use for regularization

Interpreting Metrics

Understanding training metrics:

Loss Metrics:

Training Loss: Should decrease over time
Validation Loss: Should track training loss
Gap: Large gap indicates overfitting

Accuracy Metrics:

Training Accuracy: Model performance on training data
Validation Accuracy: Model performance on validation data
Improvement: Should improve over epochs

Other Metrics:

Learning Rate: Should be appropriate for convergence
Gradient Norm: Should be stable
Resource Usage: Monitor GPU and memory usage

Red Flags:

Validation loss increasing (overfitting)
Training not converging
Metrics not improving
Resource exhaustion

Common Troubleshooting

Issue: Training Fails to Start

Symptom: Training job fails immediately
Possible Causes:
- Invalid configuration
- Insufficient resources
- Dataset issues
Solutions:
- Review configuration
- Check resource availability
- Verify dataset status

Issue: Training Stalls

Symptom: Training stops making progress
Possible Causes:
- Resource constraints
- Data loading issues
- Network problems
Solutions:
- Check resource usage
- Review data loading
- Check network connectivity

Issue: Overfitting

Symptom: Large gap between training and validation metrics
Possible Causes:
- Model too complex
- Insufficient regularization
- Small dataset
Solutions:
- Increase regularization
- Reduce model complexity
- Use more data

Issue: Underfitting

Symptom: Both training and validation metrics are poor
Possible Causes:
- Model too simple
- Insufficient training
- Poor feature selection
Solutions:
- Increase model complexity
- Train for more epochs
- Improve feature selection

Best Practices

Training Best Practices:

Start with small models for experimentation
Monitor metrics closely during training
Use appropriate data splits
Regularize to prevent overfitting
Track experiments systematically

Resource Management:

Allocate GPUs appropriately
Monitor resource usage
Optimize batch sizes
Use distributed training for large models

Experiment Management:

Use descriptive training names
Document configurations
Track experiments
Compare results systematically

Next Steps

Learn about Benchmark to evaluate models
Explore Inference Server to deploy models
Check Datasets to prepare data

Complete Step-by-Step Guide​

Recommended Configurations​

Interpreting Metrics​

Common Troubleshooting​

Best Practices​

Next Steps​