Skip to main content

Training Usage Guide

Complete step-by-step guide to training models, recommended configurations, and troubleshooting.

Complete Step-by-Step Guide

Complete guide to training a model:

Step 1: Prepare Datasets

  • Ensure datasets are READY
  • Verify features and targets are configured
  • Check dataset health

Step 2: Create Training

  • Navigate to Training section
  • Click "New Training"
  • Enter training name and description

Step 3: Select Datasets

  • Choose datasets for training
  • Configure features and targets
  • Validate dataset selection

Step 4: Configure Data Split

  • Choose split method (percentage or date range)
  • Configure training/validation split
  • Review split configuration

Step 5: Configure Architecture

  • Select model architecture (NeoLDM or Transformer)
  • Configure model size
  • Customize YAML configuration if needed
  • Set GPU count

Step 6: Start Training

  • Review all configurations
  • Start training job
  • Monitor training progress

Step 7: Monitor and Evaluate

  • Monitor training metrics
  • Track checkpoints
  • Evaluate model performance
  • Select best checkpoint

Best practices for training configuration:

Small Models:

  • Architecture: NeoLDM Small
  • GPUs: 1-2 GPUs
  • Batch Size: 2048-4096
  • Use Case: Development, testing, small datasets

Medium Models:

  • Architecture: NeoLDM Medium
  • GPUs: 2-4 GPUs
  • Batch Size: 4096-8192
  • Use Case: Moderate datasets, production training

Large Models:

  • Architecture: NeoLDM Large
  • GPUs: 4+ GPUs
  • Batch Size: 8192+
  • Use Case: Large datasets, high-performance requirements

Hyperparameter Guidelines:

  • Learning Rate: Start with 1e-5, adjust based on results
  • Batch Size: Larger for more stable gradients
  • Epochs: Monitor for overfitting
  • Dropout: Use for regularization

Interpreting Metrics

Understanding training metrics:

Loss Metrics:

  • Training Loss: Should decrease over time
  • Validation Loss: Should track training loss
  • Gap: Large gap indicates overfitting

Accuracy Metrics:

  • Training Accuracy: Model performance on training data
  • Validation Accuracy: Model performance on validation data
  • Improvement: Should improve over epochs

Other Metrics:

  • Learning Rate: Should be appropriate for convergence
  • Gradient Norm: Should be stable
  • Resource Usage: Monitor GPU and memory usage

Red Flags:

  • Validation loss increasing (overfitting)
  • Training not converging
  • Metrics not improving
  • Resource exhaustion

Common Troubleshooting

Issue: Training Fails to Start

  • Symptom: Training job fails immediately
  • Possible Causes:
    • Invalid configuration
    • Insufficient resources
    • Dataset issues
  • Solutions:
    • Review configuration
    • Check resource availability
    • Verify dataset status

Issue: Training Stalls

  • Symptom: Training stops making progress
  • Possible Causes:
    • Resource constraints
    • Data loading issues
    • Network problems
  • Solutions:
    • Check resource usage
    • Review data loading
    • Check network connectivity

Issue: Overfitting

  • Symptom: Large gap between training and validation metrics
  • Possible Causes:
    • Model too complex
    • Insufficient regularization
    • Small dataset
  • Solutions:
    • Increase regularization
    • Reduce model complexity
    • Use more data

Issue: Underfitting

  • Symptom: Both training and validation metrics are poor
  • Possible Causes:
    • Model too simple
    • Insufficient training
    • Poor feature selection
  • Solutions:
    • Increase model complexity
    • Train for more epochs
    • Improve feature selection

Best Practices

Training Best Practices:

  • Start with small models for experimentation
  • Monitor metrics closely during training
  • Use appropriate data splits
  • Regularize to prevent overfitting
  • Track experiments systematically

Resource Management:

  • Allocate GPUs appropriately
  • Monitor resource usage
  • Optimize batch sizes
  • Use distributed training for large models

Experiment Management:

  • Use descriptive training names
  • Document configurations
  • Track experiments
  • Compare results systematically

Next Steps