Skip to main content

Cluster Usage Guide

Practical guide to viewing cluster information, interpreting metrics, and troubleshooting common issues.

Viewing Cluster Information

To view cluster information:

  1. Navigate to the Clusters section in the NeoSpace platform
  2. View the cluster overview dashboard
  3. Check cluster status, node information, and NCU allocation
  4. Review performance metrics and activity

Cluster Dashboard Shows:

  • Cluster name and basic information
  • Node type and IP address
  • Current uptime
  • NCU allocation (Total, Busy, Available)
  • NCU utilization and memory charts

Interpreting Charts and Graphs

The cluster dashboard includes several charts and graphs:

NCU Utilization Chart:

  • Shows average NCU utilization over time
  • Helps identify usage patterns
  • Indicates peak usage times
  • Useful for capacity planning

NCU Memory Chart:

  • Shows average memory usage per NCU
  • Helps identify memory-intensive workloads
  • Indicates memory pressure
  • Useful for resource allocation

Performance Metrics Charts:

  • CPU, Disk, RAM, and Network usage over 24 hours
  • Line charts showing trends
  • Helps identify performance issues
  • Useful for troubleshooting

How to Read:

  • X-axis: Time period (24 hours)
  • Y-axis: Metric value (percentage, bytes, etc.)
  • Lines: Different metrics or nodes
  • Colors: Different categories or states

Common Troubleshooting

Issue: High NCU Utilization

  • Symptom: All or most NCUs are busy
  • Cause: High workload demand
  • Solution:
    • Wait for jobs to complete
    • Add more nodes to the cluster
    • Optimize job scheduling
    • Consider using smaller models

Issue: Low NCU Utilization

  • Symptom: Many NCUs are available but not being used
  • Cause: Insufficient job scheduling or small workloads
  • Solution:
    • Review job scheduling policies
    • Consolidate smaller jobs
    • Consider reducing cluster size

Issue: High Memory Usage

  • Symptom: High memory utilization across NCUs
  • Cause: Memory-intensive workloads
  • Solution:
    • Optimize model architecture
    • Reduce batch sizes
    • Use gradient checkpointing
    • Add more memory to nodes

Issue: Network Latency

  • Symptom: High network latency between nodes
  • Cause: Network congestion or configuration issues
  • Solution:
    • Check network configuration
    • Optimize data transfer
    • Use data locality optimizations
    • Review network bandwidth

Issue: Node Offline

  • Symptom: Node status shows as offline
  • Cause: Hardware failure or network issues
  • Solution:
    • Check node hardware status
    • Verify network connectivity
    • Review system logs
    • Contact support if persistent

Issue: Job Failures

  • Symptom: Jobs failing on cluster
  • Cause: Resource constraints or configuration issues
  • Solution:
    • Check resource availability
    • Review job configuration
    • Check logs for error messages
    • Verify dataset and model accessibility

Best Practices

Resource Management

  • Monitor NCU Utilization: Regularly check NCU allocation and utilization
  • Balance Workloads: Distribute jobs evenly across available NCUs
  • Plan Capacity: Plan cluster capacity based on expected workloads
  • Optimize Scheduling: Use efficient job scheduling to maximize utilization

Performance Optimization

  • Use Distributed Training: Leverage multiple GPUs for faster training
  • Optimize Data Loading: Use efficient data loading and preprocessing
  • Monitor Network: Keep network latency low for distributed training
  • Cache Data: Use data caching to reduce I/O overhead

Monitoring

  • Regular Monitoring: Regularly check cluster metrics and performance
  • Set Alerts: Set up alerts for critical metrics
  • Review Trends: Review historical trends to identify patterns
  • Capacity Planning: Use metrics for capacity planning

Next Steps

  • Learn about Inference Servers for deploying models
  • Explore Training to understand how clusters are used for training
  • Check Datasets to see how data flows through the cluster