Cluster Usage Guide
Practical guide to viewing cluster information, interpreting metrics, and troubleshooting common issues.
Viewing Cluster Information
To view cluster information:
- Navigate to the Clusters section in the NeoSpace platform
- View the cluster overview dashboard
- Check cluster status, node information, and NCU allocation
- Review performance metrics and activity
Cluster Dashboard Shows:
- Cluster name and basic information
- Node type and IP address
- Current uptime
- NCU allocation (Total, Busy, Available)
- NCU utilization and memory charts
Interpreting Charts and Graphs
The cluster dashboard includes several charts and graphs:
NCU Utilization Chart:
- Shows average NCU utilization over time
- Helps identify usage patterns
- Indicates peak usage times
- Useful for capacity planning
NCU Memory Chart:
- Shows average memory usage per NCU
- Helps identify memory-intensive workloads
- Indicates memory pressure
- Useful for resource allocation
Performance Metrics Charts:
- CPU, Disk, RAM, and Network usage over 24 hours
- Line charts showing trends
- Helps identify performance issues
- Useful for troubleshooting
How to Read:
- X-axis: Time period (24 hours)
- Y-axis: Metric value (percentage, bytes, etc.)
- Lines: Different metrics or nodes
- Colors: Different categories or states
Common Troubleshooting
Issue: High NCU Utilization
- Symptom: All or most NCUs are busy
- Cause: High workload demand
- Solution:
- Wait for jobs to complete
- Add more nodes to the cluster
- Optimize job scheduling
- Consider using smaller models
Issue: Low NCU Utilization
- Symptom: Many NCUs are available but not being used
- Cause: Insufficient job scheduling or small workloads
- Solution:
- Review job scheduling policies
- Consolidate smaller jobs
- Consider reducing cluster size
Issue: High Memory Usage
- Symptom: High memory utilization across NCUs
- Cause: Memory-intensive workloads
- Solution:
- Optimize model architecture
- Reduce batch sizes
- Use gradient checkpointing
- Add more memory to nodes
Issue: Network Latency
- Symptom: High network latency between nodes
- Cause: Network congestion or configuration issues
- Solution:
- Check network configuration
- Optimize data transfer
- Use data locality optimizations
- Review network bandwidth
Issue: Node Offline
- Symptom: Node status shows as offline
- Cause: Hardware failure or network issues
- Solution:
- Check node hardware status
- Verify network connectivity
- Review system logs
- Contact support if persistent
Issue: Job Failures
- Symptom: Jobs failing on cluster
- Cause: Resource constraints or configuration issues
- Solution:
- Check resource availability
- Review job configuration
- Check logs for error messages
- Verify dataset and model accessibility
Best Practices
Resource Management
- Monitor NCU Utilization: Regularly check NCU allocation and utilization
- Balance Workloads: Distribute jobs evenly across available NCUs
- Plan Capacity: Plan cluster capacity based on expected workloads
- Optimize Scheduling: Use efficient job scheduling to maximize utilization
Performance Optimization
- Use Distributed Training: Leverage multiple GPUs for faster training
- Optimize Data Loading: Use efficient data loading and preprocessing
- Monitor Network: Keep network latency low for distributed training
- Cache Data: Use data caching to reduce I/O overhead
Monitoring
- Regular Monitoring: Regularly check cluster metrics and performance
- Set Alerts: Set up alerts for critical metrics
- Review Trends: Review historical trends to identify patterns
- Capacity Planning: Use metrics for capacity planning
Next Steps
- Learn about Inference Servers for deploying models
- Explore Training to understand how clusters are used for training
- Check Datasets to see how data flows through the cluster