Cluster Usage Guide

Practical guide to viewing cluster information, interpreting metrics, and troubleshooting common issues.

Viewing Cluster Information

To view cluster information:

Navigate to the Clusters section in the NeoSpace platform
View the cluster overview dashboard
Check cluster status, node information, and NCU allocation
Review performance metrics and activity

Cluster Dashboard Shows:

Cluster name and basic information
Node type and IP address
Current uptime
NCU allocation (Total, Busy, Available)
NCU utilization and memory charts

Interpreting Charts and Graphs

The cluster dashboard includes several charts and graphs:

NCU Utilization Chart:

Shows average NCU utilization over time
Helps identify usage patterns
Indicates peak usage times
Useful for capacity planning

NCU Memory Chart:

Shows average memory usage per NCU
Helps identify memory-intensive workloads
Indicates memory pressure
Useful for resource allocation

Performance Metrics Charts:

CPU, Disk, RAM, and Network usage over 24 hours
Line charts showing trends
Helps identify performance issues
Useful for troubleshooting

How to Read:

X-axis: Time period (24 hours)
Y-axis: Metric value (percentage, bytes, etc.)
Lines: Different metrics or nodes
Colors: Different categories or states

Common Troubleshooting

Issue: High NCU Utilization

Symptom: All or most NCUs are busy
Cause: High workload demand
Solution:
- Wait for jobs to complete
- Add more nodes to the cluster
- Optimize job scheduling
- Consider using smaller models

Issue: Low NCU Utilization

Symptom: Many NCUs are available but not being used
Cause: Insufficient job scheduling or small workloads
Solution:
- Review job scheduling policies
- Consolidate smaller jobs
- Consider reducing cluster size

Issue: High Memory Usage

Symptom: High memory utilization across NCUs
Cause: Memory-intensive workloads
Solution:
- Optimize model architecture
- Reduce batch sizes
- Use gradient checkpointing
- Add more memory to nodes

Issue: Network Latency

Symptom: High network latency between nodes
Cause: Network congestion or configuration issues
Solution:
- Check network configuration
- Optimize data transfer
- Use data locality optimizations
- Review network bandwidth

Issue: Node Offline

Symptom: Node status shows as offline
Cause: Hardware failure or network issues
Solution:
- Check node hardware status
- Verify network connectivity
- Review system logs
- Contact support if persistent

Issue: Job Failures

Symptom: Jobs failing on cluster
Cause: Resource constraints or configuration issues
Solution:
- Check resource availability
- Review job configuration
- Check logs for error messages
- Verify dataset and model accessibility

Best Practices

Resource Management

Monitor NCU Utilization: Regularly check NCU allocation and utilization
Balance Workloads: Distribute jobs evenly across available NCUs
Plan Capacity: Plan cluster capacity based on expected workloads
Optimize Scheduling: Use efficient job scheduling to maximize utilization

Performance Optimization

Use Distributed Training: Leverage multiple GPUs for faster training
Optimize Data Loading: Use efficient data loading and preprocessing
Monitor Network: Keep network latency low for distributed training
Cache Data: Use data caching to reduce I/O overhead

Monitoring

Regular Monitoring: Regularly check cluster metrics and performance
Set Alerts: Set up alerts for critical metrics
Review Trends: Review historical trends to identify patterns
Capacity Planning: Use metrics for capacity planning

Next Steps

Learn about Inference Servers for deploying models
Explore Training to understand how clusters are used for training
Check Datasets to see how data flows through the cluster

Viewing Cluster Information​

Interpreting Charts and Graphs​

Common Troubleshooting​

Best Practices​

Resource Management​

Performance Optimization​

Monitoring​

Next Steps​