Cluster Monitoring and Metrics
The NeoSpace platform provides comprehensive monitoring and metrics for clusters, enabling you to track performance, identify bottlenecks, and optimize resource utilization.
Performance Metrics (24h)
The Performance Metrics tab provides a 24-hour view of cluster performance across key dimensions:
CPU Usage:
- Real-time CPU utilization across all nodes
- Historical CPU usage trends
- Peak and average CPU usage
- CPU usage by node
Disk Usage:
- Storage utilization across the cluster
- Disk I/O rates
- Storage capacity and availability
- Disk usage trends
RAM Memory Usage:
- Memory utilization across nodes
- Memory allocation per job
- Memory pressure indicators
- Memory usage trends
Network Usage:
- Network throughput
- Bandwidth utilization
- Network latency
- Data transfer rates
Compute Activity
The Compute Activity tab shows detailed information about active jobs running on the cluster:
Busy NCUs:
- List of all NCUs currently executing jobs
- Job information (training name, job ID)
- Job type (Pre-training, Post-training, Inference)
- Resource utilization per NCU (CPU, Disk, RAM, Network)
- Estimated time remaining
Available NCUs:
- List of NCUs ready for new jobs
- NCU specifications and capabilities
- Resource availability
Job Information:
- Job ID: Unique identifier for the job
- Training Name: Name of the training or model
- Job Type: Type of job (Pre-training, Post-training, Inference)
- Status: Current job status
- Resource Usage: CPU, memory, disk, and network utilization
- Time Remaining: Estimated time to completion
Network Monitoring
The Network tab provides comprehensive network monitoring and diagnostics:
Network Overview:
- Latency: Average network latency between nodes
- Packet Loss: Percentage of lost packets
- Active Connections: Number of active network connections
- Current Throughput: Current network bandwidth usage
Cluster Network:
- Cluster IP: Primary IP address of the cluster
- Node Count: Number of nodes in the cluster
- Status: Overall cluster network status
- Network Health: Health indicator (Good, Warning, Critical)
Bandwidth Usage:
- Current: Current bandwidth utilization
- Average: Average bandwidth over time
- Peak: Peak bandwidth usage
Network Nodes:
- List of all nodes in the cluster
- IP addresses of each node
- Node network status
Network Charts:
- Network throughput over time
- Disk usage trends
- Network performance trends
Uptime and Availability
Track cluster availability and uptime:
Uptime Metrics:
- Current Uptime: How long the cluster has been running
- Availability: Percentage of time cluster is available
- Downtime Events: History of downtime incidents
- MTTR: Mean Time To Recovery
Availability Monitoring:
- Real-time cluster status
- Node health status
- Service availability
- Automatic failover events
Cluster Information:
- Cluster Name: Name of the cluster
- Node Type: Type of nodes in the cluster
- IP Address: Cluster IP address
- Uptime: Current uptime duration
Interpreting Metrics
Understanding cluster metrics helps optimize performance:
Key Metrics to Monitor:
- NCU Utilization: Should be balanced across all NCUs
- Memory Usage: Monitor for memory pressure
- Network Bandwidth: Ensure sufficient bandwidth for distributed training
- Disk I/O: Monitor for storage bottlenecks
Performance Indicators:
- High Utilization: May indicate need for more resources
- Low Utilization: May indicate over-provisioning
- Network Latency: High latency can slow distributed training
- Memory Pressure: Can cause job failures or slowdowns
Optimization Tips:
- Balance workloads across NCUs
- Monitor resource utilization trends
- Identify and address bottlenecks
- Plan capacity based on usage patterns
Next Steps
- Check the Usage Guide for practical examples
- Review Core Concepts to understand cluster architecture