Cluster Monitoring and Metrics

The NeoSpace platform provides comprehensive monitoring and metrics for clusters, enabling you to track performance, identify bottlenecks, and optimize resource utilization.

Performance Metrics (24h)

The Performance Metrics tab provides a 24-hour view of cluster performance across key dimensions:

CPU Usage:

Real-time CPU utilization across all nodes
Historical CPU usage trends
Peak and average CPU usage
CPU usage by node

Disk Usage:

Storage utilization across the cluster
Disk I/O rates
Storage capacity and availability
Disk usage trends

RAM Memory Usage:

Memory utilization across nodes
Memory allocation per job
Memory pressure indicators
Memory usage trends

Network Usage:

Network throughput
Bandwidth utilization
Network latency
Data transfer rates

Compute Activity

The Compute Activity tab shows detailed information about active jobs running on the cluster:

Busy NCUs:

List of all NCUs currently executing jobs
Job information (training name, job ID)
Job type (Pre-training, Post-training, Inference)
Resource utilization per NCU (CPU, Disk, RAM, Network)
Estimated time remaining

Available NCUs:

List of NCUs ready for new jobs
NCU specifications and capabilities
Resource availability

Job Information:

Job ID: Unique identifier for the job
Training Name: Name of the training or model
Job Type: Type of job (Pre-training, Post-training, Inference)
Status: Current job status
Resource Usage: CPU, memory, disk, and network utilization
Time Remaining: Estimated time to completion

Network Monitoring

The Network tab provides comprehensive network monitoring and diagnostics:

Network Overview:

Latency: Average network latency between nodes
Packet Loss: Percentage of lost packets
Active Connections: Number of active network connections
Current Throughput: Current network bandwidth usage

Cluster Network:

Cluster IP: Primary IP address of the cluster
Node Count: Number of nodes in the cluster
Status: Overall cluster network status
Network Health: Health indicator (Good, Warning, Critical)

Bandwidth Usage:

Current: Current bandwidth utilization
Average: Average bandwidth over time
Peak: Peak bandwidth usage

Network Nodes:

List of all nodes in the cluster
IP addresses of each node
Node network status

Network Charts:

Network throughput over time
Disk usage trends
Network performance trends

Uptime and Availability

Track cluster availability and uptime:

Uptime Metrics:

Current Uptime: How long the cluster has been running
Availability: Percentage of time cluster is available
Downtime Events: History of downtime incidents
MTTR: Mean Time To Recovery

Availability Monitoring:

Real-time cluster status
Node health status
Service availability
Automatic failover events

Cluster Information:

Cluster Name: Name of the cluster
Node Type: Type of nodes in the cluster
IP Address: Cluster IP address
Uptime: Current uptime duration

Interpreting Metrics

Understanding cluster metrics helps optimize performance:

Key Metrics to Monitor:

NCU Utilization: Should be balanced across all NCUs
Memory Usage: Monitor for memory pressure
Network Bandwidth: Ensure sufficient bandwidth for distributed training
Disk I/O: Monitor for storage bottlenecks

Performance Indicators:

High Utilization: May indicate need for more resources
Low Utilization: May indicate over-provisioning
Network Latency: High latency can slow distributed training
Memory Pressure: Can cause job failures or slowdowns

Optimization Tips:

Balance workloads across NCUs
Monitor resource utilization trends
Identify and address bottlenecks
Plan capacity based on usage patterns

Next Steps

Check the Usage Guide for practical examples
Review Core Concepts to understand cluster architecture

Performance Metrics (24h)​

Compute Activity​

Network Monitoring​

Uptime and Availability​

Interpreting Metrics​

Next Steps​