Skip to main content

Cluster Monitoring and Metrics

The NeoSpace platform provides comprehensive monitoring and metrics for clusters, enabling you to track performance, identify bottlenecks, and optimize resource utilization.

Performance Metrics (24h)

The Performance Metrics tab provides a 24-hour view of cluster performance across key dimensions:

CPU Usage:

  • Real-time CPU utilization across all nodes
  • Historical CPU usage trends
  • Peak and average CPU usage
  • CPU usage by node

Disk Usage:

  • Storage utilization across the cluster
  • Disk I/O rates
  • Storage capacity and availability
  • Disk usage trends

RAM Memory Usage:

  • Memory utilization across nodes
  • Memory allocation per job
  • Memory pressure indicators
  • Memory usage trends

Network Usage:

  • Network throughput
  • Bandwidth utilization
  • Network latency
  • Data transfer rates

Compute Activity

The Compute Activity tab shows detailed information about active jobs running on the cluster:

Busy NCUs:

  • List of all NCUs currently executing jobs
  • Job information (training name, job ID)
  • Job type (Pre-training, Post-training, Inference)
  • Resource utilization per NCU (CPU, Disk, RAM, Network)
  • Estimated time remaining

Available NCUs:

  • List of NCUs ready for new jobs
  • NCU specifications and capabilities
  • Resource availability

Job Information:

  • Job ID: Unique identifier for the job
  • Training Name: Name of the training or model
  • Job Type: Type of job (Pre-training, Post-training, Inference)
  • Status: Current job status
  • Resource Usage: CPU, memory, disk, and network utilization
  • Time Remaining: Estimated time to completion

Network Monitoring

The Network tab provides comprehensive network monitoring and diagnostics:

Network Overview:

  • Latency: Average network latency between nodes
  • Packet Loss: Percentage of lost packets
  • Active Connections: Number of active network connections
  • Current Throughput: Current network bandwidth usage

Cluster Network:

  • Cluster IP: Primary IP address of the cluster
  • Node Count: Number of nodes in the cluster
  • Status: Overall cluster network status
  • Network Health: Health indicator (Good, Warning, Critical)

Bandwidth Usage:

  • Current: Current bandwidth utilization
  • Average: Average bandwidth over time
  • Peak: Peak bandwidth usage

Network Nodes:

  • List of all nodes in the cluster
  • IP addresses of each node
  • Node network status

Network Charts:

  • Network throughput over time
  • Disk usage trends
  • Network performance trends

Uptime and Availability

Track cluster availability and uptime:

Uptime Metrics:

  • Current Uptime: How long the cluster has been running
  • Availability: Percentage of time cluster is available
  • Downtime Events: History of downtime incidents
  • MTTR: Mean Time To Recovery

Availability Monitoring:

  • Real-time cluster status
  • Node health status
  • Service availability
  • Automatic failover events

Cluster Information:

  • Cluster Name: Name of the cluster
  • Node Type: Type of nodes in the cluster
  • IP Address: Cluster IP address
  • Uptime: Current uptime duration

Interpreting Metrics

Understanding cluster metrics helps optimize performance:

Key Metrics to Monitor:

  • NCU Utilization: Should be balanced across all NCUs
  • Memory Usage: Monitor for memory pressure
  • Network Bandwidth: Ensure sufficient bandwidth for distributed training
  • Disk I/O: Monitor for storage bottlenecks

Performance Indicators:

  • High Utilization: May indicate need for more resources
  • Low Utilization: May indicate over-provisioning
  • Network Latency: High latency can slow distributed training
  • Memory Pressure: Can cause job failures or slowdowns

Optimization Tips:

  • Balance workloads across NCUs
  • Monitor resource utilization trends
  • Identify and address bottlenecks
  • Plan capacity based on usage patterns

Next Steps