Skip to main content

Cluster Core Concepts

Understanding the fundamental concepts of NeoSpace clusters is essential for effectively using the platform.

Cluster Architecture

A NeoSpace cluster consists of:

Cluster Components:

  • Leader Node: Coordinates cluster operations and manages resources
  • Worker Nodes: Execute training and inference jobs
  • Network Infrastructure: High-bandwidth networking connecting all nodes
  • Storage: Shared storage for models, datasets, and checkpoints

Architecture Benefits:

  • Fault Tolerance: Automatic failover if a node fails
  • Load Balancing: Distribute workloads across available nodes
  • Resource Management: Efficient allocation of GPU resources
  • Network Optimization: Optimized data transfer between nodes

Node Types

Clusters support different node types optimized for different workloads:

High-Performance GPU Nodes:

  • Latest generation GPUs (e.g., NVIDIA A100, H100)
  • High memory capacity for large models
  • Optimized for training workloads
  • Support for distributed training

Node Configuration:

  • CPU: High-core count CPUs for data preprocessing
  • Memory: Large RAM capacity for data caching
  • Storage: Fast NVMe storage for dataset access
  • Network: High-bandwidth network interfaces

Node Status:

  • Ready: Node is available and ready for jobs
  • Busy: Node is currently executing a job
  • Maintenance: Node is under maintenance
  • Offline: Node is unavailable

NCUs (NeoSpace Compute Units)

NCUs (NeoSpace Compute Units) are the fundamental compute units within a cluster. Each NCU represents a unit of computational capacity that can be allocated to training or inference jobs.

NCU Characteristics:

  • Resource Allocation: Each NCU provides a specific amount of GPU, CPU, and memory
  • Isolation: NCUs provide resource isolation between different jobs
  • Scalability: Allocate multiple NCUs for larger workloads
  • Monitoring: Track NCU utilization and performance

NCU States:

  • Available: NCU is free and ready to accept jobs
  • Busy: NCU is currently executing a job
  • Reserved: NCU is reserved for a specific job or project

NCU Metrics:

  • Total NCUs: Total number of NCUs in the cluster
  • Busy NCUs: NCUs currently executing jobs
  • Available NCUs: NCUs ready for new jobs
  • Utilization: Average utilization across all NCUs
  • Memory Usage: Memory utilization per NCU

Scalability and Performance

Clusters are designed for horizontal scalability:

Scaling Capabilities:

  • Add Nodes: Add new nodes to increase cluster capacity
  • Remove Nodes: Remove nodes during low-demand periods
  • Auto-Scaling: Automatic scaling based on workload (future feature)
  • Resource Pooling: Share resources across multiple projects

Performance Optimizations:

  • Distributed Training: Train models across multiple GPUs and nodes
  • Data Parallelism: Split data across multiple workers
  • Model Parallelism: Split large models across multiple GPUs
  • Gradient Synchronization: Efficient gradient aggregation

Performance Metrics:

  • Training Speed: Time to train models
  • Throughput: Jobs processed per hour
  • Resource Utilization: GPU, CPU, and memory usage
  • Network Bandwidth: Data transfer rates between nodes

Next Steps