Cluster Core Concepts

Understanding the fundamental concepts of NeoSpace clusters is essential for effectively using the platform.

Cluster Architecture

A NeoSpace cluster consists of:

Cluster Components:

Leader Node: Coordinates cluster operations and manages resources
Worker Nodes: Execute training and inference jobs
Network Infrastructure: High-bandwidth networking connecting all nodes
Storage: Shared storage for models, datasets, and checkpoints

Architecture Benefits:

Fault Tolerance: Automatic failover if a node fails
Load Balancing: Distribute workloads across available nodes
Resource Management: Efficient allocation of GPU resources
Network Optimization: Optimized data transfer between nodes

Node Types

Clusters support different node types optimized for different workloads:

High-Performance GPU Nodes:

Latest generation GPUs (e.g., NVIDIA A100, H100)
High memory capacity for large models
Optimized for training workloads
Support for distributed training

Node Configuration:

CPU: High-core count CPUs for data preprocessing
Memory: Large RAM capacity for data caching
Storage: Fast NVMe storage for dataset access
Network: High-bandwidth network interfaces

Node Status:

Ready: Node is available and ready for jobs
Busy: Node is currently executing a job
Maintenance: Node is under maintenance
Offline: Node is unavailable

NCUs (NeoSpace Compute Units)

NCUs (NeoSpace Compute Units) are the fundamental compute units within a cluster. Each NCU represents a unit of computational capacity that can be allocated to training or inference jobs.

NCU Characteristics:

Resource Allocation: Each NCU provides a specific amount of GPU, CPU, and memory
Isolation: NCUs provide resource isolation between different jobs
Scalability: Allocate multiple NCUs for larger workloads
Monitoring: Track NCU utilization and performance

NCU States:

Available: NCU is free and ready to accept jobs
Busy: NCU is currently executing a job
Reserved: NCU is reserved for a specific job or project

NCU Metrics:

Total NCUs: Total number of NCUs in the cluster
Busy NCUs: NCUs currently executing jobs
Available NCUs: NCUs ready for new jobs
Utilization: Average utilization across all NCUs
Memory Usage: Memory utilization per NCU

Scalability and Performance

Clusters are designed for horizontal scalability:

Scaling Capabilities:

Add Nodes: Add new nodes to increase cluster capacity
Remove Nodes: Remove nodes during low-demand periods
Auto-Scaling: Automatic scaling based on workload (future feature)
Resource Pooling: Share resources across multiple projects

Performance Optimizations:

Distributed Training: Train models across multiple GPUs and nodes
Data Parallelism: Split data across multiple workers
Model Parallelism: Split large models across multiple GPUs
Gradient Synchronization: Efficient gradient aggregation

Performance Metrics:

Training Speed: Time to train models
Throughput: Jobs processed per hour
Resource Utilization: GPU, CPU, and memory usage
Network Bandwidth: Data transfer rates between nodes

Next Steps

Learn about Monitoring to track cluster performance
Check the Usage Guide for practical examples

Cluster Architecture​

Node Types​

NCUs (NeoSpace Compute Units)​

Scalability and Performance​

Next Steps​

Cluster Architecture

Node Types

NCUs (NeoSpace Compute Units)

Scalability and Performance

Next Steps