Cluster Core Concepts
Understanding the fundamental concepts of NeoSpace clusters is essential for effectively using the platform.
Cluster Architecture
A NeoSpace cluster consists of:
Cluster Components:
- Leader Node: Coordinates cluster operations and manages resources
- Worker Nodes: Execute training and inference jobs
- Network Infrastructure: High-bandwidth networking connecting all nodes
- Storage: Shared storage for models, datasets, and checkpoints
Architecture Benefits:
- Fault Tolerance: Automatic failover if a node fails
- Load Balancing: Distribute workloads across available nodes
- Resource Management: Efficient allocation of GPU resources
- Network Optimization: Optimized data transfer between nodes
Node Types
Clusters support different node types optimized for different workloads:
High-Performance GPU Nodes:
- Latest generation GPUs (e.g., NVIDIA A100, H100)
- High memory capacity for large models
- Optimized for training workloads
- Support for distributed training
Node Configuration:
- CPU: High-core count CPUs for data preprocessing
- Memory: Large RAM capacity for data caching
- Storage: Fast NVMe storage for dataset access
- Network: High-bandwidth network interfaces
Node Status:
- Ready: Node is available and ready for jobs
- Busy: Node is currently executing a job
- Maintenance: Node is under maintenance
- Offline: Node is unavailable
NCUs (NeoSpace Compute Units)
NCUs (NeoSpace Compute Units) are the fundamental compute units within a cluster. Each NCU represents a unit of computational capacity that can be allocated to training or inference jobs.
NCU Characteristics:
- Resource Allocation: Each NCU provides a specific amount of GPU, CPU, and memory
- Isolation: NCUs provide resource isolation between different jobs
- Scalability: Allocate multiple NCUs for larger workloads
- Monitoring: Track NCU utilization and performance
NCU States:
- Available: NCU is free and ready to accept jobs
- Busy: NCU is currently executing a job
- Reserved: NCU is reserved for a specific job or project
NCU Metrics:
- Total NCUs: Total number of NCUs in the cluster
- Busy NCUs: NCUs currently executing jobs
- Available NCUs: NCUs ready for new jobs
- Utilization: Average utilization across all NCUs
- Memory Usage: Memory utilization per NCU
Scalability and Performance
Clusters are designed for horizontal scalability:
Scaling Capabilities:
- Add Nodes: Add new nodes to increase cluster capacity
- Remove Nodes: Remove nodes during low-demand periods
- Auto-Scaling: Automatic scaling based on workload (future feature)
- Resource Pooling: Share resources across multiple projects
Performance Optimizations:
- Distributed Training: Train models across multiple GPUs and nodes
- Data Parallelism: Split data across multiple workers
- Model Parallelism: Split large models across multiple GPUs
- Gradient Synchronization: Efficient gradient aggregation
Performance Metrics:
- Training Speed: Time to train models
- Throughput: Jobs processed per hour
- Resource Utilization: GPU, CPU, and memory usage
- Network Bandwidth: Data transfer rates between nodes
Next Steps
- Learn about Monitoring to track cluster performance
- Check the Usage Guide for practical examples