Skip to main content

Clusters Overview

Clusters provide the compute infrastructure for training and serving Large Data Models (LDM). They consist of high-performance GPU nodes optimized for deep learning workloads, enabling you to scale your machine learning operations efficiently.

What are Clusters?

Clusters in NeoSpace are collections of high-performance compute nodes that work together to provide the computational power needed for training and serving LDM models. Each cluster consists of multiple nodes, each equipped with GPUs optimized for deep learning workloads.

Key Characteristics:

  • High-Performance GPUs: Latest generation GPUs for accelerated training and inference
  • Distributed Computing: Multiple nodes working together for parallel processing
  • Scalable Infrastructure: Scale compute resources based on workload demands
  • Network Optimization: High-bandwidth networking for efficient distributed training

Why Use Clusters?

Clusters are essential for:

  • Training Large Models: LDM models require significant computational resources that only clusters can provide
  • Parallel Processing: Distribute training across multiple GPUs for faster model training
  • Scalability: Scale resources up or down based on your workload
  • Resource Isolation: Isolate compute resources for different projects and teams
  • High Availability: Redundant nodes ensure continuous operation

Use Cases

Clusters are used for:

  • Model Training: Training LDM models on large datasets
  • Model Inference: Serving predictions at scale
  • Data Processing: Processing and preparing datasets for training
  • Experimentation: Running multiple experiments in parallel
  • Production Workloads: Serving production models with high availability

Next Steps