Skip to main content

Dataset Core Concepts

Understanding the fundamental concepts of datasets is essential for effectively preparing data for training.

Dataset Types

NeoSpace supports two main dataset structures:

EVENT_BASED

Event-based datasets are designed for time-series and event data where each row represents an event or observation at a specific point in time.

Characteristics:

  • Time-Series Data: Data with temporal ordering
  • Event Records: Each row represents an event
  • Temporal Features: Time-based features and patterns
  • Sequential Patterns: Captures sequential relationships

Use Cases:

  • Transaction data
  • User behavior events
  • Time-series predictions
  • Sequential pattern recognition

Example Data:

  • Customer transactions over time
  • User clickstream data
  • Sensor readings
  • Log events

FEATURE_BASED

Feature-based datasets are designed for tabular and structured data where each row represents an entity with multiple features.

Characteristics:

  • Tabular Data: Structured data in rows and columns
  • Feature Vectors: Each row is a feature vector
  • Static Features: Features that don't change over time
  • Cross-Sectional Data: Data at a single point in time

Use Cases:

  • Customer profiles
  • Product features
  • Classification tasks
  • Regression problems

Example Data:

  • Customer demographic data
  • Product attributes
  • Survey responses
  • Static feature sets

Data Domains

Data domains categorize datasets by their business domain or use case. A dataset can be assigned one or more domains to enable flexible organization and filtering.

Domain Format:

  • Array of Strings: data_domain is an array of strings (string[])
  • Multiple Domains: A dataset can belong to multiple domains simultaneously
  • Required Field: At least one domain must be specified when creating a dataset

Common Domains:

  • Credit: Credit scoring, risk assessment
  • Transactions: Transaction data, payment processing
  • Accounts: Account information, customer data
  • Users: User behavior and profile data
  • Custom: User-defined domains

Domain Benefits:

  • Organization: Organize datasets by business domain
  • Filtering: Filter datasets by one or more domains
  • Domain-Specific Features: Apply domain-specific processing
  • Best Practices: Domain-specific best practices
  • Multi-Domain Support: Tag datasets with multiple domains for cross-domain analysis

Example: A dataset might have data_domain: ["credit", "transactions"] to indicate it contains both credit and transaction data.

Dataset Status

Datasets have different statuses throughout their lifecycle:

Status Types:

  • READY: Dataset is ready for use in training
  • PROCESSING: Dataset is currently being processed
  • RUNNING: Dataset has active jobs running
  • AWAITING_MODELING: Dataset needs modeling configuration
  • FAILED: Dataset processing failed

Status Flow:

  1. Created: Dataset is created
  2. AWAITING_MODELING: Waiting for modeling configuration
  3. PROCESSING: Processing dataset files
  4. READY: Ready for training
  5. RUNNING: Active in training jobs

Data Partitions

Datasets can be partitioned for training and validation:

Partition Types:

  • Training Partition: Data used for model training
  • Validation Partition: Data used for model validation
  • Test Partition: Data used for final evaluation (optional)

Partition Strategies:

  • Percentage Split: Split by percentage (e.g., 80% training, 20% validation)
  • Date Range Split: Split by date ranges
  • Random Split: Random sampling for partitions
  • Stratified Split: Maintain class distribution across partitions

Partition Considerations:

  • Temporal Ordering: For time-series data, maintain temporal order
  • Data Leakage: Avoid data leakage between partitions
  • Class Balance: Maintain class balance in partitions
  • Size: Ensure sufficient data in each partition

Next Steps