Dataset Core Concepts
Understanding the fundamental concepts of datasets is essential for effectively preparing data for training.
Dataset Types
NeoSpace supports two main dataset structures:
EVENT_BASED
Event-based datasets are designed for time-series and event data where each row represents an event or observation at a specific point in time.
Characteristics:
- Time-Series Data: Data with temporal ordering
- Event Records: Each row represents an event
- Temporal Features: Time-based features and patterns
- Sequential Patterns: Captures sequential relationships
Use Cases:
- Transaction data
- User behavior events
- Time-series predictions
- Sequential pattern recognition
Example Data:
- Customer transactions over time
- User clickstream data
- Sensor readings
- Log events
FEATURE_BASED
Feature-based datasets are designed for tabular and structured data where each row represents an entity with multiple features.
Characteristics:
- Tabular Data: Structured data in rows and columns
- Feature Vectors: Each row is a feature vector
- Static Features: Features that don't change over time
- Cross-Sectional Data: Data at a single point in time
Use Cases:
- Customer profiles
- Product features
- Classification tasks
- Regression problems
Example Data:
- Customer demographic data
- Product attributes
- Survey responses
- Static feature sets
Data Domains
Data domains categorize datasets by their business domain or use case. A dataset can be assigned one or more domains to enable flexible organization and filtering.
Domain Format:
- Array of Strings:
data_domainis an array of strings (string[]) - Multiple Domains: A dataset can belong to multiple domains simultaneously
- Required Field: At least one domain must be specified when creating a dataset
Common Domains:
- Credit: Credit scoring, risk assessment
- Transactions: Transaction data, payment processing
- Accounts: Account information, customer data
- Users: User behavior and profile data
- Custom: User-defined domains
Domain Benefits:
- Organization: Organize datasets by business domain
- Filtering: Filter datasets by one or more domains
- Domain-Specific Features: Apply domain-specific processing
- Best Practices: Domain-specific best practices
- Multi-Domain Support: Tag datasets with multiple domains for cross-domain analysis
Example:
A dataset might have data_domain: ["credit", "transactions"] to indicate it contains both credit and transaction data.
Dataset Status
Datasets have different statuses throughout their lifecycle:
Status Types:
- READY: Dataset is ready for use in training
- PROCESSING: Dataset is currently being processed
- RUNNING: Dataset has active jobs running
- AWAITING_MODELING: Dataset needs modeling configuration
- FAILED: Dataset processing failed
Status Flow:
- Created: Dataset is created
- AWAITING_MODELING: Waiting for modeling configuration
- PROCESSING: Processing dataset files
- READY: Ready for training
- RUNNING: Active in training jobs
Data Partitions
Datasets can be partitioned for training and validation:
Partition Types:
- Training Partition: Data used for model training
- Validation Partition: Data used for model validation
- Test Partition: Data used for final evaluation (optional)
Partition Strategies:
- Percentage Split: Split by percentage (e.g., 80% training, 20% validation)
- Date Range Split: Split by date ranges
- Random Split: Random sampling for partitions
- Stratified Split: Maintain class distribution across partitions
Partition Considerations:
- Temporal Ordering: For time-series data, maintain temporal order
- Data Leakage: Avoid data leakage between partitions
- Class Balance: Maintain class balance in partitions
- Size: Ensure sufficient data in each partition
Next Steps
- Learn about Creating Datasets to get started
- Explore Modeling to prepare datasets