Skip to main content

Dataset Usage Guide

Complete step-by-step guide to creating datasets, best practices, and troubleshooting common issues.

Complete Step-by-Step Guide

Complete guide to creating and using datasets:

Step 1: Create Dataset

  • Navigate to Datasets section
  • Click "New Dataset"
  • Enter basic information
  • Select structure and one or more data domains (required)

Step 2: File Analysis

  • Connect data source
  • Upload or select files
  • Run file analysis
  • Review analysis results

Step 3: Modeling Configuration

  • Select feature columns
  • Select target columns
  • Exclude unwanted features
  • Configure timestamp (if EVENT_BASED)

Step 4: Data Partitioning

  • Choose partition method
  • Configure training/validation split
  • Set date ranges (if applicable)
  • Review partition configuration

Step 5: Process Dataset

  • Review all configurations
  • Click "Process Dataset"
  • Monitor processing progress
  • Wait for READY status

Step 6: Use in Training

  • Dataset is now ready
  • Use in training jobs
  • Monitor dataset usage
  • Track dataset performance

Best Practices

Data Quality:

  • Ensure data quality before creating datasets
  • Clean and preprocess data as needed
  • Validate data formats and types
  • Check for missing values and outliers

Feature Selection:

  • Select relevant features
  • Avoid data leakage
  • Consider feature interactions
  • Balance feature count and quality

Partitioning:

  • Use appropriate partition strategy
  • Maintain temporal order for time-series
  • Ensure sufficient data in each partition
  • Avoid data leakage between partitions

Organization:

  • Use descriptive names
  • Add clear descriptions
  • Organize by domain (can assign multiple domains)
  • Track dataset versions
  • Use consistent domain naming across datasets

Performance:

  • Optimize dataset size
  • Use efficient data formats
  • Consider data sampling for large datasets
  • Monitor processing times

Common Troubleshooting

Issue: Dataset Creation Fails

  • Symptom: Cannot create dataset
  • Possible Causes:
    • Invalid dataset name
    • Missing required fields
    • Invalid file format
  • Solutions:
    • Check dataset name requirements
    • Verify all required fields
    • Check file format compatibility

Issue: File Analysis Fails

  • Symptom: File analysis does not complete
  • Possible Causes:
    • Invalid file format
    • Corrupted files
    • Insufficient permissions
  • Solutions:
    • Verify file format
    • Check file integrity
    • Verify file permissions
    • Check file size limits

Issue: Processing Fails

  • Symptom: Dataset processing fails
  • Possible Causes:
    • Invalid configuration
    • Data quality issues
    • Insufficient resources
  • Solutions:
    • Review configuration
    • Check data quality
    • Verify resource availability
    • Check processing logs

Issue: Dataset Not Ready

  • Symptom: Dataset status not READY
  • Possible Causes:
    • Incomplete configuration
    • Processing not completed
    • Processing errors
  • Solutions:
    • Complete all configurations
    • Wait for processing
    • Check for errors
    • Review dataset status

Issue: Cannot Use in Training

  • Symptom: Dataset not available for training
  • Possible Causes:
    • Dataset not READY
    • Missing features or targets
    • Dataset in use elsewhere
  • Solutions:
    • Ensure dataset is READY
    • Verify feature/target selection
    • Check dataset dependencies
    • Review training requirements

Next Steps

  • Learn about Training to use datasets for model training
  • Explore Connectors to connect data sources
  • Check Clusters to understand compute infrastructure