Skip to main content

Creating and Managing Datasets

Learn how to create datasets, analyze data files, configure features and targets, and set up data partitions.

Creating a Dataset

To create a new dataset:

  1. Navigate to Datasets: Go to the Datasets section
  2. Click "New Dataset": Start the dataset creation process
  3. Basic Information:
    • Enter dataset name (required, max 100 characters)
    • Enter description (optional, max 200 characters)
    • Select dataset structure (EVENT_BASED or FEATURE_BASED)
    • Select one or more data domains (required, can select multiple)
  4. File Analysis: Analyze data files to understand structure
  5. Modeling Configuration: Configure features and targets
  6. Data Partitioning: Set up training/validation splits
  7. Process Dataset: Process the dataset to make it ready

Dataset Requirements:

  • Name: Unique name for the dataset (required, max 100 characters)
  • Structure: Must select EVENT_BASED or FEATURE_BASED (required)
  • Data Domain: Must select at least one data domain (required, array of strings)
  • Files: Must have data files from a connector (required)

Data Domain Details:

  • Type: Array of strings (string[])
  • Required: At least one domain must be selected
  • Multiple Selection: You can assign multiple domains to a single dataset
  • Common Values: credit, transactions, users, or custom domain names

File Analysis

File analysis automatically analyzes your data files to understand their structure:

Analysis Process:

  1. File Upload: Files are uploaded from connectors
  2. Schema Detection: Automatic detection of data schema
  3. Data Profiling: Analysis of data types, distributions, and quality
  4. Feature Detection: Identification of potential features
  5. Quality Checks: Validation of data quality

Analysis Results:

  • Schema Information: Column names, types, and structure
  • Data Statistics: Basic statistics for each column
  • Data Quality: Quality metrics and issues
  • Feature Suggestions: Suggested features for modeling

File Types Supported:

  • CSV files
  • Parquet files
  • JSON files
  • Other structured formats

Feature Selection

Select which features to use in your model:

Feature Types:

  • Feature Columns: Input features for the model
  • Target Columns: Target variables to predict
  • Excluded Columns: Columns to exclude from training
  • Timestamp Column: Column for temporal ordering (EVENT_BASED)

Feature Selection Process:

  1. Review Features: Review all available features
  2. Select Features: Choose features to include
  3. Select Targets: Choose target variables
  4. Exclude Features: Exclude irrelevant features
  5. Validate Selection: Ensure valid feature/target selection

Feature Selection Guidelines:

  • Relevance: Select features relevant to the prediction task
  • Quality: Avoid features with poor data quality
  • Redundancy: Avoid highly correlated redundant features
  • Target Selection: Ensure target is appropriate for the task

Data Partitioning

Configure how your dataset is split for training and validation:

Partition Methods:

Percentage Split:

  • Specify percentage for training (e.g., 80%)
  • Remaining percentage for validation (e.g., 20%)
  • Simple and commonly used
  • Good for most use cases

Date Range Split:

  • Specify date ranges for training and validation
  • Maintains temporal ordering
  • Important for time-series data
  • Prevents data leakage

Partition Configuration:

  • Training Split: Percentage or date range for training
  • Validation Split: Percentage or date range for validation
  • Timestamp Column: Column used for date-based splits
  • Random Seed: For reproducible random splits

Next Steps