Creating and Managing Datasets
Learn how to create datasets, analyze data files, configure features and targets, and set up data partitions.
Creating a Dataset
To create a new dataset:
- Navigate to Datasets: Go to the Datasets section
- Click "New Dataset": Start the dataset creation process
- Basic Information:
- Enter dataset name (required, max 100 characters)
- Enter description (optional, max 200 characters)
- Select dataset structure (EVENT_BASED or FEATURE_BASED)
- Select one or more data domains (required, can select multiple)
- File Analysis: Analyze data files to understand structure
- Modeling Configuration: Configure features and targets
- Data Partitioning: Set up training/validation splits
- Process Dataset: Process the dataset to make it ready
Dataset Requirements:
- Name: Unique name for the dataset (required, max 100 characters)
- Structure: Must select EVENT_BASED or FEATURE_BASED (required)
- Data Domain: Must select at least one data domain (required, array of strings)
- Files: Must have data files from a connector (required)
Data Domain Details:
- Type: Array of strings (
string[]) - Required: At least one domain must be selected
- Multiple Selection: You can assign multiple domains to a single dataset
- Common Values:
credit,transactions,users, or custom domain names
File Analysis
File analysis automatically analyzes your data files to understand their structure:
Analysis Process:
- File Upload: Files are uploaded from connectors
- Schema Detection: Automatic detection of data schema
- Data Profiling: Analysis of data types, distributions, and quality
- Feature Detection: Identification of potential features
- Quality Checks: Validation of data quality
Analysis Results:
- Schema Information: Column names, types, and structure
- Data Statistics: Basic statistics for each column
- Data Quality: Quality metrics and issues
- Feature Suggestions: Suggested features for modeling
File Types Supported:
- CSV files
- Parquet files
- JSON files
- Other structured formats
Feature Selection
Select which features to use in your model:
Feature Types:
- Feature Columns: Input features for the model
- Target Columns: Target variables to predict
- Excluded Columns: Columns to exclude from training
- Timestamp Column: Column for temporal ordering (EVENT_BASED)
Feature Selection Process:
- Review Features: Review all available features
- Select Features: Choose features to include
- Select Targets: Choose target variables
- Exclude Features: Exclude irrelevant features
- Validate Selection: Ensure valid feature/target selection
Feature Selection Guidelines:
- Relevance: Select features relevant to the prediction task
- Quality: Avoid features with poor data quality
- Redundancy: Avoid highly correlated redundant features
- Target Selection: Ensure target is appropriate for the task
Data Partitioning
Configure how your dataset is split for training and validation:
Partition Methods:
Percentage Split:
- Specify percentage for training (e.g., 80%)
- Remaining percentage for validation (e.g., 20%)
- Simple and commonly used
- Good for most use cases
Date Range Split:
- Specify date ranges for training and validation
- Maintains temporal ordering
- Important for time-series data
- Prevents data leakage
Partition Configuration:
- Training Split: Percentage or date range for training
- Validation Split: Percentage or date range for validation
- Timestamp Column: Column used for date-based splits
- Random Seed: For reproducible random splits
Next Steps
- Learn about Modeling to prepare datasets
- Check Operations for managing datasets