Skip to main content

Creating and Managing Datasets

Learn how to create datasets, analyze data files, configure features and targets, and set up data partitions.

Creating a Dataset

To create a new dataset:

Navigate to Datasets: Go to the Datasets section
Click "New Dataset": Start the dataset creation process
Basic Information:
- Enter dataset name (required, max 100 characters)
- Enter description (optional, max 200 characters)
- Select dataset structure (EVENT_BASED or FEATURE_BASED)
- Select one or more data domains (required, can select multiple)
File Analysis: Analyze data files to understand structure
Modeling Configuration: Configure features and targets
Data Partitioning: Set up training/validation splits
Process Dataset: Process the dataset to make it ready

Dataset Requirements:

Name: Unique name for the dataset (required, max 100 characters)
Structure: Must select EVENT_BASED or FEATURE_BASED (required)
Data Domain: Must select at least one data domain (required, array of strings)
Files: Must have data files from a connector (required)

Data Domain Details:

Type: Array of strings (string[])
Required: At least one domain must be selected
Multiple Selection: You can assign multiple domains to a single dataset
Common Values: credit, transactions, users, or custom domain names

File Analysis

File analysis automatically analyzes your data files to understand their structure:

Analysis Process:

File Upload: Files are uploaded from connectors
Schema Detection: Automatic detection of data schema
Data Profiling: Analysis of data types, distributions, and quality
Feature Detection: Identification of potential features
Quality Checks: Validation of data quality

Analysis Results:

Schema Information: Column names, types, and structure
Data Statistics: Basic statistics for each column
Data Quality: Quality metrics and issues
Feature Suggestions: Suggested features for modeling

File Types Supported:

CSV files
Parquet files
JSON files
Other structured formats

Feature Selection

Select which features to use in your model:

Feature Types:

Feature Columns: Input features for the model
Target Columns: Target variables to predict
Excluded Columns: Columns to exclude from training
Timestamp Column: Column for temporal ordering (EVENT_BASED)

Feature Selection Process:

Review Features: Review all available features
Select Features: Choose features to include
Select Targets: Choose target variables
Exclude Features: Exclude irrelevant features
Validate Selection: Ensure valid feature/target selection

Feature Selection Guidelines:

Relevance: Select features relevant to the prediction task
Quality: Avoid features with poor data quality
Redundancy: Avoid highly correlated redundant features
Target Selection: Ensure target is appropriate for the task

Data Partitioning

Configure how your dataset is split for training and validation:

Partition Methods:

Percentage Split:

Specify percentage for training (e.g., 80%)
Remaining percentage for validation (e.g., 20%)
Simple and commonly used
Good for most use cases

Date Range Split:

Specify date ranges for training and validation
Maintains temporal ordering
Important for time-series data
Prevents data leakage

Partition Configuration:

Training Split: Percentage or date range for training
Validation Split: Percentage or date range for validation
Timestamp Column: Column used for date-based splits
Random Seed: For reproducible random splits

Next Steps

Learn about Modeling to prepare datasets
Check Operations for managing datasets

Creating a Dataset
File Analysis
Feature Selection
Data Partitioning
Next Steps