Skip to main content

Leaderboard Core Concepts

Understanding how the leaderboard works, how checkpoints are evaluated, and how rankings are determined.

How Leaderboard Works

The Leaderboard aggregates evaluation results from benchmarks:

Leaderboard Process:

Training: Models are trained and generate checkpoints
Evaluation: Checkpoints are evaluated against benchmarks
Aggregation: Evaluation results are aggregated in the leaderboard
Ranking: Models are ranked by performance metrics
Comparison: Compare models across different benchmarks

Key Components:

Checkpoints: Model checkpoints from training runs
Evaluations: Benchmark evaluation results
Metrics: Performance metrics from evaluations
Rankings: Model rankings based on metrics

Checkpoints and Evaluations

Understanding checkpoints and evaluations:

Checkpoints:

Definition: Saved model states during training
Creation: Automatically created during training
Evaluation: Checkpoints are evaluated against benchmarks
Selection: Best checkpoints can be selected for deployment

Evaluations:

Definition: Benchmark evaluation results for checkpoints
Execution: Evaluations run checkpoints against benchmarks
Results: Evaluation results include metrics for each benchmark
Comparison: Results enable model comparison

Evaluation Process:

Select checkpoint to evaluate
Select benchmark(s) to evaluate against
Run evaluation
Results appear in leaderboard
Compare with other evaluations

Metrics and Rankings

How metrics and rankings work:

Metrics:

Per Benchmark: Each benchmark provides specific metrics
Aggregated: Metrics can be aggregated across benchmarks
Comparable: Metrics enable fair model comparison
Tracked: Metrics are tracked over time

Rankings:

Metric-Based: Rankings based on selected metrics
Direction: Ascending or descending based on metric type
Best Checkpoint: Automatically identify best checkpoint per training
Comparison: Compare rankings across different metrics

Ranking Methods:

Single Metric: Rank by a single metric
Average: Rank by average across multiple metrics
Best Checkpoint Only: Show only best checkpoint per training
Custom: Custom ranking configurations

Best Checkpoint Selection

Automatically identify best performing checkpoints:

Selection Criteria:

Metric Value: Based on selected ranking metric
Direction: Ascending (lower is better) or descending (higher is better)
Per Training: Best checkpoint selected per training run
Automatic: Automatic selection based on configuration

Selection Process:

Select ranking metric
Determine direction (asc/desc)
Group by training
Select best checkpoint per training
Display in leaderboard

Use Cases:

Model Comparison: Compare best models from each training
Deployment Selection: Select best checkpoint for deployment
Performance Tracking: Track best performance over time
Experiment Analysis: Analyze best results from experiments

Next Steps

Learn about Visualizations to see different views
Explore Operations to use the leaderboard

How Leaderboard Works
Checkpoints and Evaluations
Metrics and Rankings
Best Checkpoint Selection
Next Steps