Skip to main content

Leaderboard Core Concepts

Understanding how the leaderboard works, how checkpoints are evaluated, and how rankings are determined.

How Leaderboard Works

The Leaderboard aggregates evaluation results from benchmarks:

Leaderboard Process:

  1. Training: Models are trained and generate checkpoints
  2. Evaluation: Checkpoints are evaluated against benchmarks
  3. Aggregation: Evaluation results are aggregated in the leaderboard
  4. Ranking: Models are ranked by performance metrics
  5. Comparison: Compare models across different benchmarks

Key Components:

  • Checkpoints: Model checkpoints from training runs
  • Evaluations: Benchmark evaluation results
  • Metrics: Performance metrics from evaluations
  • Rankings: Model rankings based on metrics

Checkpoints and Evaluations

Understanding checkpoints and evaluations:

Checkpoints:

  • Definition: Saved model states during training
  • Creation: Automatically created during training
  • Evaluation: Checkpoints are evaluated against benchmarks
  • Selection: Best checkpoints can be selected for deployment

Evaluations:

  • Definition: Benchmark evaluation results for checkpoints
  • Execution: Evaluations run checkpoints against benchmarks
  • Results: Evaluation results include metrics for each benchmark
  • Comparison: Results enable model comparison

Evaluation Process:

  1. Select checkpoint to evaluate
  2. Select benchmark(s) to evaluate against
  3. Run evaluation
  4. Results appear in leaderboard
  5. Compare with other evaluations

Metrics and Rankings

How metrics and rankings work:

Metrics:

  • Per Benchmark: Each benchmark provides specific metrics
  • Aggregated: Metrics can be aggregated across benchmarks
  • Comparable: Metrics enable fair model comparison
  • Tracked: Metrics are tracked over time

Rankings:

  • Metric-Based: Rankings based on selected metrics
  • Direction: Ascending or descending based on metric type
  • Best Checkpoint: Automatically identify best checkpoint per training
  • Comparison: Compare rankings across different metrics

Ranking Methods:

  • Single Metric: Rank by a single metric
  • Average: Rank by average across multiple metrics
  • Best Checkpoint Only: Show only best checkpoint per training
  • Custom: Custom ranking configurations

Best Checkpoint Selection

Automatically identify best performing checkpoints:

Selection Criteria:

  • Metric Value: Based on selected ranking metric
  • Direction: Ascending (lower is better) or descending (higher is better)
  • Per Training: Best checkpoint selected per training run
  • Automatic: Automatic selection based on configuration

Selection Process:

  1. Select ranking metric
  2. Determine direction (asc/desc)
  3. Group by training
  4. Select best checkpoint per training
  5. Display in leaderboard

Use Cases:

  • Model Comparison: Compare best models from each training
  • Deployment Selection: Select best checkpoint for deployment
  • Performance Tracking: Track best performance over time
  • Experiment Analysis: Analyze best results from experiments

Next Steps