Skip to main content

Benchmark Metrics

Understanding available metrics and when to use them is crucial for effective model evaluation.

Classification Metrics

Accuracy

Definition: Percentage of correct predictions.

When to Use:

  • Balanced datasets
  • Equal importance of all classes
  • General performance indicator

Interpretation:

  • Higher is better (0-1 or 0-100%)
  • 1.0 = perfect predictions
  • 0.5 = random guessing (for binary classification)

Precision

Definition: Percentage of positive predictions that are correct.

When to Use:

  • When false positives are costly
  • Fraud detection
  • Medical diagnosis

Interpretation:

  • Higher is better
  • Measures prediction reliability
  • Trade-off with recall

Recall (Sensitivity)

Definition: Percentage of actual positives correctly identified.

When to Use:

  • When false negatives are costly
  • Disease detection
  • Security screening

Interpretation:

  • Higher is better
  • Measures coverage of positive class
  • Trade-off with precision

F1 Score

Definition: Harmonic mean of precision and recall.

When to Use:

  • Balanced precision and recall
  • Single metric for comparison
  • Imbalanced datasets

Interpretation:

  • Higher is better (0-1)
  • Balances precision and recall
  • Good for imbalanced classes

ROC AUC

Definition: Area under the ROC curve.

When to Use:

  • Binary classification
  • Ranking problems
  • Threshold-independent evaluation

Interpretation:

  • Higher is better (0-1)
  • 1.0 = perfect classifier
  • 0.5 = random classifier

Log Loss (Cross-Entropy)

Definition: Logarithmic loss measuring prediction confidence.

When to Use:

  • Probabilistic predictions
  • Confidence matters
  • Multi-class classification

Interpretation:

  • Lower is better
  • Penalizes confident wrong predictions
  • Sensitive to prediction probabilities

KS Statistic (Kolmogorov-Smirnov)

Definition: Maximum difference between cumulative distributions.

When to Use:

  • Credit scoring
  • Risk assessment
  • Distribution comparison

Interpretation:

  • Higher is better (0-1)
  • Measures separation between classes
  • Common in financial services

Gini Coefficient

Definition: Measure of inequality in predictions.

When to Use:

  • Credit scoring
  • Risk modeling
  • Financial services

Interpretation:

  • Higher is better (0-1)
  • Related to ROC AUC
  • Common in credit risk

Regression Metrics

MSE (Mean Squared Error)

Definition: Average of squared differences between predictions and actuals.

When to Use:

  • General regression evaluation
  • When large errors are costly
  • Standard regression metric

Interpretation:

  • Lower is better
  • Penalizes large errors more
  • Sensitive to outliers

MAE (Mean Absolute Error)

Definition: Average of absolute differences between predictions and actuals.

When to Use:

  • When all errors are equally important
  • Robust to outliers
  • Interpretable error magnitude

Interpretation:

  • Lower is better
  • Same units as target variable
  • Less sensitive to outliers than MSE

RMSE (Root Mean Squared Error)

Definition: Square root of MSE.

When to Use:

  • Same units as target variable
  • When large errors matter
  • Standard regression metric

Interpretation:

  • Lower is better
  • Same units as target
  • More interpretable than MSE

R-squared (R²)

Definition: Proportion of variance explained by the model.

When to Use:

  • Model fit evaluation
  • Variance explanation
  • Standard regression metric

Interpretation:

  • Higher is better (can be negative)
  • 1.0 = perfect fit
  • 0.0 = no better than mean

When to Use Each Metric

For Classification:

  • General: Accuracy, F1 Score
  • Imbalanced Data: F1, ROC AUC
  • Cost-Sensitive: Precision, Recall
  • Probabilistic: Log Loss, ROC AUC
  • Financial: KS, Gini

For Regression:

  • General: RMSE, R²
  • Outlier-Sensitive: MAE
  • Large Errors Important: MSE, RMSE
  • Variance Explanation: R²

Interpreting Results

Good Performance Indicators:

  • High accuracy/F1 for classification
  • Low MSE/MAE for regression
  • High R² for regression
  • Balanced precision/recall

Red Flags:

  • Very low metrics
  • Large gaps between training and validation
  • Inconsistent results
  • Metrics not improving

Next Steps