🤖 Foundation Model Metrics Lab

Interactive Learning Environment for NLP Evaluation Metrics

Welcome to the Foundation Model Metrics Lab!

This interactive lab will help you understand and experiment with key metrics used to evaluate foundation models and NLP systems. You'll learn about:

  • ROUGE - Recall-Oriented Understudy for Gisting Evaluation
  • BLEU - Bilingual Evaluation Understudy
  • BERTScore - Contextual embedding-based evaluation

🎯 Learning Objectives

  • Understand how each metric works mathematically
  • Learn when to use each metric
  • Practice calculating scores with real examples
  • Compare different metrics on the same text
  • Identify strengths and limitations of each approach

📊 Quick Comparison

Metric Best For Focus Range
ROUGE Summarization Recall of n-grams 0 to 1
BLEU Translation Precision of n-grams 0 to 1
BERTScore General NLG Semantic similarity 0 to 1

💡 Pro Tip

Start with the individual metric tabs to understand each one deeply, then use the Compare All tab to see how they differ on the same text!

ROUGE Metric

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is primarily used for evaluating automatic summarization and machine translation. It measures the overlap of n-grams between generated text and reference text.

Types of ROUGE:

  • ROUGE-N: Overlap of n-grams (ROUGE-1 for unigrams, ROUGE-2 for bigrams)
  • ROUGE-L: Longest Common Subsequence
  • ROUGE-S: Skip-bigram co-occurrence
ROUGE-N Formula:
ROUGE-N = Σ(matched n-grams) / Σ(n-grams in reference)

F1-Score:
F1 = 2 × (Precision × Recall) / (Precision + Recall)

ROUGE Scores:

BLEU Metric

BLEU (Bilingual Evaluation Understudy) is primarily used for machine translation evaluation. It measures precision of n-grams and includes a brevity penalty to avoid favoring shorter translations.

Key Components:

  • N-gram Precision: How many n-grams in candidate appear in reference
  • Brevity Penalty: Penalizes overly short translations
  • Geometric Mean: Combines different n-gram precisions
BLEU Formula:
BLEU = BP × exp(Σ wn × log(pn))

Where:
BP = Brevity Penalty = min(1, exp(1 - r/c))
pn = precision for n-grams
wn = weights (typically 1/N)

BLEU Analysis:

BERTScore (Simplified)

BERTScore leverages contextual embeddings to compute similarity between candidate and reference sentences. This is a simplified version using cosine similarity of word vectors.

How it Works:

  • Tokenization: Break text into tokens
  • Embedding: Convert tokens to vectors (simplified here)
  • Matching: Find best alignment between tokens
  • Aggregation: Compute precision, recall, and F1
BERTScore Components:
Precision = Σ max_similarity(candidate_token, reference) / |candidate|
Recall = Σ max_similarity(reference_token, candidate) / |reference|
F1 = 2 × (Precision × Recall) / (Precision + Recall)

BERTScore Results:

Compare All Metrics

Compare how different metrics evaluate the same text pairs. This helps understand their strengths and use cases.

Metric Comparison:

Practice Exercises

Exercise 1: Understanding ROUGE

Given these texts, predict which will have higher ROUGE-1 score:

Reference: "The student completed the assignment on time."

Candidate A: "The assignment was finished by the student punctually."

Candidate B: "The student completed their work on time."

Exercise 2: BLEU vs ROUGE

For summarization tasks, which metric is typically preferred and why?

Exercise 3: BERTScore Advantages

Write two sentences that have different words but similar meaning. Calculate their BERTScore to see semantic similarity detection.

Exercise 4: Metric Selection

Match each use case with the most appropriate metric:

1. Evaluating machine translation quality:

2. Assessing news article summarization:

3. Evaluating paraphrasing quality: