Foundation Model Metrics Lab

Welcome to the Foundation Model Metrics Lab!

This interactive lab will help you understand and experiment with key metrics used to evaluate foundation models and NLP systems. You'll learn about:

ROUGE - Recall-Oriented Understudy for Gisting Evaluation
BLEU - Bilingual Evaluation Understudy
BERTScore - Contextual embedding-based evaluation

🎯 Learning Objectives

Understand how each metric works mathematically
Learn when to use each metric
Practice calculating scores with real examples
Compare different metrics on the same text
Identify strengths and limitations of each approach

📊 Quick Comparison

Metric	Best For	Focus	Range
ROUGE	Summarization	Recall of n-grams	0 to 1
BLEU	Translation	Precision of n-grams	0 to 1
BERTScore	General NLG	Semantic similarity	0 to 1

💡 Pro Tip

Start with the individual metric tabs to understand each one deeply, then use the Compare All tab to see how they differ on the same text!

ROUGE Metric

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is primarily used for evaluating automatic summarization and machine translation. It measures the overlap of n-grams between generated text and reference text.

Types of ROUGE:

ROUGE-N: Overlap of n-grams (ROUGE-1 for unigrams, ROUGE-2 for bigrams)
ROUGE-L: Longest Common Subsequence
ROUGE-S: Skip-bigram co-occurrence

ROUGE-N Formula:
ROUGE-N = Σ(matched n-grams) / Σ(n-grams in reference)

F1-Score:
F1 = 2 × (Precision × Recall) / (Precision + Recall)

Reference Text (Ground Truth):

Candidate Text (Generated):

ROUGE Scores:

BLEU Metric

BLEU (Bilingual Evaluation Understudy) is primarily used for machine translation evaluation. It measures precision of n-grams and includes a brevity penalty to avoid favoring shorter translations.

Key Components:

N-gram Precision: How many n-grams in candidate appear in reference
Brevity Penalty: Penalizes overly short translations
Geometric Mean: Combines different n-gram precisions

BLEU Formula:
BLEU = BP × exp(Σ wn × log(pn))

Where:
BP = Brevity Penalty = min(1, exp(1 - r/c))
pn = precision for n-grams
wn = weights (typically 1/N)

Reference Text:

Candidate Translation:

BLEU Analysis:

BERTScore (Simplified)

BERTScore leverages contextual embeddings to compute similarity between candidate and reference sentences. This is a simplified version using cosine similarity of word vectors.

How it Works:

Tokenization: Break text into tokens
Embedding: Convert tokens to vectors (simplified here)
Matching: Find best alignment between tokens
Aggregation: Compute precision, recall, and F1

BERTScore Components:
Precision = Σ max_similarity(candidate_token, reference) / |candidate|
Recall = Σ max_similarity(reference_token, candidate) / |reference|
F1 = 2 × (Precision × Recall) / (Precision + Recall)

Reference Text:

Candidate Text:

BERTScore Results:

Compare All Metrics

Compare how different metrics evaluate the same text pairs. This helps understand their strengths and use cases.

Reference Text:

Candidate Text:

Metric Comparison:

Practice Exercises

Exercise 1: Understanding ROUGE

Given these texts, predict which will have higher ROUGE-1 score:

Reference: "The student completed the assignment on time."

Candidate A: "The assignment was finished by the student punctually."

Candidate B: "The student completed their work on time."

Exercise 2: BLEU vs ROUGE

For summarization tasks, which metric is typically preferred and why?

Exercise 3: BERTScore Advantages

Write two sentences that have different words but similar meaning. Calculate their BERTScore to see semantic similarity detection.

Sentence 1:

Sentence 2:

Exercise 4: Metric Selection

Match each use case with the most appropriate metric:

1. Evaluating machine translation quality:

2. Assessing news article summarization:

3. Evaluating paraphrasing quality:

🤖 Foundation Model Metrics Lab

Welcome to the Foundation Model Metrics Lab!

🎯 Learning Objectives

📊 Quick Comparison

💡 Pro Tip

ROUGE Metric

Types of ROUGE:

ROUGE Scores:

BLEU Metric

Key Components:

BLEU Analysis:

BERTScore (Simplified)

How it Works:

BERTScore Results:

Compare All Metrics

Metric Comparison:

Practice Exercises

Exercise 1: Understanding ROUGE

Exercise 2: BLEU vs ROUGE

Exercise 3: BERTScore Advantages

Exercise 4: Metric Selection