Intuition Behind BLEU Score

1. Introduction

In this article, I aim to elucidate the underlying intuition behind the BLEU score. While numerous online resources touch upon its primary components—precision and brevity penalty—few delve into the rationale behind its formula, particularly the incorporation of exponential and logarithmic functions. Although the original paper on BLEU discusses aspects like the geometric mean, it falls short of offering a comprehensive explanation.

This article endeavors to provide a thorough understanding of the BLEU score and its formulaic underpinnings. For readers seeking clarification specifically on the logarithmic aspects of both weighted precision and brevity penalty, they can directly navigate to the sections on Weighted Precision and Brevity Penalty.

2. Define the Problem

Before delving into the BLEU score, let's revisit the problem at hand. Translation quality is inherently subjective and challenging to quantify due to the inherent ambiguity of language. Directly comparing two machine translation results is practically impossible without human involvement. Therefore, we need to devise an effective method to assess translation quality. One promising approach is to compare the similarity between each machine translation and its corresponding human counterpart. By considering human translations as the gold standard, we can determine which machine translation performs better based on their similarity scores. Higher similarity scores indicate better translation quality.

3. Precision

One simplistic method for evaluating translation quality is precision, defined as the ratio of the number of words in the machine translation that appear in the human translation to the total number of words in the machine translation. This ratio offers insight into the accuracy of the machine-translated words.

Formally, given a machine translation token list $TL_{pred}$ and a human translation token list $TL_{ref}$ $$ TL_{pred} = [t^{1}_{pred}, t^{2}_{pred}, \cdots, t^{n}_{pred}]$$ $$ TL_{ref} = [t^{1}_{ref}, t^{2}_{ref}, \cdots, t^{m}_{ref}]$$ where $t^{i}_{pred}$ and $t^{i}_{ref}$ are the $i$-th token in the machine translation and the human translation, respectively.

The precision is then calculated as: $$ P= \frac{\sum_{i=0}^{n}{Occurrences(TL_{ref},t^{i}_{pred})}}{ Len(TL_{pred})}$$ Here, $Occurrences(TL_{ref},t^{i}_{pred})$ indicates whether the word $t^{i}_{pred}$ appears in the human translation $TL_{ref}$, If yes, $Occurrences(TL_{ref},t^{i}_{pred})=1$, otherwise $Occurrences(TL_{ref},t^{i}_{pred})=0$. $Len(TL_{pred})$ denotes the total number of words in the machine translation $TL_{pred}$.

Below is a code snippet demonstrating the calculation of precision:

def precision(pred: List[str], refs: List[List[str]]) -> float:
    """
    Calculate the precision of the machine translation
    :param pred: List of words in the machine translation
    :param refs: List of reference sentences, each sentence is a list of words
    :return: precision
    """

    correct = 0
    for word in pred:
        if any(word in ref for ref in refs):
            correct += 1
    return correct / len(pred)

However, this method has its limitations. It solely considers the precision of words in the machine translation and overlooks the recall of words in the human translation. Consequently, a low quality machine translation may exhibit high score. Consider the following example:

Machine Translation:: the the cat the the the the the
Human Translation:: the cat is on the mat

Despite achieving a precision of $\frac{8}{8} = 1 $, this machine translation is clearly inadequate.

4. Modified Precision

To address the limitations of the basic precision metric, the original BLEU paper proposes a modified version. This modified precision metric sets an upper limit on the number of times a word can be counted in the human translation. For instance, in the previous example, if the word "the" appears a maximum of 2 times in the human translation, the occurrences of "the" in the machine translation are calculated as the minimum between number of times "the" appears in the machine translation and number of times "the" appears in the human translation.

Formally, let's define the set of words in the machine translation as $S_{pred}$ and the set of words in the human translation as $S_{ref}$. $$ S_{pred} = \{ w^{1}_{pred}, w^{2}_{pred}, \cdots, w^{n}_{pred} \}$$ $$ S_{ref} = \{ w^{1}_{ref}, w^{2}_{ref}, \cdots, w^{m}_{ref} \}$$

The modified precision is then calculated as:

$$ P = \frac{ \sum_{w \in S_{pred}{ min(Count(S_{pred},w),Count(S_{ref},w))}} }{\sum_{w \in S_{pred}}{Count(S_{pred}, w)}}$$ where $Count(S_{pred},w)$ is the number of times the word $w$ appears in the machine translation where $Count(S_{ref},w)$ is the number of times the word $w$ appears in the human translation The denominator is the total number of words in the machine translation.

Code snippet for calculating modified precision is shown below:

def modified_precision(pred: List[str], refs: List[List[str]]) -> float:
    """
    Calculate the modified precision of the machine translation
    :param pred: List of words in the machine translation
    :param refs: List of reference sentences, each sentence is a list of words
    :return: modified precision
    """
    correct = 0
    total = 0
    pred_dict = Counter(pred)
    refs_dict = [Counter(ref) for ref in refs]
    for word in pred:
        correct += min(pred_dict[word], max(ref_dict[word] for ref_dict in refs_dict))
        total += pred_dict[word]
    return correct / total

While the modified precision is an improvement over basic precision, it still disregards the word order in translations. For instance, consider two translations with the same modified precision:

Machine Translation 1:: the cat is on the mat
Machine Translation 2:: the mat is on the cat
Human Translation: the cat is on the mat

5. Multi-gram Modified Precision

Consideration of word order in translation involves more than just individual words. One method to account for this is by computing various modified precision scores for n-grams and then averaging them. For instance, with the sentence "the cat is on the mat," n-grams range typically from 1 to 4:

unigram: the, cat, is, on, the, mat
bigram: the cat, cat is, is on, on the, the mat
trigram: the cat is, cat is on, is on the, on the mat
fourgram: the cat is on, cat is on the, is on the, on the mat
fivegram: the cat is on the, cat is on the, is on the mat

For a given sentence, we calculate the n-gram set $ S_{pred}^{i}$, where $i$ represents the n-gram size.

The modified precision for n-grams is defined as: $$ P_n = \frac{1}{4} \cdot \sum_{i \in \{ 1 \ldots 4\}} \frac{ \sum_{ g \in S_{pred}^{i}} {min(Count(S_{ref}^{i},g), Count(S_{pred}^{i},g))}}{ \sum_{ g \in S_{pred}^{i}} {Count(S_{pred}^{i},g)}}$$ Here, $S_{ref}^{i}$ and $S_{pred}^{i}$ is the n-gram set of the human translation and machine translation, The $Count(S_{ref}^{i},g)$ and $Count(S_{pred}^{i},g)$ counts the number of times the n-gram $g$ appears in the human translation and machine translation, respectively.

A code snippet for calculating n-gram modified precision is provided below:

def get_ngram(sentence: List[str], n: int) -> List[Tuple[str]]:
    """
    Get the n-gram of the sentence
    :param sentence: List of words in the sentence
    :param n: n-gram size
    :return: List of n-grams
    """
    return [tuple(sentence[i:i+n]) for i in range(len(sentence)-n+1)]

def ngram_modified_precision(pred: List[str], refs: List[List[str]], n: int) -> float:
    """
    Calculate the n-gram modified precision of the machine translation
    :param pred: List of words in the machine translation
    :param refs: List of reference sentences, each sentence is a list of words
    :param n: n-gram size
    :return: n-gram modified precision
    """
    res = 0
    for i in range(1, n+1):
        pred_ngram = get_ngram(pred, i)
        refs_ngram = [get_ngram(ref, i) for ref in refs]
        correct = 0
        total = 0
        for ngram in pred_ngram:
            correct += min(ngram.count(ngram), max(ref_ngram.count(ngram) for ref_ngram in refs_ngram))
            total += ngram.count(ngram)
        res += correct / total
    return res / n

This approach considers word order by examining n-grams, representing sequences of n words in the translation. Achieving high scores on tri-gram or four-gram modified precision requires matching word sequences between machine and human translations. However, averaging the n-gram modified precision treats all n-grams equally, which may not be ideal. Longer n-grams are inherently more significant and harder to get right.

6. Weighted Precision

In the original paper on BLEU score, the authors proposed using the geometric mean of the n-gram modified precision. This choice stems from the correlation among the n-gram modified precision scores. Typically, when calculating data metrics, we employ the arithmetic mean under the assumption that data points are independent. However, in the case of n-gram modified precision, they are correlated because longer n-grams are subsets of shorter ones. In essence, if a machine translation scores high on, say, four-gram modified precision, it's likely to score high on lower n-gram modifications as well. To take the correlation into account, we can use the geometric mean, which is defined as:

To address this correlation, the authors advocate for the use of the geometric mean, defined as:

$$ P_{geometric} = \sqrt[n]{P_1 \times P_2 \times \cdots \times P_n}$$

where $P_i$ is the n-gram modified precision of the i-th n-gram.

Now the weighted precision is defined as: $$ P_n = \sqrt[n]{P_1 \times P_2 \times \cdots \times P_n}$$

However, during implementation, it becomes apparent that the calculated results may not meet expectations. As the n increases, the n-gram modified precision can yield very small values, leading to potential floating-point underflow when multiplying these values together. To mitigate this issue, the authors suggest using the logarithm of the equation, as the cancellation of log and exp functions circumvents numerical instability. $$ P = \exp( \log( \sqrt[n]{P_1 \times P_2 \times \cdots \times P_n }))$$ For the expression $\sqrt[n]{x}$, it can be rewritten as: $x^{\frac{1}{n}}$. Consequently: $$ P = \exp( \frac{1}{n} \cdot \log( \prod_{i \in \{ 1 \ldots n\}}P_i))$$ Since $\log(a \cdot b) = \log(a) + \log(b)$, we can further simplify it to: $$ P = \exp( \frac{1}{n} \cdot \sum_{i \in \{ 1 \ldots n\} }{ \log( P_i )})$$

Below is a code snippet for calculating weighted precision:

def weighted_precision(pred: List[str], refs: List[List[str]], n: int) -> float:
    """
    Calculate the weighted precision of the machine translation
    :param pred: List of words in the machine translation
    :param refs: List of reference sentences, each sentence is a list of words
    :param n: n-gram size
    :return: weighted precision
    """
    res = []
    for i in range(1, n+1):
        pred_ngram = get_ngram(pred, i)
        refs_ngram = [get_ngram(ref, i) for ref in refs]
        correct = 0
        total = 0
        for ngram in pred_ngram:
            correct += min(ngram.count(ngram), max(ref_ngram.count(ngram) for ref_ngram in refs_ngram))
            total += ngram.count(ngram)
        res.append(correct / total)
    return exp(1/n * sum(log(p) for p in res))

Despite its soundness, this solution still has its limitations. For instance, it remains vulnerable to gaming by generating exceedingly short translations. Consider the following two translations, both having the same weighted precision, yet the first one is evidently superior:

Machine Translation 1:: the cat is on the mat
Machine Translation 2:: the cat
Human Translation: the cat is on the mat

7. Brevity Penalty

To address the issue of generating excessively short sentences to achieve high scores, the concept of a brevity penalty is introduced. The rationale behind the brevity penalty is to penalize machine translations that are shorter than their human counterparts. This penalty is applied by multiplying a factor to the weighted precision. Otherwise, the factor remains 1.

In essence, we require a function for the factor that always falls within the range $[0,1]$. It should be always decreasing in the range $ [0, \infty)$. An ideal candidate for such a function is the exponential function, as it always yields positive values and outputs numbers between 0 and 1 for inputs less than 1.

After some variation, the factor is defined as: $exp(1-\frac{Count(S_{ref})}{Count(S_{pred})})$ When the machine translation is shorter than the human translation, $1-\frac{\text{Count}(S_{\text{ref}})}{\text{Count}(S_{\text{pred}})}$ becomes negative, and the exponential function maps it to a value between 0 and 1. As the machine translation shortens, $1-\frac{\text{Count}(S_{\text{ref}})}{\text{Count}(S_{\text{pred}})}$ approaches negative infinity, and the exponential function converges towards 0. Conversely, when the machine translation is longer than the human translation, no penalty should be imposed. Therefore, the factor remains 1. Thus, the brevity penalty is defined as: $$ BP = \begin{cases} 1 & \text{if } Count(S_{pred}) > Count(S_{ref}) \\ \exp(1- \frac{Count(S_{ref})}{Count(S_{pred})}) & \text{if } Count(S_{pred}) \leq Count(S_{ref}) \end{cases}$$ Below is a code snippet for calculating the brevity penalty:

def brevity_penalty(pred: List[str], ref: List[str]) -> float:
        """
        Calculate the brevity penalty of the machine translation
        :param pred: List of words in the machine translation
        :param ref: List of words in the reference translation
        :return: brevity penalty
        """
        pred_len = len(pred)
        refs_len = [len(r) for r in refs]
        ref_len = min(refs_len)
        if pred_len > ref_len:
            return 1
        else:
            return exp(1 - ref_len / pred_len)

8. BLEU Score

Now that we have all the necessary components, we can calculate the BLEU score. The BLEU score is defined as: $$ BLEU = BP \times P$$ where $BP$ represents the brevity penalty and $P$ signifies the weighted precision using the logarithm. Typically, the BLEU score is computed for 1-gram to 4-gram.

9. Conclusion

In this article, we delved into the intuition behind the BLEU score. Beginning with precision, we explored its limitations and then introduced various enhancements such as modified precision, n-gram modified precision, weighted precision, and brevity penalty. While the BLEU score serves as a valuable metric for evaluating machine translation, it's important to acknowledge its imperfections. Hence, it's advisable to complement BLEU with other metrics when assessing machine translation quality.