Mean Average Precision (mAP) and Recall (mAR)

drawing

What is mAP?

Mean Average Precision (mAP) is the de facto standard accuracy metric in object detection, providing a comprehensive measure of a model's ability to correctly detect and localize objects within an image.
This description and methodology follow the approach detailed in R. Padilla et al. ¹.

How to calculate mAP?

The calculation of mAP involves the Predictions generated by the object detector and the corresponding Ground-truth annotations.

Predictions

Each prediction of an object detector consists of the following components:

Class \(\hat{c} \in \mathbb{N}_{\leq C}^+\), where \(C \in \mathbb{N}^+\) is the total number of classes;
Bounding Box \(\hat{\textbf{B}} \in \mathbb{R}^4\), specifying the location of the detected object;
Confidence Score \(\hat{s} \in [0, 1]\), indicating the model's confidence in the prediction.

Each detection can be represented as:

\[ \hat{\textbf{y}} = ( \hat{c}, \hat{\textbf{B}}, \hat{s} ) \]

Ground-truth

Ground-truth annotations provide the reference data against which predictions are evaluated. Each ground-truth annotation consists of the following components:

Class \(c \in \mathbb{N}_{\leq C}^+\), the true class label of the object;
Bounding Box \(\textbf{B} \in \mathbb{R}^4\), the true location of the object.

Each ground-truth object can be represented as:

\[\textbf{y} = ( c, \textbf{B} ). \]

To compute mAP, the following sets are used:

a set of \(G\) ground-truth objects:

\[Y = \{\textbf{y}_i=(c_i, \textbf{B}_i) \}_{i=1,\ldots,G} \]

a set of \(D\) predictions:

\[\hat{Y} = \{\hat{\textbf{y}}_j = (\hat{c}_j, \hat{\textbf{B}}_j, \hat{s}_j) \}_{j=1,\ldots,D} \]

Precision and Recall

To compute Precision (\(\text{P}_\bar{c}\)) and Recall (\(\text{R}_\bar{c}\)) for a fixed class \(\bar{c} \in \mathbb{N}_{\leq C}^+\), let's define True Positives (\(\text{TP}_\bar{c}\)), that are predictions \(\hat{\textbf{y}} = ( \hat{c}, \hat{\textbf{B}}, \hat{s} )\) that meet the following criteria:

the predicted class matches the ground-truth class (\(\hat{c} = \bar{c}\));
the IoU between the predicted and ground-truth bounding boxes is greater than or equal to a IoU threshold \(\tau_{\text{IoU}}\) (\(\text{IoU}( \textbf{B}, \hat{\textbf{B}} ) \geq \tau_{\text{IoU}}\));
the confidence score \(\hat{s}\) is greater than or equal to a confidence threshold \(\tau_{s} (\hat{s} \geq \tau_{s})\).

Formally, the definition of \(\text{TP}_\bar{c}\) (depending on \(\tau_{\text{IoU}}\) and \(\tau_{s}\)):

\[\text{TP}_{\bar{c}}(\tau_{s}; \tau_{\text{IoU}}) = \{ ( \hat{c}, \hat{\textbf{B}}, \hat{s} ) \in \hat{Y} \mid \exists \; ( \bar{c}, \textbf{B} ) \in Y: \hat{c} = \bar{c} \wedge \text{IoU}( \textbf{B}, \hat{\textbf{B}} ) \geq \tau_{\text{IoU}} \wedge \hat{s} \geq \tau_{s} \}. \]

Thus, we can define Precision (\(\text{P}_\bar{c}\)) as:

\[ \text{P}_{\bar{c}}(\tau_{s}; \tau_{\text{IoU}}) = \frac{\text{TP}_{\bar{c}}(\tau_{s}; \tau_{\text{IoU}})}{\text{TP}_{\bar{c}}(\tau_{s}; \tau_{\text{IoU}}) + \text{FP}_{\bar{c}}(\tau_{s}; \tau_{\text{IoU}})} = \frac{|\text{TP}_{\bar{c}}(\tau_{s}; \tau_{\text{IoU}})|}{|\hat{Y}_{\bar{c}, \tau_{s}}|}\]

where \(\hat{Y}_{\bar{c}, \tau_{s}} = \{\hat{y} = ( \hat{c}, \hat{\textbf{B}}, \hat{s} ) \in \hat{Y} \mid \hat{c} = \bar{c} \wedge \hat{s} \geq \tau_{s}\}\).

Similarly for the Recall (\(\text{R}_\bar{c}\)):

\[ \text{R}_{\bar{c}}(\tau_{s}; \tau_{\text{IoU}}) = \frac{\text{TP}_{\bar{c}}(\tau_{s}; \tau_{\text{IoU}})}{\text{TP}_{\bar{c}}(\tau_{s}; \tau_{\text{IoU}}) + \text{FN}_{\bar{c}}(\tau_{s}; \tau_{\text{IoU}})} = \frac{|\text{TP}_{\bar{c}}(\tau_{s}; \tau_{\text{IoU}})|}{|Y_{\bar{c}}|}\]

where \(Y_{\bar{c}} = \{ y = (c, \textbf{B}) \in Y \mid c = \bar{c} \}\).

Average Precision

For a specific class \(\bar{c}\) and a fixed IoU threhsold \(\bar{\tau}_{IoU}\), the Average Precision \(\text{AP}_{\bar{c}}@[\bar{\tau}_{IoU}]\) is a metric based on the area under a \(\text{P}_{\bar{c}}(\tau_{s}; \bar{\tau}_{\text{IoU}})\times \text{R}_{\bar{c}}(\tau_{s}; \bar{\tau}_{\text{IoU}})\) curve:

\[\text{AP}_{\bar{c}}@[\bar{\tau}_{\text{IoU}}] = \int_0^1 \text{P}_{\bar{c}}(\text{R}_{\bar{c}}; \bar{\tau}_{\text{IoU}}) d\text{R}_{\bar{c}} \]

This area is in practice replaced with a finite sum using certain recall values and different interpolation methods. One starts by ordering the \(K\) different confidence scores output by the detector, for the specific class \(\bar{c}\):

\[\{ \tau_{s_k}, k \in \mathbb{N}_{\leq K}^+ \mid \tau_{s_i} > \tau_{s_j} \; \forall i > j\}\]

Since the \(\text{R}_{\bar{c}}\) values have a one-to-one, monotonic correspondence with \(\tau_{s_k}\), which has a one-to-one, monotonic, correspondence with the index \(k\), then the Precision-Recall curve is not continuous but sampled at the discrete points \(\text{R}_{\bar{c}}(\tau_{s_k};\bar{\tau}_{\text{IoU}})\), leading to the set of pairs \((\text{P}_{\bar{c}}(\tau_{s_k};\bar{\tau}_{\text{IoU}}),\text{R}_{\bar{c}}(\tau_{s_k};\bar{\tau}_{\text{IoU}}))\) indexed by \(k\). Now one defines an ordered set of reference recall values \(\text{R}_r\):

\[\{ \text{R}_{r_n}, n \in \mathbb{N}_{\leq N}^+ \mid \text{R}_{r_m} < \text{R}_{r_n} \; \forall m > n \}\]

The Average Precision \(\text{AP}_{\bar{c}}\) is computed using the two ordered sets \(\{ \tau_{s_k} \}_{k \in \mathbb{N}_{\leq K}^+}\) and \(\{ \text{R}_{r_n} \}_{n \in \mathbb{N}_{\leq N}^+}\). But before computing \(\text{AP}_{\bar{c}}\), the Precision-Recall pairs have to be interpolated such that the resulting Precision-Recall curve is monotonic. The resulting interpolated curve is defined by a continuous function \(\tilde{\text{P}}_{\bar{c}}(x; \bar{\tau}_{\text{IoU}})\), where \(x\) is a real value contained in the interval \([0, 1]\), defined as:

\[\tilde{\text{P}}_{\bar{c}}(x; \bar{\tau}_{\text{IoU}}) = \max_{k \in \mathbb{N}_{\leq K}^+ \mid \text{R}_{\bar{c}}(\tau_{s_k}, \bar{\tau}_{\text{IoU}}) \geq x} \text{P}_{\bar{c}}(\tau_{s_k}; \bar{\tau}_{\text{IoU}})\]

The precision value interpolated at recall \(x\) corresponds to the maximum precision \(\text{P}_{\bar{c}}(\tau_{s_k}; \bar{\tau}_{\text{IoU}})\) whose corresponding recall value is greater than or equal to \(x\).

N-Point Interpolation

In the \(N\)-point interpolation, the set of reference recall values \(\{ \text{R}_{r_n} \}_{n \in \mathbb{N}_{\leq N}^+}\) are equally spaced in the interval \([0, 1]\) that is:

\[\text{R}_{r_n} = \frac{N - n}{N -1}, \; \; n \in \mathbb{N}^+_{\leq N}\]

and:

\[ \text{AP}_{\bar{c}}@[\bar{\tau}_{\text{IoU}}] = \frac{1}{N} \sum_{n=1}^N \tilde{\text{P}}_{\bar{c}}(\frac{N - n}{N -1}; \bar{\tau}_{\text{IoU}}) \]

Popular choices include \(N=101\) as in MS-COCO ² detection competition, and \(N=11\), initially adopted by the PASCAL-VOC ³, which later transitioned to the all-point interpolation method.

OD-Metrics

Since OD-Metrics adopts the MS-COCO ² standard, it uses the \(N\)-point interpolation.

All-Point Interpolation

In the so-called all-point interpolation, the set values \(\{ \text{R}_{r_n} \}_{n \in \mathbb{N}_{\leq N}^+}\) corresponds exactly to the set of recall values computed considering all \(K\) confidence levels \(\{ \tau_{s_k} \}_{k \in \mathbb{N}_{\leq K}^+}\):

\[ \text{AP}_{\bar{c}}@[\bar{\tau}_{\text{IoU}}] = \sum_{k=0}^{K} (\text{R}_\bar{c}(\tau_{s_k}; \bar{\tau}_{\text{IoU}}) - \text{R}_\bar{c}(\tau_{s_{k+1}}; \bar{\tau}_{\text{IoU}})) \tilde{\text{P}}_{\bar{c}}(\text{R}_\bar{c}(\tau_{s_k}; \bar{\tau}_{\text{IoU}}); \bar{\tau}_{\text{IoU}}) \]

with \(\tau_{s_0} = 0\), \(\text{R}_\bar{c}(\tau_{s_0}; \bar{\tau}_{\text{IoU}}) = 1\), \(\tau_{s_{K+1}} = 1\), \(\text{R}_\bar{c}(\tau_{s_{K+1}}; \bar{\tau}_{\text{IoU}}) = 0\).

Mean Average Precision

Regardless of the interpolation method, the Average Precision \(\text{AP}_{\bar{c}}@[\bar{\tau}_{IoU}]\) is obtained individually for each class \(\bar{c}\). In large datasets, it is useful to have a unique metric value that is able to represent the accuracy of the detections among all \(C\) classes. For such cases, the Mean Average Precision \(\text{mAP}@[\bar{\tau}_{\text{IoU}}]\) is computed, which is simply:

\[\text{mAP}@[\bar{\tau}_{\text{IoU}}] = \frac{1}{C}\sum_{c=1}^C \text{AP}_{c}@[\bar{\tau}_{\text{IoU}}] \]

In certain competitions, the final metric \(\text{mAP}@[T]\) is computed as the average over a predefined set \(T\) of IoU thresholds \(\tau_{\text{IoU}}\). For instance, in MS-COCO ², \(T\) is defined as \(\{0.5, 0.55, \ldots, 0.95\}\) (increments of 0.05) and it is commonly denoted as \(\text{mAP}@[0.5:0.95]\).

Average Recall

Following the definition used in MS-COCO ², given a set \(T\) of IoU thresholds \(\tau_{\text{IoU}}\) and a specific class \(\bar{c}\), the Average Recall \(\text{AR}_{\bar{c}}@[T]\) is defined as:

\[ \text{AR}_{\bar{c}}@[T] = \frac{1}{|T|} \sum_{\tau_{\text{IoU}} \in T} \max_{k \in \mathbb{N}_{\leq K}^+} \text{R}_\bar{c}(\tau_{s_k}; \tau_{\text{IoU}})\]

The Mean Average Recall \(\text{mAR}@[T]\) is then calculated as the average of \(\text{AR}_{\bar{c}}@[T]\) across all \(C\) classes:

\[ \text{mAR}@[T] = \frac{1}{C}\sum_{c=1}^C \text{AR}_{c}@[T] \]

In MS-COCO ², \(T = \{0.5, 0.55, \ldots, 0.95\}\) and the corresponding Mean Average Recall is denoted as \(\text{mAR}@[0.5:0.95]\).

References

Rafael Padilla, Wesley L. Passos, Thadeu L. B. Dias, Sergio L. Netto, and Eduardo A. B. da Silva. A comparative analysis of object detection metrics with a companion open-source toolkit. Electronics, 2021. URL: https://www.mdpi.com/2079-9292/10/3/279, doi:10.3390/electronics10030279. ↩
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, 740–755. Springer, 2014. ↩↩↩↩↩
Mark Everingham, SM Ali Eslami, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes challenge: a retrospective. International journal of computer vision, 111:98–136, 2015. ↩