An Ordinal Classification Metric: Closeness Evaluation Measure

Introduce an evaluation metric for the ordinal classification task.

Streamlit: Demo App | Code: Github | CEM Paper: ACL

Introduction

Ordinal classification is a type of classification task whose predicted classes (or categories) have a specific ordering. This implies misclassifying a class pair should be penalized differently and proportionally, with respect to their semantic order and class distribution in a dataset. Existing evaluation metrics are shown failed to capture certain aspects of this task, that it calls for the employment of a new metric, namely Closeness Evaluation Measure ( CEMCEM).

For example,

  • precision/recall ignores the ordering among classes.
  • mean average error/mean squared error assumes a predefined interval between classes, subject to their numeric encoding.
  • ranking measures indeed capture the relative ranking in terms of relevancy, but lack of emphasis on identifying the correct class.

CEMCEM employs the idea of informational closeness: the less LIKELY there is data point between a class pair, the closer they are. As a result, a model that misclassifies informational-close classes should be penalized lesser than those informational-distant. It has been mathematically proven this measure attains the desired properties that addresses the foregoing shortcomings, which to be discussed shortly after.

This post is divided as follows (for a more interactive experience, checkout the streamlit demo):

Information Closeness

Let’s say we have two journals FF and SS (mockup data in Fig. 1.) with two very distinct natures of reviewing research paper submissions. Reviewers at journal FF are very cautious and it’s rare to see they review a paper as reject or accept. Journal SS, on the other hand, tends to take a clear stance on the paper’s acceptance/rejection status.

Given that, weak reject vs. weak accept in the context of SS are informational close because they are closer assessment, while FF treating them as two far ends of the grading scales due to the fact this is a strong disagreement between reviewers in the context of FF. Put it differently, two classes aa and bb are informationally close if the probability of finding data points between the two is low.

More formally, given a set of ordinal classes C={c1,...,cm}C=\{c_1, ..., c_m\} and amount of data points assigned to each class {n1,...,nm}\{n_1, ..., n_m\} with N=Σi=1mniN=\Sigma_{i=1}^{m} n_i, the informational closeness of any class pair IC(ci,cj)IC(c_i, c_j) is the inverse log(.)-log(.) of the probability of data points assigned to classes in between (ni2+Σk=i+1jnk)/N(\frac{n_i}{2} + \Sigma_{k=i+1}^{j} n_k)/N.

Put it all together, we have

IC(ci,cj)=log(ni2+Σk=i+1jnkN)IC(c_i, c_j) = -log(\frac{\frac{n_i}{2} + \Sigma_{k=i+1}^{j} n_k}{N})

To be clear,

IC(c1,c3)=log(n1/2+n2+n3N)IC(c_1, c_3) = -log(\frac{n_1/2 + n_2 + n_3}{N}) IC(c3,c1)=log(n3/2+n2+n1N)IC(c_3, c_1) = -log(\frac{n_3/2 + n_2 + n_1}{N}) IC(c1,c1)=log(n1/2N)IC(c_1, c_1) = -log(\frac{n_1/2}{N})

Fig. 1. Two journals FF and SS with two very distinct natures of reviewing research paper submissions.

In the above charts, for each journal, we computed ICIC for all class combinations and show the top closest and furthest pairs. As expected, both journals agree that reject vs. accept is the furthest pair. Journal SS treats those on the border: weakly reject, undecided, weakly accept as closest pairs, given there are small amount of data points between them.

Note that, ICIC is not symmetrical, hence IC(ci,cj)IC(cj,ci)IC(c_i, c_j) \neq IC(c_j, c_i). Also for the ease of readability, we scale the computed value by a constant of 1/log(2)1/log(2).

Closeness Evaluation Measure

Next question: How to formulate the idea of informational closeness to an usable evaluation metric for ordinal classification task?

It’s simple. Assume you have an ordinal classification dataset whose each data point labelled an ordinal class. Firstly, you compute the ICIC for classes in the dataset; this quantity translates to the reward for making an ordinal step from one class to another which are distributionally closer. Secondly, assume you have a candidate Machine Learning model has been trained on such dataset and ready to evaluate. For each data point, we favour prediction whose class is close to the groundtruth; an accurate class matching results to a point 11, while the lesser earning a lesser point. Doing the same for all predictions, we obtain the overall performance measure, i.e. CEMCEM. The highest achievable score for CEMCEM is 11, that is, the higher the better.

More formally, given a dataset DD, a groundtruth label mapping, g:DCg: D \rightarrow C, an ML model m:DCm: D \rightarrow C, the closeness evaluation measure of such model CEM(m,g)CEM(m, g) is the summation of ICIC between predicted class and groundtruth, normalized by ICIC of the groundtruth itself, IC[m(d),g(d)]/IC[g(d),g(d)]IC[m(d), g(d)] / IC[g(d), g(d)] across all data points (ΣdD\Sigma_{d \in D}).

Put together, we have

CEM(m,g)=ΣdDIC[m(d),g(d)]ΣdDIC[g(d),g(d)]CEM(m, g) = \frac{\Sigma_{d \in D} IC[m(d), g(d)]} {\Sigma_{d \in D} IC[g(d), g(d)]}

Metric Properties

CEMCEM has been proven to satisfy three mathematical properties that address the shortcomings of other metrics. That is, the metrics take into account of ordering information, does not assume a predefined interval between classes, and reward the highest score for identifying the correct class.

Below are subsections elaborating the properties. Each property to be accompanied by a pair of exampled datasets, whose groundtruth and prediction are depicted in the x-axis and coloring, respectively. This to illustrate the intuition behind each property, and in order to read the charts properly, do pay attention to Ordinal Invariance section.

Ordinal Invariance

First Dataset is a journal paper review dataset with 99 data points. The x-axis’ ticks depict their true classes: 33 rejects, 33 weak rejects, 33 undecideds and 00 for the other two classes. The coloring of the circles ( pink, red, yellow, dark green, light green), depicts the data points’ predictions. In this data set, the 99 predictions are exact matches to their true classes.

Second Dataset is similar with 100%100 \% match for other set of classes: 33 rejects, 33 undecideds, 33 weak accepts. Its mere difference is that its groundtruth shifted their value to strictly higher-order classes.

Therefore, Ordinal Invariance states that the metric value should remain the same if a dataset have its model output and groundtruth shift their value in a strictly higher-order classes.

Fig. 2. Ordinal Invariance.

Monotonicity

Monotonicity states that, changing the model output closer to the groundtruth should result in a metric increase.

Indeed, First Dataset have one weak accept point misclassified as undecided. Second Dataset should perform worse because the very same data point is misclassified to a further class, i.e. weak reject.

Fig. 3. Monotonicity.

Imbalance

Imbalance states that, a classification error in a small class should have lesser impact - or receive higher reward - in comparison to the same mistake in a frequent class.

First Dataset misclassifies a data point from a frequent class, weak reject, should have lower metric value than Second Dataset that misclassifies a data point from a small class, i.e. weak accept.

There is an inconsistency in my interpretation and the original paper. I have contacted the author for advice regarding this discrepancy. Raised Issue

Fig. 4. Imbalance.

An Example Usecase

Let’s put CEMCEM into a test!

Given a dataset, we have the confusion matrix for two model AA and BB as followings. From accuracy metric standpoint, both models are on par. CEMCEM said otherwise and highlights that model BB should be the better performer.

It is indeed a true evaluation, in view that:

  • Model AA makes more mistakes between distant classes: positive-negative (7+4>4+27+4 > 4+2).
  • Model AA makes more mistakes in positive-neutral, whose population represent 90%90 \% of the dataset, hence penalized more heavily, or precisely, earning less reward.

Fig. 5. Imbalance.



© 2018. All rights reserved.

Powered by Hydejack v8.2.0