In a previous post, we described mathematical metrics to choose the best semantic model and to optimize its parameters. However, these metrics need a ground truth: they require baseline rankings (of words, sentences, or texts) that can be regarded as correct and can be compared with the output of the model.
Online we can find some data that may be used to evaluate general-purpose models based on the English language. However, Inda’s semantic search is based on semantic models specific to the recruiting domain: these models are specialized in vocabulary, idioms, and semantic meanings typical of the recruiting context. When we are dealing with these domain-specific models – especially if we want to use them also in languages different from English – we need to collect new data, which means collecting many human annotations.
We developed therefore a protocol to collect these annotations in such a way to optimize the evaluation accuracy of the dataset, given the annotation effort. More in detail, the protocol aims to create a dataset where pairs of tokens (but the method can be generalized to pairs of sentences, pairs of texts, etc.) are sorted based on their semantic relatedness and with particular focus on top ranks.
i) The first step in our protocol is identifying the semantic areas relevant to our downstream task (i.e., the final task of the models we want to evaluate). It should be noted that, by focusing on a single area at a time, we reduce semantic ambiguity issues.
ii) The second step is the choice of the tokens within each semantic area. In particular, in order to build an evaluation dataset that can detect the hubness problem (see also this post), we need to include rare tokens.
iii) Once we have defined the tokens, we can pair them in all combinations within each semantic area. We will refer to a pair of tokens as an item.
The graph above displays the cosine similarity distributions for pairs of tokens randomly selected in the whole vocabulary (brown crosses) and within a single semantic area (blue circles). It shows that, by splitting the tokens by area, we shift the similarity distribution towards higher values: this means that we are more likely to include top rank items, as desired.
iv) The last and more demanding step consists of ranking the previously described groups of items according to their semantic relatedness. This requires collecting many human annotations, and a convenient approach is crowdsourcing, which means that we rely on the opinions of a large group of non-expert voters. As the voters are expected to be non-experts, a good practice is to ask simple questions; therefore, instead of requiring to rank all items, we ask the voters to compare two items at a time. When a substantial number of comparisons is collected, we will combine the data to obtain the full ranking.
A critical question at this stage is: how many times should we present a given item to the voters? This is a fundamental issue because the number of times an item is compared with the other items determines the accuracy of its position in the final ranking. The most straightforward approach is to present each item the same number of times. However, as we know that top ranks are particularly important, we prefer to use an adaptive approach to increase their precision.
The key idea is to subdivide the voting into different ballots: the first ballot involves all items and we present each item the same number of times to the voters. At the end of the ballot, we calculate the winning rate of each item, we sort the items according to their winning scores, and we select a fraction of high-score-items, which we use in the next ballot. The subsequent ballots are analogous, except for the fact that the number of compared items gradually decreases. We must be aware that, as we narrow the pool of competing items around top ranks, it becomes more and more difficult to win a comparison; however, a score-rescaling can compensate for this phenomenon.
After the last ballot, we obtain the final ranking, which, as desired, is particularly accurate for top ranks.
We tested this adaptive approach via computer simulations based on a stochastic model – which will be described in a subsequent post – of the pairwise comparisons. To check the simulation results with data collected from human voters, we developed Lavaember. This tool proposes pairs of items (in turn composed of pairs of terms, subdivided by semantic area) and allows the user to choose the item composed of the most semantically related terms. Note that Lavaember is currently focused on the Italian language, other languages will be added in a second stage.
For all technical details, please refer to our scientific paper Top-Rank-Focused Adaptive Vote Collection for the Evaluation of Domain-Specific Semantic Models, which you can download via the form below.