How ‘live’ was the live-on-tape? We let AI watch and decide

Watching the live-on-tape performances last weekend, most of us probably asked the questions “how similar is a live-on-tape performance to the actual live performance from Rotterdam?” or “which countries would have been less affected if their live-on-tape performance would be shown instead of the live performance?”. To get this sorted out, we gave an artificial intelligence system (let’s call it Adam) with computer vision abilities to watch pairs of performances and score their similarity – and here are the results:

AI scores for similarity in a range between 0 (not similar at all) and 100 (exactly the same). As the live-on-tape performances of Ireland and the UK were not published, it was not possible to calculate their score.

Australia, as we know, used the live-on-tape performance as the live performance and so its similarity rate is 100. On the top we can find countries who used their national final setting to film the live-on-tape performances, such as Estonia, Finland, and France, which were more or less similar to the Eurovision ones. Roughly, we say that all top 12 countries had pretty similar performances. On the other side we can find countries like Russia, Israel and Slovenia which had significantly different performances.

How does it work?

For each country, Adam watched the live performance and used the SIFT algorithm to mark points that can be considered important and valuable, for example borders between very dark areas and very light areas. Each point is represented as a signature of 128 numbers. Then, Adam used Hierarchical clustering to take all the points and divided them into groups based on how close their signatures were to each other. Each group was considered a “visual word”. Having a vocabulary of 256 visual words, Adam described the live performance using this vocabulary.
Adam watched the live-on-tape performance and tried to describe it using the same vocabulary it learned while watching the actual live performance. The similarity rate shows how well Adam managed to do that.

Ukraine was much more similar to the live performance than Malta, how come it got a lower rate? Cyprus was pretty similar too, why is it 21st? What is Estonia doing 2nd?

Adam doesn’t describe the performances with words we human fellows can easily understand (such as backdrop, choreography, dancers or instruments) so it is difficult to find the exact reasons. For the sake of simplicity and to save computational time, Adam did not watch the whole performances but a collection of approximately 100 frames from each performance, that were sampled every 45 frames (one frame every 1.8 seconds). Furthermore, Adam is color blind and can not see the difference between blue or red lighting as long as their intensities are the same. Showing Adam more frames per performance, adjusting the number of points extracted for every frame, relating to color or having a larger vocabulary can likely, but not necessarily, contribute to higher accuracy.

How come Sweden is almost last? The performance was almost the same!

Although some of the performances which, to the human eye we think are similar, scored low with Adam’s test, Sweden is a significant outlier. It is known that AI is having hard times identifying and processing black skin tone correctly and it is one of the challenges that AI developers face. In our case, the overall darkness of the performance – the staging, clothing and the skin tone of the performers, could made Adam focus on other aspects of the performance that were easier to see, such as the pattern of the backdrop that was somewhat different between the two performances.

Sweden’s performance as seen by AI. The backdrop seemed to have a bigger significance.

So what’s the point of using AI for this?

Watching all pairs of performances and ranking them according to their similarity might be an impossible task for a human, and that’s where AI can help us. However, it is important for us as humans (and Eurovision fans, in this case) to evaluate the results and identify the places where it might performed poorly.

What do you think of the results? Which countries managed to have live-to-tape performances that were close to the live performances in your opinion?