Download

**EDIT 2021: PLEASE NOTE THAT THE COMPETITION IS NOW CLOSED AND THE DATASETS ARE NO LONGER AVAILABLE.**

Diverse datasets are available for the different subtasks. All datasets are released as UTF-8 encoded comma separated “.csv” files, whose download is password-protected. Please write us [dankmemesevalita (AT) gmail.com] to get access to them.

 

General Guidelines

 

Development Datasets

 

Test Datasets

 

Annotated Test Datasets

 


Datasets

The whole dataset is composed of 2,361 images extracted from Instagram through a Python script aimed at the hashtag related to the Italian government crisis (“#crisidigoverno”). Posts have been extracted along with their related metadata and anonymized. For each meme, we will provide as input the image, the text transcription, the image vector representation and the following features:

  • Date: when the image has first been posted on the platform (Instagram)
  • Macro status: refers to meme layouts and their relation to diffused, conventionalised formats called macros. Binary: macro (1), not macro (0)
  • Picture manipulation: Non-manipulated or low impact changes are labeled 0 (e.g. the addition of a text or a logo). Heavily manipulated, impactful changes (e.g. images edited to include political actors) are labeled 1.
  • Visual actors: the political actors (i.e. politicians, parties’ logos) portrayed visually, as edited into the picture or portrayed in the original image.
  • Engagement: the sum of comments and likes on the artifact in its original setting (Instagram)
  • Meme: binary feature, where 0 represents non meme images and 1 meme images. This is the target label for the first subtask.
  • Hate Speech: binary feature only for memes. It differentiates memes with offensive language (1) from non offensive memes (0). This is the target label for the second subtask.
  • Event: feature only for meme images, categorizing them according to 4 events (described in \ref{sub 3}), plus a residual category labeled as 0. This is the target label for the third subtask.

Text Extraction. Texts have been extracted through optical character recognition (OCR) using Google’s Tesseract-OCR Engine, and further manually corrected by linguists.

Image Vector Representations. The dataset includes image embeddings. The vector representations are computed employing ResNet (He et al., 2016), a state-of-the-art model for image recognition based on Deep Residual Learning. Providing such image representations allows the participants to approach these multimodal tasks focusing primarily on its NLP aspects (Kiela and Bottou, 2014).

External Resources. Participants are allowed to use external resources, lexicons or independently annotated data. Given that, although we provide ResNet image embeddings, participants can use any other image representations.


 

Task 1: Meme Detection

The dataset counts 2000 images, half memes and half not. We split the dataset into training and test sets, in a proportion of 80-20% of items. The following table represents the format of the training dataset. The test dataset will be provided without gold labels (i.e. without the “Meme” attribute) for testing purposes.

Filename Engagement Date Manipulated Visual actors Text Meme
1.jpg 234 27/08/2019 1 Mattarella io che aspetto arrivino le 15 per assistere alla fine del governo 1
2.jpg 432 15/08/2019 0 Salvini cosa ha fatto a mosca la notte del 17 ottobre? 0

 


 

Task 2: Hate Speech Identification

The dataset counts 1,000 memes. We split the dataset into training and test sets, in a proportion of 80-20% of items. The following table represents the format of the training dataset. The test dataset will be provided without gold labels (i.e. without the “Hate Speech” attribute) for testing purposes.

Filename Engagement Manipulated Visual actors Text Hate Speech
1.jpg 114 0 Salvini, Conte, Di Maio aspetta manca ancora la parte in cui parlo del lavoro di tua sorella 1
2.jpg 412 1 Salvini merdman 1

 

Task 3: Event Clustering

The dataset counts 1000 memes. We split the dataset into training and test sets, in a proportion of 80-20% of items. A set of 800 labeled memes will be provided for development. In order to test and evaluate the model, the remaining 200 memes will be provided without labels.

Filename Date Engagement Macro Manipulation Visual actors Text Embedding ID Event
1.jpg 11/08/2019 324 0 1 Salvini, Berlusconi la crisi di governo è stata innescata mio signore hai lavorato bene lord salvini. tutto procede come avevo previsto 1 1
2.jpg 10/08/2019 123 0 0 Salvini, Di Maio, Conte, Mattarella il governo entra in crisi 5 secondi di penalità per vettel 1 1
3.jpg 20/08/2019 111 0 0 Salvini, Conte conte usa boato 2 2
4.jpg 20/08/2019 45 0 0 Salvini, Conte quando ti interroghi su chi abbia fatto affogare tutti i tuoi pacchetti udp io 2 2
5.jpg 29/08/2019 23 1 0 0 affogare i bambini con la lega fare l’elettroshock ai bambini con il pd 3 3
6.jpg 27/08/2019 987 1 0 0 di maio salvini ricordi come arrivare a palazzo chigi credo di sì, ma perché dovrei saperlo? 3 3
7.jpg 03/09/2019 65 0 0 0 quando provi a votare su rousseau ma il sito non carica 4 4
8.jpg 02/09/2019 112 0 0 Salvini quando perdi il posto al viminale ma sei ancora in tempo per andare a fare la chiusura della stagione ad Ibiza 4 4