Download
**EDIT 2021: PLEASE NOTE THAT THE COMPETITION IS NOW CLOSED AND THE DATASETS ARE NO LONGER AVAILABLE.**
Diverse datasets are available for the different subtasks. All datasets are released as UTF-8 encoded comma separated “.csv” files, whose download is password-protected. Please write us [dankmemesevalita (AT) gmail.com] to get access to them.
General Guidelines
Guidelines_DANKMEMES2020
Development Datasets
Task 1 - Meme Identification - Training set
Task 2 - Hate Speech Detection - Training set
Task 3 - Event Clustering - Training set
Test Datasets
Task 1 - Meme Identification - Test set
Task 2 - Hate Speech Detection - Test Set
Task 3 - Event Clustering - Test Set
Annotated Test Datasets
Annotated Test Sets
Datasets
The whole dataset is composed of 2,361 images extracted from Instagram through a Python script aimed at the hashtag related to the Italian government crisis (“#crisidigoverno”). Posts have been extracted along with their related metadata and anonymized. For each meme, we will provide as input the image, the text transcription, the image vector representation and the following features:
- Date: when the image has first been posted on the platform (Instagram)
- Macro status: refers to meme layouts and their relation to diffused, conventionalised formats called macros. Binary: macro (1), not macro (0)
- Picture manipulation: Non-manipulated or low impact changes are labeled 0 (e.g. the addition of a text or a logo). Heavily manipulated, impactful changes (e.g. images edited to include political actors) are labeled 1.
- Visual actors: the political actors (i.e. politicians, parties’ logos) portrayed visually, as edited into the picture or portrayed in the original image.
- Engagement: the sum of comments and likes on the artifact in its original setting (Instagram)
- Meme: binary feature, where 0 represents non meme images and 1 meme images. This is the target label for the first subtask.
- Hate Speech: binary feature only for memes. It differentiates memes with offensive language (1) from non offensive memes (0). This is the target label for the second subtask.
- Event: feature only for meme images, categorizing them according to 4 events (described in \ref{sub 3}), plus a residual category labeled as 0. This is the target label for the third subtask.
Text Extraction. Texts have been extracted through optical character recognition (OCR) using Google’s Tesseract-OCR Engine, and further manually corrected by linguists.
Image Vector Representations. The dataset includes image embeddings. The vector representations are computed employing ResNet (He et al., 2016), a state-of-the-art model for image recognition based on Deep Residual Learning. Providing such image representations allows the participants to approach these multimodal tasks focusing primarily on its NLP aspects (Kiela and Bottou, 2014).
External Resources. Participants are allowed to use external resources, lexicons or independently annotated data. Given that, although we provide ResNet image embeddings, participants can use any other image representations.
Task 1: Meme Detection
The dataset counts 2000 images, half memes and half not. We split the dataset into training and test sets, in a proportion of 80-20% of items. The following table represents the format of the training dataset. The test dataset will be provided without gold labels (i.e. without the “Meme” attribute) for testing purposes.
Filename | Engagement | Date | Manipulated | Visual actors | Text | Meme |
---|---|---|---|---|---|---|
1.jpg | 234 | 27/08/2019 | 1 | Mattarella | io che aspetto arrivino le 15 per assistere alla fine del governo | 1 |
2.jpg | 432 | 15/08/2019 | 0 | Salvini | cosa ha fatto a mosca la notte del 17 ottobre? | 0 |
Task 2: Hate Speech Identification
The dataset counts 1,000 memes. We split the dataset into training and test sets, in a proportion of 80-20% of items. The following table represents the format of the training dataset. The test dataset will be provided without gold labels (i.e. without the “Hate Speech” attribute) for testing purposes.
Filename | Engagement | Manipulated | Visual actors | Text | Hate Speech |
---|---|---|---|---|---|
1.jpg | 114 | 0 | Salvini, Conte, Di Maio | aspetta manca ancora la parte in cui parlo del lavoro di tua sorella | 1 |
2.jpg | 412 | 1 | Salvini | merdman | 1 |
Task 3: Event Clustering
The dataset counts 1000 memes. We split the dataset into training and test sets, in a proportion of 80-20% of items. A set of 800 labeled memes will be provided for development. In order to test and evaluate the model, the remaining 200 memes will be provided without labels.
Filename | Date | Engagement | Macro | Manipulation | Visual actors | Text | Embedding ID | Event |
---|---|---|---|---|---|---|---|---|
1.jpg | 11/08/2019 | 324 | 0 | 1 | Salvini, Berlusconi | la crisi di governo è stata innescata mio signore hai lavorato bene lord salvini. tutto procede come avevo previsto | 1 | 1 |
2.jpg | 10/08/2019 | 123 | 0 | 0 | Salvini, Di Maio, Conte, Mattarella | il governo entra in crisi 5 secondi di penalità per vettel | 1 | 1 |
3.jpg | 20/08/2019 | 111 | 0 | 0 | Salvini, Conte | conte usa boato | 2 | 2 |
4.jpg | 20/08/2019 | 45 | 0 | 0 | Salvini, Conte | quando ti interroghi su chi abbia fatto affogare tutti i tuoi pacchetti udp io | 2 | 2 |
5.jpg | 29/08/2019 | 23 | 1 | 0 | 0 | affogare i bambini con la lega fare l’elettroshock ai bambini con il pd | 3 | 3 |
6.jpg | 27/08/2019 | 987 | 1 | 0 | 0 | di maio salvini ricordi come arrivare a palazzo chigi credo di sì, ma perché dovrei saperlo? | 3 | 3 |
7.jpg | 03/09/2019 | 65 | 0 | 0 | 0 | quando provi a votare su rousseau ma il sito non carica | 4 | 4 |
8.jpg | 02/09/2019 | 112 | 0 | 0 | Salvini | quando perdi il posto al viminale ma sei ancora in tempo per andare a fare la chiusura della stagione ad Ibiza | 4 | 4 |