Download

**EDIT 2021: PLEASE NOTE THAT THE COMPETITION IS NOW CLOSED AND THE DATASETS ARE NO LONGER AVAILABLE.**

Diverse datasets are available for the different subtasks. All datasets are released as UTF-8 encoded comma separated “.csv” files, whose download is password-protected. Please write us [dankmemesevalita (AT) gmail.com] to get access to them.

General Guidelines

Guidelines_DANKMEMES2020

1 file(s) 161.70 KB

Download

Development Datasets

Task 1 - Meme Identification - Training set

1 file(s) 158.90 MB

Download

Task 2 - Hate Speech Detection - Training set

1 file(s) 83.03 MB

Download

Task 3 - Event Clustering - Training set

1 file(s) 79.09 MB

Download

Test Datasets

Task 1 - Meme Identification - Test set

1 file(s) 41.10 MB

Download

Task 2 - Hate Speech Detection - Test Set

1 file(s) 17.80 MB

Download

Task 3 - Event Clustering - Test Set

1 file(s) 18.97 MB

Download

Annotated Test Datasets

Annotated Test Sets

1 file(s) 40.72 KB

Download

Datasets

The whole dataset is composed of 2,361 images extracted from Instagram through a Python script aimed at the hashtag related to the Italian government crisis (“#crisidigoverno”). Posts have been extracted along with their related metadata and anonymized. For each meme, we will provide as input the image, the text transcription, the image vector representation and the following features:

Date: when the image has first been posted on the platform (Instagram)
Macro status: refers to meme layouts and their relation to diffused, conventionalised formats called macros. Binary: macro (1), not macro (0)
Picture manipulation: Non-manipulated or low impact changes are labeled 0 (e.g. the addition of a text or a logo). Heavily manipulated, impactful changes (e.g. images edited to include political actors) are labeled 1.
Visual actors: the political actors (i.e. politicians, parties’ logos) portrayed visually, as edited into the picture or portrayed in the original image.
Engagement: the sum of comments and likes on the artifact in its original setting (Instagram)
Meme: binary feature, where 0 represents non meme images and 1 meme images. This is the target label for the first subtask.
Hate Speech: binary feature only for memes. It differentiates memes with offensive language (1) from non offensive memes (0). This is the target label for the second subtask.
Event: feature only for meme images, categorizing them according to 4 events (described in \ref{sub 3}), plus a residual category labeled as 0. This is the target label for the third subtask.

Text Extraction. Texts have been extracted through optical character recognition (OCR) using Google’s Tesseract-OCR Engine, and further manually corrected by linguists.

Image Vector Representations. The dataset includes image embeddings. The vector representations are computed employing ResNet (He et al., 2016), a state-of-the-art model for image recognition based on Deep Residual Learning. Providing such image representations allows the participants to approach these multimodal tasks focusing primarily on its NLP aspects (Kiela and Bottou, 2014).

External Resources. Participants are allowed to use external resources, lexicons or independently annotated data. Given that, although we provide ResNet image embeddings, participants can use any other image representations.

Task 1: Meme Detection

The dataset counts 2000 images, half memes and half not. We split the dataset into training and test sets, in a proportion of 80-20% of items. The following table represents the format of the training dataset. The test dataset will be provided without gold labels (i.e. without the “Meme” attribute) for testing purposes.

Filename	Engagement	Date	Manipulated	Visual actors	Text	Meme
1.jpg	234	27/08/2019	1	Mattarella	io che aspetto arrivino le 15 per assistere alla fine del governo	1
2.jpg	432	15/08/2019	0	Salvini	cosa ha fatto a mosca la notte del 17 ottobre?	0

Task 2: Hate Speech Identification

The dataset counts 1,000 memes. We split the dataset into training and test sets, in a proportion of 80-20% of items. The following table represents the format of the training dataset. The test dataset will be provided without gold labels (i.e. without the “Hate Speech” attribute) for testing purposes.

Filename	Engagement	Manipulated	Visual actors	Text	Hate Speech
1.jpg	114	0	Salvini, Conte, Di Maio	aspetta manca ancora la parte in cui parlo del lavoro di tua sorella	1
2.jpg	412	1	Salvini	merdman	1

Task 3: Event Clustering

The dataset counts 1000 memes. We split the dataset into training and test sets, in a proportion of 80-20% of items. A set of 800 labeled memes will be provided for development. In order to test and evaluate the model, the remaining 200 memes will be provided without labels.

Filename	Date	Engagement	Macro	Manipulation	Visual actors	Text	Embedding ID	Event
1.jpg	11/08/2019	324	0	1	Salvini, Berlusconi	la crisi di governo è stata innescata mio signore hai lavorato bene lord salvini. tutto procede come avevo previsto	1	1
2.jpg	10/08/2019	123	0	0	Salvini, Di Maio, Conte, Mattarella	il governo entra in crisi 5 secondi di penalità per vettel	1	1
3.jpg	20/08/2019	111	0	0	Salvini, Conte	conte usa boato	2	2
4.jpg	20/08/2019	45	0	0	Salvini, Conte	quando ti interroghi su chi abbia fatto affogare tutti i tuoi pacchetti udp io	2	2
5.jpg	29/08/2019	23	1	0	0	affogare i bambini con la lega fare l’elettroshock ai bambini con il pd	3	3
6.jpg	27/08/2019	987	1	0	0	di maio salvini ricordi come arrivare a palazzo chigi credo di sì, ma perché dovrei saperlo?	3	3
7.jpg	03/09/2019	65	0	0	0	quando provi a votare su rousseau ma il sito non carica	4	4
8.jpg	02/09/2019	112	0	0	Salvini	quando perdi il posto al viminale ma sei ancora in tempo per andare a fare la chiusura della stagione ad Ibiza	4	4

Datasets and download

Download

Datasets

Task 1: Meme Detection

Task 2: Hate Speech Identification

Task 3: Event Clustering