Kawintiranon, Kornraphop ; Singh, Lisa and Budak Ceren
Overview
This data set is being released to support the spam and context-specific spam detection tasks on Twitter data.
[Paper] [Github][PDF]
Description
There are three sets of tweets, parenting-related, #MeToo-related (a social movement focused on tackling issues related to sexual harassment and sexual assault of women), and gun-violence-related tweets.
Each set contains 5,000 tweets. These tweets are original tweets in English. There are no retweets, quoted tweets or non-English tweets.
The distribution of class labels is shown in the following figure.
Data Labeling
The tweets were labeled using Amazon Mechanical Turk. For each tweet, three annotators were asked to answer three questions including:
(Q1) Is the tweet about domain Di?
(Q2) Is the statement an advertisement?
(Q3) Is the statement spam?
The label is the majority vote. A tweet is traditional spam if Q3=yes. A tweet is context-specific spam if Q1=No and Q2=yes. The inter-annotator agreement scores for different metrics are shown in the following table.
Dataset | Question | Alpha | Task-based | Worker-based |
Parenting
| |
is_parenting
is_ad
is_spam
| 0.6809
0.6720
0.3451
| 0.9072
0.8582
0.9717
| 0.8701
0.8191
0.8571
MeToo
| |
is_metoo
is_ad
is_spam
| 0.5324
0.4607
0.4155
| 0.8841
0.9545
0.9649
| 0.8830
0.9384
0.9498
Gun-violence
| |
is_gun_violence
is_ad
is_spam
| 0.7124
0.5400
0.7024
| 0.9848
0.9457
0.8702
| 0.9903
0.9290
0.7884
Data Collection
The meToo data set was constructed by collecting tweets that include #meToo through the Twitter API from October 2018 to October 2019, the first year of the larger online movement.
The gun-violence data set was collected in 2017 using keywords and hashtags related to gun-violence through the Twitter API.
Keywords used include guns, suicide, gun deaths, etc. The parenting data set was constructed by collecting the tweets of 75 authorities who primarily post and discuss parenting topics and collecting tweets using parenting-related keywords and hashtags.
Example authorities include parenting magazines and medical sites.
Preprocessing
We preprocessed data by replacing all Twitter usernames with a special token @USERN (where N is a counting number of handles appearing in the tweet) and all URLs by URLM Removed (where M is a counting number of URLs appearing in the tweet).
Finally, we removed non-English and duplicated tweets, retweets and quoted tweets. We release only the preprocessed tweets to maintain privacy.
Citation
Kornraphop Kawintiranon, Lisa Singh and Ceren Budak (2022). Traditional and Context-Specific Spam Detection in Low Resource Settings.
Machine Learning. [Paper] [PDF]
By downloading this data, you agree that you are using these
data for research purposes only and that you will not
redistribute the data.