Call for Participation
You are invited to participate in the IWSPA-AP Shared Task at
IWSPA 2018. The shared task will be on Detection and Analysis of Email nature.
The International Workshop on Security and Privacy Analytics (IWSPA) - Anti Phishing Shared Task will feature an exercise in the field of applied machine learning and text analysis in cyber security.
The participants will be asked to build a classifier that will be able to detect phishing emails from spam and legitimate ones in an "unbalanced" dataset.
In order to make the task relatable to a real world situation, the training and testing dataset will have realistic ratios of malicious and legitimate emails (not 50:50).
A sample training data will be provided and the results will be evaluated on a testing dataset that will be posted a week before the results are due.
We ask of the participants to send us their trained model and the results they achieved on the testing dataset.
The participants are encouraged to use any data in their possession, in addition to the one provided, to train their model.
The participants are also free to use any kind of feature engineering, and any type of classifiers. However keep in mind that the dataset is "Unbalanced".
The proceedings will be published online in the CEUR publication service. This year also we will invite the authors of selected system papers at the Shared Task,
to submit extended versions to a special issue of a journal (Details coming soon).
Tasks
The overall task description consists of the following:
Use training dataset offered and/or any dataset available online.
Analyze email content (Header, Body, URLs). The emails will be in .txt format.
Preferably come up with new and interesting features and/or use existing ones in the literature.
Build and train a machine learning model or use an already existing one.
Finally report the results based on the evaluation metrics specified in what follows.
A few probable SubTasks: We may post two types of training datasets -
Emails with headers: For this type of dataset, the participants are free to use all the content available in an email to extract information.
Emails with no headers: This task will only focus on the body of the emails. Participants may use any type of information extraction related to the body.
Evaluation Metrics: The evaluation metrics expected are: Confusion Matrix (FP, FN, TP, TN), Accuracy, F-Score, Precision, Recall, Weighted average of recall and precision.
Registration
The registration link to EasyChair is
HERE!.
The deadline is
January 23rd, 2018 January 28th, 2018.
While registering, the interested teams should put the name of the team as Title and a short description of their approach as Abstract on EasyChair.
Organizations wishing to participate in the AP Shared Task track
at IWSPA 2018 are invited to register on EasyChair.
Participants are advised to register as soon as
possible in order to receive timely access to evaluation resources,
including development and testing data. Registration for the task
does not commit you to participation - but is helpful to know for
planning. All participants who submit system runs are welcome to
present their system at the IWSPA 2018.
Important!
All the participants can present their submitted systems as a poster at IWSPA 2018 located in Tempe, AZ, on March 21st, 2018.
The teams willing to participate in the Poster presentation Session should inform us before March 3rd, 2018.
All interested participants must register for IWSPA through CODASPY Registration website.
Corpus
We have provided a few examples of the Legitimate and Phishing Emails:
Legitimate Email Samples
Phishing Email Samples
We will post the details for the training corpus on February 1, 2018 (11:59 U.S.-CST). Stay tuned!
Submission Instructions
For your system submissions:
We expect the predicted outputs on the test data as well as your best performing model.
For the predicted output:
The output needs to be a "_submission__.txt" file with the name of the email file and your predicted label: 1 for legitimate email and 0 for phishing email.
Your Group-ID should be either the name of your group or the initials of the last names of all the group members.
There is no limit on the number of submissions for a particular task. Your submissions should be numbered sequentially.
For example:
If your team name is "BlueTeam" and you are submitting the predictions for the "No Headers" task - your submission file should be "TeamBlue_submission_noheaders_1.txt" and the file should have the following contents:
1.txt 1
2.txt 0
...
For the model submission:
We expect you to submit your top performing model on the training data with clear instructions on how to run. This is required so that the results on the test data using that model can be reproduced. The output submission for the TOP model should be labeled with your Group-ID as well as "TOP" in the name.
For example,
For Team Blue, if "SVM" is the best performer for the "No Headers" subtask:
The model submission will read - "TeamBlue_TOP_SVM_noheaders" and the output submission for this model will be "TeamBlue_submission_TOP_noheaders_1.txt".
When we run "TeamBlue_TOP_SVM_noheaders" model file on the test data we should get exactly the same results Team Blue has reported in "TeamBlue_submission_TOP_noheaders_1.txt".
Important Dates
Please consult the IWSPA
2018 Workshop for official dates for the workshop.
The important deadlines for the Shared Task:
Event |
Date |
Registration Deadline |
January 28, 2018 |
Training Data Release |
February 1, 2018 (11:59 U.S.-CST) |
Test Data Release |
February 28, 2018March 1, 2018 (11:59 P.M. CST) |
Model + Results Submission |
March 3, 2018March 6, 2018 (11:59 P.M. CST) (Hard Deadline) |
Start of Evaluation |
March 5, 2018 |
End of Evaluation |
March 20, 2018 |
IWSPA 2018
http://capex.cs.uh.edu/?q=content/4th-international-workshop-security-and-privacy-analytics-2018
This is the fourth workshop in the series of workshops on
Security and Privacy Analytics. Increasingly, sophisticated
techniques from machine learning, data mining, statistics and
natural language processing are being applied to challenges
in security and privacy fields. However, experts from these
areas have had no medium in the past where they can meet
and exchange ideas so that strong collaborations can
emerge, and cross-fertilization of these areas can occur.
Moreover, current courses and curricula in security do not
sufficiently emphasize background in these areas and
students in security and privacy are not emerging with deep
knowledge of these topics. Hence, we propose to continue
the workshop that we started in the year 2015 to address the
research and development efforts in which analytical
techniques from machine learning, data mining, natural
language processing and statistics are applied to solve
security and privacy challenges (“security and privacy
analytics”). Submissions of papers related to methodology,
design, techniques and new directions for security and
privacy that make significant use of machine learning, data
mining, statistics or natural language processing are
welcome. Furthermore, submissions on educational topics
and systems in the field of security analytics are also highly
encouraged.
Organising Committee
Dr. Rakesh Verma, Professor, University of Houston
Shahryar Baki, PhD candidate, University of Houston
Avisha Das, PhD candidate, University of Houston
Ayman Elassal, PhD candidate, University of Houston
Luis Felipe Teixeira De Moraes, PhD candidate, University of Houston
For any more information or issues, contact Ayman Elassal (elaassal.ayman@gmail.com) or Dr. Rakesh Verma (rmverma6@gmail.com)