|Format:||3-page report + 5-minute presentation|
|Percentage of Grade:||7% (+3% for extra credit)|
Solving the problems below.
Quality of presenting solutions.
If presenting, clear, engaging presentation.
|Dataset:||hosted on munmun's site|
In this assignment you are supposed to build a classifier that can distinguish between legitimate (ham) and spam SMS. Refer to the table above to download the dataset. Your submission would include a three page report that discusses how you built the classifier, and then presents the performance of your classifier. The classifier would be based on text features extracted from the ham and spam messages.
Note: you do not have to write a classifier from scratch. You are free to use one or more of the many open-source (or other) tools and packages that allow you to use a variety of different classifiers. You can pick any package or programming language you like or are most comfortable with.
Every team will hand-in a report summarizing their findings. I expect something 3 pages, composed as described below. When writing up your findings, look to the research papers we have read in class. For each one, please have a short Method section and a longer Results section. The Method section describes what data you used and how you got it. The Results section describes your analysis and contains your graphics.
The dataset is called the SMS Spam Collection v.1 dataset released and maintained by Tiago A. Almeida and Jose Maria Gomez Hidalgo. It is a public set of labeled SMS messages that have been collected for spam research. It has 5,574 English, real and non-encoded messages, tagged by humans to be legitimate (ham) or spam. The dataset contains just one text file (along with a ReadMe document). In the main file, each line has the correct class (ground truth) followed by the raw message. Some examples are given below:ham What you doing?how are you?
(1) Feature construction (3 points). Like we covered in our lectures, constructing a classifier involves extracting relevant and meaningful features from the data under consideration. In your report, you will need to first present the various features you derived from the textual content of the ham and spam messages. Features can include (but not limited to) unigrams, bigrams, TF-IDF, Part-of-Speech tags, length of words etc. of the messages. Also you will need to discuss why you chose the particular set of features. Feel free to refer to the papers from class.
(2) Description of the classifier (2 points). Discuss what is the particular classifier you chose (e.g., k Nearest Neighbor, Support Vector Machine, Naive Bayes or some other), and a justification or rationale behind its choice and applicability to the dataset in question. That is, if you picked classifier X, why is it a good fit for this problem? Why is it a better choice compared to another classifier Y?
(3) Evaluation technique (2 points). Present how you evaluated how well your classifier of choice performed in distinguishing between ham and spam messages. Particularly, discuss applicability of the concept of k-fold cross validation here, which we had covered in our Statistics/Data Mining review lectures. You will also need to present what metrics you used to evaluate performance of the classifier. For instance, typical metrics would include percentage accuracy, precision, recall.
(4) Implementation (7 points).
a. Discuss how you preprocessed your data. If you used stopword removal, stemming, or
tokenization over the content of the messages, you need to report it here. Point to the particular
libraries or functions you needed for this.
b. Discuss how you extracted the features you presented above from the SMS dataset. This needs to include what libraries and which particular functions you used for extracting each feature. If you did not use an existing library, you need to write about the method you used to compute the features from the data. Report if you did some filtering or feature selection to disregard not-so- common features (e.g., if you ignored all unigrams which occurred less than five times). Also report if you did any kind of normalization or standardization of each feature, and your justification behind doing or not doing so.
c. Next, discuss how you implemented/used a library for your chosen classifier. Report what were the inputs and outputs to the particular library function you used and if/how you tuned parameters of the classifier (e.g., if you chose SVM, report the particular kernel you used).
d. Discuss how you partitioned the dataset for k-fold cross validation, along with what was your chosen k here. Here you will also discuss based on your chosen k-fold cross validation setup, what were your training and test sets in each of the k-iterations.
e. Discuss how you calculated the metrics of performance evaluation, e.g., accuracy, precision, recall etc. It is again okay to use an existing library that gives precision and recall values, in which case you need to present in your report which libraries/functions you used for the purpose, and what was your input and output to those functions.
(5) Analysis of results (6 points). Report the performance of your classifier based on the above discussion. You will need to use charts, graphs, or tables to report actual numbers--i.e., the values of the 2 performance metrics you chose above (accuracy, precision, recall etc.). These numbers should be reported for each of the k iterations of the k-fold cross validation setup. You should also report the average performance over all k cross validation folds, corresponding to each evaluation metric.
(6) Extra credit (5 points).
a. There will be extra credit for extracting novel text based features from the
ham/spam messages (don't be afraid to be creative here; clue: many words in the messages are
b. There will be additional extra credit for comparing (per the above evaluation technique) two or more classifiers in their ability to distinguish the two sets, spam and ham, when applied to the same data. For instance, you can compare your chosen classifier SVM's performance over another classifier Naive Bayes, and in the analysis section discuss which classifier performs better based on your chosen evaluation metrics like accuracy, precision, recall.