Data Mini-projects

CS 8803: Social Computing
Professor Eric Gilbert (assignment developed by Professor Munmun De Choudhury)

Due:11/16
Format:3-page report + 5-minute presentation
Percentage of Grade:7% (+3% for extra credit)
Team size: tbd
Grading Criteria:
Solving the problems below.
Quality of presenting solutions.
If presenting, clear, engaging presentation.
Dataset: hosted on munmun's site

Data mini-project

In this assignment you are supposed to build a classifier that can distinguish between legitimate (ham) and spam SMS. Refer to the table above to download the dataset. Your submission would include a three page report that discusses how you built the classifier, and then presents the performance of your classifier. The classifier would be based on text features extracted from the ham and spam messages.

Note: you do not have to write a classifier from scratch. You are free to use one or more of the many open-source (or other) tools and packages that allow you to use a variety of different classifiers. You can pick any package or programming language you like or are most comfortable with.

What to hand in

Every team will hand-in a report summarizing their findings. I expect something 3 pages, composed as described below. When writing up your findings, look to the research papers we have read in class. For each one, please have a short Method section and a longer Results section. The Method section describes what data you used and how you got it. The Results section describes your analysis and contains your graphics.

Background and information about the dataset

The dataset is called the SMS Spam Collection v.1 dataset released and maintained by Tiago A. Almeida and Jose Maria Gomez Hidalgo. It is a public set of labeled SMS messages that have been collected for spam research. It has 5,574 English, real and non-encoded messages, tagged by humans to be legitimate (ham) or spam. The dataset contains just one text file (along with a ReadMe document). In the main file, each line has the correct class (ground truth) followed by the raw message. Some examples are given below:

ham What you doing?how are you?
ham MY NO. IN LUTON 0125698789 RING ME IF UR AROUND! H*
ham Cos i was out shopping wif darren jus now n i called him 2 ask wat present he wan lor. Then he started guessing who i was wif n he finally guessed darren lor. spam FreeMsg: Txt: CALL to No: 86888 & claim your reward of 3 hours talk time to use from your phone now! ubscribe6GBP/ mnth inc 3hrs 16 stop?txtStop
spam Sunshine Quiz! Win a super Sony DVD recorder if you canname the capital of Australia? Text MQUIZ to 82277. B
spam URGENT! Your Mobile No 07808726822 was awarded a L2,000 Bonus Caller Prize on 02/09/03! This is our 2nd attempt to contact YOU! Call 0871-872-9758 BOX95QU

Contents of the report

(1) Feature construction (3 points). Like we covered in our lectures, constructing a classifier involves extracting relevant and meaningful features from the data under consideration. In your report, you will need to first present the various features you derived from the textual content of the ham and spam messages. Features can include (but not limited to) unigrams, bigrams, TF-IDF, Part-of-Speech tags, length of words etc. of the messages. Also you will need to discuss why you chose the particular set of features. Feel free to refer to the papers from class.

(2) Description of the classifier (2 points). Discuss what is the particular classifier you chose (e.g., k Nearest Neighbor, Support Vector Machine, Naive Bayes or some other), and a justification or rationale behind its choice and applicability to the dataset in question. That is, if you picked classifier X, why is it a good fit for this problem? Why is it a better choice compared to another classifier Y?

(3) Evaluation technique (2 points). Present how you evaluated how well your classifier of choice performed in distinguishing between ham and spam messages. Particularly, discuss applicability of the concept of k-fold cross validation here, which we had covered in our Statistics/Data Mining review lectures. You will also need to present what metrics you used to evaluate performance of the classifier. For instance, typical metrics would include percentage accuracy, precision, recall.

(4) Implementation (7 points). a. Discuss how you preprocessed your data. If you used stopword removal, stemming, or tokenization over the content of the messages, you need to report it here. Point to the particular libraries or functions you needed for this.

b. Discuss how you extracted the features you presented above from the SMS dataset. This needs to include what libraries and which particular functions you used for extracting each feature. If you did not use an existing library, you need to write about the method you used to compute the features from the data. Report if you did some filtering or feature selection to disregard not-so- common features (e.g., if you ignored all unigrams which occurred less than five times). Also report if you did any kind of normalization or standardization of each feature, and your justification behind doing or not doing so.

c. Next, discuss how you implemented/used a library for your chosen classifier. Report what were the inputs and outputs to the particular library function you used and if/how you tuned parameters of the classifier (e.g., if you chose SVM, report the particular kernel you used).

d. Discuss how you partitioned the dataset for k-fold cross validation, along with what was your chosen k here. Here you will also discuss based on your chosen k-fold cross validation setup, what were your training and test sets in each of the k-iterations.

e. Discuss how you calculated the metrics of performance evaluation, e.g., accuracy, precision, recall etc. It is again okay to use an existing library that gives precision and recall values, in which case you need to present in your report which libraries/functions you used for the purpose, and what was your input and output to those functions.

(5) Analysis of results (6 points). Report the performance of your classifier based on the above discussion. You will need to use charts, graphs, or tables to report actual numbers--i.e., the values of the 2 performance metrics you chose above (accuracy, precision, recall etc.). These numbers should be reported for each of the k iterations of the k-fold cross validation setup. You should also report the average performance over all k cross validation folds, corresponding to each evaluation metric.

(6) Extra credit (5 points). a. There will be extra credit for extracting novel text based features from the ham/spam messages (don't be afraid to be creative here; clue: many words in the messages are non-standard English).

b. There will be additional extra credit for comparing (per the above evaluation technique) two or more classifiers in their ability to distinguish the two sets, spam and ham, when applied to the same data. For instance, you can compare your chosen classifier SVM's performance over another classifier Naive Bayes, and in the analysis section discuss which classifier performs better based on your chosen evaluation metrics like accuracy, precision, recall.