Course Project

The course project will give the students a chance to use the methods taught in the course on real world problems using real world data. Course projects will be done in groups of up to 3 students.

Instructions for the final project are posted here. Note that we will be relying on the Kaggle platform for prediction submissions. If you are new to Kaggle, this introduction and this tutorial may be a good place to start.

Data sets

Kaggle submissions


FAQ

Can I work with other students in the class?
Yes, you can work with up to two other students in submitting a joint report. However, the report will have to state the contributions from each person.
Can I seek advice from friends or colleagues?
Yes, you can seek advice from anyone, as long as you submit your own report and explicitly indicate with whom (if anyone) you discussed the project with and what help they provided.
How will I be evaluated for the project?
If you explain and understand each step of the data mining process, understand and visualize the data effectively, are able to build a clustering, regression, & classification model from the data (i.e. complete all four objectives) you can expect to get full credit. Partial credit will be given as well.
Can we submit more than one model?
As we are mimicing the productionalization of your model into the future, we will not be allowing multiple model submissions.
By which metric will the model be evaluated? How good does it have to be to get full credit?
The evaluation metrics are detailed in the write-up. Briefly, we will be summing the mean squared error for each day and taking the average across days after separating the days into two periods. We haven't set a quantitative bar on how good the model has to be to get full credit. A good report, with good visualizations and a good thought process will get full credit even for a model that may not work very well.
Will you be teaching most of the tools required to do the class project, or will most of it be self-learning? For example, will we be learning how to build our own regression model and train it with data sets, or is that something for us to figure out on our own in R/Python?
We will be teaching several types of classifiers and regression approaches in the course as well as a Section on statistical programming; however you will have to become familiar enough with R/Python to be able to write the program for the various models. R/Python comes with packages that can be used to construct support vector machines, decisions trees, and much more, so hopefully you don't have to write much code from scratch.
Am I allowed to use external data?
Pairing external data with the anonymized training data may lead to de-anonymization of the data. Consequently, we are not allowing external data use.