General Information
Welcome to Data5000! The course will be lecture-based and will also offer some hands-on tutorials. The project component will be flexible and will involve data collection, manipulation, and analysis. For further details on the course content, please refer to its outline (pdf).
Seminars are held every Thursday from 11:35 AM to 2:25 PM. Please check the Carleton Central for room location. We look forward to seeing you on Thursday January 11 at 11:35 AM.
Instructors | Majid Komeili | Elio Velazquez | Mahmud Hasan |
majid.komeili@carleton.ca | Elio.Velazquez@carleton.ca | mahmudhasan@cunet.carleton.ca | |
Office hours | by appointment | by appointment | by appointment |
Content Overview
The course covers topics relevant to data science: working with data, exploratory data analysis, data mining, machine learning. The concepts are illustrated using the R language. Students also receive hands-on tutorials (e.g., Tableau, IBM Cognos Analytics, Microsoft Azure).
Tentative Schedule
It is important to note that this schedule is evolving and will change based on how the class is progressing.
- Thursday January 11 - Lecture 1: What is Data Science?
- Thursday January 18 - Lecture 2: Working with Data
- Thursday January 25 - Lecture 3: Visualization and Exploration
- Thursday February 1 - Lecture 4: Data Mining and Machine Learning I
- Thursday February 8 - Lecture 5: Machine Learning II
- Thursday February 15 - Paper presentations
- Thursday February 22 - NO CLASS (Winter Break)
- Thursday February 29 - Microsoft Azure Tutorial by Mohamed Sharaf
- Thursday March 7 - IBM Cognos Analytics Tutorial by Matthew Denham
- Thursday March 14 - Tableau Tutorial by Josh Gillmore
- Thursday March 21 - Guest Lecture: Abbas Akkasi and Tracey Lauriault
- Thursday March 28 - NO Class (in lieu of Data Day).
- Thursday April 4 - Project Presentations.
Course Information
Evaluation
- Project proposal: 10% (due January 25, 11:59 PM)
- Paper presentation : 10% (paper selection due February 1 )
- Poster draft: 5% (due March 12, 11:59 PM)
- Poser submission:N/A (due March 19)
- Poster presentation: 10% (March 26, on Data Day 10.0)
- Project presentation: 15% (in class on April 4)
- Project report: 50% (due April 11, 11:59 PM)
Project Proposal
The project forms an integral part of this course. The project is to be completed in group of two-three students.You have two options: you can choose to mine and analyze one of the provided datasets or come up with an idea of your own that relates to the course material. In either case, the project topic will require the instructor's approval.
Before you undertake your project you will need to submit a proposal for approval. The proposal should be short (max 2 page PDF). You may use the ACM format. The proposal should include a problem statement, the motivation for the project, a description of the data your will be working on, and a set of objectives you aim to accomplish. This will be due on January 25 by 11:59 PM via Email.
Paper Presentations
Each group needs to choose a conference or journal paper related to Data Science and present it in class (15-minute talk). Paper selection is due February 1. The paper needs to be approved by the instructor. Papers will be presented on February 15.Poster Draft
You would need to submit your poster draft including the structure of your poster and content (in PDF format). Instuctors will review posters and offer feedback. This will be due on March 12 by 11:59 PM via email.Poser Submission
You would need to submit your poster to the Data Day by March 19. Note that the Data Day committee will not consider late submissions. Further instructions will be communicated closer to the Data Day.Poster Presentation
Each group will have the opportunity to present their project's poster during the Data Day poster fair. Data Day is held usually in late March. The exact date will be announced.Project Presentation
Each group will have the opportunity to present their poster in class on April 4. This presentation should take the form of a 20 minute (hard maximum) conference-style talk and describe the motivation for your work, what you did, and what you found. If a demo is the best way to describe what you did, feel free to include one in the middle of the talk.The proposed structure of your presentation:
- Introduction (describe the problem and motivation)
- Research questions
- Methodology: data collection, data cleanup, data mining, data analysis (statistics, machine learning), etc.
- Results (achieved, preliminary, or anticipated)
- Implications (why does this study matter? how can your findings be used?)
- Conclusion (summary, main contributions)
Project Report
The required length of the written report varies from project to project (8-10 pages, double column format); all reports must be formatted according to the ACM or IEEE format and submitted as a PDF. This will be due on April 11 by 11:59 PM via email.Datasets
- Open Data @ Government of Canada
- Open Data Ontario
- Data acquisition and preparation from a variety of open data sources
- Statistics Canada
- City of Ottawa
- MSR Mining Challenge datasets (various datasets for different years)
- Dataset Search by Google
- Kaggle Datasets: It has a large collection of ML competitions, codes and public datasets.
- Kaggle Competitions
- Physionet: It has a large collection of ML datasets/competitions mainly in health.
- IAPR
- KDnuggets:
Resources
The following books are suggested but not required:- "Doing Data Science: Straight Talk From the Frontline" by Cathy O'Neil and Rachel Schutt, O'Reilly Media, 2013
- "Data Mining and Business Analytics with R" by Johannes Ledolter, Wiley, 2013
- "Data Science for Business: what you need to know about data mining and data-analytic thinking" by Foster Provost and Tom Fawcett, O'Reilly Media, 2013.
- "An Introduction to Statistical Learning: with Applications in R" by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani, Springer, 2013
- "The Elements of Statistical Learning: Data Mining, Inference, and Prediction" by Trevor Hastie, Robert Tibshirani and Jerome Friedman, Springer, 2011.
- "Cookbook for R" by Winston Chang
- "The R Inferno" by Patrick Burns
- Quick-R
- "Software for Data Analysis Programming with R" by John Chambers, Springer, 2008.