General Information
Welcome to Data5000! The course will be lecture-based and will also offer some hands-on tutorials. The project component will be flexible and will involve data collection, manipulation, and analysis.
Classes are held every Thursday from 11:35 AM to 2:25 PM. Please check the Carleton Central for room location. There are 3 classrooms for the 3 sections. The first class and the tutorials/guest lectures will be in the SC103. We look forward to seeing you on Thursday January 9 at 11:35 AM in SC103.
Instructors | Majid Komeili | Elio Velazquez | Mahmud Hasan |
majid.komeili@carleton.ca | Elio.Velazquez@carleton.ca | mahmudhasan@cunet.carleton.ca | |
Office hours | by appointment | by appointment | by appointment |
Teaching Assistant:
Adnan Khan, AdnanKhan5@cmail.carleton.ca
Office Hours: by appointment
Content Overview
The course covers topics relevant to data science: working with data, exploratory data analysis, data mining, machine learning. The concepts are illustrated using the R language. Students will be evaluated by their course projects. There will be tutorials and guest speakers.
Tentative Schedule
It is important to note that this schedule is evolving and will change based on how the class is progressing.
- Thursday January 9 - Lecture 1: Introduction
- Thursday January 16 - Lecture 2: Working with Data
- Thursday January 23 - Lecture 3: Visualization and Exploration
- Thursday January 30 - Lecture 4: Machine Learning I
- Thursday February 6 - Lecture 5: Machine Learning II
- Thursday February 13 -
Paper presentationsTableau Tutorial by Josh Gillmore - Thursday February 20 - NO CLASS (Winter Break)
- Thursday February 27 -
Tutorial or guest lecturesPaper presentations - Thursday March 6 - IBM Cognos Analytics Tutorial by Matthew Denham
- Thursday March 13 - Guest speaker: Mohamed Sharaf
- Thursday March 20 - Guest speakers: Gerald Grant and Tracey Lauriault
- Thursday March 27 - Project Presentations.
- Thursday April 3 - Project Presentations.
Course Information
Evaluation
- Project proposal: 10% (due January
2326) - Paper selection : 0% (due January 30 )
- Paper presentation: 10% ( February
1327 ) - Progress report: 10% (due March 4)
- Class participation and discussions: 5%
- Project presentation: 15% ( March 27 and April 3)
- Project report: 50% (due April 10)
Project Proposal
The project forms an integral part of this course. The project is to be completed in group of two students. Each group would have one technical expert (a student from Computer Science, Systems and Computer Engineering, Information Technology, Physics, Chemistry), and one domain expert (e.g., from Communication, Geography, Biology, History, Psychology, Economics, Business, Health Sciences, Cognitive Science, Public Policy and Administration, International Affairs). Domain experts may contribute to finding the right problem, justifying why it is important to study it, extracting the value and implications of the work. Technical experts do the heavy lifting of building models. The main goal for students is to learn how to work on a multidisciplinary team, i.e., for domain experts, it is about learning technical terminology, while for technical experts, how to fruitfully work with domain experts. Before you undertake your project you will need to submit a proposal for approval. The proposal
should be short (max 2 page PDF). You may use the ACM format. The proposal should
include a problem statement, the motivation for the project, a description of the data your will be
working on, and a set of objectives you aim to accomplish.
This will be due on January 23 26 by 11:59 PM.
Paper Presentations
Each group needs to choose a conference or journal paper related to Data Science and present it. Paper selection is due January 30. The paper needs to be approved by the instructor. Papers will be presented on FebruaryProgress Report
Each group should submit a progress report (3000 characters max) by March 4. A progress report typically includes the following sections:- Introduction: A summary of the project and its goals.
- Work completed: A list of tasks that have been completed.
- Work in progress: A list of tasks that are currently in progress.
- Work to be started: A list of tasks that have not yet started.
- Conclusion: An appraisal of the project's progress and if applicable, issues or concerns about the project.
Project Presentation
Each group will have the opportunity to present their project in class on either March 27 or April 3. Slides must be submitted by March 26 regardless of the date of your presentation. This presentation should be in the form of a conference-style talk and describe the motivation for your work, what you have done and what you have found so far. If a demo is the best way to describe what you did, feel free to include one in the middle of the talk.The proposed structure of your presentation:
- Introduction (describe the problem and motivation)
- Research questions
- Methodology: data collection, data cleanup, data mining, data analysis, etc.
- Results (achieved, preliminary, or anticipated)
- Implications (why does this study matter? how can your findings be used?)
- Conclusion (summary, main contributions)
Project Report
The required length of the written report varies from project to project (8-10 pages, double column format); all reports must be formatted according to the ACM or IEEE format and submitted as a PDF. This will be due on April 10.Datasets
- City of Ottawa
- Open Data Ontario
- Open Data @ Government of Canada
- Statistics Canada
- Dataset Search by Google
- Data acquisition and preparation from a variety of open data sources
- Kaggle Datasets: It has a large collection of ML competitions, codes and public datasets.
- Kaggle Competitions
- Physionet: It has a large collection of ML datasets/competitions mainly in health.
- KDnuggets:
Resources
The following books are suggested but not required:- "Doing Data Science: Straight Talk From the Frontline" by Cathy O'Neil and Rachel Schutt, O'Reilly Media, 2013
- "Data Mining and Business Analytics with R" by Johannes Ledolter, Wiley, 2013
- "Data Science for Business: what you need to know about data mining and data-analytic thinking" by Foster Provost and Tom Fawcett, O'Reilly Media, 2013.
- "An Introduction to Statistical Learning: with Applications in R" by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani, Springer, 2013
- "The Elements of Statistical Learning: Data Mining, Inference, and Prediction" by Trevor Hastie, Robert Tibshirani and Jerome Friedman, Springer, 2011.
- "Cookbook for R" by Winston Chang
- "The R Inferno" by Patrick Burns
- Quick-R
- "Software for Data Analysis Programming with R" by John Chambers, Springer, 2008.