The course will be lecture-based and will also offer some hands-on tutorials. The project component will be flexible and will involve data collection, manipulation, and analysis. For further details on the course content, please refer to its outline (pdf). This course is offered by the School of Computer Science at the Carleton University.
Seminars are held every Thursday from 11:35 AM to 2:25 PM over Zoom.
|Instructors||Majid Komeili||Elio Velazquez||Michael Genkin|
|Office hours||by appointment||by appointment||by appointment|
- Project teams must be formed no later than January 20. Instructions has been send to your email.
- Lectures will be recorded (subject to any technical issue).
- See the Location above for the link to attend the class over Zoom. Passcode is send to your email address.
- Welcome to Data5000! We look forward to meeting you on Thursday January 14 at 11:35 am over Zoom.
The course covers topics relevant to data science: working with data, exploratory data analysis, data mining, machine learning. The concepts are illustrated using the R language. Students also receive hands-on tutorials (e.g., Tableau, IBM Cognos Analytics). Students will be evaluated by their course projects.
It is important to note that this schedule is evolving and will change based on how the class is progressing.
- Thursday January 14, 2021 - Lecture 1: What is Data Science?
- Thursday January 21, 2021 - Lecture 2: Working with Data.
- Thursday January 28, 2021 - Lecture 3: Visualization and Exploration.
- Thursday February 4, 2021 - Lecture 4: Data Mining and Machine Learning I.
- Thursday February 11, 2021 - Lecture 5: Machine Learning II.
- Thursday February 18, 2021 - NO CLASS (Winter Break)
- Thursday February 25, 2021 - IBM Watson Studio Tutorial, by Dennis Buttera, IBM Canada.
- Thursday March 4, 2021 - IBM Cognos Analytics Tutorial, by Dennis Buttera, IBM Canada.
- Thursday March 11, 2021 - Tableau Tutorial.
- Thursday March 18, 2021 - Guest Lecture.
- Thursday March 25, 2021 - Guest Lecture.
- Thursday April 1, 2021 - Project Presentations.
- Thursday April 8, 2021 - Project Presentations.
- Paper presentation : 10% (paper selection due January 28 )
- Project proposal: 10% (due January 28, 11:59 PM)
- Presentation outlines: 5% (due March 11, 11:59 PM)
- Poster presentation: 15% (submission due March 23, to be presented on Data Day)
- Project presentation: 10% (in class on April 1 and April 8)
- Project report: 50% (due April 15, 11:59 PM)
Method of DeliveryBlended delivery; Students are expected to participate during the synchronous meeting time, including lectures and other presentations. There will be additional activities such as project for completion outside of class time. Classes will be recorded subject to any technical issue. Presentations by guest speakers will be recorded subject to their consent. Students are expected to have high-speed internet access, and a computer with microphone.
Paper PresentationsEach group needs to choose a conference publication on the topic of Data Science to present in class (15-minute talk). A 8-12 page conference proceeding (e.g., IEEE International Conference on Data Science, SIGKDD/KDD Conference, etc.) will be approved by the instructor. Presentations will be scheduled throughout the term during class time. Paper selection due January 28, 2021. Late submissions will be penalized 10% per day.
Project ProposalThe project forms an integral part of this course. The project is to be completed in group of two students.
You have two options: you can choose to mine and analyze one of the provided datasets or come up with an idea of your own that relates to the course material. In either case, the project topic will require the instructor's approval.
Before you undertake your project you will need to submit a proposal for approval. The proposal should be short (max 2 page PDF in ACM format). The proposal should include a problem statement, the motivation for the project, a description of the data your will be working on, and a set of objectives you aim to accomplish. This will be due on January 28, 2021 by 11:59 PM via Email. Late submissions will be penalized 10% per day.
Presentation OutlinesYou would need to submit your project presentation outline describing the structure of your poster and preliminary content (in PDF format). This would be like a very early draft of your poster. This will be due on March 11 by 11:59 PM via Email. Late submissions will be penalized 10% per day for up to 4 days. No late submission will be considered after that.
Poster PresentationYou will present your project's poster during the poster presentation on Data Day. Data Day date is TBA, but is expected to be in late March. An independent jury will evaluate posters and select winners. Groups that are among the top three, will receive five bonus marks. The poster should be submitted for approval to your instructor via email by March 23 at 11:59 pm. Late submissions will be penalized 10% per day for up to two days. No submission will be considered after that.
Project PresentationEach group will have the opportunity to present their poster in class on April 1 and April 8. This presentation should take the form of a 15 minute (hard maximum) conference-style talk and describe the motivation for your work, what you did, and what you found. If a demo is the best way to describe what you did, feel free to include one in the middle of the talk.
The proposed structure of your presentation:
- Introduction (describe the problem and motivation)
- Research questions
- Methodology: data collection, data cleanup, data mining, data analysis (statistics, machine learning), etc.
- Results (achieved, preliminary, or anticipated)
- Implications (why does this study matter? how can your findings be used?)
- Conclusion (summary, main contributions)
Project ReportThe required length of the written report varies from project to project (8-10 pages, double column format); all reports must be formatted according to the ACM format and submitted as a PDF. This will be due on April 15 by 11:59 PM via email. Late submissions will not be considered.
- Dataset Search by Google
- Kaggle Datasets: It has a large collection of ML competitions, codes and public datasets.
- Kaggle Competitions
- Physionet: It has a large collection of ML datasets/competitions mainly in health.
- GitHub repository via GHTorrent
- MSR Mining Challenge datasets (various datasets for different years)
- UCI Machine learning repository: It has a large collection of standard ML datasets.
- Open Data @ Government of Canada
- Mimic: It is a large data set of health data associated with ~60,000 intensive care unit admissions.
ResourcesThe following books are suggested but not required:
- "Doing Data Science: Straight Talk From the Frontline" by Cathy O'Neil and Rachel Schutt, O'Reilly Media, 2013
- "Data Mining and Business Analytics with R" by Johannes Ledolter, Wiley, 2013
- "Data Science for Business: what you need to know about data mining and data-analytic thinking" by Foster Provost and Tom Fawcett, O'Reilly Media, 2013.
- "An Introduction to Statistical Learning: with Applications in R" by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani, Springer, 2013
- "The Elements of Statistical Learning: Data Mining, Inference, and Prediction" by Trevor Hastie, Robert Tibshirani and Jerome Friedman, Springer, 2011.
- "Cookbook for R" by Winston Chang
- "The R Inferno" by Patrick Burns
- "Software for Data Analysis Programming with R" by John Chambers, Springer, 2008.