Algorithms for Modern Data Sets (COMP 3801)
Fall 2017
Instructor: Anil Maheshwari
Office: Herzberg Building 5125B
E-mail: anil@scs.carleton.ca
Lectures: Monday & Wednesday
08:35 - 09:55 CB 2202
Office hours: Tuesday 10:00 to 11:30 5125b
HP.
Teaching assistants: Taras Gritsenko
Regular (Weekly) Office Hours: Monday 11:45 AM to 12:45PM
Bi-Weekly Office Hours (Only for Project Meetings): Tuesday &
Friday 10:00-11:30 AM
TA will be located in Room HP 4125
Course objectives: Algorithm design
techniques for modern data sets arising in, for example, data
mining, web analytics, search engines, social networks, and
machine learning.
Caution: Note that you
need a minimum of B+ in COMP 2804 to register in this course. The
contents of this course are fairly broad, and will cover a spectrum
of techniques from the design and analysis of algorithms. It is
assumed that you have a very good grasp on the analysis of
algorithms (O-notation, recurrences, and complexity analysis),
elementary probability theory including expectation and indicator
random variables (contents of COMP 2804), the knowledge of basic
data structures (lists, trees, hashing), and the knowledge of
discrete mathematics (counting, permutations and combinations, proof
techniques: induction, contradiction, ..). Note that there
will not be time to review these material, and to appreciate the
contents of this course, you must have a very good grasp on these
topics.
Textbooks:
General References:
Useful References related to various topics:
- PageRank:
- Probability:
- Locality Sensitive Hashing
- Data Streaming
- Adwords
- Chapter on Advertisement on the Web from the Book.
- SVD/Dimensionality Reduction:
- Chapter in the book on Dimensionality Reduction
- Try a few SVDs using Wolfram Alpha.
Topics
(We will likely cover parts of MMDS Chapters 3, 4, 5, 7, 8, 9, 11,
and possibly 10 and 6.)
In broad terms, some combination of the following topics:
Link Analysis
Mining Data Streams - Frequency and Moment Estimates
Finding Similar Items - Locality Sensitive Hashing
Advertising on the web - Adwords & Online matching
Recommendation Systems - Collaborative Filtering
Dimensionality Reduction - Eigenvalues, PCA, SVD
Clustering
Mining Social Networks - Community Detection, Partitioning of
Graphs, Dynamic Graph Algorithms
Frequent Itemset
and
some probability+linear algebra as and when required.
Grading Scheme: (Tentative)
- 3 Assignments: 15% (Assignment 1: PDF-File
TeX-File) (Assignment
2: PDF-File TeX File) (Assignment 3: PDF-File
TeX File)
- Mid-Term: 10%
- Final Exam: 20%
- Quiz (during the class on several occasions): 9%
- Project: 46% (Breakdown: Bi-weekly
Progress Meetings with the TA (6%); Summary
Presentation (5%); Report (18%); Seminar(17%))
Information on Project: One of the major components
of this course is a project on one of the topics/themes related
to this course. You can either carry out a project by yourself,
or make a team of two. For getting some project ideas look into
http://www.mmds.org/ and specifically in CS 246 and CS 341
course web-pages. Please do look at the Bibliographic Notes at
the end of the various chapters in the textbook. They present
pointers to related work. A few projects from the last
offering of this course are also provided on the course web-page
to help you to plan. Please consult TA and the course
instructor regarding all aspects of your project. Also the three
books on Networks listed in Useful References are good source
for project ideas.
A typical project will involve - some theoretical work and some
prototype implementation. First, the students will submit a
project proposal (1-2 page) consisting of well-defined problem
statement, appropriate references, some idea on what kind of
theoretical research/analysis is required, what will likely be
implemented, some idea on how testing data will be collected,
what kind of results are expected, and a time line. It should
have the breakdown of the whole project in terms of bi-weekly
goals. Your proposal is due on October 4th. On October 11th, you
will give a 4 minute talk to the class on what is your project
about, what you have done so far, and what remains to be done.
For this you will provide at most three slides (pdf file) and
the presentation will be done in the class. This will be
followed by a Project report and a seminar in late
November/Early December. Project Report will provide
details on all aspect of the project (Problem Statement,
Background Work, Theoretical Analysis, Experimental Design and
Analysis, Conclusions, and References (about 4-6
pages)). The Seminar will be approximately for 15-18
minutes, where you will describe the class your project in some
detail. It is expected that you will make a formal presentation
about your project in this seminar. Also, you will be asked to
prepare a set of 4-6 quiz problems based on your seminar. Your
formal presentation is due one week before your seminar date
including the quiz problems. Your project report is due a couple
of days before your seminar date. All these deadlines are
fairly strict due to the nature of this course.
To ensure and to guide you with respect to all aspects of the
project, you will be setting mandatory bi-weekly meetings with
the TA. He will monitor your progress as well as assist you with
your conceptual/technical questions. Note that he is not there
for fixing bugs in your code. Also the instructor is there to
help you with the theoretical content, as sometimes they may be
daunting on the first sight!
Information on Quizzes: A number of Quizzes will take
place during the class. You may expect a quiz in every two
class, and this will actually depend on the coverage of topics.
This is to ensure that you attend classes and to see where are
the gaps in understanding.
What is done in Fall
2017 Term:
Sept 06: Bright & Early - the first class of Year
2017-18. Good Luck :)
Introductory remarks. What to expect. A quick summary of some of
the topics that we will cover.
A bit of Linear Algebra and Probability.
Sept 11: Introduction to Eigenvalues/Vectors;
Characteristic Polynomials; Markov Matrices - recurrent/transient,
reducible/irreducible, period; (Q1: Computing Eigenvalues/vectors
of an identity matrix.)
Sept 13: Eigenvalues
of Markov Matrices. Principal Eigenvalue/Eigenvector of Special
type of Markov Matrices. Structure of the web graph.
Sept 18: Markov Matrices and their
connection to Pagerank (Q2: Computing steady state of a 2-state
Markov Chain.)
Sept 20: More details on Pagerank, LSH
Sept 25: LSH Continued [Chapter 3 in Book and/or See my
course notes]
Sept 27: LSH + Metric; (Q3:
Show that the probability that signatures match equals the Jaccard
Similarity.)
Oct 02: A1 Due, Locality Sensitive Family
Oct 04: Due date of Project Proposal (pdf file) Nearest Neighbors +
Fingerprinting
Oct 09: Thanksgiving
Oct 11: 4 minute Presentation on
Project Summary
Oct 16: Data Streaming Algorithms -
Bloom Filters; Heavy Hitters; Estimating Moments; DGIM algorithm
for counting ones in a window.
Oct 18: Data Streaming Algorithms Continued
Oct 30: Review of problems from A1
Nov 01: Mid-Term
Nov 06: CMS, Counting #1s in the last N bit
of the stream (the DGIM algorithm).
Nov 08: A2 Due. Analysis of
CMS.
Nov 13: Online Matching +
Balance Algorithm - Lower Bounds
Nov 15: Balance Algorithm -
Lower/Upper Bounds; Collaborative Filtering
Nov 20: Collaborative
Filtering; Finding Missing Entries in the Utility Matrix; SVDs
Nov 22: Project
Presentations; More on SVDs
Nov 27: Project Presentations;
More on SVDs.
Nov 29: Project Presentations
Dec 04: Project Presentations
Dec 06: Project
Presentations
Dec 08: Due
date of Final/Edited Project Reports; A3 Due. Some problems on
K-Means + Some comments on SVD decomposition.
Dec 11: Final Exam. TB 206 at 2PM.
Project Schedule:
(Please submit all documents as pdf files. Do not write your
student number as all the submitted material is going to be
published on the course web-page.)
Some of the projects from the past offerings of
this course:
Music Collaboration Graph
Counting Distinct Elements in
Streams
Finding Regions of Interest in Images
Predicting controversy of a Wikipedia
Article
Some Sample reports from last
term: 1 2 3
What follows below are the
contents from the 1st offering of this course from Fall 2016.
What is done
in Fall 2016 Term:
Sep 07: What is a Matrix - Linear Transforms - Eigenvalues
- Eigenvectors - What are Markov Matrices and their Eigenvalues.
Other than the Wikipedia links on these topics, you may look at
these links: 1
2
Book on Linear Algebra and its Applications by Gilbert Strang from
MIT is a good source.
Sep 09: More on Eigenvalues, diagonalization and
factorization of square matrices. Page Rank Algorithm.
Sep 14: Eigenvalues of powers of A. PageRank
Algorithm. Some useful links:
Sep 16: Pagerank + Data Streaming
Consult the MMDS book for
Data Streaming.
Sep 21: Data Streaming - Why
simple sampling is not sufficient.
Sep 23: Data Streaming - Bloom Filters
Sep 28: Data Streaming - Flajolet - Martin algorithm
for estimating distinct elements. An algorithm for Estimating
Frequency Moments (Alon, Matias, and Szegedy
- JCSS 1999).
Sep 30: Handing of Project Proposal
Data Streaming - Analysis of AMS algorithm -
How to extend it for unknown value of n, how to estimate number of
1's in last N bits.
Oct 05: How to estimate number of 1's in last N bits
- algorithm of Datar, Gionis, Indyk, and Motwani from SIAM Jl.
Computing 2002, and some of its applications.
Oct 07: Introduction to Locality Sensitive Hashing
(Consult Book and/or my Course
Notes on Algorithm Design.)
Oct 12: LSH Continued - Analysis,
Applications in Document Matching, Fingerprint Matching.
Oct 14: More
Applications of LSH (Metric,
AND/OR Family of Functions, Applications
using Hamming Distance & finding close points.
Oct 19:
Due
date for Assignment 1.
Matching
Fingerprints
Advertising on the web - online algorithms for matching.
Oct 21: Advertising on the
web - online algorithms.
Nov 02: Project Previews:
Short 5-7 minute presentation on your projects.
Balance Algorithm
Nov 04: Finishing up Adwards -
including variations of Balance Algorithm.
Nov 09:
Clustering:
K-Means and K++-Means.
Algorithm
by Arthur and Vassilvitski (SODA 2007)
Nov 11: Finishing
up K++-Means and Introduction to Recommendation Systems.
Nov 16:
Recommendation
Systems: Writing utility matrix as a product of two matrices.
Introduction
to SVD.
Due Date
for Assignment 2
Nov 18: Dimensionality
Reduction: SVD, its variants, and applications.
CUR
Decomposition by Mahoney and Drineas (PNAS 2008)
Nov 23: Social
Network as a Graph G: Second Eigenvalue and its applications to
partitioning G.
Counting
triangles in G.
Nov 25: Project
Presentation (BR--> Presentation
Report) + Quiz
Nov 30: Project
Presentation (OB --> Presentation
Report, CE --> Presentation
Report)) + Quiz
Dec 02: Project
Presentation (CA --> Presentation
Report, TG - Presentation
Report)+
Quiz
Dec 07: Assignment
3 is due at Baker's Breakfast (4th floor University Center) at
8:30AM.
Undergraduate Academic Advisor:
The undergraduate advisor for the School of Computer Science is
available in Room 5302C HP, by telephone at 520-2600, ext. 4364 or
by email at undergraduate_advisor@scs.carleton.ca. The advisor can
assist with information about prerequisites and preclusions, course
substitutions/equivalencies, understanding your academic audit and
the remaining requirements for graduation. The undergraduate advisor
will also refer students to appropriate resources such as the
Science Student Success Centre, Learning Support Services and the
Writing Tutorial Services.
University Policies:
Student Academic Integrity Policy: Every student should be
familiar with the Carleton University student academic integrity
policy. A student found in violation of academic integrity standards
may be awarded penalties which range from a reprimand to receiving a
grade of F in the course or even being expelled from the program or
University. Some examples of offences are: plagiarism and
unauthorized co-operation or collaboration. Information on this
policy may be found in the Undergraduate Calendar.
Plagiarism: As defined by Senate, "plagiarism is presenting,
whether intentional or not, the ideas, expression of ideas or work
of others as one's own". Such reported offences will be reviewed by
the office of the Dean.
Unauthorized Co-operation or Collaboration:
- Students are encouraged to collaborate on assignments, but at
the level of discussion only. When writing down the solutions,
they must do so in their own words. Just a word of caution - in
theoretically oriented courses, it is important to come up with
your own ideas for the proof/an algorithm/a contradiction/etc.
Sometimes these are like logical puzzles - if somebody tells you
a solution then they are trivial and hard part is to come up
with a solution. What we want to learn is how to solve them
ourselves.
- Past experience has shown conclusively that those who do not
put adequate effort into the assignments do not learn the
material and have a probability near 1 of doing poorly on the
exams.
Equity Statements:
You may need special arrangements to meet your academic obligations
during the term. For an accommodation request the processes are as
follows:
Pregnancy obligation: write to me with any requests for
academic accommodation during the first two weeks of class, or as
soon as possible after the need for accommodation is known to exist.
For more details visit the Equity Services website:
http://www2.carleton.ca/equity/
Religious obligation: write to me with any requests for
academic accommodation during the first two weeks of class, or as
soon as possible after the need for accommodation is known to exist.
For more details visit the Equity Services website:
http://www2.carleton.ca/equity/
Academic Accommodations: The Paul Menton Centre for Students with Disabilities
(PMC) provides services to students with Learning Disabilities
(LD), psychiatric/mental health disabilities, Attention Deficit
Hyperactivity Disorder (ADHD), Autism Spectrum Disorders (ASD),
chronic medical conditions, and impairments in mobility, hearing,
and vision. If you have a disability requiring academic
accommodations in this course, please contact PMC at 613-520-6608
or pmc@carleton.ca
for a formal evaluation. If you are already registered with the
PMC, contact your PMC coordinator to send me your Letter of
Accommodation at the beginning of the term, and no later
than two weeks before the first in-class scheduled test or exam
requiring accommodation (if applicable). Requests made
within two weeks will be reviewed on a case-by-case
basis. After requesting accommodation from PMC, meet with me
to ensure accommodation arrangements are made. Please consult the
PMC website (www.carleton.ca/pmc)
for the deadline to request accommodations for the
formally-scheduled exam (if applicable).
Medical Certificate: The following is a link to the official
medical certificate accepted by Carleton University for the deferral
of final examinations or assignments in undergraduate courses. To
access the form, please go to
http://www1.carleton.ca/registrar/forms/
Announcements:
- Project meeting dates are:
1st
Meeting: Week of Sept 18-22
2nd
Meeting: Week of Sept 25-28
3rd
Meeting: Week of Oct 16-20
4th
Meeting: Week of Oct 30-Nov 3
5th
Meeting: Week of Nov 13-18
6th
Meeting: Week of Nov 20-25
- Assignment 1 is posted.
- Midterm will include all the material before we covered the
count min sketch. It is not a multiple choice exam.
Please do bring your calculators as you may have to evaluate
probabilities.
- I am away in the week of Oct 30 - Nov 3. Mondays class will be
taken by Taras where he will go over the solutions of Assignment
1. Mid-term on Wednesday will be conducted by Darryl Hill. All
project meetings with me will be postponed for the following
week.
- Assignment 2 is posted: PDF-File
TeX File
- Assignment 3 is posted
- If you want to collect your Assignment 3, do
so please by January 15, 2018. You are also welcome to
come and see your final exam. They will be shredded
after that date.
- Have a nice break and good luck.