Algorithms for Modern Data Sets (COMP 3801)
Fall 2017
Instructor: Anil Maheshwari
Office: Herzberg Building 5125B
Email: anil@scs.carleton.ca
Lectures: Monday & Wednesday
08:35  09:55 CB 2202
Office hours: Tuesday 10:00 to 11:30 5125b
HP.
Teaching assistants: Taras Gritsenko
Regular (Weekly) Office Hours: Monday 11:45 AM to 12:45PM
BiWeekly Office Hours (Only for Project Meetings): Tuesday &
Friday 10:0011:30 AM
TA will be located in Room HP 4125
Course objectives: Algorithm design
techniques for modern data sets arising in, for example, data
mining, web analytics, search engines, social networks, and
machine learning.
Caution: Note that you
need a minimum of B+ in COMP 2804 to register in this course. The
contents of this course are fairly broad, and will cover a spectrum
of techniques from the design and analysis of algorithms. It is
assumed that you have a very good grasp on the analysis of
algorithms (Onotation, recurrences, and complexity analysis),
elementary probability theory including expectation and indicator
random variables (contents of COMP 2804), the knowledge of basic
data structures (lists, trees, hashing), and the knowledge of
discrete mathematics (counting, permutations and combinations, proof
techniques: induction, contradiction, ..). Note that there
will not be time to review these material, and to appreciate the
contents of this course, you must have a very good grasp on these
topics.
Textbooks:
General References:
Useful References related to various topics:
 PageRank:
 Probability:
 Locality Sensitive Hashing
 Data Streaming
 Adwords
 Chapter on Advertisement on the Web from the Book.

Topics
(We will likely cover parts of MMDS Chapters 3, 4, 5, 7, 8, 9, 11,
and possibly 10 and 6.)
In broad terms, some combination of the following topics:
Link Analysis
Mining Data Streams  Frequency and Moment Estimates
Finding Similar Items  Locality Sensitive Hashing
Advertising on the web  Adwords & Online matching
Recommendation Systems  Collaborative Filtering
Dimensionality Reduction  Eigenvalues, PCA, SVD
Clustering
Mining Social Networks  Community Detection, Partitioning of
Graphs, Dynamic Graph Algorithms
Frequent Itemset
and
some probability+linear algebra as and when required.
Grading Scheme: (Tentative)
 3 Assignments: 15% (Assignment 1: PDFFile
TeXFile) (Assignment
2: PDFFile TeX File) (Assignment 3: PDFFile TeX
File)
 MidTerm: 10%
 Final Exam: 20%
 Quiz (during the class on several occasions): 9%
 Project: 46% (Breakdown: Biweekly
Progress Meetings with the TA (6%); Summary
Presentation (5%); Report (18%); Seminar(17%))
Information on Project: One of the major components
of this course is a project on one of the topics/themes related
to this course. You can either carry out a project by yourself,
or make a team of two. For getting some project ideas look into
http://www.mmds.org/ and specifically in CS 246 and CS 341
course webpages. Please do look at the Bibliographic Notes at
the end of the various chapters in the textbook. They present
pointers to related work. A few projects from the last
offering of this course are also provided on the course webpage
to help you to plan. Please consult TA and the course
instructor regarding all aspects of your project. Also the three
books on Networks listed in Useful References are good source
for project ideas.
A typical project will involve  some theoretical work and some
prototype implementation. First, the students will submit a
project proposal (12 page) consisting of welldefined problem
statement, appropriate references, some idea on what kind of
theoretical research/analysis is required, what will likely be
implemented, some idea on how testing data will be collected,
what kind of results are expected, and a time line. It should
have the breakdown of the whole project in terms of biweekly
goals. Your proposal is due on October 4th. On October 11th, you
will give a 4 minute talk to the class on what is your project
about, what you have done so far, and what remains to be done.
For this you will provide at most three slides (pdf file) and
the presentation will be done in the class. This will be
followed by a Project report and a seminar in late
November/Early December. Project Report will provide
details on all aspect of the project (Problem Statement,
Background Work, Theoretical Analysis, Experimental Design and
Analysis, Conclusions, and References (about 46
pages)). The Seminar will be approximately for 1518
minutes, where you will describe the class your project in some
detail. It is expected that you will make a formal presentation
about your project in this seminar. Also, you will be asked to
prepare a set of 46 quiz problems based on your seminar. Your
formal presentation is due one week before your seminar date
including the quiz problems. Your project report is due a couple
of days before your seminar date. All these deadlines are
fairly strict due to the nature of this course.
To ensure and to guide you with respect to all aspects of the
project, you will be setting mandatory biweekly meetings with
the TA. He will monitor your progress as well as assist you with
your conceptual/technical questions. Note that he is not there
for fixing bugs in your code. Also the instructor is there to
help you with the theoretical content, as sometimes they may be
daunting on the first sight!
Information on Quizzes: A number of Quizzes will take
place during the class. You may expect a quiz in every two
class, and this will actually depend on the coverage of topics.
This is to ensure that you attend classes and to see where are
the gaps in understanding.
What is done in Fall
2017 Term:
Sept 06: Bright & Early  the first class of Year
201718. Good Luck :)
Introductory remarks. What to expect. A quick summary of some of
the topics that we will cover.
A bit of Linear Algebra and Probability.
Sept 11: Introduction to Eigenvalues/Vectors;
Characteristic Polynomials; Markov Matrices  recurrent/transient,
reducible/irreducible, period; (Q1: Computing Eigenvalues/vectors
of an identity matrix.)
Sept 13: Eigenvalues
of Markov Matrices. Principal Eigenvalue/Eigenvector of Special
type of Markov Matrices. Structure of the web graph.
Sept 18: Markov Matrices and their
connection to Pagerank (Q2: Computing steady state of a 2state
Markov Chain.)
Sept 20: More details on Pagerank, LSH
Sept 25: LSH Continued [Chapter 3 in Book and/or See my
course notes]
Sept 27: LSH + Metric; (Q3:
Show that the probability that signatures match equals the Jaccard
Similarity.)
Oct 02: A1 Due, Locality Sensitive Family
Oct 04: Due date of Project Proposal (pdf file) Nearest Neighbors +
Fingerprinting
Oct 09: Thanksgiving
Oct 11: 4 minute Presentation on
Project Summary
Oct 16: Data Streaming Algorithms 
Bloom Filters; Heavy Hitters; Estimating Moments; DGIM algorithm
for counting ones in a window.
Oct 18: Data Streaming Algorithms Continued
Oct 30: Review of problems from A1
Nov 01: MidTerm
Nov 06: CMS, Counting #1s in the last N bit
of the stream (the DGIM algorithm).
Nov 08: A2 Due. Analysis of
CMS.
Nov 13: Online Matching +
Balance Algorithm  Lower Bounds
Nov 15:
Nov 20:
Nov 22: Project
Presentations
Nov 27: Project Presentations
Nov 29: Project Presentations
Dec 04: Project Presentations
Dec 06: Project
Presentations
Dec 08: Due
date of Final/Edited Project Reports; A3 Due
Final Exam: Date will be announced by Examination Office.
Project Schedule:
(Please submit all documents as pdf files. Do not write your
student number as all the submitted material is going to be
published on the course webpage.)
Some of the projects from the past offerings of
this course:
Music Collaboration Graph
Counting Distinct Elements in
Streams
Finding Regions of Interest in Images
Predicting controversy of a Wikipedia
Article
Some Sample reports from last
term: 1 2 3
What follows below are the
contents from the 1st offering of this course from Fall 2016.
What is done
in Fall 2016 Term:
Sep 07: What is a Matrix  Linear Transforms  Eigenvalues
 Eigenvectors  What are Markov Matrices and their Eigenvalues.
Other than the Wikipedia links on these topics, you may look at
these links: 1
2
Book on Linear Algebra and its Applications by Gilbert Strang from
MIT is a good source.
Sep 09: More on Eigenvalues, diagonalization and
factorization of square matrices. Page Rank Algorithm.
Sep 14: Eigenvalues of powers of A. PageRank
Algorithm. Some useful links:
Sep 16: Pagerank + Data Streaming
Consult the MMDS book for
Data Streaming.
Sep 21: Data Streaming  Why
simple sampling is not sufficient.
Sep 23: Data Streaming  Bloom Filters
Sep 28: Data Streaming  Flajolet  Martin algorithm
for estimating distinct elements. An algorithm for Estimating
Frequency Moments (Alon, Matias, and Szegedy
 JCSS 1999).
Sep 30: Handing of Project Proposal
Data Streaming  Analysis of AMS algorithm 
How to extend it for unknown value of n, how to estimate number of
1's in last N bits.
Oct 05: How to estimate number of 1's in last N bits
 algorithm of Datar, Gionis, Indyk, and Motwani from SIAM Jl.
Computing 2002, and some of its applications.
Oct 07: Introduction to Locality Sensitive Hashing
(Consult Book and/or my Course
Notes on Algorithm Design.)
Oct 12: LSH Continued  Analysis,
Applications in Document Matching, Fingerprint Matching.
Oct 14: More
Applications of LSH (Metric,
AND/OR Family of Functions, Applications
using Hamming Distance & finding close points.
Oct 19:
Due
date for Assignment 1.
Matching
Fingerprints
Advertising on the web  online algorithms for matching.
Oct 21: Advertising on the
web  online algorithms.
Nov 02: Project Previews:
Short 57 minute presentation on your projects.
Balance Algorithm
Nov 04: Finishing up Adwards 
including variations of Balance Algorithm.
Nov 09:
Clustering:
KMeans and K++Means.
Algorithm
by Arthur and Vassilvitski (SODA 2007)
Nov 11: Finishing
up K++Means and Introduction to Recommendation Systems.
Nov 16:
Recommendation
Systems: Writing utility matrix as a product of two matrices.
Introduction
to SVD.
Due Date
for Assignment 2
Nov 18: Dimensionality
Reduction: SVD, its variants, and applications.
CUR
Decomposition by Mahoney and Drineas (PNAS 2008)
Nov 23: Social
Network as a Graph G: Second Eigenvalue and its applications to
partitioning G.
Counting
triangles in G.
Nov 25: Project
Presentation (BR> Presentation
Report) + Quiz
Nov 30: Project
Presentation (OB > Presentation
Report, CE > Presentation
Report)) + Quiz
Dec 02: Project
Presentation (CA > Presentation
Report, TG  Presentation
Report)+
Quiz
Dec 07: Assignment
3 is due at Baker's Breakfast (4th floor University Center) at
8:30AM.
Undergraduate Academic Advisor:
The undergraduate advisor for the School of Computer Science is
available in Room 5302C HP, by telephone at 5202600, ext. 4364 or
by email at undergraduate_advisor@scs.carleton.ca. The advisor can
assist with information about prerequisites and preclusions, course
substitutions/equivalencies, understanding your academic audit and
the remaining requirements for graduation. The undergraduate advisor
will also refer students to appropriate resources such as the
Science Student Success Centre, Learning Support Services and the
Writing Tutorial Services.
University Policies:
Student Academic Integrity Policy: Every student should be
familiar with the Carleton University student academic integrity
policy. A student found in violation of academic integrity standards
may be awarded penalties which range from a reprimand to receiving a
grade of F in the course or even being expelled from the program or
University. Some examples of offences are: plagiarism and
unauthorized cooperation or collaboration. Information on this
policy may be found in the Undergraduate Calendar.
Plagiarism: As defined by Senate, "plagiarism is presenting,
whether intentional or not, the ideas, expression of ideas or work
of others as one's own". Such reported offences will be reviewed by
the office of the Dean.
Unauthorized Cooperation or Collaboration:
 Students are encouraged to collaborate on assignments, but at
the level of discussion only. When writing down the solutions,
they must do so in their own words. Just a word of caution  in
theoretically oriented courses, it is important to come up with
your own ideas for the proof/an algorithm/a contradiction/etc.
Sometimes these are like logical puzzles  if somebody tells you
a solution then they are trivial and hard part is to come up
with a solution. What we want to learn is how to solve them
ourselves.
 Past experience has shown conclusively that those who do not
put adequate effort into the assignments do not learn the
material and have a probability near 1 of doing poorly on the
exams.
Equity Statements:
You may need special arrangements to meet your academic obligations
during the term. For an accommodation request the processes are as
follows:
Pregnancy obligation: write to me with any requests for
academic accommodation during the first two weeks of class, or as
soon as possible after the need for accommodation is known to exist.
For more details visit the Equity Services website:
http://www2.carleton.ca/equity/
Religious obligation: write to me with any requests for
academic accommodation during the first two weeks of class, or as
soon as possible after the need for accommodation is known to exist.
For more details visit the Equity Services website:
http://www2.carleton.ca/equity/
Academic Accommodations: The Paul Menton Centre for Students with Disabilities
(PMC) provides services to students with Learning Disabilities
(LD), psychiatric/mental health disabilities, Attention Deficit
Hyperactivity Disorder (ADHD), Autism Spectrum Disorders (ASD),
chronic medical conditions, and impairments in mobility, hearing,
and vision. If you have a disability requiring academic
accommodations in this course, please contact PMC at 6135206608
or pmc@carleton.ca
for a formal evaluation. If you are already registered with the
PMC, contact your PMC coordinator to send me your Letter of
Accommodation at the beginning of the term, and no later
than two weeks before the first inclass scheduled test or exam
requiring accommodation (if applicable). Requests made
within two weeks will be reviewed on a casebycase
basis. After requesting accommodation from PMC, meet with me
to ensure accommodation arrangements are made. Please consult the
PMC website (www.carleton.ca/pmc)
for the deadline to request accommodations for the
formallyscheduled exam (if applicable).
Medical Certificate: The following is a link to the official
medical certificate accepted by Carleton University for the deferral
of final examinations or assignments in undergraduate courses. To
access the form, please go to
http://www1.carleton.ca/registrar/forms/
Announcements:
 Project meeting dates are:
1st
Meeting: Week of Sept 1822
2nd
Meeting: Week of Sept 2528
3rd
Meeting: Week of Oct 1620
4th
Meeting: Week of Oct 30Nov 3
5th
Meeting: Week of Nov 1318
6th
Meeting: Week of Nov 2025
 Assignment 1 is posted.
 Midterm will include all the material before we covered the
count min sketch. It is not a multiple choice exam.
Please do bring your calculators as you may have to evaluate
probabilities.
 I am away in the week of Oct 30  Nov 3. Mondays class will be
taken by Taras where he will go over the solutions of Assignment
1. Midterm on Wednesday will be conducted by Darryl Hill. All
project meetings with me will be postponed for the following
week.
 Assignment 2 is posted: PDFFile
TeX File