Algorithms for Modern Data Sets (COMP 3801)
Fall 2017


Instructor: Anil Maheshwari
Office: Herzberg Building 5125B
E-mail: anil@scs.carleton.ca

Lectures: Monday & Wednesday  08:35 - 09:55 CB 2202
Office hours: Tuesday 10:00 to 11:30 5125b HP. 


Teaching assistants:  Taras Gritsenko
                                    Regular (Weekly) Office Hours:  Monday 11:45 AM to 12:45PM
                                    Bi-Weekly Office Hours (Only for Project Meetings): Tuesday & Friday 10:00-11:30 AM
                                    TA will be located in Room HP 4125

Course objectives: Algorithm design techniques for modern data sets arising in, for example, data mining, web analytics,  search engines, social networks, and machine learning.

Caution: Note that you need a minimum of B+ in COMP 2804 to register in this course. The contents of this course are fairly broad, and will cover a spectrum of techniques from the design and analysis of algorithms. It is assumed that you have a very good grasp on the analysis of algorithms (O-notation, recurrences, and complexity analysis), elementary probability theory including expectation and indicator random variables (contents of COMP 2804), the knowledge of basic data structures (lists, trees, hashing), and the knowledge of discrete mathematics (counting, permutations and combinations, proof techniques:  induction, contradiction, ..). Note that there will not be time to review these material, and to appreciate the contents of this course, you must have a very good grasp on these topics. 

Textbooks:

General References:

Useful References related to various topics:



Topics

(We will likely cover parts of MMDS Chapters 3, 4, 5, 7, 8, 9, 11, and possibly 10 and 6.) 

In broad terms, some combination of the following topics:

Link Analysis
Mining Data Streams - Frequency and Moment Estimates
Finding Similar Items - Locality Sensitive Hashing
Advertising on the web - Adwords & Online matching
Recommendation Systems - Collaborative Filtering
Dimensionality Reduction - Eigenvalues, PCA, SVD
Clustering
Mining Social Networks - Community Detection, Partitioning of Graphs,  Dynamic Graph Algorithms
Frequent Itemset
and
some probability+linear algebra as and when required.


Grading Scheme: (Tentative)

- 3 Assignments: 15% (Assignment 1: PDF-File   TeX-File)  (Assignment 2: PDF-File  TeX File) (Assignment 3:
PDF-File  TeX File)
- Mid-Term: 10%
- Final Exam: 20%
- Quiz (during the class on several occasions): 9%
- Project: 46% (Breakdown: Bi-weekly Progress Meetings with the TA (6%); Summary Presentation (5%); Report (18%); Seminar(17%))

Information on Project: One of the major components of this course is a project on one of the topics/themes related to this course. You can either carry out a project by yourself, or make a team of two. For getting some project ideas look into http://www.mmds.org/ and specifically in  CS 246 and CS 341 course web-pages. Please do look at the Bibliographic Notes at the end of the various chapters in the textbook. They present pointers to related work.  A few projects from the last offering of this course are also provided on the course web-page to help you to plan. Please consult  TA and the course instructor regarding all aspects of your project. Also the three books on Networks listed in Useful References are good source for project ideas.

A typical project will involve - some theoretical work and some prototype implementation. First, the students will submit a project proposal (1-2 page) consisting of well-defined problem statement, appropriate references, some idea on what kind of theoretical research/analysis is required, what will likely be implemented, some idea on how testing data will be collected, what kind of results are expected, and a time line. It should have the breakdown of the whole project in terms of bi-weekly goals. Your proposal is due on October 4th. On October 11th, you will give a 4 minute talk to the class on what is your project about, what you have done so far, and what remains to be done. For this you will provide at most three slides (pdf file) and the presentation will be done in the class. This will be followed by a Project report and a seminar in late November/Early December.  Project Report will provide details on all aspect of the project (Problem Statement, Background Work, Theoretical Analysis, Experimental Design and Analysis, Conclusions, and References (about 4-6 pages)).   The Seminar will be approximately for 15-18 minutes, where you will describe the class your project in some detail. It is expected that you will make a formal presentation about your project in this seminar. Also, you will be asked to prepare a set of 4-6 quiz problems based on your seminar. Your formal presentation is due one week before your seminar date including the quiz problems. Your project report is due a couple of days before your seminar date.  All these deadlines are fairly strict due to the nature of this course.

To ensure and to guide you with respect to all aspects of the project, you will be setting mandatory bi-weekly meetings with the TA. He will monitor your progress as well as assist you with your conceptual/technical questions. Note that he is not there for fixing bugs in your code. Also the instructor is there to help you with the theoretical content, as sometimes they may be daunting on the first sight!

Information on Quizzes: A number of Quizzes will take place during the class. You may expect a quiz in every two class, and this will actually depend on the coverage of topics. This is to ensure that you attend classes and to see where are the gaps in understanding.


What is done in Fall 2017 Term:

Sept 06:
Bright & Early - the first class of Year 2017-18. Good Luck :)
               Introductory remarks. What to expect. A quick summary of some of the topics that we will cover.
               A bit of Linear Algebra and Probability.

Sept 11:
Introduction to Eigenvalues/Vectors; Characteristic Polynomials; Markov Matrices - recurrent/transient, reducible/irreducible, period; (Q1: Computing Eigenvalues/vectors of an identity matrix.)
Sept 13: Eigenvalues of Markov Matrices. Principal Eigenvalue/Eigenvector of Special type of Markov Matrices.  Structure of the web graph.

Sept 18: Markov Matrices and their connection to Pagerank (Q2: Computing steady state of a 2-state Markov Chain.)
Sept 20:
More details on Pagerank, LSH

Sept 25:
LSH Continued [Chapter 3 in Book and/or See my course notes]
Sept 27: LSH + Metric; (Q3: Show that the probability that signatures match equals the Jaccard Similarity.)

Oct  02: A1 Due,
Locality Sensitive Family
Oct  04: Due date of Project Proposal (pdf file) Nearest Neighbors + Fingerprinting

Oct  09: Thanksgiving

Oct  11:  4 minute Presentation on Project Summary

Oct  16: Data Streaming Algorithms - Bloom Filters; Heavy Hitters; Estimating Moments; DGIM algorithm for counting ones in a window.
Oct  18: Data Streaming Algorithms Continued

Oct 30:
Review of problems from A1
Nov 01: Mid-Term

Nov 06: CMS, Counting #1s in the last N bit of the stream (the DGIM algorithm).
Nov 08: A2 Due. Analysis of CMS.
 
Nov 13: Online Matching + Balance Algorithm - Lower Bounds
Nov 15:
 
Nov 20:
Nov 22: Project Presentations
 
Nov 27: Project Presentations
Nov 29: Project Presentations
 
Dec 04:
Project Presentations
Dec 06:
Project Presentations 
Dec 08:
Due date of Final/Edited Project Reports; A3 Due

Final Exam: Date will be announced by Examination Office.


Project Schedule: 

(Please submit all documents as pdf files. Do not write your student number as all the submitted material is going to be published on the course web-page.)

Proposal Talks
(October 11)

Report
Due Date  (PDF File Please).
Seminar Presentation
Date
COMP3801 Project Proposal Talk1
BM
Monday Nov 27 by 5PM
Nov 29
Generating Music with Recurrent NN Microsoft Word - Project Proposal.docx Talk1
ES
Friday Dec 01 by 10AM Dec 04
Deducing Tier Lists

MC
Tuesday Nov 21 by noon
Nov 22
Predicting Price of Crypto-currency

Fedor L
Friday Nov 24 by 10AM
Nov 27
Finding relevant scientific articles
Talk1
CB+PK
Friday Nov 24 by 10AM Nov 27
LSH for Plagiarism Detection in Python Code
Talk1
MM
Friday Dec 01 by 10AM Dec 04
ID3-Algorithm
Talk1
MS
Friday Dec 01 by 10AM Dec 04
Finding similar articles
Talk1
TD
Monday Nov 27 by 5PM Nov 29
Finding Similar Routes
Talk1
AM
Monday Dec 04 by 5PM Dec 06
Finding outliers using Bayesian Classification
Talk1
BL+VC
Friday Nov 24 by 10AM Nov 27
Perron-Frobenius Theorem + CUR Decomposition

FX
Tuesday Nov 21 by noon Nov 22
Clustering Articles
Talk1
KJ
Monday Dec 04 by 5PM Dec 06
Popularity of Twitter Articles

A D-A
Monday Nov 27 by 5PM Nov 29
Finding funny comments on Reddit 
MB
Monday Dec 04 by 5PM Dec 06







Some of the projects from the past offerings of this course:

Music Collaboration Graph 
Counting Distinct Elements in Streams
Finding Regions of Interest in Images 
Predicting controversy of a Wikipedia Article

Some Sample reports from last term: 1 2 3


What follows below are the contents from the 1st offering of this course from Fall 2016.

What is done in Fall 2016 Term:

Sep 07:
What is a Matrix - Linear Transforms - Eigenvalues - Eigenvectors - What are Markov Matrices and their Eigenvalues. Other than the Wikipedia links on these topics, you may look at these links: 1 2  Book on Linear Algebra and its Applications by Gilbert Strang from MIT is a good source.

Sep 09: More on Eigenvalues, diagonalization and factorization of square matrices.  Page Rank Algorithm. 
 
Sep 14:
Eigenvalues of powers of A. PageRank Algorithm. Some useful links:
Sep 16: Pagerank + Data Streaming
    
  Consult the MMDS book for Data Streaming.

Sep 21: Data Streaming - Why simple sampling is not sufficient.

Sep 23:
Data Streaming - Bloom Filters

Sep 28:
Data Streaming - Flajolet - Martin algorithm for estimating distinct elements. An algorithm for Estimating Frequency Moments  (Alon, Matias, and Szegedy - JCSS 1999).

Sep 30: Handing of Project Proposal
             Data Streaming -  Analysis of AMS algorithm -  How to extend it for unknown value of n, how to estimate number of 1's in last N bits.

Oct 05:
How to estimate number of 1's in last N bits - algorithm of Datar, Gionis, Indyk, and Motwani from SIAM Jl. Computing 2002, and some of its applications.

Oct 07:
Introduction to Locality Sensitive Hashing (Consult Book and/or my Course Notes on Algorithm Design.)

Oct 12:
LSH Continued - Analysis, Applications in Document Matching, Fingerprint Matching.

Oct 14:
More Applications of LSH (Metric,  AND/OR Family of Functions, Applications using Hamming Distance & finding close points.

Oct 19:
  Due date for Assignment 1.
             
Matching Fingerprints             
              Advertising on the web - online algorithms for matching.

Oct 21:
Advertising on the web - online algorithms.

Nov 02:
Project Previews: Short 5-7 minute presentation on your projects. Balance Algorithm

Nov 04:
Finishing up Adwards - including variations of Balance Algorithm.

Nov 09:
Clustering: K-Means and K++-Means.  Algorithm by Arthur and Vassilvitski (SODA 2007)


Nov 11: 
Finishing up K++-Means and Introduction to Recommendation Systems.

Nov 16:
Recommendation Systems: Writing utility matrix as a product of two matrices. Introduction to SVD.
              Due Date for Assignment 2

Nov 18:
Dimensionality Reduction: SVD,  its variants, and applicationsCUR Decomposition by Mahoney and Drineas (PNAS 2008)

Nov 23:
Social Network as a Graph G: Second Eigenvalue and its applications to partitioning G. Counting triangles in G.

Nov 25:
Project Presentation  (BR--> Presentation Report) + Quiz

Nov 30:
Project Presentation (OB --> Presentation Report,   CE --> Presentation   Report)) + Quiz

Dec 02:
Project Presentation (CA   --> Presentation  Report,  TG - Presentation  Report)+ Quiz        

Dec 07:
Assignment 3 is due at Baker's Breakfast (4th floor University Center) at 8:30AM.


Undergraduate Academic Advisor:

The undergraduate advisor for the School of Computer Science is available in Room 5302C HP, by telephone at 520-2600, ext. 4364 or by email at undergraduate_advisor@scs.carleton.ca. The advisor can assist with information about prerequisites and preclusions, course substitutions/equivalencies, understanding your academic audit and the remaining requirements for graduation. The undergraduate advisor will also refer students to appropriate resources such as the Science Student Success Centre, Learning Support Services and the Writing Tutorial Services.
University Policies:

Student Academic Integrity Policy: Every student should be familiar with the Carleton University student academic integrity policy. A student found in violation of academic integrity standards may be awarded penalties which range from a reprimand to receiving a grade of F in the course or even being expelled from the program or University. Some examples of offences are: plagiarism and unauthorized co-operation or collaboration. Information on this policy may be found in the Undergraduate Calendar.

Plagiarism: As defined by Senate, "plagiarism is presenting, whether intentional or not, the ideas, expression of ideas or work of others as one's own". Such reported offences will be reviewed by the office of the Dean.

Unauthorized Co-operation or Collaboration:
Equity Statements:

You may need special arrangements to meet your academic obligations during the term. For an accommodation request the processes are as follows:

Pregnancy obligation: write to me with any requests for academic accommodation during the first two weeks of class, or as soon as possible after the need for accommodation is known to exist. For more details visit the Equity Services website: http://www2.carleton.ca/equity/

Religious obligation: write to me with any requests for academic accommodation during the first two weeks of class, or as soon as possible after the need for accommodation is known to exist. For more details visit the Equity Services website: http://www2.carleton.ca/equity/

Academic Accommodations: The Paul Menton Centre for Students with Disabilities (PMC) provides services to students with Learning Disabilities (LD), psychiatric/mental health disabilities, Attention Deficit Hyperactivity Disorder (ADHD), Autism Spectrum Disorders (ASD), chronic medical conditions, and impairments in mobility, hearing, and vision. If you have a disability requiring academic accommodations in this course, please contact PMC at 613-520-6608 or pmc@carleton.ca for a formal evaluation. If you are already registered with the PMC, contact your PMC coordinator to send me your Letter of Accommodation at the beginning of the term, and no later than two weeks before the first in-class scheduled test or exam requiring accommodation (if applicable). Requests made within two weeks will be reviewed on a case-by-case basis. After requesting accommodation from PMC, meet with me to ensure accommodation arrangements are made. Please consult the PMC website (www.carleton.ca/pmc)  for the deadline to request accommodations for the formally-scheduled exam (if applicable).

Medical Certificate: The following is a link to the official medical certificate accepted by Carleton University for the deferral of final examinations or assignments in undergraduate courses. To access the form, please go to http://www1.carleton.ca/registrar/forms/


Announcements:

  1. Project meeting dates are:

    1st Meeting: Week of Sept 18-22

    2nd Meeting: Week of Sept 25-28

    3rd Meeting: Week of Oct 16-20

    4th Meeting: Week of Oct 30-Nov 3

    5th Meeting: Week of Nov 13-18

    6th Meeting: Week of Nov 20-25

  2. Assignment 1 is posted.
  3. Midterm will include all the material before we covered the count min sketch. It is not a multiple choice exam. Please do bring your calculators as you may have to evaluate probabilities.
  4. I am away in the week of Oct 30 - Nov 3. Mondays class will be taken by Taras where he will go over the solutions of Assignment 1. Mid-term on Wednesday will be conducted by Darryl Hill. All project meetings with me will be postponed for the following week.
  5. Assignment 2 is posted: PDF-File  TeX File