Useful References related to various topics:
Topics
We will likely cover parts of MMDS Chapters 3, 4, 5, 7, 8, 9,
11, and possibly 10 and 6. In addition to this there will
be more material on Data Streaming from some research articles.
In broad terms, some combination of the following topics:
Link Analysis
Mining Data Streams - Frequency and Moment Estimates
Finding Similar Items - Locality Sensitive Hashing
Advertising on the web - Adwords & Online matching
Recommendation Systems - Collaborative Filtering
Dimensionality Reduction - Eigenvalues, PCA, SVD
Clustering - K-Means
Mining Social Networks - Community Detection, Partitioning of
Graphs, Dynamic Graph Algorithms
Frequent Itemset
Online Learning
+ some probability
+ linear algebra as and when required.
- Written Project Proposal (10%): Due by Oct 1
- 3-Minute Project Proposal Presentation (5%): Friday, Oct 3rd Class
- End-of-Term presentation on projects (20%): Scheduled within the last 3-4 weeks of classes.
- Project Report (20%): A couple of days before your end-of-term project presentation.
Outline is as follows:
- Pick a topic. The idea is to summarize the article in your own words, and possibly implement some aspect.
You may look for papers/citations in recent proceedings of the conferences ACM-SIAM Symposium on Discrete Algorithms conference (SODA), SIGKDD, WWW, KDD, Data Minining and Knowledge Discovery, WSDM, ICLR, ICML.- Initial Proposal: Submit one page draft by October 1st. What is the topic you chose? Why? What problem(s) you will look at? What you plan to do? Outline of sections of your report? Main References. Prepare a 3 minute presentation which will be held during the class time slot on October 3rd.
- Final Project Presentation: Presentation is for approximately 10 minutes duration. It will be held during last 3-4 weeks of the course.
- Project Report: The report format will likely be a research article. Its best to use LaTeX (e.g. see Overleaf). The sections will include:
- Introduction (Motivation, Problem Statement, Related Work, Short Summary of what you did).
- Preliminaries (In case you need to discuss some notation, definitions, etc. as background)
- Main Section - How did you solve the problem; State Algorithm; State its Analysis; State its Correctness.
- Experiments (in case you performed any simulation etc.)
- Conclusions (Summary + What did you learn? + What do you think can be done in future?)
- References
- Report will be approximately 4 pages long and will be posted on the course web-page.
- You may use a double column format - for example the style file from Canadian Conference in Computational Geometry Style File from here: http://vga.usask.ca/cccg2020/CCCG2020-tex-template.zip
Sept 26: Frequency Moments F_0 and F_2
- Section 9.3 of Notes + Chapter 4 of MMDS
- AMS Article (Approximating frequency moments)
- Flajolet and Martin's Article on number of distinct elements in a stream
- Summary Notes: Frequency-Estimation
Oct 03: 3-Minute Project Proposal Presentation
- Rank, RREF, Null Space.
- Eigenvalues and Eigenvectors
- Video Lectures and the book of Gilbert Strang
- Section 4.1, 4.2, 4.3 of Notes
- Summary Notes
Oct 10: Clustering + More on Matrices
- k-means (Llyod's algorithm, MMDS Section 7.3)
- k-means++ paper ( Arthur and Vassilvitskii, 8th ACM-SIAM Symposium on Discrete algorithms, 2007)
- Section 15.1 and 15.2 of Notes
- Summary on Clustering
Oct 17: Social Networks - Cohesive Communities
- Brief discussion on how graphs represent social networks and why they are useful.
- Betweenness and the Girvan-Newman algorithm (from Mining of Massive Datasets).
- Using betweenness to detect communities (from mmds).
Oct 31: Markov Matrices and Page RankTriangle counting and an algorithm for finding triangles Relating triangle structure to k-truss's (Amirali's Thesis)
- Markovian Matrices (Definition, Recurrent/Transient States, reducible/irrreducible, aperiodic/periodic)
- Eignevalues and Eigenvectors of Recurrent, irreducible, aperiodic Markov Chain/Matrix.
- Perron-Frobenius Theorem
- Convergence, Steady State of Markov Chains and connection to principal eigenvectors
- Pagerank
- Eigenvalues (1 2)
- Video Lectures and the book of Gilbert Strang
- Page Rank
- Original paper by Brin and Page on PageRank: The anatomy of a large-scale hypertextual web search engine (1998).
- What can you do with a web in your pocket by Brin, Motwani, Page and Winograd.
- Link to the TED talk of Cedric Villani on "What's so sexy about Math" and the description of PageRank.
- Summary Notes
- Section 4.2, 4.3, 4.4, and 4.7 of Notes
| Reference |
Slides |
| A. Petukhova, J.P. Matos-Carvalho, and N.
Fachada, Text Clustering with Large Language Model Embeddings, International Journal of Cognitive Computing in Engineering, 2024. J. Sun, M. Du, and Y. Dong, Efficient Online Stream Clustering Based on Fast Peeling of Boundary Micro-Cluster. IEEE Trans Neural Netw Learn Syst., 2025. |
Clustering
Techniques for Euclidean and Non-Euclidean Spaces |
| S. Li, S. Dragicevic, F. Castro, M. Sester,
S. Winter, A. Coltekin, C. Pettit, B. Jiang, J. Haworth, A.
Stein, and T. Cheng, Geospatial big data handling theory and methods: A review and research challenges ISPRS J. Photogramm. Remote Sens., vol. 115, pp. 33-119, 2016. |
Clustering
Big Data Based on Distributed Fuzzy K-Medoids |
| D. Kempe, J. Kleinberg, and E. Tardos. Maximizing the spread of influence through a social network. KDD 2003. |
Influence
Maximization in Social Networks |
| O. Ertl. UltraLogLog: A Practical and More Space-Efficient Alternative to HyperLogLog for Approximate Distinct Counting. Proceedings of the VLDB, 17(7):1655-1668, 2024 |
Recent developments in Count-Distinct Problem
- UltraLogLog |
| M. Zinkevich, M. Johanson, M. Bowling and C.
Piccione. Regret minimization in games with incomplete information. Advances in neural information processing systems, 20:1729-1736, 2007. |
Counterfactual
Regret Minimization |
| Fortunato, S. (2010). Community detection in graphs. Physics Reports, 486(3-5), 75-174. https://www.sciencedirect.com/science/article/pii/S0370157309002841 |
Community
detection in graphs |
| A. Clementi, L. Gual`a, L. Pep`e Sciarria,
and A.Straziota. Maintaining k-MinHash Signatures over Fully-Dynamic Data Streams with Recovery. Proc. of the 18th ACM International Conference on Web Search and Data Mining (WSDM), 2025. |
Maintaining
k-MinHash Signatures over Fully-Dynamic Data Streams with Recovery |
| Y. Li and P. Li. C-MinHash: Improving Minwise Hashing with Circulant Permutation. International Conference on Machine Learning (ICML), PMLR, 2022. |
C-MinHash
and MinHash in an Approximate Nearest Neighbours Setting |
| Shlens, A Tutorial on Principal Component Analysis:1404.1100, Apr. 2014. [Online]. Available: https://arxiv.org/pdf/1404.1100 T. Roughgarden and G. Valiant, CS168: The Modern Algorithmic Toolbox Lecture #7: Understanding and Using Principal Component Analysis (PCA), 2024. [Online]. Available: https://web.stanford.edu/class/cs168/l/l7.pdf? |
PCA
and variants |
| Virginia Vassilevska Williams, Zoe Xi,
Yinzhan Xu, and Uri Zwick, All-Hops Shortest Paths, Proceedings of the 2025 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA). 2025, 5191-5206 |
All-Hops Shortest Paths |
| Balcan, M.-F., Blum, A., Gupta, A. (2009).
Approximate Clustering without the Approximation. Proceedings of the Twentieth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA). |
Approximate
Clustering without the Approximation |
| S. Bhatia, M. Wadhwa, K. Kawaguchi, N. Shah,
P. S. Yu, and B. Hooi. Sketch-Based Anomaly Detection in Streaming Graphs. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2023). |
Sketch-Based
Anomaly Detection |
- See Exercises 2.35 and 2.36 in Notes.
- Summary Notes: Balls-and-Bins
Max
k-Coveragae
- Maximizing the Spread of Influence through a Social Network by Kempe, Kleinberg and Tardos
- Turning Down the Noise in the Blogosphere by El-Arini et al.
- https://en.wikipedia.org/wiki/Maximum_coverage_problem
- See Exercises 13.10, 13.11, 13.12, and 13.13 in Notes
- Summary: k-coverage
CUR Decomposition
- Chapter 9 in MMDS
- Summary on Collaborative Filtering
- Chapter 11.4 of MMDS
- CUR Decomposition
Important Dates:
For important academic dates and deadlines, refer to Carleton's
Academic
Calendar.