Overview

Welcome to the web page of COMP 5118 - Trends in Big Data Management. This is a grad-level course for students in Carleton University and the University of Ottawa. Each year we focus on some research topics in the general field of data management. These research topics change from one course offering to another depending on what's new and hot. This term, we focus on the following topics: Social Media Data, Text-to-SQL, Internet of Things, Entity Matching, Data Lakes, Sentiment Analysis, Anomaly Detection, Time Series, AI Applications in Geospatial Data, AI Applications in Football, AI Applications in Football, AI Applications in Medical Data, Question Answering, Database Tuning, Data Discovery, Knowledge Graphs, and Document Search.. Check the schedule below to see the list of papers that we will discuss this term. Most of the papers we will be covering during the term are published in top-tier conferences, and are very recent. This should give us a rough idea of what the research community of data management is currently working on. Psst, this will also (hopefully) give you ideas for the course project, which you should take very seriously.

The class is on Tuesday from 11:35 am to 2:25 pm. The class will take place in RB2308. If an in-person is not possible for any reason, the class will be held via Zoom.

Contact Information

Herzberg Laboratories 5433
1125 Colonel By Dr
Ottawa, Ontario K1S 5B6

613-520-2600 ext. 4254
myFirstName.myLastNameWithoutHyphen@carleton.ca

There's also this anonymous feedback form, in which you can swear at me. But during the swearing spree, please give me some constructive feedback.

Grading

In this course, students will be reading and reviewing papers for each class. During the class, some students will be presenting the papers for the week, they and the rest of the class (including myself) will be discussing these papers. There is also a term-long project, which is worth the biggest chunk of your grade. Following is the grade breakdown:

  • Project 45%
  • Presentations 20%
  • Paper Reviews 20%
  • Class Participation 15%

Project

The project could be any of the following:

  1. New research idea: A prototype implementation of a new research idea that addresses one of the drawbacks or limitations of an existing research work, or a completely new research idea that is inspired by any of your readings.
  2. Experimental Study: An experimental comparison and evaluation of existing work in a specific research topic. Students are not supposed to reimplement all of the existing solution. Rather, they should be able to reuse an existing code base with minor changes to run the benchmark. The main contribution in the benchmark is to give insights that did not exist in the systems used in the evaluation.
  3. Survey: With the extensive research efforts in the topics covered in this course, a survey paper should summarize and categorize the major research contributions in a specific area. The survey should not be a mere summarization of existing papers, rather, the students should provide their own insight on the surveyed body of work. For example, they can provide a categorization or a taxonomy that highlights that major research directions in that area. Students can also identify the open research problems that were hardly addressed in the literature. Here is a good example of how the survey should look like.
  4. System Implementation and Reproducibility: I have a number of systems I would like implemented. Your project could be choosing any of them, and implement and reproduce the results reported.

The project can be done individually or in groups. However, the assessment will take into consideration how many students are in the group.

The project deliverables will be:

  1. Project Proposal: This should be a maximum of two-pages proposal (including references) in ACM Proceedings Format (Latex is mandatory). To write a good proposal, I strongly suggest reading Jennifer Widom's tips for writing introductions. I also strongly suggest reading the whole document as it is helpful for writing research papers in general. This proposal is due on October 27 at 11:59 PM. The proposal can be submited using this form. If you have a solid idea that you would like to submit before the deadline to get better feedback and give yourself more time to work on the project, early submissions of the proposal are STRONGLY encouraged.
  2. Project Paper: Again, in ACM Proceedings Format. This should be at least 7 pages including references. Depending on the size of the group and contributions, the paper could be longer. So, there is no page-limit. Due date for the project paper is December 8 (11:59 PM). Late submissions are allowed for one more weeks with a hard deadline for submission on December 15 at 11:59 PM. The project can be submitted using this form.
  3. Source Code: Your source code is expected to be publically available on github. The github link for your project should be in the project paper. Please write a good README that clearly describe how to run your code. Due date for the project source code is the same as for the project paper.

Presentations

There will be 19 presentations throughout the term. This workload may not be evenly distributed over the students doing this class. Therefore, the student who presents one more presentation than average will get a bonus. Each presentation should be 30 to 45 minutes long, followed by a 30 to 45 minutes of discussion of the paper. The presenter should not only present the details of the paper, but also suggest the discussion points at the end of their presentation.

Paper Reviews

The paper reviews are due at 11:00 AM on the day of the class. The format for the review is fixed: Summary of the paper, three or more strong points, three or more weak points, and any additional comments you may have on this paper. The number of fields required is small, but you are expected to be elaborative. Theoretically, if your review is written in a Word document, it should be at least one page long in 12 pt. Your two worst reviews will not count towards your grade.

Here are a few comments to consider when you write your reviews:

  1. Don't copy/paste sentences from the paper. Write down your own understanding of the paper.
  2. When listing the strong and weak points, please enumerate them and don't write a single big paragraph with all the points. The writing of the paper cannot be one of the main strong or weak points.
  3. Don't just say that one of the strong points is that this paper used a deep learning approach. This is not a strong point. A strong point would be to say why you think using deep learning in this case is a good approach. Not why deep learning is good in general, but why in this particular case.
  4. In general, when you choose one point as a strong or weak point. Elaborate on why you think it's strong or weak. I can't read minds to know what you had in mind when you chose "using templates to answer questions" as a strong point.
  5. Please get over the writing of the paper when it comes to enumerating strong or weak points. Focus on the real beef of the paper rather than presentation. I'd accept presentation as a fourth strong or weak point, not one of the main two.
  6. Don't count something that is natural for the authors to do as a strong point. For example, using a benchmark with complex questions in the evaluation. Well, this isn't a strong point. If they didn't do that, it would have been a sh*tty paper. Another example would be beating the state-of-the-art systems. If they didn't, they wouldn't publish a paper.
Some References:
  1. Reading a Computer Science Paper.
  2. Example of a bad review.
  3. How to Read a Paper.
  4. IMPORTANT: Here is a sample review for a rejected paper (real review). Please have a look at it. It will show you how reviews look like in real life. You'll see that strong points are not elaborated on. That's OK for this kind of review because the reviewers are reviewing the paper for acceptance or rejection. If it's accepted, they won't elaborate more on what's obvious. On the other hand, you'll find they tend to elaborate more on weak points. Sometimes the three (or more) weak points are brief, but that's because they elaborate on these points in the detailed comments section. Since we don't have that, it's expected that you write elaborative strong and weak points. Please read this review carefully to understand the level of feedback expected in a review.

Paper Review Submission Link

Class Participation

This is a seminar-based class, meaning that your participation in the class is essential. You are encouraged to ask questions, answer other students questions, give comments over the papers we discuss, etc.

Schedule

Date Topics Papers Speakers
September 12 Course Introduction N/A Ahmed El-Roby
September 19 System Design
  1. An investigation of the Therac-25 accidents.
Vaishnavi Dinesh.
September 26 Graph Processing
Data Analytics
  1. The Ubiquity of Large Graphs and Surprising Challenges of Graph Processing.
  2. DataChat: An Intuitive and Collaborative Data Analytics Platform.
1. Kishore Vanapalli.
2. Shuvankar Saha.
October 3 AI Applications in Football
  1. Reinforcement Learning for Football Player Decision Making Analysis.
  2. Making Offensive Play Predictable - Using a Graph Convolutional Network to Understand Defensive Performance in Soccer.
1. Abdel Qayyim.
2. Evan Pierce.
October 10 NO CLASS. N/A.
October 17 AI Applications in Football
Internet of Vehicles
  1. StratAlign: Uncovering Tactical Patterns through Large-Scale Event Sequence Matching.
  2. CarStream: An Industrial System of Big Data Processing for Internet-of-Vehicles.
  3. Real-time Data Infrastructure at Uber.
1. Jola Ajayi.
2. Ayomide Awonaya.
3. Stephen Akinpelu.
October 24 NO CLASS (Fall Break)
October 31 Knowledge Graphs
  1. Explaining Link Prediction Systems based on Knowledge Graph Embeddings.
  2. CBench: Towards Better Evaluation of Question Answering Over Knowledge Graphs.
  3. Maestro: Automatic Generation of Comprehensive Benchmarks for Question Answering Over Knowledge Graphs.
1. Yansong Li.
2. Gurkirat Dhatt.
3. Kailash Balakrishnan.
November 7 AI Applications in Medical Data
AI Applications in Geospatial Data
Knowledge Graphs
  1. Medical Entity Disambiguation Using Graph Neural Networks.
  2. QARTA: An ML-based System for Accurate Map Services.
  3. Saga: A Platform for Continuous Construction and Serving of Knowledge At Scale
1. Sam Serdah.
2. Moses Muwanga.
3. Zabih ur Rehman Bilal.
November 14 Entity Matching
Database Tuning
  1. Analyzing How BERT Performs Entity Matching
  2. DB-BERT: A Database Tuning Tool that "Reads the Manual".
1. Xuanyu Su.
2. Mohamed Basyouni.
November 21 Data Discovery
Text-to-SQL
Document Search
  1. Discovering Related Data At Scale.
  2. An In-Depth Benchmarking of Text-to-SQL Systems.
  3. JEDI: These aren't the JSON documents you're looking for....
1. Kaniz Sinethyah.
2. Amy Wang.
3. Alex Leslie.
November 28 Anomaly Detection
Social Media Data
  1. Anomaly Detection in Time Series: A Comprehensive Evaluation
  2. Popularity Prediction for Social Media over Arbitrary Time Horizons.
1. Owen Brouse.
2. Hashim Awan.
December 5 Data Lakes
Sentiment Analysis
  1. Photon: A Fast Query Engine for Lakehouse Systems
  2. Quality of Sentiment Analysis Tools: The Reasons of Inconsistency
1. Swapneeth Gorantla.
2. Kamal Chahrour.
Papers Pool