Overview

Welcome to the web page of COMP 5118 - Trends in Big Data Management. This is a grad-level course for students in Carleton University and the University of Ottawa. Each year we focus on some research topics in the general field of data management. These research topics change from one course offering to another depending on what's new and hot. This term, we focus on the following topics: QQuestion Answering, Knowledge Graphs, Internet of Things, Social Media, Graph Processing, Data Lake Management, Timeseries, Sentiment Analysis, Anomaly Detection, and applications of AI in sports, health, and geospatial data.. Check the schedule below to see the list of papers that we will discuss this term. Most of the papers we will be covering during the term are published in top-tier conferences, and are very recent. This should give us a rough idea of what the research community of data management is currently working on. Psst, this will also (hopefully) give you ideas for the course project, which you should take very seriously.

The class is on Tuesday from 11:35 am to 2:25 pm. The class will take place in RB 2308. If an in-person is not possible for any reason, the class will be held via Zoom.

Contact Information

Herzberg Laboratories 5433
1125 Colonel By Dr
Ottawa, Ontario K1S 5B6

613-520-2600 ext. 4254
myFirstName.myLastNameWithoutHyphen@carleton.ca

There's also this anonymous feedback form, in which you can swear at me. But during the swearing spree, please give me some constructive feedback.

Grading

In this course, students will be reading and reviewing papers for each class. During the class, some students will be presenting the papers for the week, they and the rest of the class (including myself) will be discussing these papers. There is also a term-long project, which is worth the biggest chunk of your grade. Following is the grade breakdown:

  • Project 45%
  • Presentations 20%
  • Paper Reviews 20%
  • Class Participation 15%

Project

The project could be any of the following:

  1. New research idea: A prototype implementation of a new research idea that addresses one of the drawbacks or limitations of an existing research work, or a completely new research idea that is inspired by any of your readings.
  2. Experimental Study: An experimental comparison and evaluation of existing work in a specific research topic. Students are not supposed to reimplement all of the existing solution. Rather, they should be able to reuse an existing code base with minor changes to run the benchmark. The main contribution in the benchmark is to give insights that did not exist in the systems used in the evaluation.
  3. Survey: With the extensive research efforts in the topics covered in this course, a survey paper should summarize and categorize the major research contributions in a specific area. The survey should not be a mere summarization of existing papers, rather, the students should provide their own insight on the surveyed body of work. For example, they can provide a categorization or a taxonomy that highlights that major research directions in that area. Students can also identify the open research problems that were hardly addressed in the literature. Here is a good example of how the survey should look like.
  4. System Implementation and Reproducibility: I have a number of systems I would like implemented. Your project could be choosing any of them, and implement and reproduce the results reported.

The project can be done individually or in groups. However, the assessment will take into consideration how many students are in the group. E.g., if one student demonstrates contributions in her/his project that is equal to the contributions for a team of three students, students should expect a high variance in grades.

The project deliverables will be:

  1. Project Proposal: This should be a maximum of two-pages proposal (including references) in ACM Proceedings Format (Latex is mandatory). To write a good proposal, I strongly suggest reading Jennifer Widom's tips for writing introductions. I also strongly suggest reading the whole document as it is helpful for writing research papers in general. This proposal is due on February 17 at 11:59 PM. The proposal can be submited using this form. If you have a solid idea that you would like to submit before the deadline to get better feedback and give yourself more time to work on the project, early submissions of the proposal are STRONGLY encouraged.
  2. Project Paper: Again, in ACM Proceedings Format. This should be at least 7 pages including references. Depending on the size of the group and contributions, the paper could be longer. So, there is no page-limit. Due date for the project paper is April 12 (11:59 PM). Late submissions are allowed for two more weeks with a hard deadline for submission on April 27th at 11:59 PM. The project can be submitted using this form.
  3. Source Code: Your source code is expected to be publically available on github. The github link for your project should be in the project paper. Please write a good README that clearly describe how to run your code. Due date for the project source code is the same as for the project paper.

Presentations

There will be 17 presentations throughout the term. This workload may not be evenly distributed over the students doing this class. Therefore, the student who presents one more presentation than average will get a bonus. Each presentation should be 30 to 45 minutes long, followed by a 30 to 45 minutes of discussion of the paper. The presenter should not only present the details of the paper, but also suggest the discussion points at the end of his/her presentation.

Paper Reviews

The paper reviews are due at 11:00 AM on the day of the class. The format for the review is fixed: Summary of the paper, three or more strong points, three or more weak points, and any additional comments you may have on this paper. The number of fields required is small, but you are expected to be elaborative. Theoretically, if your review is written in a Word document, it should be at least one page long in 12 pt. Your two worst reviews will not count towards your grade.

Here are a few comments to consider when you write your reviews:

  1. Don't copy/paste sentences from the paper. Write down your own understanding of the paper.
  2. When listing the strong and weak points, please enumerate them and don't write a single big paragraph with all the points. The writing of the paper cannot be one of the main strong or weak points.
  3. Don't just say that one of the strong points is that this paper used a deep learning approach. This is not a strong point. A strong point would be to say why you think using deep learning in this case is a good approach. Not why deep learning is good in general, but why in this case.
  4. In general, when you choose one point as a strong or weak point. Elaborate on why you think it's strong or weak. I can't read minds to know what you had in mind when you chose "using templates to answer questions" (btw, that's a superficial answers. Not shaming anyone).
  5. Please get over the writing of the paper when it comes to enumerating strong or weak points. Focus on the real beef of the paper rather than presentation. I'd accept presentation as a fourth strong or weak point, not one of the main three.
  6. Don't count something that is natural for the authors to do as a strong point. For example, using a benchmark with complex question in the evaluation. Well, this isn't a strong point. If they didn't do that, it would have been a real shitty paper. Another example would be beating the state-of-the-art systems. If they didn't, they wouldn't publish a paper.
  7. Please don't use the future work as an answer for the sequel for the paper. That's not the point. The point is that maybe this paper gave you an idea for a new project. What would this idea be? I changed the wording of the question in the review to reflect that.
Some References:
  1. Reading a Computer Science Paper.
  2. Example of a bad review.
  3. How to Read a Paper.
  4. IMPORTANT: Here is a sample review for a rejected paper (real review). Please have a look at it. It will show you how reviews look like in real life. You'll see that strong points are not elaborated on. That's OK for this kind of review because they're reviewing the paper for acceptance or rejection. If it's accepted, they won't elaborate more on what's obvious. On the other hand, you'll find they tend to elaborate more on weak points. Sometimes the three (or more) weak points are brief, but that's because they take their time and space in the detailed comments section. Since we don't have that, you're expected to elaborate where you mention the three strong or weak points. Please read this review carefully to understand the level of feedback expected in a review. This is what's expected from you in your reviews throughout this course.

Paper Review Submission Link

Class Participation

This is a seminar-based class, meaning that your participation in the class is essential. You are encouraged to ask questions, answer other students questions, give comments over the papers we discuss, etc.

Schedule

Date Topics Papers Speakers
January 10 Course Introduction N/A Ahmed El-Roby
January 17 Graph Processing
Internet of Things
  1. The Ubiquity of Large Graphs and Surprising Challenges of Graph Processing.
  2. CarStream: An Industrial System of Big Data Processing for Internet-of-Vehicles.
1. Nirav Chhaganbhai.
2. Mohammad Yousuf.
January 24 Question Answering
Database Tuning
  1. CBench: Towards Better Evaluation of Question Answering Over Knowledge Graphs.
  2. DB-BERT: A Database Tuning Tool that "Reads the Manual".
1. Megha Agrawal.
2. Alireza Choubineh
January 31 System Design
  1. An investigation of the Therac-25 accidents.
1. Tariq El Bahrawy.
February 7 NO CLASS (Sickness of Presenter). N/A.
February 14 AI Applications in Medical Data
  1. Medical Entity Disambiguation Using Graph Neural Networks.
1. Oz Kilic.
February 21 NO CLASS (Winter Break)
February 28 N/A
  1. Students Project Proposal Presentations.
Everyone
March 7 AI Applications in Geospatial Data
AI Applications in Football
  1. QARTA: An ML-based System for Accurate Map Services.
  2. Reinforcement Learning for Football Player Decision Making Analysis
1. Booshra Nazifa Mahmud.
2. Tariq El Bahrawy.
March 14 Anomaly Detection.
Time Series.
  1. Anomaly Detection in Time Series: A Comprehensive Evaluation
  2. Time2Feat: Learning Interpretable Representations for Multivariate Time Series Clustering.
1. Oz Kilic.
March 21 Sentiment Analysis
Time Series
  1. Quality of Sentiment Analysis Tools: The Reasons of Inconsistency
  2. Hercules Against Data Series Similarity Search
1. Megha Agrawal.
2. Booshra Nazifa Mahmud.
March 28 Data Lakes
  1. Popularity Prediction for Social Media over Arbitrary Time Horizons.
1. Alireza Choubineh.
April 4 Internet of Things
Entity Matching
  1. Real-time Data Infrastructure at Uber
  2. Analyzing How BERT Performs Entity Matching
1. Nirav Chhaganbhai.
April 11 Social Media Data.
Text-to-SQL.
  1. Photon: A Fast Query Engine for Lakehouse Systems
  2. An In-Depth Benchmarking of Text-to-SQL Systems.
1. Mohammad Yousuf.