Overview

Welcome to the web page of COMP 5118 - Trends in Big Data Management. This is a grad-level course for students in Carleton University and the University of Ottawa. Each year we focus on some research topics in the general field of data management. These research topics change from one course offering to another depending on what's new and hot. This term, we focus on the following topics: Question Answering, Knowledge Graphs, Data Cleaning, Data Integration, Graph Processing, Data Lake Management, Crowdsourcing, Data Exploration and Training via Weak Supervision. Check the schedule below to see the list of papers that we will discuss this term. Most of the papers we will be covering during the term are published in top-tier conferences, and are very recent. This should give us a rough idea of what the research community of data management is currently working on. Psst, this will also (hopefully) give you ideas for the course project, which you should take very seriously.

The class is on Tuesday from 11:35 am to 2:25 pm. The class will take place via Zoom. Links for each class will be posted on this page in the schedule table below.

Contact Information

Herzberg Laboratories 5433
1125 Colonel By Dr
Ottawa, Ontario K1S 5B6

613-520-2600 ext. 4254
myFirstName.myLastNameWithoutHyphen@carleton.ca

There's also this anonymous feedback form, in which you can swear at me. But during the swearing spree, please give me some constructive feedback.

Grading

In this course, students will be reading and reviewing papers for each class. During the class, some students will be presenting the papers for the week, they and the rest of the class (including me) will be discussing these papers and our take on them. There is also a term-long project, which is worth the biggest chunk of your grade. Following is the grade breakdown:

  • Project 45%
  • Presentations 20%
  • Paper Reviews 20%
  • Class Participation 15%

Project

The research project could be any of the following:

  1. New research idea: A prototype implementation of a new research idea that addresses one of the drawbacks or limitations of an existing research work, or a completely new research idea that is inspired by any of your readings.
  2. Experimental Study: An experimental comparison and evaluation of existing work in a specific research topic. Students are not supposed to reimplement all of the existing solution. Rather, they should be able to reuse an existing code base with minor changes to run the benchmark. The main contribution in the benchmark is to give insights that did not exist in the systems used in the evaluation.
  3. Survey: With the extensive research efforts in the topics covered in this course, a survey paper should summarize and categorize the major research contributions in a specific area. The survey should not be a mere summarization of existing papers, rather, the students should provide their own insight on the surveyed body of work. For example, they can provide a categorization or a taxonomy that highlights that major research directions in that area. Students can also identify the open research problems that were hardly addressed in the literature.
  4. System Implementation and Reproducibility: I have a number of systems I would like implemented. Your project could be choosing any of them, and implement and reproduce the results reported.

The project can be done individually or in groups. However, the assessment will take into consideration how many students are in the group. E.g., if one student demonstrates contributions in her/his project that is equal to the contributions for a team of three students, students should expect a high variance in grades.

The project deliverables will be:

  1. Project Proposal: This should be a maximum of two-pages proposal (including references) in ACM Proceedings Format (Latex is mandatory). To write a good proposal, I strongly suggest reading Jennifer Widom's tips for writing introductions. I also strongly suggest reading the whole document as it is helpful for writing research papers in general. This proposal is due on February 25. If you have a solid idea that you would like to submit before the deadline to get better feedback and give yourself more time to work on the project, early submissions of the proposal are STRONGLY encouraged.
  2. Project Paper: Again, in ACM Proceedings Format. This should be at least 7 pages including references. Depending on the size of the group and contributions, the paper could be longer. So, there is no page-limit. Due date for the project paper is April 12 (11:59 PM). Late submissions are allowed for two more weeks with a hard deadline for submission on April 26th.
  3. Source Code: Your source code is expected to be publically available on github. The github link for your project should be in the project paper. Please write a good README that clearly describe how to run your code. Due date for the project source code is the same as for the project paper.

Presentations

There will be 22 presentations throughout the term. This workload may not be evenly distributed over the students doing this class. Therefore, the student who presents one more presentation than average will get a bonus. Each presentation should be 30 to 45 minutes long, followed by a 30 to 45 minutes of discussion of the paper. The presenter should not only present the details of the paper, but also suggest the discussion points at the end of his/her presentation.

Paper Reviews

The paper reviews are due at 11:00 AM on the day of the class. The format for the review is fixed: Summary of the paper, three or more strong points, three or more weak points, and any additional comments you may have on this paper. The number of fields required is small, but you are expected to be elaborative. Theoretically, if your review is written in a Word document, it should be at least one page long in 12 pt. Your two worst reviews will not count towards your grade.

Here are a few comments to consider when you write your reviews:

  1. Don't copy/paste sentences from the paper. Write down your own understanding of the paper.
  2. When listing the strong and weak points, please enumerate them and don't write a single big paragraph with all the points. The writing of the paper cannot be one of the main strong or weak points.
  3. Don't just say that one of the strong points is that this paper used a deep learning approach. This is not a strong point. A strong point would be to say why you think using deep learning in this case is a good approach. Not why deep learning is good in general, but why in this case.
  4. In general, when you choose one point as a strong or weak point. Elaborate on why you think it's strong or weak. I can't read minds to know what you had in mind when you chose "using templates to answer questions" (btw, that's a superficial answers. Not shaming anyone).
  5. Please get over the writing of the paper when it comes to enumerating strong or weak points. Focus on the real beef of the paper rather than presentation. I'd accept presentation as a fourth strong or weak point, not one of the main three.
  6. Don't count something that is natural for the authors to do as a strong point. For example, using a benchmark with complex question in the evaluation. Well, this isn't a strong point. If they didn't do that, it would have been a real shitty paper. Another example would be beating the state-of-the-art systems. If they didn't, they wouldn't publish a paper.
  7. Please don't use the future work as an answer for the sequel for the paper. That's not the point. The point is that maybe this paper gave you an idea for a new project. What would this idea be? I changed the wording of the question in the review to reflect that.
Some References:
  1. Reading a Computer Science Paper.
  2. Example of a bad review.
  3. How to Read a Paper.
  4. IMPORTANT: Here is a sample review for a rejected paper (real review). Please have a look at it. It will show you how reviews look like in real life. You'll see that strong points are not elaborated on. That's OK for this kind of review because they're reviewing the paper for acceptance or rejection. If it's accepted, they won't elaborate more on what's obvious. On the other hand, you'll find they tend to elaborate more on weak points. Sometimes the three (or more) weak points are brief, but that's because they take their time and space in the detailed comments section. Since we don't have that, you're expected to elaborate where you mention the three strong or weak points. Please read this review carefully to understand the level of feedback expected in a review. This is what's expected from you in your reviews throughout this course.

Paper Review Submission Link

Class Participation

This is a seminar-based class, meaning that your participation in the class is essential. You are encouraged to ask questions, answer other students questions, give comments over the papers we discuss, etc.

Schedule (ZOOM Class)

Date Topics Papers Speakers
January 11 Course Introduction & Recent Game Changers in Data Managament N/A Ahmed El-Roby
January 18 Graph Processing
Internet of Things
  1. The Ubiquity of Large Graphs and Surprising Challenges of Graph Processing
  2. CarStream: An Industrial System of Big Data Processing for Internet-of-Vehicles
1. Vivek Thaker
2. Yanan Mao
January 25 Question Answering
Data Integration
  1. Question Answering Over Knowledge Graphs: Question Understanding Via Template Decomposition
  2. Creating Embeddings of Heterogeneous Relational Datasets for Data Integration Tasks
1. Evelyn Yang
2. Jiahe Geng
February 1 Question Answering
  1. CBench: Towards Better Evaluation of Question Answering Over Knowledge Graphs
  2. Complex Factoid Question Answering with a Free-Text Knowledge Graph
1. Taoseef Ishtiak
2. Samin Azhan
February 8 Blockchains
Text-to-SQL
  1. Blockchains vs. Distributed Databases: Dichotomy and Fusion
  2. An In-Depth Benchmarking of Text-to-SQL Systems
1. Yaqing Zhu
2. Elmira Adeeb
February 15 Web Tables
Data Cleaning
  1. Generating Titles for Web Tables
  2. Baran: Effective Error Correction via a Unified Context Representation and Transfer Learning
1. Mohammad Zarei
2. Raha Rashid
February 22 NO CLASS (Winter Break)
March 1 Sentiment Analysis
Data Discovery
  1. Quality of Sentiment Analysis Tools: The Reasons of Inconsistency
  2. Discovering Related Data At Scale
1. Amirali Madani
2. Elmira Adeeb
March 8 Resource Allocation
Natural Language Processing
  1. Seagull: An Infrastructure for Load Prediction and Optimized Resource Allocation
  2. Adaptive Rule Discovery for Labeling Text Data
1. Masoumeh Haghighi
2. Robin Redhu
March 15 AI in Geospatial Applications
  1. Contact tracing: beyond the apps
  2. QARTA: an ML-based system for accurate map services
1. Zoya Shahcheraghi
2. Aagyapal Kaur
March 22 AI in Healthcare
  1. Medical Entity Disambiguation Using Graph Neural Networks
  2. PACE: Learning Effective Task Decomposition for Human-in-the-loop Healthcare Delivery
1. Evelyn Yang
2. Yanan Mao
April 5 AI in Football
  1. Combining Machine Learning and Human Experts to Predict Match Outcomes in Football: A Baseline Model
  2. Making Offensive Play Predictable - Using a Graph Convolutional Network to Understand Defensive Performance in Soccer
1. Satyadev Abhiram Pandravada
2. Satyadev Abhiram Pandravada
April 12 Data Discovery
NL to SQL
  1. NOAH: Interactive Spreadsheet Exploration with Dynamic Hierarchical Overviews
  2. Duoquest: A Dual-Specification System for Expressive SQL Queries
1. Vivek Thaker
2. Taoseef Ishtiak