Overview

Welcome to the web page of COMP 5900 - Recent Trends in Big Data Management. This is a grad-level course for MSC and PhD students in Carleton University and the University of Ottawa. During this term, we are covering many areas of research in data management. This includes, but not limited to: Data Integration, Data Cleaning, Core Database Technologies, Large-scale Data Management, Internet of Things, Question Answering, Text Processing, and many more (check the schedule below to see the reading material). Most of the papers we will be covering during the term are published in top-tier conferences, and are very recent. This should give us a chance to know what people are currently working on in the areas of research we discuss in this course. Psst, this will also (hopefully) give you ideas for the course project, which you should take very seriously.

Contact Information

Herzberg Laboratories 5433
1125 Colonel By Dr
Ottawa, Ontario K1S 5B6

613-520-2600 ext. 4254
myFirstName.myLastNameWithoutHyphen@carleton.ca

There's also this anonymous feedback form, in which you can swear at me. But during the swearing spree, please give me some constructive feedback.

Grading

In this course, students will be reading and reviewing papers for each class. During the class, some students will be presenting the papers for the week, they and the rest of the class (including me) will be discussing these papers and our take on them. There is also a term-long project, which is worth the biggest chunk of your grade. Following is the marks breakdown:

  • Project 45%
  • Presentations 25%
  • Paper Reviews 15%
  • Class Participation 15%

Project

The research project could be any of the following:

  1. New research idea: A prototype implementation of a new research idea that addresses one of the drawbacks or limitations of an existing research work, or a completely new research idea that is inspired by any of your readings.
  2. Experimental Study: An experimental comparison and evaluation of existing work in a specific research topic. Students are not supposed to reimplement all of the existing solution. Rather, they should be able to reuse an existing code base with minor changes to run the benchmark. The main contribution in the benchmark is to give insights that did not exist in the systems used in the evaluation.
  3. Survey: With the extensive research efforts in the topics covered in this course, a survey paper should summarize and categorize the major research contributions in a specific area. The survey should not be a mere summarization of existing papers, rather, the students should provide their own insight on the surveyed body of work. For example, they can provide a categorization or a taxonomy that highlights that major research directions in that area. Students can also identify the open research problems that were hardly addressed in the literature.
  4. System Implementation and Reproducibility (must be individual project): I have a number of systems I would like implemented. Your project could be choosing any of them, and implement and reproduce the results reported.

The project can be done individually or in groups (except the system implementation). However, the assessment will take into consideration how many students are in the group. E.g., if one student demonstrates contributions in her/his project that is equal to the contributions for a team of three students, students should expect a high variance in grades.

The project deliverables will be:

  1. Project Proposal: This should be a two-page proposal (including references) in ACM Proceedings Format. To write a good proposal, I strongly suggest reading Jennifer Widom's tips for writing introductions. I also strongly suggest reading the whole thing as it's helpful for writing research papers in general. This proposal is due on February 22nd. If you have a solid idea that you would like to submit before the deadline to get better feedback and give yourself more time to work on the project, early submissions of the proposal are encouraged.
  2. Project Paper: Again, in ACM Proceedings Format. This should be at least 7 pages including references. Depending on the size of the group and contributions, the paper could be longer. So, there is no page-limit. Due date for the project paper is April 10th.
  3. Source Code: Your source code is expected to be publically available on github. The github link for your project should be in the project paper. Please write a good README that clearly describe how to run your code. Due date for the project source code is April 10th.
  4. Project Papers Review: You will be assigned a couple of other students project papers to review. You will submit the same review format for the papers we discuss in class. You can use the same submission link (below in the Paper Reviews Section). This is the last deliverable of the course. Its due date is April 12th.

Presentations

There will be 26 papers that need to be presented by you throughout the term. This workload may not be evenly distributed over the students doing this class. Therefore, the student who presents one more presentation than average will get a bonus. Each presentation should be 25 to 30 minutes, followed by a 20 to 25 minutes of discussion of the paper. The presenter should not only present the details of the paper, but also suggest the discussion points at the end of his/her presentation.

Paper Reviews

The paper reviews are due at 8:35 AM on the day of the class. The format for the review is fixed: Summary of the paper, three or more strong points, three or more weak points, and any additional comments you may have on this paper. The number of fields required is small, but you are expected to be elaborative. Theoretically, if your review is written in a Word document, it should be at least one page long in 12 pt. You'll be given a free pass for three reviews throughout the term.

Paper Review Submission Link

Class Participation

This is a seminar-based class, meaning that your participation in the class is essential. You are encouraged to ask questions, answer other students questions, give comments over the papers we discuss, etc.

Schedule

Date Topics Papers Speakers
January 9th Course Introduction & Recent Game Changers in Data Managament N/A Ahmed El-Roby
January 16th Text Processing
Large Scale Data Managament
  1. Scalable Semantic Querying of Text
  2. Navigating the Data Lake with DATAMARAN: Automatically Extracting Structure from Log Datasets
1. Jeffery Zhang
2. Khadija Osman
January 30th Large Scale Data Management
Internet of Things
  1. RHEEM: Enabling Cross-Platform Data Processing
  2. CarStream: An Industrial System of Big Data Processing for Internet-of-Vehicles
  3. Frontier: Resilient Edge Processing for the Internet of Things
1. Davoud Saljoughi Badlou
2. Sungwon Hong
3. Ziaullah Dawrankhil
February 6th Question Answering
Data Transformation
Semantic Web
  1. Question Answering Over Knowledge Graphs: Question Understanding Via Template Decomposition
  2. Transform-data-by-example (TDE): an extensible search engine for data transformations
  3. A Survey and Experimental Comparison of Distributed SPARQL Engines for Very Large RDF Data

2. Norbert Ake
3. Kalonji Kalala
February 27th Recommendation Systems
Data Integration
Data Cleaning
  1. Heterogeneous Recommendations: What You Might Like To Read After Watching Interstellar
  2. Stitching Web Tables for Improving Matching Quality
  3. Katara: A data cleaning system powered by knowledge bases and crowdsourcing
1. Davoud Saljoughi Badlou
2. Cynthia Amanyunose
March 6th Data Integration
  1. Auto-Join: Joining Tables by Leveraging Transformations
  2. Table Union Search on Open Data
1. Jeffery Zhang
2. Norbert Ake
March 13th Data Cleaning
  1. Auto-Detect: Data-Driven Error Detection in Tables
  2. HoloClean: Holistic Data Repairs with Probabilistic Inference
1. Ziaullah Dawrankhil
2. Khadija Osman
March 20th Graph Processing
  1. The Ubiquity of Large Graphs and Surprising Challenges of Graph Processing
  2. Experimental Analysis of Distributed Graph Systems
1. Kalonji Kalala
2. Cynthia Amanyunose
March 27th Mining Structured Data
Semantic Web
  1. Maverick: Discovering Exceptional Facts from Knowledge Graphs
  2. An Analytical Study of Large SPARQL Query Logs
1. Cynthia Amanyunose
2. Kalonji Kalala
April 3rd Cloud Computing
Entity Matching
  1. Goods: Organizing Google's Datasets
  2. Falcon: Scaling Up Hands-Off Crowdsourced Entity Matching to Build Cloud Services
1. Davoud Saljoughi Badlou
2. Jeffery Zhang
Extra Papers
  1. LSH Ensemble: Internet-Scale Domain Search
  2. F1 query: declarative querying at scale
  3. Magellan Toward Building Entity Matching Management Systems
  4. Azure Data Lake Store: A Hyperscale Distributed File Service for Big Data Analytics
  5. Amazon Aurora: Design Considerations for High Throughput Cloud-Native Relational Databases
  6. Building a Bw-Tree Takes More Than Just BuzzWords
  7. The Case for Learned Index Structures
  8. Understanding Workers, Developing Effective Tasks, and Enhancing Marketplace Dynamics
  9. Deep Learning for Entity Matching A Design Space Exploration
  10. Massively Parallel Processing of Whole Genome Sequence Data: An In-Depth Performance Study
  11. How Good Are Modern Spatial Analytics Systems?
  12. Automatic Database Management System Tuning Through Large-scale Machine Learning
  13. Query-based Workload Forecasting for Self-Driving Database Management Systems
  14. Database Learning: Toward a Database that Becomes Smarter Every Time
  15. Knowledge Exploration Using Tables on the Web
  16. Inferray: Fast In-memory RDF Inference
  17. Lusail: A System for Querying Linked Data at Scale
N/A