COMP 5118 - Trends in Big Data Management Logo

COMP 5118

Fall 2025

Course Overview

Welcome to COMP 5118! This is a graduate-level course for students at Carleton University and the University of Ottawa. Each year, we focus on cutting-edge research topics in the general field of data management. These topics change from one offering to another depending on what's new and hot.

This term, we will explore a variety of exciting areas that represent the forefront of data management research. This will give you a strong understanding of what the research community is currently working on and hopefully inspire ideas for your course project, which you should take very seriously.

This Semester's Topics Include:

  • Knowledge Graphs
  • Geospatial Data
  • LLM-Applications
  • Time Series
  • Healthcare Analytics
  • Natural Language to SQL
  • Data Cleaning
  • Data Lakes
  • Efficiency
  • Entity Alignment

Grading Scheme

In this course, students will be reading and reviewing papers for each class. During the class, some students will present the papers for the week, and we will all discuss them. There is also a term-long project, which is worth the biggest chunk of your grade.

  • Project 45%
  • Presentations 20%
  • Paper Reviews 20%
  • Class Participation 15%

Project (45%)

The project can be done individually or in groups (assessment will consider group size) and can be one of the following types:

  1. New Research Idea: A prototype implementation of a new research idea that addresses a limitation in existing work or is a completely new idea inspired by your readings.
  2. Experimental Study: A comprehensive experimental comparison and evaluation of existing work on a specific topic. The goal is to provide new insights not present in the original papers.
  3. Survey: A paper that summarizes, categorizes, and provides new insights on a major research area. See this good example.
  4. System Implementation & Reproducibility: Implement and reproduce the results of a system from a published paper. I have a number of systems in mind.

Project Deliverables:

  • Project Proposal: Max 2 pages in ACM Format (LaTeX mandatory). Due: October 27 at 11:59 PM. Early submissions are STRONGLY encouraged. Submit via this form.
  • Project Paper: Min 7 pages in ACM Format. Due: December 8 at 11:59 PM (hard deadline for late submissions: Dec 15). Submit via this form.
  • Source Code: Publicly available on GitHub with a good README. Link must be in the project paper.

Presentations (20%)

Each presentation should be 30-45 minutes, followed by a 30-45 minute discussion. The presenter is responsible for leading the discussion.

Paper Reviews (20%)

Reviews are due at 11:00 AM on the day of the class via the Paper Review Submission Link. Format: Summary, 3+ strong points, 3+ weak points, and additional comments. Your two worst reviews will be dropped.

How to Write a Good Review:
  1. Do not copy/paste from the paper. Use your own words.
  2. Enumerate strong and weak points; do not write one large paragraph.
  3. Justify your points. Why is a deep learning approach good *in this specific case*?
  4. Elaborate! "Using templates" is not a strong point on its own. Explain why it is effective.
  5. Focus on the technical substance, not just the writing style.
  6. Avoid stating the obvious as a strong point (e.g., "they beat the state-of-the-art").
References: Reading a CS Paper, How to Read a Paper, and a real-world sample review.

Class Participation (15%)

This is a seminar-based class, meaning that your participation is essential. You are encouraged to ask questions, answer other students' questions, and give comments on the papers we discuss.

Papers List

Topic Title Mandatory
Knowledge Graphs CBench: Towards Better Evaluation of Question Answering Over Knowledge Graphs (2021).
Maestro: Automatic Generation of Comprehensive Benchmarks for Question Answering Over Knowledge Graphs (2023).
The Ubiquity of Large Graphs and Surprising Challenges of Graph Processing (2017).
Real-time Data Infrastructure at Uber (2021).
Dialogue Benchmark Generation from Knowledge Graphs with Cost-Effective Retrieval-Augmented LLMs (2025).
Geospatial Data QARTA: An ML-based System for Accurate Map Services (2021).
Kamel: A Scalable BERT-based System for Trajectory Imputation (2023).
SIMformer: Single-Layer Vanilla Transformer Can Learn Free-Space Trajectory Similarity (2024).
LLM-Applications A Cost-Effective LLM-based Approach to Identify Wildlife Trafficking in Online Marketplaces (2025).
ArcheType: A Novel Framework for Open-Source Column Type Annotation using Large Language Models (2023).
Agent-OM: Leveraging LLM Agents for Ontology Matching (2023).
Automating the Enterprise with Foundation Models (2024).
Generation of Training Examples for Tabular Natural Language Inference (2023).
𝜆-Tune: Harnessing Large Language Models for Automated Database System Tuning (2025).
Pneuma: Leveraging LLMs for Tabular Data Representation and Retrieval in an End-to-End System (2025).
Time Series An Experimental Evaluation of Anomaly Detection in Time Series (2023).
Anomaly Detection in Time Series: A Comprehensive Evaluation (2022).
Goku: A Schemaless Time Series Database for Large Scale Monitoring at Pinterest.
Healthcare Analytics CohortNet: Empowering Cohort Discovery for Interpretable Healthcare Analytics (2024).
Natural Language to SQL Combining Small Language Models and Large Language Models for Zero-Shot NL2SQL (2024).
Cracking SQL Barriers: An LLM-based Dialect Translation System (2025).
Logical and Physical Optimizations for SQL Query Execution over Large Language Models (2025).
ScienceBenchmark: A Complex Real-World Benchmark for Evaluating Natural Language to SQL Systems (2023).
SNAILS: Schema Naming Assessments for Improved LLM-Based SQL Inference (2025).
Solo: Data Discovery Using Natural Language Questions Via A Self-Supervised Approach (2023).
Sphinteract: Resolving Ambiguities in NL2SQL Through User Interaction (2024).
Text-to-SQL Empowered by Large Language Models: A Benchmark Evaluation (2024).
The Dawn of Natural Language to SQL: Are We Fully Ready? (2024).
Data Cleaning DataVinci: Learning Syntactic and Semantic String Repairs (2025).
Sparcle: Boosting the Accuracy of Data Cleaning Systems through Spatial Awareness (2024).
Data Lakes LakeBench: A Benchmark for Discovering Joinable and Unionable Tables in Data Lakes.
Efficiency SplitDF: Splitting Dataframes for Memory-Efficient Data Analysis (2024).
Entity Alignment ZeroEA: A Zero-Training Entity Alignment Framework via Pre-Trained Language Model (2024).

Course Schedule

Date Topics Papers Speakers
Sep 9 Course Introduction N/A Ahmed El-Roby
Sep 16 Graph Processing
Natural Language to SQL
  1. The Ubiquity of Large Graphs and Surprising Challenges of Graph Processing (2017).
  2. The Dawn of Natural Language to SQL: Are We Fully Ready? (2024).
1. TBD.
2. TBD.
Sep 23 TBD
  1. TBD
  2. TBD
1. TBD.
2. TBD.
Sep 30 TBD
  1. TBD
  2. TBD
1. TBD.
2. TBD.
Oct 7 TBD
  1. TBD
  2. TBD
1. TBD.
2. TBD.
Oct 14 TBD
  1. TBD
  2. TBD
1. TBD.
2. TBD.
Oct 21 NO CLASS (Fall Break)
Oct 28 TBD
  1. TBD
  2. TBD
1. TBD.
2. TBD.
Nov 4 TBD
  1. TBD
  2. TBD
1. TBD.
2. TBD.
Nov 11 TBD
  1. TBD
  2. TBD
1. TBD.
2. TBD.
Nov 18 TBD
  1. TBD
  2. TBD
1. TBD.
2. TBD.
Nov 25 TBD
  1. TBD
  2. TBD
1. TBD.
2. TBD.
Dec 2 TBD
  1. TBD
  2. TBD
1. TBD.
2. TBD.