COMP 5118

Fall 2025

Course Overview

Welcome to COMP 5118! This is a graduate-level course for students at Carleton University and the University of Ottawa. Each year, we focus on cutting-edge research topics in the general field of data management. These topics change from one offering to another depending on what's new and hot.

This term, we will explore a variety of exciting areas that represent the forefront of data management research. This will give you a strong understanding of what the research community is currently working on and hopefully inspire ideas for your course project, which you should take very seriously.

This Semester's Topics Include:

Knowledge Graphs
Geospatial Data
LLM-Applications
Time Series
Healthcare Analytics
Natural Language to SQL
Data Cleaning
Data Lakes
Efficiency
Entity Alignment

Grading Scheme

In this course, students will be reading and reviewing papers for each class. During the class, some students will present the papers for the week, and we will all discuss them. There is also a term-long project, which is worth the biggest chunk of your grade.

Project 45%
Presentations 20%
Paper Reviews 20%
Class Participation 15%

Project (45%)

The project can be done individually or in groups (assessment will consider group size) and can be one of the following types:

New Research Idea: A prototype implementation of a new research idea that addresses a limitation in existing work or is a completely new idea inspired by your readings.
Experimental Study: A comprehensive experimental comparison and evaluation of existing work on a specific topic. The goal is to provide new insights not present in the original papers.
Survey: A paper that summarizes, categorizes, and provides new insights on a major research area. See this good example.
System Implementation & Reproducibility: Implement and reproduce the results of a system from a published paper. I have a number of systems in mind.

Project Deliverables:

Project Proposal: Max 2 pages in ACM Format (LaTeX mandatory). Due: October 27 at 11:59 PM. Early submissions are STRONGLY encouraged. Submit via this form.
Project Paper: Min 7 pages in ACM Format. Due: December 8 at 11:59 PM (hard deadline for late submissions: Dec 15). Submit via this form.
Source Code: Publicly available on GitHub with a good README. Link must be in the project paper.

Presentations (20%)

Each presentation should be 30-45 minutes, followed by a 30-45 minute discussion. The presenter is responsible for leading the discussion.

Paper Reviews (20%)

Reviews are due at 11:00 AM on the day of the class via the Paper Review Submission Link. Format: Summary, 2+ strong points, 2+ weak points, and additional comments. Your two worst reviews will be dropped.

How to Write a Good Review:

Do not copy/paste from the paper. Use your own words.

Enumerate strong and weak points; do not write one large paragraph.

Justify your points. Why is a deep learning approach good *in this specific case*?

Elaborate! "Using templates" is not a strong point on its own. Explain why it is effective.

Focus on the technical substance, not just the writing style.

Avoid stating the obvious as a strong point (e.g., "they beat the state-of-the-art").

References: Reading a CS Paper, How to Read a Paper, and a real-world sample review.

Class Participation (15%)

This is a seminar-based class, meaning that your participation is essential. You are encouraged to ask questions, answer other students' questions, and give comments on the papers we discuss.

Papers List

Topic	Title	Mandatory
Knowledge Graphs	CBench: Towards Better Evaluation of Question Answering Over Knowledge Graphs (2021).
	Maestro: Automatic Generation of Comprehensive Benchmarks for Question Answering Over Knowledge Graphs (2023).
	The Ubiquity of Large Graphs and Surprising Challenges of Graph Processing (2017).	✅
	Real-time Data Infrastructure at Uber (2021).
	Dialogue Benchmark Generation from Knowledge Graphs with Cost-Effective Retrieval-Augmented LLMs (2025).
Geospatial Data	QARTA: An ML-based System for Accurate Map Services (2021).
	Kamel: A Scalable BERT-based System for Trajectory Imputation (2023).
	SIMformer: Single-Layer Vanilla Transformer Can Learn Free-Space Trajectory Similarity (2024).
LLM-Applications	A Cost-Effective LLM-based Approach to Identify Wildlife Trafficking in Online Marketplaces (2025).
	ArcheType: A Novel Framework for Open-Source Column Type Annotation using Large Language Models (2023).
	Agent-OM: Leveraging LLM Agents for Ontology Matching (2023).
	Automating the Enterprise with Foundation Models (2024).	✅
	Generation of Training Examples for Tabular Natural Language Inference (2023).
	𝜆-Tune: Harnessing Large Language Models for Automated Database System Tuning (2025).
	Pneuma: Leveraging LLMs for Tabular Data Representation and Retrieval in an End-to-End System (2025).
Time Series	An Experimental Evaluation of Anomaly Detection in Time Series (2023).
	Anomaly Detection in Time Series: A Comprehensive Evaluation (2022).
	Goku: A Schemaless Time Series Database for Large Scale Monitoring at Pinterest.
Healthcare Analytics	CohortNet: Empowering Cohort Discovery for Interpretable Healthcare Analytics (2024).
Natural Language to SQL	Combining Small Language Models and Large Language Models for Zero-Shot NL2SQL (2024).
	Cracking SQL Barriers: An LLM-based Dialect Translation System (2025).
	Logical and Physical Optimizations for SQL Query Execution over Large Language Models (2025).
	ScienceBenchmark: A Complex Real-World Benchmark for Evaluating Natural Language to SQL Systems (2023).
	SNAILS: Schema Naming Assessments for Improved LLM-Based SQL Inference (2025).
	Solo: Data Discovery Using Natural Language Questions Via A Self-Supervised Approach (2023).
	Sphinteract: Resolving Ambiguities in NL2SQL Through User Interaction (2024).
	Text-to-SQL Empowered by Large Language Models: A Benchmark Evaluation (2024).	✅
	The Dawn of Natural Language to SQL: Are We Fully Ready? (2024).	✅
Data Cleaning	DataVinci: Learning Syntactic and Semantic String Repairs (2025).
	Sparcle: Boosting the Accuracy of Data Cleaning Systems through Spatial Awareness (2024).
Data Lakes	LakeBench: A Benchmark for Discovering Joinable and Unionable Tables in Data Lakes.
Efficiency	SplitDF: Splitting Dataframes for Memory-Efficient Data Analysis (2024).
Entity Alignment	ZeroEA: A Zero-Training Entity Alignment Framework via Pre-Trained Language Model (2024).

Course Schedule

Date	Topics	Papers	Speakers
Sep 9	Course Introduction	N/A	Ahmed El-Roby
Sep 16	Graph Processing Natural Language to SQL	The Ubiquity of Large Graphs and Surprising Challenges of Graph Processing (2017). The Dawn of Natural Language to SQL: Are We Fully Ready? (2024).	1. Inara Hussain. 2. Jai Rana and Caden Snelling.
Sep 23	Knowledge Graphs LLM Applications	Real-time Data Infrastructure at Uber (2021). Automating the Enterprise with Foundation Models (2024).	1. Jeremy Fang. 2. Tina Tasavori.
Sep 30	Anomaly Detection Time Series	Anomaly Detection in Time Series: A Comprehensive Evaluation (2022). Goku: A Schemaless Time Series Database for Large Scale Monitoring at Pinterest.	1. Jakob Nix. 2. Alexis Udechukwu and Julie Wechsler.
Oct 7	Data Cleaning Healthcare Analytics	Agent-OM: Leveraging LLM Agents for Ontology Matching (2023). CohortNet: Empowering Cohort Discovery for Interpretable Healthcare Analytics (2024).	1. Jason Au. 2. Minjie Cai.
Oct 14	Anomaly Detection LLM Applications	An Experimental Evaluation of Anomaly Detection in Time Series (2023). A Cost-Effective LLM-based Approach to Identify Wildlife Trafficking in Online Marketplaces (2025).	1. William Zhu and Daniel Maniuk. 2. Tom Fan and Rhishita Mondal.
Oct 21	NO CLASS (Fall Break)
Oct 28	Natural Language to SQL LLM Applications	Combining Small Language Models and Large Language Models for Zero-Shot NL2SQL (2024). Pneuma: Leveraging LLMs for Tabular Data Representation and Retrieval in an End-to-End System (2025).	1. Avery Robertson*. 2. Paul Roode.
Nov 4	Geospatial Data Natural Language to SQL	QARTA: An ML-based System for Accurate Map Services (2021). Text-to-SQL Empowered by Large Language Models: A Benchmark Evaluation (2024).	1. Mohammed Al Amri and Awwab Mahdi. 2. Khizar Malik and Abdulla Abdalla.
Nov 11	Natural Language to SQL Efficiency	Logical and Physical Optimizations for SQL Query Execution over Large Language Models (2025). SplitDF: Splitting Dataframes for Memory-Efficient Data Analysis (2024).	1. Yi Long Huang*. 2. Zhuoran Liu and Hundey Kuma.
Nov 18	Knowledge Graphs	CBench: Towards Better Evaluation of Question Answering Over Knowledge Graphs (2021). Maestro: Automatic Generation of Comprehensive Benchmarks for Question Answering Over Knowledge Graphs (2023).	1. Moshope Lawal and Mosope Oluwashina. 2. Muhammed Arfath Khan.
Nov 25	Knowledge Graphs Data Cleaning Entity Alignment	Dialogue Benchmark Generation from Knowledge Graphs with Cost-Effective Retrieval-Augmented LLMs (2025). DataVinci: Learning Syntactic and Semantic String Repairs (2025). ZeroEA: A Zero-Training Entity Alignment Framework via Pre-Trained Language Model (2024).	1. Do Quang Minh Phan. 2. Abdelrahman Abdelkader. 3. Nnamdi Obasi.
Dec 2	Natural Language to SQL LLM Applications	Cracking SQL Barriers: An LLM-based Dialect Translation System (2025). Generation of Training Examples for Tabular Natural Language Inference (2023). Pneuma: Leveraging LLMs for Tabular Data Representation and Retrieval in an End-to-End System (2025).	1. Michael Han. 2. Gamage Perera. 3. Paul Roode.