Discovering causal relationships from observational data is a central and challenging problem in artificial intelligence and data analysis, and is crucial for many scientific and practical applications. Yet, in practice, real data often violate the assumptions that standard causal discovery methods rely on. In this tutorial, we introduce techniques for discovering causal relationships even when faced with hidden variables, selection bias, data heterogeneity, and non-stationarity. We start from the basics of causality and move toward advanced methods designed for real-world, imperfect data.


Outline

The study of cause and effect has deep roots in philosophy and has long been central in fields such as medicine, economics, and the social sciences. In recent decades, the development of formal frameworks — in particular, the theory of causal graphical models pioneered by Judea Pearl and the framework of functional causal models — has enabled major progress in understanding and discovering causal relationships from observational data. This has led to a surge of interest in causal discovery within artificial intelligence and data analysis over the past twenty years.

However, in real-world settings, the classical assumptions underlying most causal discovery algorithms rarely hold: not all relevant variables are measured, datasets may suffer from selection bias, data sources may be heterogeneous, and the underlying data-generating processes may shift over time.

In this tutorial, we will give a gentle introduction to the foundational notions of causal discovery and will illustrate the various ways in which these notions fail to connect with reality. We will discuss the challenges posed by hidden confounders, selection bias, heterogeneous and non-stationary data, and for each of these cases, we will explain theory and methods on how to alleviate, and sometimes fully circumvent, these problems, allowing us to extract high-quality causal networks from data collected in a non-ideal world. We will conclude with a presentation of open research problems.

CDRW will be 2 hour tutorial at SIAM SDM 2025.

Part 1: The Basics of Causal Discovery


slides for part 1

First, we discuss why we should consider causality at all. We then give a gentle introduction to how we can discover causality from data. We will discuss causal graphs, Pearl's framework, and why it is unavoidable to make assumptions if we want to discover causality from data. We then introduce three common approaches for causal discovery, i.e. constraint-based (e.g., PC), restricted model class-based (e.g., LiNGAM), score-based (e.g., GES), and continuous-optimization based (e.g., NOTEARS) strategies.

Part 2: Confounding, Selection Bias, and Measurement Errors


slides for part 2

We then move on to what goes wrong when the most common assumptions in causal discovery do not hold. We discuss how hidden confounders, selection bias, and measurement errors can distort our results towards causal gobbledygook, and why "controlling for" various biasing factors is much harder in practice than one might think. We also explore how we can detect and alleviate these problem in theory and practice.

Part 3: Heterogeneous Data


slides for part 3
Most methods for data mining, machine learning, and causal discovery assume that we have one homogenous dataset. But what if we have multiple datasets, e.g. from different hospitals? Naively combining datasets together can lead to non-sensical results (e.g., Simpson's paradox), but by employing methods that make use of the heterogeneity of these datasets we can leverage diverse data to obtain invariant causal mechanisms.

Part 4: Time Series


slides for part 4
The notion of time can help determine causality, especially in presence of feed-back mechanism (cycles). However, methods that leverage time face a dimensionality problem due to the potential delays between causes and effects to be considered, and they make other restrictive assumptions such as stationarity. We will see how nonstationary can be understood through the prism of causality. We will then see strategies to capture the nonstationarity mechanisms and integrate them in the learned causal models.

Part 5: Open Problems


slides for part 5
At this stage, we have addressed many key problems of standard causal discovery, but there remain many open problems. In this part we will identify a few of these, such as discrete and mixed-type data, non-identical variable sets, causal representation learning, and cyclic causal relationships.
 

Speakers

David Kaltenpoth
Post-Doc
Lénaïg Cornanguer
Post-Doc

Co-Organizers

Sarah Mameche
PhD Student
Jilles Vreeken
Professor

Affiliation


The 2025 tutorial on Causal Discovery in a Non-Ideal World is organized in conjunction with SDM.