The deadline is June 27th at 10:00 Saarbrücken standard-time. You are free to hand in earlier. You will have to choose one topic from the list below, read the articles, and hand in a report that answers the assignment questions but above all, critically discusses the material beyond these. Reports should summarise the key aspects, but more importantly, should include original and critical thought that show you have acquired a meta level understanding of the topic – plain summaries will not suffice. All sources you use should be appropriately referenced, any text you quote should be clearly identified as such. The expected length of a report is between 3 to 5 pages, but there is no limit.
For the topic of your assignment, choose one of the following:
For a long time, practical causal inference depended on Reichenbach's principle of common cause, which allows us to identify causality between \(X\), \(Y\), and \(Z\) up to Markov equivalence classes via conditional independence tests. Modern approaches focus on identifying the most likely causal direction between pairs of variables \(X\) and \(Y\). One approach, based on the notion of the Additive Noise Models (ANMs), is while rather restricted in theory, relatively easy to instantiate and quite succesful in practice. Another approach, based on the algorithmic Markov condition, is very powerful in theory, but seems difficult to instantiate directly.
Read both Peters et al. [1] and Janzing & Schölkopf [2] and besides paraphrasing the main ideas, such as why and how can we use ANMs resp. Kolmogorov Complexity to identify causal directions, most importantly, investigate how the two are related. Can the one be seen as an instance of the other? What are the implications? How can we instantiate a score based on the algorithmic Markov condition? Can we plug in any compressor, like ZIP, or should it meet some special requirements? How would you go about to instantiate it?
One of the main underpinnings of classical statistical learning theory is that we are trying to learn from data that is independently and identically distributed. That is, every datapoint was generated from the same source distribution, underwent the same function (or, model), and was affected by the same noise distribution. This not realistic: different environments (e.g. geographical areas) likely have different source distributions (e.g. populations). To obtain models that generalize beyond the distributions that they were inferred on, recent proposals seek models that perform well on all individual environments (datasets) rather than just on average over all data lumped-together.
Peters et al. [3] was one of the first to propose this idea towards obtaining robust causal networks. Read the paper, and summarize it in no more than a few lines. Arjovsky et al. [4] picked up on this idea, and boldly proposed a replacement for good-old empirical risk minimization. Summarize their idea also in a few lines only. More importantly, discuss the similarities and differences, or (in)variants between these papers, how these key ideas could be combined to obtain an even more powerful method, as well as what their key limitations are; discuss under which (realistic) circumstances they will fail. How could we fix this?
(Bonus) Not long after Arjovsky et al. [4] put their ideas online, Rosenfeld et al. [5] offered some very stern critique. Summarize this critique succinctly, and include the main points from this paper into your discussion above.
Since the beginning of time, research on causal inference required three main assumptions: that the data generating process is Faithful, follows the Causal Markov Condition, and that variables we consider are Causally Sufficient. That is to say, that the data was indeed generated from a causal model that can be represented as a DAG over precisely the given variables. While the first two can be seen as reasonable, assuming that we have measured everything relevant is rather unrealistic — and worse, whenever there does exist an unmeasured common cause (a hidden confounder), the results of methods that assume causal sufficiency will be spurious.
Recently, researchers in causality have started looking into this problem, proposing methods to do causal inference in the presence of latent confounders. Read both Wang & Blei [6] and Kaltenpoth & Vreeken [7] and discuss how these papers differ and relate. Example questions of interest include, whether the authors have the same goal? What are the basic assumptions (stated or not) made in both papers? What do they have in common and how do they differ from each other? Are these assumptions reasonable? Could you verify that they hold for real world data if you could measure anything you want? Can you come up with any examples that should work according to the papers' claims but you think in fact wouldn't? Do you find the included experiments convincing or are there setups you would like to see (in order to be convinced, rather than nice-to-haves) but which weren't included?
Under the standard assumptions for causal discovery it is possible to identify causal networks up to Markov equivalence class. That is, we end up with a partially-oriented DAG (pDAG). While nice, that's not quite what you would hope for. Wouldn't it be great if we could discover fully-oriented causal networks? Hold our beer, because Mian et al. [8] recently proposed the Globe algorithm that can do exactly that! Globe is based on the Algorithmic Markov Condition and seeks that SEM that best compresses the data, modelling edges non-parametrically using multivariate regression splines. When you think about it, this means that Globe is quite reminiscient of CAM [9], which raises the question how are the two similar and different, whether Globe is merely a 7-year delayed reinvention, or whether there are distinct advantages of the one over the other. To find good causal networks in practice, Globe greedily adds, removes, and dynamically reorients edges on the fly, by which it is also very reminiscient of GES [10], which raises the question how are the two similar and different, whether Globe is merely a 18-year delayed re-invention, or whether there are distinct advantages of the one over the other. In particular, investigate which guarantees what method provide, and in order to impress the lecturer most, try and see if and how (i.e. directly, or with additional assumptions or algorithmic changes) these can be transplanted from one to the other algorithm.
Return the assignment by email to vreeken (at) cispa.de by 27 June, 1000 hours. The subject of the email must start with [TADA]. The assignment must be returned as a PDF and it must contain your name, matriculation number, and e-mail address together with the exact topic of the assignment.
Grading will take into account both Hardness of questions, as well as whether you answer the Bonus questions.
You will need a username and password to access the papers. Contact the lecturer if you don't know the username or password.
[1] | Identifying Cause and Effect on Discrete Data using Additive Noise Models. , pages 597-604, 2010. |
[2] | Causal Inference Using the Algorithmic Markov Condition. IEEE Transactions on Information Technology, 56(10):5168-5194, 2010. |
[3] | Causal inference using invariant prediction: identification and confidence intervals. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 2016. |
[4] | Invariant Risk Minimization. arXiv, 2022. |
[5] | Risks of Invariant Risk Minimization. In Proceedings of the International Conference on Learning Representations (ICLR), 2022. |
[6] | The blessings of multiple causes. Journal of the American Statistical Association, 2019. |
[7] | We Are Not Your Real Parents: Telling Causal from Confounded by MDL. In Proceedings of the 2019 SIAM International Conference on Data Mining (SDM), SIAM, 2019. |
[8] | Discovering Fully Directed Causal Networks. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), AAAI, 2021. |
[9] | CAM: Causal additive models, high-dimensional order search and penalized regression. The Annals of Statistics, 42(6):2526-2556, Institute of Mathematical Statistics, 2014. |
[10] | Optimal Structure Identification With Greedy Search. JMLR, 3:507-554, 2002. |