## Assignment 2: Influence

The deadline is May 26th at 10:00 Saarbrücken standard-time. You are free to hand in earlier. You will have to choose one topic from the list below, read the articles, and hand in a report that critically discusses this material and answers the assignment questions. Reports should summarise the key aspects, but more importantly, should include original and critical thought that show you have acquired a meta level understanding of the topic – plain summaries will not suffice. All sources you use should be appropriately referenced, any text you quote should be clearly identified as such. The expected length of a report is between 3 to 5 pages, but there is no limit.

For the topic of your assignment, choose one of the following:

1. #### Insufficient

Since the beginning of time, research on causal inference required three main assumptions: that the data generating process is Faithful, follows the Causal Markov Condition, and that variables we consider are Causally Sufficient. That is to say, that the data was indeed generated from a causal model that can be represented as a DAG over precisely the given variables. While the first two can be seen as reasonable, assuming that we have measured everything relevant is rather unrealistic — and worse, whenever there does exist an unmeasured common cause (a hidden confounder), the results of methods that assume causal sufficiency will be spurious.

Recently, researchers in causality have started looking into this problem, proposing methods to do causal inference in the presence of latent confounders. Read both Wang & Blei [1] and Kaltenpoth & Vreeken [2] and discuss how these papers differ and relate. Example questions of interest include, whether the authors have the same goal? What are the basic assumptions (stated or not) made in both papers? What do they have in common and how do they differ from each other? Are these assumptions reasonable? Could you verify that they hold for real world data if you could measure anything you want? Can you come up with any examples that should work according to the papers' claims but you think in fact wouldn't? Do you find the included experiments convincing or are there setups you would like to see (in order to be convinced, rather than nice-to-haves) but which weren't included?

2. #### Independent (Hard)

For a long time, practical causal inference depended on Reichenbach's principle of common cause, which allows us to identify causality between $$X$$, $$Y$$, and $$Z$$ up to Markov equivalence classes via conditional independence tests. Modern approaches focus on identifying the most likely causal direction between pairs of variables $$X$$ and $$Y$$. One approach, based on the notion of the Additive Noise Models (ANMs), is while rather restricted in theory, relatively easy to instantiate and quite succesful in practice. Another approach, based on the algorithmic Markov condition, is very powerful in theory, but seems difficult to instantiate directly.

Read both Peters et al. [3] and Janzing & Schölkopf [4] and besides paraphrasing the main ideas, such as why and how can we use ANMs resp. Kolmogorov Complexity to identify causal directions, most importantly, investigate how the two are related. Can the one be seen as an instance of the other? What are the implications? How can we instantiate a score based on the algorithmic Markov condition? Can we plug in any compressor, like ZIP, or should it meet some special requirements? How would you go about to instantiate it?

3. #### Indiscriminate

Whenever we apply data-driven methods we have to be very careful that we do not introduce any algorithmic bias; we have to prevent our algorithms from making unfair or discriminative decisions, regardless of whether this is because of how what assumptions the method makes, or whether these biases might have been present in the data we trained the method on. Statistical parity, which assumes that all is fair if we make sure that the distribution of the sensitive attribute is the same between the groups we decide over, is a popular notion in fair machine learning, but has been attacked by Dwork et al. [5]. Bonchi et al. [6] do not mention parity at all in their paper, as they take a causality based approach. Read this paper carefully and critically. Do you find their approach convincing? Does it live up to the critiques that Dwork gives against statistical parity? Do you see potential other problems in this approach? How could it possibly go wrong? What are the main strengths, in your opinion, and what are the main weaknesses?

(Bonus) In many applications it is not so much about making a (binary) decision, but rather about ranking elements. Say, a search engine that we can query to discover who are the smartest students on campus—and of course we want to be fair with regard to hair colour. How would you solve the fair-ranking problem, if your only tool is statistical parity? And, how would you solve the same problem if you would have the SBCN?

4. #### Invariant

One of the main underpinnings of classical statistical learning theory is that we are trying to learn from data that is independently and identically distributed. That is, every datapoint was generated from the same source distribution, underwent the same function (or, model), and was affected by the same noise distribution. First of all, let's agree that this not the most realistic assumption: different environments (e.g. geographical areas) likely have different source distributions (e.g. populations). The problem is two-fold. First of all, if we would train on data from one environment (which is nicely i.i.d.) the resulting model is unlikely to do well on testing data drawn from another environment (as we break i.i.d.-ness). Second, if we simply lump data from multiple environments together, we effectively created a dataset that is differently distribution than any or even all of the environments, and whatever model we now learn assuming i.i.d. will be good for this lumped-together data, but arbitrarily bad for any of the enviroments. Oops.

To obtain models that generalize beyond the distributions that they were inferred on, recent proposals seek for models that perform well on each environment dataset, rather than on the lumped-together-data. Peters et al. [7] was one of the first to propose this idea towards obtaining robust causal networks. Read the paper, and summarize it in no more than a few lines. Arjovsky et al. [8] picked up on this idea, and boldly proposed a replacement for good-old empirical risk minimization. Summarize their idea also in a few lines only. More importantly, discuss the similarities and differences, or (in)variants between these papers, how these key ideas could be combined to obtain an even more powerful method, as well as what their key limitations are; discuss under which (realistic) circumstances they will fail. How could we fix this?

(Bonus) Not long after Arjovsky et al. [8] put their ideas online, Rosenfeld et al. [9] offered some very stern critique. Summarize this critique succinctly, and include the main points from this paper into your discussion above.

Return the assignment by email to vreeken (at) cispa.de by 26 May, 1000 hours. The subject of the email must start with [TADA]. The assignment must be returned as a PDF and it must contain your name, matriculation number, and e-mail address together with the exact topic of the assignment.

Grading will take into account both Hardness of questions, as well as whether you answer the Bonus questions.