## Assignment 4: Because

The deadline is July 16th at 10:00 Saarbrücken standard-time. You are free to hand in earlier. You will have to choose one topic from the list below, read the articles, and hand in a report that critically discusses this material and answers the assignment questions. Reports should summarise the key aspects, but more importantly, should include original and critical thought that show you have acquired a meta level understanding of the topic – plain summaries will not suffice. All sources you use should be appropriately referenced, any text you quote should be clearly identified as such. The expected length of a report is between 3 to 5 pages, but there is no limit.

For the topic of your assignment, choose one of the following:

1. #### Don't Confound Yourself

Since the beginning of time, research on causal inference required three main assumptions: that the data generating process is Faithful, follows the Causal Markov Condition, and that variables we consider are Causally Sufficient. That is to say, that the data was indeed generated from a causal model that can be represented as a DAG over precisely the given variables. While the first two can be seen as reasonable, assuming that we have measured everything relevant is rather unrealistic — and worse, whenever there does exist an unmeasured common cause (a hidden confounder), the results of methods that assume causal sufficiency will be spurious.

Recently, researchers in causality have started looking into this problem, proposing methods to do causal inference in the presence of latent confounders. Read both Wang & Blei [1] and Kaltenpoth & Vreeken [2] and discuss how these papers differ and relate. Example questions of interest include, whether the authors have the same goal? What are the basic assumptions (stated or not) made in both papers? What do they have in common and how do they differ from each other? Are these assumptions reasonable? Could you verify that they hold for real world data if you could measure anything you want? Can you come up with any examples that should work according to the papers' claims but you think in fact wouldn't? Do you find the included experiments convincing or are there setups you would like to see (in order to be convinced, rather than nice-to-haves) but which weren't included?

2. #### Independence (Hard)

For a long time, practical causal inference depended on Reichenbach's principle of common cause, which allows us to identify causality between $$X$$, $$Y$$, and $$Z$$ up to Markov equivalence classes via conditional independence tests. Modern approaches focus on identifying the most likely causal direction between pairs of variables $$X$$ and $$Y$$. One approach, based on the notion of the Additive Noise Models (ANMs), is while rather restricted in theory, relatively easy to instantiate and quite succesful in practice. Another approach, based on the algorithmic Markov condition, is very powerful in theory, but much more difficult to instantiate directly.

Read both [3] and [4] and besides paraphrasing the main ideas, such as why and how can we use ANMs resp. Kolmogorov Complexity to identify causal directions, most importantly, investigate how the two are related. Can the one be seen as an instance of the other? What are the implications? How can we instantiate a score based on the algorithmic Markov condition? Can we plug in any compressor, or does it have special requirements? How would you go about to instantiate it? Would Krimp [5] do, and if so how? and so on.

3. #### The Root of All Evil

Discovering sources of viral infections is a hot area of research. Recently, within three months three groups of authors published results on how to discover culprits from incomplete data. Read these three papers [6,7,8] and discuss how (dis)similar the explored ideas and proposed solutions are. Be critical, try to see through the hype, not everything that is said to be radically different has to be such.

(Bonus) Do (any of) the proposed methods have any weaknesses? Do you see more general, more realistic problem settings can be solved? (For example, how would you extend [6] to the SIR model?) Read also [9], who consider identifying sources on dynamic, interaction graphs, without having to assume any normal infection model. This sounds very tasty, yet there is no such thing as a free lunch. Discuss the advantages and the disadvantages. In particular, draw the connections to the methods above, and investigate whether the shortest path idea can be back-ported into those settings.

4. #### Connecting the Dots

Read the following three papers, [10], [11], and [12]. All three consider vastly different and quite fascinating topics. First, discuss to what extend they convince you. For example, do you find the goals clearly defined? Do the authors make principled, or ad-hoc, choices to define the solution? Are the results convincing? Are the algorithms sufficiently evaluated? What extra experiments would you have liked to see? Can you identify possible improvements for the algorithms, how would you approach the problem? Second, and most importantly, can you connect the dots between them?

Return the assignment by email to tada-staff@mpi-inf.mpg.de by 16 July, 1000 hours. The subject of the email must start with [TADA]. The assignment must be returned as a PDF and it must contain your name, matriculation number, and e-mail address together with the exact topic of the assignment.

Grading will take into account both Hardness of questions, as well as whether you answer the Bonus questions.

## References

You will need a username and password to access the papers outside the MPI network. Contact the lecturer if you don't know the username or password.

 [1] Wang, Y. & Blei, D.M. The blessings of multiple causes. Journal of the American Statistical Association, 2019. [2] Kaltenpoth, D. & Vreeken, J. We Are Not Your Real Parents: Telling Causal from Confounded by MDL. In Proceedings of the 2019 SIAM International Conference on Data Mining (SDM), SIAM, 2019. [3] Peters, J., Janzing, D. & Schölkopf, B. Identifying Cause and Effect on Discrete Data using Additive Noise Models. , pages 597-604, 2010. [4] Janzing, D. & Schölkopf, B. Causal Inference Using the Algorithmic Markov Condition. IEEE Transactions on Information Technology, 56(10):5168-5194, 2010. [5] Vreeken, J., van Leeuwen, M. & Siebes, A. Krimp: Mining Itemsets that Compress. Data Mining and Knowledge Discovery, 23(1):169-214, Springer, 2011. [6] Sundareisan, S., Vreeken, J. & Prakash, B.A. Hidden Hazards: Finding Missing Nodes in Large Graph Epidemics. In Proceedings of the SIAM International Conference on Data Mining (SDM'15), SIAM, 2015. [7] Sefer, E. & Kingsford, C. Diffusion Archaeology for Diffusion Progression History Reconstruction. In Proceedings of the 14th IEEE International Conference on Data Mining (ICDM), 2014. [8] Farajtabar, M., Gomez-Rodriguez, M., Du, N., Zamani, M., Zha, H. & Song, L. Back to the Past: Source Identification in Diffusion Networks from Partially Observed Cascades. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), 2015. [9] Rozenshtein, P., Gionis, A., Prakash, B.A. & Vreeken, J. Reconstructing an Epidemic over Time. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'16), pages 1835-1844, ACM, 2016. [10] Shahaf, D. & Guestrin, C. Connecting the dots between news articles. In Proceedings of the 16th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), Washington, DC, pages 623-632, 2010. [11] Shahaf, D., Horvitz, E. & Mankoff, R. Inside Jokes: Identifying Humorous Cartoon Captions. In Proceedings of the ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), pages 1065-1074, ACM, 2015. [12] Hope, T., Chan, J., Kittur, A. & Shahaf, D. Accelerating Innovation Through Analogy Mining. In Proceedings of the ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), pages 235-243, ACM, 2017.