## Assignment 3: Interaction

The deadline is June 24th at 10:00 Saarbrücken standard-time. You are free to hand in earlier. You will have to choose one topic from the list below, read the articles, and hand in a report that critically discusses this material and answers the assignment questions. Reports should summarise the key aspects, but more importantly, should include original and critical thought that show you have acquired a meta level understanding of the topic – plain summaries will not suffice. All sources you use should be appropriately referenced, any text you quote should be clearly identified as such. The expected length of a report is between 3 to 5 pages, but there is no limit.

For the topic of your assignment, choose one of the following:

1. #### Explain It Like I'm Five

Discovering which points in your data 'belong together', also known as clustering, is a very important task in data analysis. Modern techniques aim to group data together in a way that is trustworthy, robust, and interpretable— or, at the very least offer an easy to understand explanation for the discovered clusters.

First, read the papers by Budhathoki & Vreeken [1] and Kim et al. [2], and discuss in what sense (if at all) these methods 'cluster' the data, in what sense they provide explanations or interpretations for 'clusters', and in what sense they are similar or different in these regards. Are their results hierarchical, or not, and in what way might that be important with regard to explanations? Next, additionally read Dalleiger & Vreeken [3] without getting scared by all the math. How does it build trust in the data decompositions it provides? In what way are these decompositions explainable, or is this just a marketing ploy?

Now we can finally discuss all methods together. A few example questions you can consider include: How does the third paper relate to the first two? Compared to the other two, what are the key differences in how Dalleiger & Vreeken [3] models the data, and how much information it actively uses? How could we extend Kim et al. [2] to incorporate other measures of trust in the data decomposition? Which of these methods do you find the most trustworthy, which of these methods provides the best explanations? why? and, perhaps most importantly, can you identify any aspects of trustworthiness and explainability that these methods do not address, but you think are particularly important?

2. #### OK, Boomer

In modern Data Mining, researchers started looking into ways of minimizing the false positives, i.e., minimizing the number of patterns discovered that are not significant. In even more modern Data Mining, Mandros et al. [4] suggest to use mutual information as an interestingness measure. In theory, we have that $$I(X,Y)=0$$ whenever $$X$$ and $$Y$$ are independent, and so, if $$I(X,Y)>0$$ we know that $$X$$ and $$Y$$ are not independent. In practice, however, the mutual information estimates from empirical data are not perfect. How did Mandros et al. solve this? It seems as if their solution guarantees no false positives. Is that true?

Finding sets of features that are informative for a target comes really close to finding Markov blankets, and established algorithms for that task often involve the use of statistical hypothesis testing. Tsamardinos et al. [5], for example, use statistical tests and mutual information. How do the methods and approaches relate? What is the benefit of Mandros et al. over that of the established method? How do you expect these two approaches to differ in terms of precision and recall? Why?

3. #### Ulterior

Discovering non-trivial interactions or correlations requires a powerful score. As we have seen, entropy has many likeable properties, especially that it can detect any type of interaction. Using entropy to define a good score for multivariate continuous-valued random variables is a bit tricky though. Nguyen et al. [6] showed with cmi that we can succesfully use cumulative entropy to this end. Soon after, they improved over cmi with mac [7] – loosely speaking, a multivariate extension of MIC, that uses cumulative entropy for maximal correlation analysis. Two years later they proposed the (so-far) ultimate approach, UdS [8], which is even better because it is universal. Well, well, well... we now have three measures that each say improves over the previous one, but we should not buy into this so easily; your task is to carefully read these three papers, connect the dots on how they relate, identifying their respective strengths and weaknesses, identify which corners they cut, and reason about an even more ideal score that you could imagine building upon the main ideas of these papers.

4. #### Fairly Causal

Whenever we apply data-driven methods we have to be very careful that we do not introduce any algorithmic bias; we have to prevent our algorithms from making unfair or discriminative decisions, regardless of whether this is because of how what assumptions the method makes, or whether these biases might have been present in the data we trained the method on. Statistical parity, which assumes that all is fair if we make sure that the distribution of the sensitive attribute is the same between the groups we decide over, is a popular notion in fair machine learning, but has been attacked by Dwork et al. [9]. Bonchi et al. [10] do not mention parity at all in their paper, as they take a causality based approach. Read this paper carefully and critically. Do you find their approach convincing? Does it live up to the critiques that Dwork gives against statistical parity? Do you see potential other problems in this approach? How could it possibly go wrong? What are the main strengths, in your opinion, and what are the main weaknesses?

(Bonus) In many applications it is not so much about making a (binary) decision, but rather about ranking elements. Say, a search engine that we can query to discover who are the smartest students on campus—and of course we want to be fair with regard to hair colour. How would you solve the fair-ranking problem, if your only tool is statistical parity? And, how would you solve the same problem if you would have the SBCN?

Return the assignment by email to tada-staff@mpi-inf.mpg.de by 24 June, 1000 hours. The subject of the email must start with [TADA]. The assignment must be returned as a PDF and it must contain your name, matriculation number, and e-mail address together with the exact topic of the assignment.

Grading will take into account both Hardness of questions, as well as whether you answer the Bonus questions.

## References

You will need a username and password to access the papers outside the MPI network. Contact the lecturer if you don't know the username or password.

 [1] Budhathoki, K. & Vreeken, J. The Difference and the Norm -- Characterising Similarities and Differences between Databases. In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECMLPKDD'15), Springer, 2015. [2] Kim, B., Shah, J.A. & Doshi-Velez, F. Mind the Gap: A Generative Approach to Interpretable Feature Selection and Extraction. In Proceedings of the 28th conference on Advances in Neural Information Processing Systems (NeurIPS'15), pages 2260-2268, Curran, 2015. [3] Dalleiger, S. & Vreeken, J. Explainable Data Decompositions. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2020. [4] Mandros, P., Boley, M. & Vreeken, J. Discovering Reliable Approximate Functional Dependencies. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 355-363, ACM, 2017. [5] Tsamardinos, I., Aliferis, C., Statnikov, A. & Statnikov, E. Algorithms for Large Scale Markov Blanket Discovery. In In The 16th International FLAIRS Conference, pages 376-380, AAAI Press, 2003. [6] Nguyen, H.V., Müller, E., Vreeken, J., Keller, F. & Böhm, K. CMI: An Information-Theoretic Contrast Measure for Enhancing Subspace Cluster and Outlier Detection. In Proceedings of the 13th SIAM International Conference on Data Mining (SDM), Austin, TX, pages 198-206, SIAM, 2013. [7] Nguyen, H.V., Müller, E., Vreeken, J. & Böhm, K. Multivariate Maximal Correlation Analysis. In Proceedings of the 31st International Conference on Machine Learning (ICML), Beijing, China, pages 775-783, JMLR, 2014. [8] Nguyen, H.V., Mandros, P. & Vreeken, J. Universal Dependency Analysis. In Proceedings of the 16th SIAM International Conference on Data Mining (SDM), Miami, FL, pages 791-800, SIAM, 2016. [9] Dwork, C., Hardt, M., Pitassi, T., Reingold, O. & Zemel, R. Fairness through Awareness. In Proceedings of the 3rd Innovations in Theoretical Computer Science Conference (ITCS), ACM, 2012. [10] Bonchi, F., Hajian, S., Mushra, B. & Ramazotti, D. Exposing the probabilistic causal structure of discrimination. Data Science and Analytics, Springer, 2017.