The deadline is June 25th at 10:00 Saarbrücken standard-time. You are free to hand in earlier. You will have to choose one topic from the list below, read the articles, and hand in a report that critically discusses this material and answers the assignment questions. Reports should summarise the key aspects, but more importantly, should include original and critical thought that show you have acquired a meta level understanding of the topic – plain summaries will not suffice. All sources you use should be appropriately referenced, any text you quote should be clearly identified as such. The expected length of a report is between 3 to 5 pages, but there is no limit.
For the topic of your assignment, choose one of the following:
Discovering which points in your data 'belong together', also known as clustering, is a very important task in data analysis. Although a super useful goal, traditional clustering methods don't quite attain the ideal, as instead of exploring the data, they essentially let you explore your own assumptions on the data: you have to tell them how many clusters you want to be found, how to measure the difference between two groups of points, or what distribution a cluster follows. Moreover, by their definition they cannot give a sensible explanation of why points are grouped together, except for this is (my best guess of the) `optimal' assignment according what you told me to do. Modern techniques, on the other hand, do emphasize trustworthiness, and aim to group data together in a way that is interpretable, or at least offer an easy to understand explanation why they say these groups exist.
First, read the papers by Budhathoki & Vreeken  and Kim et al. , and discuss in what sense (if at all) these methods 'cluster' the data, in what sense they provide explanations or interpretations for 'clusters', and in what sense they are similar or different in these regards. Are their results hierarchical, or not, and in what way might that be important with regard to explanations? Next, additionally read Dalleiger & Vreeken  without getting scared by all the math. How does it build trust in the data decompositions it provides? In what way are these decompositions explainable, or is this just a marketing ploy?
Now we can finally discuss all methods together. A few example questions you can consider include: How does the third paper relate to the first two? Compared to the other two, what are the key differences in how Dalleiger & Vreeken  models the data, and how much information it actively uses? How could we extend Kim et al.  to incorporate other measures of trust in the data decomposition? Which of these methods do you find the most trustworthy, which of these methods provides the best explanations? why? and, perhaps most importantly, can you identify any aspects of trustworthiness and explainability that these methods do not address, but you think are particularly important?
Statistical hypothesis testing is a well-established tool used in science for telling whether two random variables are associated: the two variables are tested against the hypothesis they are independent. Traditionally, in Data Mining we search for interesting associations (e.g., itemsets, association rules) in high-dimensional data, and every subset of attributes is a candidate. Using interestingness measures such as frequency leaves the question of whether the results are significant. While one solution would be to post-process all the results and test for significance, this can be very inefficient, and also misleading as it requires more carefully designed tests, e.g., multiple hypothesis testing.
In modern Data Mining, researchers started looking into ways of minimizing the false positives, i.e., minimizing the number of patterns discovered that are not significant. While this area is very rich with different approaches, Hämäläinen and Webb  provide a nice tutorial with some of the techniques. One such method, KingFisher , is to perform the search with hypothesis tests on the fly, while at the same time pruning the search space whenever sure that the supersets are not significant.
In even more modern Data Mining, Mandros et al.  suggested to use mutual information as an interestingness measure. In theory, we have that \(I(X,Y)=0\) whenever \(X\) and \(Y\) are independent, and so, if \(I(X,Y)>0\) we know that \(X\) and \(Y\) are not independent. In practice, however, the mutual information estimates from empirical data are not perfect. How did Mandros et al. solve this? It seems as if their solution guarantees no false positives. Is that true?(Bonus) Learning Bayesian networks, e.g., finding Markov blankets, often involves the use of statistical hypothesis testing. Tsamardinos et al.  have used statistical tests and mutual information to identify Markov blankets. Putting the algorithmic part aside, and assuming that Mandros et al. also finds Markov blankets, discuss how you expect these two approaches to differ in terms of precision and recall.
Discovering non-trivial interactions or correlations requires a powerful score. As we have seen, entropy has many likeable properties, especially that it can detect any type of interaction. Using entropy to define a good score for multivariate continuous-valued random variables is a bit tricky though. Nguyen et al.  showed with cmi that we can succesfully use cumulative entropy to this end. Soon after, they improved over cmi with mac  – loosely speaking, a multivariate extension of MIC, that uses cumulative entropy for maximal correlation analysis. Two years later they proposed the (so-far) ultimate approach, UdS , which is even better because it is universal. Well, well. Each of these measures are said to improve over each other, but one has to remain critical; your task is to carefully read these three papers, connect the dots on how they relate, identifying their respective strengths and weaknesses, identify which corners they cut, and reason about an even more ideal score that you could imagine building upon the main ideas of these papers.
Whenever we apply data-driven methods we have to be very careful that we do not introduce any algorithmic bias; we have to prevent our algorithms from making unfair or discriminative decisions, regardless of whether this is because of how what assumptions the method makes, or whether these biases might have been present in the data we trained the method on. Statistical parity, which assumes that all is fair if we make sure that the distribution of the sensitive attribute is the same between the groups we decide over, is a popular notion in fair machine learning, but has been attacked by Dwork et al. . Bonchi et al.  do not mention parity at all in their paper, as they take a causality based approach. Read this paper carefully and critically. Do you find their approach convincing? Does it live up to the critiques that Dwork gives against statistical parity? Do you see potential other problems in this approach? How could it possibly go wrong? What are the main strengths, in your opinion, and what are the main weaknesses?
(Bonus) In many applications it is not so much about making a (binary) decision, but rather about ranking elements. Say, a search engine that we can query to discover who are the smartest students on campus—and of course we want to be fair with regard to hair colour. How would you solve the fair-ranking problem, if your only tool is statistical parity? And, how would you solve the same problem if you would have the SBCN?
Return the assignment by email to firstname.lastname@example.org by 25 June, 1000 hours. The subject of the email must start with [TADA]. The assignment must be returned as a PDF and it must contain your name, matriculation number, and e-mail address together with the exact topic of the assignment.
Grading will take into account both Hardness of questions, as well as whether you answer the Bonus questions.
You will need a username and password to access the papers outside the MPI network. Contact the lecturer if you don't know the username or password.
|||The Difference and the Norm -- Characterising Similarities and Differences between Databases. In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECMLPKDD'15), Springer, 2015.|
|||Mind the Gap: A Generative Approach to Interpretable Feature Selection and Extraction. In Proceedings of the 28th conference on Advances in Neural Information Processing Systems (NeurIPS'15), pages 2260-2268, Curran, 2015.|
|||Explainable Data Decompositions. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2020.|
|||A tutorial on statistically sound pattern discovery. Data Mining and Knowledge Discovery, 33(2), 2019.|
|||Kingfisher: an efficient algorithm for searching for both positive and negative dependency rules with statistical significance measures. Knowl Inf Syst, 32(2):383-414, 2012.|
|||Discovering Reliable Approximate Functional Dependencies. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 355-363, ACM, 2017.|
|||Algorithms for Large Scale Markov Blanket Discovery. In In The 16th International FLAIRS Conference, pages 376-380, AAAI Press, 2003.|
|||CMI: An Information-Theoretic Contrast Measure for Enhancing Subspace Cluster and Outlier Detection. In Proceedings of the 13th SIAM International Conference on Data Mining (SDM), Austin, TX, pages 198-206, SIAM, 2013.|
|||Multivariate Maximal Correlation Analysis. In Proceedings of the 31st International Conference on Machine Learning (ICML), Beijing, China, pages 775-783, JMLR, 2014.|
|||Universal Dependency Analysis. In Proceedings of the 16th SIAM International Conference on Data Mining (SDM), Miami, FL, pages 791-800, SIAM, 2016.|
|||Fairness through Awareness. In Proceedings of the 3rd Innovations in Theoretical Computer Science Conference (ITCS), ACM, 2012.|
|||Exposing the probabilistic causal structure of discrimination. Data Science and Analytics, Springer, 2017.|