Assignment 4: Because


The deadline is July 10th at 10:00 Saarbrücken standard-time. You are free to hand in earlier. You will have to choose one topic from the list below, read the articles, and hand in a report that answers the assignment questions but above all, critically discusses the material beyond these. Reports should summarise the key aspects, but more importantly, should include original and critical thought that show you have acquired a meta level understanding of the topic – plain summaries will not suffice. All sources you use should be appropriately referenced, any text you quote should be clearly identified as such. The expected length of a report is between 3 to 5 pages, but there is no limit.

For the topic of your assignment, choose one of the following:

  1. Same-same, but different

    There exist many proposals for measuring how similar two graphs are, but very few for telling how they are similar. In this assignment, you'll be reading three proposals Coupette & Vreeken [1], Coupette et al. [2], and Kumpulainen et al. [3]. Each aim to answer the second question, but do so in a different way. To make everything very meta, your task will be to tell how these are similar and how they are different. The earliest paper, Momo describes differences using tangible graph structures, the second, Gragra using connected components, and the most recent one, sSBM the most loosely using stochastic block models. That almost sounds like devolution. Is it? Which method is most expressive? Which is most reliable? Which is most efficient? Which is most versatile? Which method is best? etc. etc. And, above all, why?

    (Bonus) A few years prior, Kalofolias et al. [4] proposed a variant of subgroup discovery where we are interested in discovering and naming dense subgraphs in a single given graphs. Can we meaningfully adapt this method to find what is similar between two graphs? How?

  2. No + No = No

    The EU GDPR codifies that you have the right to obtain an explanation of a fully automated decision process. That is, computer is no longer allowed to just say "No" but also has to say why it says no. To this end, one can either start using methods that are inherently explainable (e.g. based on patterns) or one can stick to using opaque models (e.g. SVMs, deep neural networks, etc) together with post-hoc explaination methods (and hope that a judge will go along with that). shap [5] is a popular game-theoretic approach for computing the importance of the feature values \(x_i\) for a prediction outcome \(\hat{f}(x)\).

    The basic idea is to consider \(x\) as a coalition or team, and to then determine the importance of a team member by randomizing it while keeping the rest of the team fixed. If we only consider individual features, we will miss important interactions between feature-values, but if we want to consider all possible interactions we face an exponential blow-up. How now, brown cow? To this end, Brordt & Von Luxburg [6] propose \(n\)-shap which computes the contributions of all sets of up to \(n\) features. Although this approach has the benefit of completeness, Xu et al. [7] argue that this is too much detail for a sensible explanation. They propose \(i\)-shap, which considers coalitions of (up to) \(i\) team members.

    Read all three papers, connect the dots between them, and critically discuss their strengths and weaknesses in general and in light of each other. As always, explain your reasoning.

  3. The Root of All Evil

    Discovering sources of viral infections is a hot area of research. A few years ago, within three months, three independent groups of authors published results on how to discover culprits from incomplete data. Read each of these three, i.e. Sundareisan et al. [8], Sefer & Kingsford [9], and Farajtabar et al. [10], discuss how (dis)similar the explored ideas and proposed solutions are. Be critical, try to see through the hype, not everything that is said to be radically different has to be such.

    (Bonus) Do (any of) the proposed methods have any weaknesses? Do you see more general, more realistic problem settings can be solved? (For example, how would you extend Sundareisan et al. [8] to the SIR model?) Read also Rozenshtein et al. [11], who consider identifying sources on dynamic, interaction graphs, without having to assume any normal infection model. This sounds very tasty, yet there is no such thing as a free lunch. Discuss the advantages and the disadvantages. In particular, draw the connections to the methods above, and investigate whether the shortest path idea can be back-ported into those settings.

  4. Hype, hyper, hyperbolae

    A lot of research attention in graph mining is focused on the discovery of communities, groups of nodes that strongly interact. Many papers assume highly simplistic models, such as that communities strongly interact within, but only loosely to the outside, or, that communities do not overlap. Recently, different groups of researchers started considering that, just like graphs themselves, communities within these graphs may also show powerlaw-like edge distributions.

    Read the papers by Yang & Leskovec [12], Araujo et al. [13], and Metzler et al. [14], and critically discuss these approaches. Example questions you can address include the following. What are their (hidden or not) assumptions, what are the strengths, and what are the weaknesses? Consider also the evaluation, what are these methods trying to reconstruct? Is that a meaningful goal, or is it a self-fulfilling prophecy? Does any of these papers convince you that real communities follow the assumptions the authors make? The first paper considers overlapping communities, whereas the other does not. That sounds like an important downside. Is it? Really? Why (not)?

Return the assignment by email to vreeken (at) cispa.de by 10 July, 1000 hours. The subject of the email must start with [TADA]. The assignment must be returned as a PDF and it must contain your name, matriculation number, and e-mail address together with the exact topic of the assignment.

Grading will take into account both Hardness of questions, as well as whether you answer the Bonus questions.



References

You will need a username and password to access the papers. Contact the lecturer if you don't know the username or password.

[1] Coupette, C. & Vreeken, J. Graph Similarity Description. In Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), ACM, 2021.
[2] Coupette, C., Dalleiger, S. & Vreeken, J. Differentially Describing Groups of Graphs. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), AAAI, 2022.
[3] Kumpulainen, I., Dalleiger, S., Vreeken, J. & Tatti, N. From Your Block to Our Block: How to Find Shared Structure between Stochastic Block Models over Multiple Graphs. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), ACM, 2021.
[4] Kalofolias, J., Boley, M. & Vreeken, J. Discovering Robustly Connected Subgraphs with Simple Descriptions. In Proceedings of the IEEE International Conference on Data Mining (ICDM), 2019.
[5] Lundberg, S.M. & Lee, S.I. A Unified Approach to Interpreting Model Predictions. In Proceedings of the Annual Conference on Neural Information Processing Systems (NeurIPS), pages 4768-77, Curran, 2017.
[6] Brordt, S. & Von Luxburg, U. From Shapley Values to Generalized Additive Models and back. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), PMLR, 2023.
[7] Xu, S., Cueppers, J. & Vreeken, J. Succinct Interaction-Aware Explanations. In Proceedings of the ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), ACM, 2025.
[8] Sundareisan, S., Vreeken, J. & Prakash, B.A. Hidden Hazards: Finding Missing Nodes in Large Graph Epidemics. In Proceedings of the SIAM International Conference on Data Mining (SDM'15), SIAM, 2015.
[9] Sefer, E. & Kingsford, C. Diffusion Archaeology for Diffusion Progression History Reconstruction. In Proceedings of the 14th IEEE International Conference on Data Mining (ICDM), 2014.
[10] Farajtabar, M., Gomez-Rodriguez, M., Du, N., Zamani, M., Zha, H. & Song, L. Back to the Past: Source Identification in Diffusion Networks from Partially Observed Cascades. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), 2015.
[11] Rozenshtein, P., Gionis, A., Prakash, B.A. & Vreeken, J. Reconstructing an Epidemic over Time. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'16), pages 1835-1844, ACM, 2016.
[12] Yang, J. & Leskovec, J. Overlapping Communities Explain Core-Periphery Organization of Networks. Proceedings of the IEEE, 102(12), IEEE, 2014.
[13] Araujo, M., Günnemann, S., Mateos, G. & Faloutsos, C. Beyond Blocks: Hyperbolic Community Detection. In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Data (ECMLPKDD'14), Springer, 2014.
[14] Metzler, S., Günnemann, S. & Miettinen, P. Hyperbolae are no Hyperbole: Modelling Communities that are not Cliques. In Proceedings of the IEEE International Conference on Data Mining (ICDM'16), IEEE, 2016.