## Assignment 4: Because

The deadline is July 21st at 10:00 Saarbrücken standard-time. You are free to hand in earlier. You will have to choose one topic from the list below, read the articles, and hand in a report that critically discusses this material and answers the assignment questions. Reports should summarise the key aspects, but more importantly, should include original and critical thought that show you have acquired a meta level understanding of the topic – plain summaries will not suffice. All sources you use should be appropriately referenced, any text you quote should be clearly identified as such. The expected length of a report is between 3 to 5 pages, but there is no limit.

For the topic of your assignment, choose one of the following:

1. #### The Root of All Evil

Discovering sources of viral infections is a hot area of research. Recently, within three months three groups of authors published results on how to discover culprits from incomplete data. Read these three papers [1,2,3] and discuss how (dis)similar the explored ideas and proposed solutions are. Be critical, try to see through the hype, not everything that is said to be radically different has to be such.

(Bonus) Do (any of) the proposed methods have any weaknesses? Do you see more general, more realistic problem settings can be solved? (For example, how would you extend [1] to the SIR model?) Read also [4], who consider identifying sources on dynamic, interaction graphs, without having to assume any normal infection model. This sounds very tasty, yet there is no such thing as a free lunch. Discuss the advantages and the disadvantages. In particular, draw the connections to the methods above, and investigate whether the shortest path idea can be back-ported into those settings.

2. #### Subgraphgroups

In traditional subgroup discovery we are interested in subgroups for which the distribution over the target variable is exceptionally distributed; e.g. the size of a tumor after the administration of a combination of drugs. But, what if $$Y$$ is not a column of (scalar) values, but rather the friendship graph over the population in the dataset, and we are interested in those subgroups that induce dense subgraphs? Testing for every possible subset of the population whether it induces a dense subgraph is prohibitively expensive, so we need something smarter.

A first idea is that we simply compute the degree in $$Y$$ of each population member, and use these values as a continuous valued target attribute $$Y'$$ to discover subgroups with exceptionally high average degree using a standard subgroup discovery algorithm. Is this a good idea? Why (not)?

A second idea is that we first mine the densest subgraph in $$Y$$ (e.g. via Goldberg's 1984 algorithm), create a binary target attribute $$Y'$$ that for every population member labels whether they're part (1) of this subgraph or not (0), and then use a traditional subgroup discovery algorithm over $$X$$ with $$Y'$$ to discover subgroups with a particular strong affinity to the positive labels. Is this a good idea? Why (not)?

A third idea is to discover cohesive subgroups [5]. How does this task differ from 'normal' subgroup discovery? When would you use one and when the other?

A fourth idea is proposed by Kalofolias et al. [6] Is this a good idea? Why (not)? Discuss both the objective and the optimisation approaches in comparison to the other ideas.

3. #### Oddities

A large part of data mining is focused on discovering regularities in data, while oftentimes the exceptions to these underlying regularities are as, or perhaps even more informative to the domain expert. Detecting anomalies, or outliers, in graphs is particularly challenging as unlike for a row in a table—where we only regard that row and that's it—to determine whether a node (or an edge) in a graph is odd, we may have to consider the whole graph.

Read two out of the following three papers, [7], [8], [9]. (Bonus, for all three) Each of these take a different look at graphs and therewith can identify different types of anomalies. Or, do they? For each method, carefully examine what makes an anomaly an anomaly, and critically discuss whether this definition makes sense, and what (hidden) assumptions these methods make. Do we always need the different methods, or can you find cases where method A find more-or-less the same anomalies as method B? Last, but not least, does there exist an ultimate outlier detection algorithm? Why (not)?

4. #### Hype, hyper, hyperbolae

A lot of research attention in graph mining is focused on the discovery of communities, groups of nodes that strongly interact. Many papers assume highly simplistic models, such as that communities strongly interact within, but only loosely to the outside, or, that communities do not overlap. Recently, different groups of researchers started considering that, just like graphs themselves, communities within these graphs may also show powerlaw-like edge distributions.

Read [10], [11], and critically discuss these approaches. Example questions you can address include the following. What are their (hidden or not) assumptions, what are the strengths, and what are the weaknesses? Consider also the evaluation, what are these methods trying to reconstruct? Is that a meaningful goal, or is it a self-fulfilling prophecy? Does any of these papers convince you that real communities follow the assumptions the authors make? The first paper considers overlapping communities, whereas the other does not. That sounds like an important downside. Is it? Really? Why (not)? (Bonus, also consider [12])

Return the assignment by email to vreeken (at) cispa.de by 21 July, 1000 hours. The subject of the email must start with [TADA]. The assignment must be returned as a PDF and it must contain your name, matriculation number, and e-mail address together with the exact topic of the assignment.

Grading will take into account both Hardness of questions, as well as whether you answer the Bonus questions.