## Assignment 3: Interaction

The deadline is June 30th at 10:00 Saarbrücken standard-time. You are free to hand in earlier. You will have to choose one topic from the list below, read the articles, and hand in a report that critically discusses this material and answers the assignment questions. Reports should summarise the key aspects, but more importantly, should include original and critical thought that show you have acquired a meta level understanding of the topic – plain summaries will not suffice. All sources you use should be appropriately referenced, any text you quote should be clearly identified as such. The expected length of a report is between 3 to 5 pages, but there is no limit.

For the topic of your assignment, choose one of the following:

1. #### Grab it

To retrieve succinct sets of patterns, MDL based approaches have proven quite successfull. To extend beyond simple conjunctive patterns, such as Krimp [1] and Slim [2] discover, Fischer and Vreeken recently proposed Grab [3] to discover interesting sets of rules. Your task is to read each of these papers critically, and investigate their conceptual differences and similarities.

All three employ an MDL based score, yet the authors of Grab decided for binomial codes over canonical enumerations to encode the data, rather than optimal prefix codes that Krimp and Slim use to define $$L(D|M)$$. What are the differences between these two approaches, and what are the implications? In particular, how does each score behave when there are overlapping patterns in the set and what does that mean for the interpretation of the pattern sets? Are there any algorithmic advantages/disadvantages to either approach?

Another difference to Slim is that Grab models noise explicitly using error matrices. For which type of patterns and what kind of data is such an encoding useful? Is it always preferable to use a model that explicitly models noise? If not, when would you prefer the one, and when would you prefer the other?

2. #### Squeeze it

Mining sets of patterns that together describe the data well has effectively solved the pattern explosion. The question remains, how to score, and how to mine such sets? This are difficult questions. The first determines what the ideal solution looks like, whereas the second determines what we can possibly find. Both involve choices that have far-reaching consequences that are not always easy to oversee.

To summarise event sequences, Tatti and Vreeken [4], for example, proposed the sqs algorithm. They use MDL, the Minimum Description Length principle, to define their score, and propose algorithms to both score the data, as well as to discover good pattern sets directly from data. A few years later, Fowkes and Sutton [5] take a related but slightly different probabilistic approach that does not directly punish gaps, and does allow patterns to interleave. Recently, Bhattacharyya and Vreeken [6] presented Squish, which can discover interleaving and nested patterns, as well as considers a richer class of patterns than both SQS and ISM.

Your assignment, if you choose to accept is, is to read and analyse these papers critically, and connect the dots. Basic questions that can help you on your way, include the following. Are there any hidden, or obvious, biases and assumptions in the scores, in the cover, or search algorithms that may influence the results? If so, how? What are the advantages of the probabilistic over the MDL based score? How different are they really? What are the implications of not punishing gaps, in theory and in practice? What about the comparison between the different methods, are these convincing, fair, or are they comparing apples and oranges? Does ISM fare well on discovering the types of patterns they are after? How about interleaving? Why are the results as they are? (And, are the experiments presented fairly?)

(Bonus) Squish much faster discovers a model that is at least as good as SQS, yet convergence takes a while. How could we speed this up, in a principled way? Also, the SQS-Search procedure requires many passes over the data, how could we reduce this and still (likely) obtain good models? Further, is it possible to include some of the ideas of ISM back into SQS or Squish? How?

3. #### Group it (Hard)

Loosely speaking, in subgroup discovery we are after discovering subpopulations of our data that are a) selectable with a simply interpretable query, i.e. a pattern, and b) that exhibit a different distribution over the target variable than we see for the global distribution. Of course, there are many ways to define what makes a good subgroup—and over time many such definitions have been proposed. Commonly used scores include weighted relative accuracy for discrete targets, and mean-shift for continuous-valued targets.

Only surprisingly recently, however, we realized that 'standing out' by itself is not necessarily a virtue. If we want to use subgroups to better predict the value of the target attribute, or want to use them to explain the target attribute using simple terms, only standing out does not suffice.

Read the following two papers, one by Song et al. [7], and one by Boley et al. [8]. Focus your discussion on what is common between the goal of these two (technically very different) approaches, and how they (try to) solve what problem. Discuss critically on whether they achieve this goal, be it in general, or in the specific use case they consider. Last, go into detail what is different between the two approaches, beyond the obvious. Whatever you do, try not to be distracted by the details of the search schemes, as that's all boilerplate.

(Bonus) Read the recent paper by Kalofolias et al. [9], who propose a variant of subgroup discovery in which we additionally consider a control variable. Ignoring the technical solution they propose, but rather focusing on the general problem where this control variable could be either continuous-valued or discrete valued, could we use the ideas of Song et al. [7] and Boley et al. [8] to build a stronger method? Or, do simple notions of (not) standing out suffice here?

4. #### Bop it

In modern Data Mining, researchers started looking into ways of minimizing the false positives, i.e., minimizing the number of patterns discovered that are not significant. In even more modern Data Mining, Mandros et al. [10] suggest to use mutual information as an interestingness measure. In theory, we have that $$I(X,Y)=0$$ whenever $$X$$ and $$Y$$ are independent, and so, if $$I(X,Y)>0$$ we know that $$X$$ and $$Y$$ are not independent. In practice, however, the mutual information estimates from empirical data are not perfect. How did Mandros et al. solve this? It seems as if their solution guarantees no false positives. Is that true?

Finding sets of features that are informative for a target comes really close to finding Markov blankets, and established algorithms for that task often involve the use of statistical hypothesis testing. Tsamardinos et al. [11], for example, use statistical tests and mutual information. How do the methods and approaches relate? What is the benefit of Mandros et al. over that of the established method? How do you expect these two approaches to differ in terms of precision and recall? Why?

Return the assignment by email to vreeken (at) cispa.de by 30 June, 1000 hours. The subject of the email must start with [TADA]. The assignment must be returned as a PDF and it must contain your name, matriculation number, and e-mail address together with the exact topic of the assignment.

Grading will take into account both Hardness of questions, as well as whether you answer the Bonus questions.