Abstract. Pattern mining is one of the core topics of data mining. We consider the problem of mining a succinct set of patterns that together explain the data in terms of mutual exclusivity and co-occurence. That is, we extend the traditional pattern languages beyond conjunctions, enabling us to capture more complex relationships, such as replacable sub-components or antagonists in biological pathways.
We formally define this problem in terms of the Minimum Description Length principle, by which we identify the best set of patterns as the one that most succinctly describes the data. To avoid spurious result—in sparse data mutual exclusivity is likely just due to chance—we propose an efficient statistical test for \(K\)-ary mutual exclusivity. As the search space for the optimal model is enormous and unstructured, we propose Mexican, a heuristic algorithm to efficiently discover high quality sets of patterns of co-occurences and mutual exclusivity. Through extensive experiments we show that Mexican recovers the ground truth on synthetic data, and meaningful results on real-world data. Both in stark contrast to the state of the art, that result in millions of spurious patterns.
Discovering Succinct Pattern Sets Expressing Co-Occurrence and Mutual Exclusivity . In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), ACM, 2020. (16.8% acceptance rate)