Topics in Algorithmic Data Analysis SS'22


News

more ▾

Course Information

Type Advanced Lecture (6 ECTS)
Lecturer Prof. Dr. Jilles Vreeken
Email vreeken (at) cispa.de
Lectures Thursdays, 10–12 o'clock (sharp) in 0.05 (CISPA, E9.1) and online via Zoom and YouTube
(and twice on Tuesday, 10–12 o'clock, also sharp, but online only)
Summary In this advanced course we'll be investigating hot topics in data mining and machine learning that the lecturer thinks are cool. This course is for those of you who are interested in Data Mining, Machine Learning, Data Science, Big Data Analytics – or, as the lecturer prefers to call it – Algorithmic Data Analysis. We'll be looking into how to determine causal relations from observational data, how to extract non-linear dependencies, discover significant and useful patterns from data, as well as how to gain insight into graph structured data.

Preliminary Schedule

Month Day Topic Slides Assignment Req.
Reading
Opt.
Reading
Apr 14 Introduction and Practicalities PDF 1st assignment out
21 Causality PDF [1] Ch 1, Ch 6 [8]
28 Causal Discovery PDF deadline 1st, 2nd out [1] Ch 7 [9]
May 5 Causal Inference PDF [1] Ch 2, Ch 4 [10,11,12,13,14]
10* (Tue) Dependence PDF [2] [15,16,17]
19 Subgroup Discovery PDF [3] [18,19,20]
24* (Tue) Useful Patterns PDF [4] [21,22,23]
26 yay holiday — no class deadline 2nd, 3rd out
Jun 2 Jilles travelling — no class
9 Insightful Patterns PDF [5] [24,25]
16 yay holiday — no class
23 Sequential Patterns PDF 4th out [22] [26,27]
30 Graph Summarization PDF deadline 3rd [6] [28,29,30,31]
Jul 7 Graph Epidemics PDF [7] [32,33,34,35]
14 Wrap-Up with Audience-Selected Topic PDF
21 Jilles travelling — no class deadline 4th
28 oral exams
Sep 29 oral re-exams

* Tuesdays 10:00 till 12:00 and online only

All report deadlines are on the indicated day at 10:00.

Materials

All required and optional reading will be made available here. You will need a username and password to access the papers outside the MMCI/MPI network, which will be given out in the first lecture.

In case you do not have a strong enough background in data mining, machine learning, or statistics, these books [1,36,37] may help to get you on your way. The university library kindly keeps hard copies of these books available in a so-called Semesteraparat.

Required Reading

[1] Peters, J., Janzing, D. & Schölkopf, B. Elements of Causal Inference. MIT Press, 2017.
[2] Nguyen, H.V., Müller, E., Vreeken, J. & Böhm, K. Multivariate Maximal Correlation Analysis. In Proceedings of the 31st International Conference on Machine Learning (ICML), Beijing, China, pages 775-783, JMLR, 2014.
[3] Atzmueller, M. Subgroup Discovery. WIRE's Data Mining and Knowledge Discovery, 5:35-49, Wiley, 2015.
[4] van Leeuwen, M. & Vreeken, J. Mining and Using Sets of Patterns through Compression. In Frequent Pattern Mining, Aggarwal, C. & Han, J., pages 165-198, Springer, 2014.
[5] Fischer, J. & Vreeken, J. Sets of Robust Rules, and How to Find Them. In Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Data (ECMLPKDD), Springer, 2019.
[6] Koutra, D., Kang, U., Vreeken, J. & Faloutsos, C. VoG: Summarizing Graphs using Rich Vocabularies. In Proceedings of the 14th SIAM International Conference on Data Mining (SDM), Philadelphia, PA, pages 91-99, SIAM, 2014.
[7] Prakash, B.A., Vreeken, J. & Faloutsos, C. Spotting Culprits in Epidemics: How many and Which ones?. In Proceedings of the 12th IEEE International Conference on Data Mining (ICDM), Brussels, Belgium, IEEE, 2012.

Optional Reading

[8] Pearl, J. The do-calculus revisited. In Proceedings of Uncertainty in AI, 2012.
[9] Mian, O., Marx, A. & Vreeken, J. Discovering Fully Directed Causal Networks. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), AAAI, 2021.
[10] Janzing, D. & Schölkopf, B. Causal Inference Using the Algorithmic Markov Condition. IEEE Transactions on Information Technology, 56(10):5168-5194, 2010.
[11] Janzing, D., Mooij, J., Zhang, K., Lemeire, J., Zscheischler, J., Daniusis, P., Steudel, B. & Schölkopf, B. Information-geometric Approach to Inferring Causal Directions. , 182-183:1-31, 2012.
[12] Budhathoki, K. & Vreeken, J. Accurate Causal Inference on Discrete Data. In Proceedings of the IEEE International Conference on Data Mining (ICDM'18), IEEE, 2018.
[13] Marx, A. & Vreeken, J. Identifiability of Cause and Effect using Regularized Regression. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), ACM, 2019.
[14] Tatasovska, N., Chavez-Demoulin, V. & Vatter, T. Distinguishing Cause from Effect Using Quantiles: Bivariate Quantile Causal Discovery. In Proceedings of the International Conference on Machine Learning (ICML), PMLR, 2020.
[15] Mandros, P., Boley, M. & Vreeken, J. Discovering Dependencies with Reliable Mutual Information. Knowledge and Information System, 62:4223-3253, Springer, 2020.
[16] Mandros, P., Boley, M. & Vreeken, J. Discovering Reliable Correlations in Categorical Data. In Proceedings of the IEEE International Conference on Data Mining (ICDM), 2019.
[17] Dutta, A., Vreeken, J., Ghiringhelli, L. & Bereau, T. Data-Driven equation for Drug-Membrame Permeability across Drugs and Membrames. Journal of Chemical Physics, AIP Publishing, 2021.
[18] Kalofolias, J., Boley, M. & Vreeken, J. Efficiently Discovering Locally Exceptional yet Globally Representative Subgroups. In Proceedings of the 17th IEEE International Conference on Data Mining (ICDM), New Orleans, LA, IEEE, 2017.
[19] Sutton, C., Boley, M., Ghiringhelli, L., Rupp, M., Vreeken, J. & Scheffler, M. Identifying Domains of Applicability of Machine Learning Models for Materials Science. Nature Communications, 11:1-9, Nature Research, 2020.
[20] Budhathoki, K., Boley, M. & Vreeken, J. Rule Discovery for Exploratory Causal Reasoning. In Proceedings of the SIAM Conference on Data Mining (SDM), SIAM, 2021.
[21] Vreeken, J., van Leeuwen, M. & Siebes, A. Krimp: Mining Itemsets that Compress. Data Mining and Knowledge Discovery, 23(1):169-214, Springer, 2011.
[22] Bhattacharyya, A. & Vreeken, J. Efficiently Summarising Event Sequences with Rich Interleaving Patterns. In Proceedings of the SIAM International Conference on Data Mining (SDM'17), SIAM, 2017.
[23] Cueppers, J. & Vreeken, J. Just Wait For It... Mining Sequential Patterns with Reliable Prediction Delays. In Proceedings of the IEEE International Conference on Data Mining (ICDM), IEEE, 2020.
[24] Fischer, J., Oláh, A. & Vreeken, J. What's in the Box? Exploring the Inner Life of Neural Networks with Robust Rules. In Proceedings of the International Conference on Machine Learning (ICML), PMLR, 2021.
[25] Fischer, J. & Vreeken, J. Differentiable Pattern Set Mining. In Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), ACM, 2021.
[26] Tatti, N. & Vreeken, J. The Long and the Short of It: Summarizing Event Sequences with Serial Episodes. In Proceedings of the 18th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), Beijing, China, ACM, 2012.
[27] Cueppers, J., Kalofolias, J. & Vreeken, J. Omen: Discovering Sequential Patterns with Reliable Prediction Delays. Knowledge and Information Systems, Springer, 2022.
[28] Chakrabarti, D., Papadimitriou, S., Modha, D.S. & Faloutsos, C. Fully automatic cross-associations. In Proceedings of the 10th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), Seattle, WA, pages 79-88, 2004.
[29] Kang, U. & Faloutsos, C. Beyond Caveman Communities: Hubs and Spokes for Graph Compression and Mining. In Proceedings of the 11th IEEE International Conference on Data Mining (ICDM), Vancouver, Canada, pages 300-309, IEEE, 2011.
[30] Goeble, S., Tonch, A., Böhm, C. & Plant, C. MeGS: Partitioning Meaningful Subgraph Structures Using Minimum Description Length. In Proceedings of the IEEE International Conference on Data Mining (ICDM), pages 889-894, IEEE, 2016.
[31] Coupette, C. & Vreeken, J. Graph Similarity Description. In Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), ACM, 2021.
[32] Lappas, T., Terzi, E., Gunopulos, D. & Mannila, H. Finding effectors in social networks. In Proceedings of the 16th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), Washington, DC, pages 1059-1068, ACM, 2010.
[33] Shah, D. & Zaman, T. Rumors in a Network: Who's the Culprit?. IEEE Transactions on Information Technology, 57(8):5163-5181, 2011.
[34] Sundareisan, S., Vreeken, J. & Prakash, B.A. Hidden Hazards: Finding Missing Nodes in Large Graph Epidemics. In Proceedings of the SIAM International Conference on Data Mining (SDM'15), SIAM, 2015.
[35] Rozenshtein, P., Gionis, A., Prakash, B.A. & Vreeken, J. Reconstructing an Epidemic over Time. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'16), pages 1835-1844, ACM, 2016.
[36] Wasserman, L. All of Statistics. Springer, 2005.
[37] Aggarwal, C.C. Data Mining - The Textbook. Springer, 2015.

Prerequisites

Students should have basic working knowledge of machine learning, data mining, and/or statistics, e.g. by successfully having taken courses such as Machine Learning, Probabilistic Graphical Models, Probabilistic Machine Learning, Elements of Machine Learning, etc.

The skills you will benefit most from when taking TADA are reading comprehension and critical thought. We will practice these in both lectures and assignments.

Registration

There is no need to register for the course with the lecturer. The credentials to the Zoom meetings, YouTube stream, and necessary materials, will be shared in the first (publicly available) lecture.

As is usual, you will have to register for the exam via LSF. You can do so up to one week before the exam.

Lectures

Each lecture will start at 10:00 sharp.

TADA will be a hybrid course. You can attend all lectures in-person in the CISPA lecture hall (room 0.05 of E9.1), as well as via online via Zoom. We will additionally stream every lecture to YouTube, and keep the recorded videos until the end of the semester. The Zoom meetings, YouTube streams, and edited video will be linked from the schedule above.

Masks are mandatory in the CISPA building. Whether masks are required while sitting in the lecture hall depends on the number of students that attend in person, and will be decided by the lecturer.

The credentials to access the course materials will be shared during the first lecture.

Assignments

Students will individually do one assignment per topic – four in total. For every assignment, you will have to read one or more research papers and hand in a report that critically discusses this material and answers the assignment questions. Reports should summarise the key aspects, but more importantly, should include original and critical thought that show you have acquired a meta level understanding of the topic – plain summaries will not suffice. All sources you've drawn from should be referenced. The expected length of a report is 3 pages, but there is no limit.

A sample assignment from 2015, together with a well-graded report can be found here.

The deadlines for the reports are on the day indicated in the schedule at 10:00 Saarbrücken standard-time. You are free to hand in earlier.

Grading and Exam

The assignments will be graded in scale of Fail, Pass, Very Good, and Excellent. Any assignment not handed in by the deadline is automatically considered failed, and cannot be re-done. You are allowed to re-do one Failed assignment: you have to hand in the improved assignment within two weeks. Two failures mean you are not eligible for the exam, and hence failed the course.

You can earn up to three bonus points by obtaining Excellent or Very Good grades for the assignments. An Excellent grade gives you one bonus point, as do every two Very Good grades, up to a maximum of three bonus points. Each bonus point improves your final grade by 1/3 assuming you pass the final exam. For example, if you have two bonus points and you receive 2.0 from the final exam, your final grade will be 1.3. You fail the course if you fail the final exam, irrespective of your possible bonus points. Failed assignments do not reduce your final grade, provided you are eligible to sit the final exam.

The final exams will be oral, and will cover all the material discussed in the lectures and the topics on which you did your assignments. The main exam will be on July 28th. The re-exam will be on September 29th. The exact time slot per student will be announced per email. Inform the lecturer of any potential clashes as soon as you know them.

Hybrid Lecture, Zoom, and Privacy

TADA will this semester be organized as a hybrid lecture. That is, I will generally lecture in-person in lecture hall 0.05 of the CISPA building (E9.1) on campus, but all lectures will also be streamed live to Zoom and Youtube. The links to these will appear in the schedule above.

For those of you who do join via Zoom, please ask your questions via the chat, or use the raise-hand function to draw my attention.

I decided to use Zoom as a videoconferencing service. Note that this provider (Zoom Video Communications, Inc., 55 Almaden Blvd, Suite 600, San Jose, CA 95113, USA) can access all data that you provide when registering for the video conference. If you do not provide personal data during the registration, there is still a possibility that Zoom identifies you using your IP address. Neither CISPA, nor the computer science department would have decided to use Zoom if we considered this as a significant risk. As an additional precaution, we have opted to use European computing centers. Should you still have privacy concerns (and are using an Internet Service Provider that can map IP addresses to your name), we suggest using an anonymization service such as Tor.

You can find Zoom's complete privacy policy here.