Information-Theoretic Machine Learning 2024

News

more ▾

Feb 21 — presentation template for powerpoint posted.
Feb 02 — presentation day will be on March 4th, program posted below.
Jan 27 — bonus lecture on presentation skills will be on Feb 13th at 14:00.
Jan 23 — slides eight lecture available here.
Jan 09 — slides sevent lecture available here.
Dec 19 — slides sixth lecture available here.
Dec 17 — topic assignment available. be sure to register on LSF by Dec 26th.
Dec 12 — lecture on December 19th will be in room 0.07.
Dec 06 — student topics available. let us know your preference by Dec 12th
Nov 27 — papers to be read for the second discussion session available
Nov 26 — slides fifth lecture available here.
Nov 19 — slides fourth lecture available here.
Nov 07 — slides third lecture available here.
Oct 31 — slides second lecture available here.
Oct 24 — slides first lecture available here.
Oct 23 — Tentative schedule posted.
Oct 10 — Website online, more details soon.

Course Information

Type	Seminar (7 ECTS)
Lecturer	Prof. Dr. Jilles Vreeken and Dr. David Kaltenpoth
Email	vreeken (at) cispa.de
Meetings	Thursdays 10:00 till 12:00, room 0.01, E9.1 (CISPA)
Summary	In this seminar, we will discuss information theoretic approaches to machine learning. We will investigate, among others, the following questions: What is interesting and meaningful structure? How can we identify this from data without overfitting? What is a good model when we don't have a decent prior, target, or even know what we're looking for? What is the ultimate model, and how can we approximate it in practice? We'll explore these in light of Algorithmic Information Theory, and its practical variant, the Minimum Description Length (MDL) principle. We will consider the relevance and application of these to a wide range of problems, from description and prediction to generalization, from neural to symbolic, from associative to causal, and so on.

Schedule

Month	Day	Time	Type	Topic	Slides	Req. Reading	Opt. Reading
Oct	24	10:00	L	Introduction	PDF	[1] Ch 1.1–1.10	[1,2,3]
	31	10:00	L	Compression	PDF	[1] Ch 2.1–2.2, 2.8	[4,5,6]
Nov	7	16:00	L	Information	PDF	[2] Ch 2.1-2.8, 3.1, 8.1, 8.3-8.6, 12.1-12.2	[1] Ch 2.8 (1.11), [7,8]
	14	10:00	D	Discussion		[9,10]	[11]
	19	10:00	L	Description	PDF	[3] Ch 1
	26	10:00	L	Coding	PDF	[1] Ch 1.11, [3] Ch 3	[12]
Dec	5	10:00	D	Discussion		[13,14]
	12			yay — no lecture
	19*	10:00	L	Estimation	PDF	[15]	[16,17,18]
	26			yay holiday — no lecture
Jan	2			yay holiday — no lecture
	9	10:00	L	Prediction	PDF	[19,20]	[21,22,23]
	16	10:00	D	Discussion		[24]	[25] Ch 6
	23	10:00	L	Causation	PDF		[26,27]
	30	10:00	L	Reinforcing Counterfactual Confounding	PDF		[28]
Feb	6	10:00	D	Discussion		[29]
	13	14:00*	B	How to Present	PDF
	20		S	Presentation

* By exception, this lecture will be held in room 0.07.

Lecture type key:

L = lecture — sit, listen & ask questions, read homework material when you prefer
B = bonus lecture — on special request
D = discuss — read homework material in advance, actively discuss the topics in the meeting
S = selected students give a presentation on the topic of their assignment, followed by a short discussion.

Student Presentations

		When	Who	Topic	Reading
March	4	10:00	Moritz	ChatGZIP	[22]
		10:40	Max	Mutual Information and High Dimensional Data	[16]
		11:30	Ferdinand	Compression and Creativity	[6]
		13:30	Giorgi	Causality, Compression, and Multiple Datasets	[27]
		14:10	Yannic	Algorithmic Causality	[26]

Every student will have to give a presentation and write a report on an assigned topic. The presentation should be between 15 and 20 minutes long, and will be followed by a 10 minute discussion. The style of the presentation should be like a lecture; your fellow students can follow it with only the previous lectures of the course as background material, you should only discuss those details that are necessary, and make use of examples where possible. Topic will be assigned by the lecturer.

In addition to the presentation, students will have to hand in a report that discusses the assigned topic. Unlike the presentation, which has to be high-level, the report is where you are allowed to show off what you've learned. Summarize the material you read as clearly as you can in your own words, identifying the key contributions/most interesting or important aspects, relating the topic to any/all previous lectures in the course and/or papers read for the course, all the while correctly referring to the sources you've drawn from. The expected length of the report is 5-10 pages, but you are free to use as many pages as you like.

Students will be graded on the slides, the presentation, your answers in the discussion, and the report. Students are allowed to ask for feedback on their slides once a week, up till latest 1 week before the presentation. The provided reading material are provided as starting points. For some topics the provided materials will be (by far) (way) (more than) enough, for others you may want to gather more material. When in doubt, contact the lecturer.

The deadline for your slides and report is right before the meeting in which you have to give your presentation.

You may use any template for either report or the slides. For your convenience, we provide the template (pptx) used in the lecture.

Materials

All required reading will be made available for download. You will need a username and password. These will be given out in the first meeting. If you need a reminder, think of a cat in a box.

The following books provide relevant background to many of the topics of the course, Li & Vitányi [1], Cover & Thomas [2], Jaynes [30], Grünwald [3], Rissanen [31], Rissanen [32], Yamanishi [33], Wallace [34]. Most are also available in a so-called Semesteraparat at the university library.

[1]	Li, M. & Vitányi, P. An Introduction to Kolmogorov Complexity and its Applications. Springer, 1993.
[2]	Cover, T.M. & Thomas, J.A. Elements of Information Theory. Wiley-Interscience New York, 2006.
[3]	Grünwald, P. The Minimum Description Length Principle. MIT Press, 2007.
[4]	Kolmogorov, A. On Tables of Random Numbers. The Indian Journal of Statistics, Series A, 25(4):369-376, 1963.
[5]	Kolmogorov, A. Three Approaches to the Quantitative Definition of Information. Problemy Peredachi Informatsii, 1(1):3-11, 1965.
[6]	Schmidhuber, J. Formal Theory of Creativity, Fun, and Intrinsic Motivation (1990-2010). IEEE Transactions on Autonomous Mental Development, 2(3), 2010.
[7]	Shannon, C.E. A Mathematical Theory of Communication. The Bell System Technical Journal, 27:379-423, 623-656, 1948.
[8]	Shannon, C.E. The bandwagon. IRE transactions on Information Theory, 2(1):3, 1956.
[9]	Li, M., Chen, X., Li, X., Ma, B. & Vitanyi, P. The similarity metric. IEEE Transactions on Information Technology, 50(12):3250-3264, 2004.
[10]	Cilibrasi, R. & Vitányi, P. Clustering by Compression. IEEE Transactions on Information Technology, 51(4):1523-1545, 2005.
[11]	Faloutsos, C. & Megalooikonomou, V. On data mining, compression and Kolmogorov Complexity. Data Mining and Knowledge Discovery, 15(1):3-20, Springer-Verlag, 2007.
[12]	Rissanen, J. Modeling by shortest data description. The Annals of Statistics, 11(2):416-431, 1983.
[13]	Miettinen, P. & Vreeken, J. mdl4bmf: Minimum Description Length for Boolean Matrix Factorization. ACM Transactions on Knowledge Discovery from Data, 2014.
[14]	Blier, L. & Ollivier, Y. The Description Length of Deep Learning Models. In Proceedings of the Annual Conference on Neural Information Processing Systems (NeurIPS), 2018.
[15]	Mandros, P., Boley, M. & Vreeken, J. Discovering Reliable Dependencies from Data: Hardness and Improved Algorithms. In Proceedings of the ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), 2018.
[16]	Goldfeld, Z. & Greenewald, K. Sliced Mutual Information: A Scalable Measure of Statistical Dependence. In Proceedings of the Annual Conference on Neural Information Processing Systems (NeurIPS), 2021.
[17]	Belghazi, M.I., Baratin, A., Rajeswar, S., Ozair, S., Bengio, Y., Courville, A. & Hjelm, R.D. MINE: Mutual Information Neural Estimation. In Proceedings of the International Conference on Machine Learning (ICML), 2018.
[18]	Marx, A. & Vreeken, J. Testing Conditional Independence on Discrete Data using Stochastic Complexity. In Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), 2019.
[19]	Solomonoff, R.J. A formal theory of inductive inference. Part I. Information and Control, 7(1):1-22, 1964.
[20]	Solomonoff, R.J. A formal theory of inductive inference. Part II. Information and Control, 7(2):224-254, 1964.
[21]	van Erven, T., Grünwald, P. & De Rooij, S. Catching up faster by switching sooner: a predictive approach to adaptive estimation with an application to the AIC--BIC dilemma. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 74(3):361-417, Blackwell Publishing Ltd, 2012.
[22]	Jiang, Z., Yang, M., Tsirlin, M., Tang, R., Dai, Y. & Lin, J. Low-Resource Text Classification: A Parameter-Free Classification Method with Compressors. ACL Anthology, pages 6810-6828, 2023.
[23]	Grünwald, P., Kontkanen, P., Myllymäki, P., Silander, T. & Tirri, H. Minimum Encoding Approaches for Predictive Modeling. In Proceedings of the 14th International Conference on Uncertainty in Artificial Intelligence (UAI), 1998.
[24]	Grau-Moya, J., Genewein, T., Hutter, M., Orseau, L., Deletang, G., Catt, E., Ruoss, A., Wenliang, L.K., Mattern, C., Aitchison, M. & Veness, J. Learning Universal Predictors. In Proceedings of the International Conference on Machine Learning (ICML), 2024.
[25]	Schmidhuber, J. Algorithmic Theories of Everything. , 2000., (Technical Report 20-00).
[26]	Janzing, D. & Schölkopf, B. Causal Inference Using the Algorithmic Markov Condition. IEEE Transactions on Information Technology, 56(10):5168-5194, 2010.
[27]	Mameche, S., Kaltenpoth, D. & Vreeken, J. Learning Causal Mechanisms under Independent Changes. NeurIPS, 36:75595-75622, 2023.
[28]	Mameche, S., Vreeken, J. & Kaltenpoth, D. Identifying Confounding from Causal Mechanism Shifts. Proceedings of the International Conference on Artificial Intelligence and Statistics (AISTATS), 2024.
[29]	Jiang, Y., Liu, E.Z., Eysenbach, B., Kolter, J.Z. & Finn, C. Learning Options via Compression. In Proceedings of the Annual Conference on Neural Information Processing Systems (NeurIPS), 2022.
[30]	Jaynes, E.T. Probability Theory: The logic of science. Cambridge Press, 2002.
[31]	Rissanen, J. Information and complexity in statistical modeling. Springer, 2007.
[32]	Rissanen, J. Optimal Estimation of Parameters. Springer, 2012.
[33]	Yamanishi, K. Learning with the Minimum Description Length Principle. Springer, 2023.
[34]	Wallace, C.S. Statistical and inductive inference by minimum message length. Springer, 2005.

Structure and Content

In general terms, the course will consist of

introductory lectures,
reading, discussing and presenting scientific articles, and
a 'practical' assignment (optional)

At a high level, the topics we will cover will probably include

Algorithmic Information Theory

Induction by Compression
Optimal descriptions: Kolmogorov Complexity
Optimal prediction: Solomonoff's Universal Probability Distribution
Optimal generalization: Kolmogorov's Structure Functions

Practical Information Theoretic Scores

The Minimum Description Length Principle
Two-Part MDL and Refined MDL
Prefix codes, Universal Codes, Prequential coding, and NML

Information-Theoretic Machine Learning in Practice

MDL for Deep Neural Networks
Approximating the Universal Probability Distribution
Information-Theoretic Causal Inference

Loosely speaking, students will learn about the information theoretical optimal approach to learn from data, as well as learn about and how to apply more practical approaches. We will study many different types of tasks and models.

Course format

The course has one meeting per week, typically on Thursdays 10-12, but with a few exceptions due to room availability. The first part of the course will feature regular lectures covering the basic topics of the course and sessions in which we will discuss material covered in the lectures and scientific articles assigned by the lecturer. During the second part the students will write an essay based on scientific articles assigned to them by the lecturer and will prepare a presentation. All student presentations will be scheduled in a block-seminar style, that is in one or two days near the end of the semester, the exact date is to be announced.

There will be no weekly tutorial group meetings.

Prerequisites

Students should have basic working knowledge of machine learning and statistics, e.g. by successfully having taken courses related to data mining, machine learning, and/or statistics, such as Topics in Algorithmic Data Analysis, Machine Learning, Probabilistic Graphical Models, Elements of Machine Learning, etc.

Signing Up

Discussions will be an essential element of this seminar, and hence there will a maximum of twelve (12) participants.

Students that want to participate should register via the seminar registration system on or before October 16th, and make sure to specify: a) name, b) level (bachelor, n-th year master, PhD student), c) relevant courses taken so far, d) short motivation for why they want to participate.