## Assignment 1: The Warm-Up

The deadline is April 29th at 10:00 Saarbrücken standard-time. You are free to hand in earlier. You will have to choose one topic from the list below, read the articles, and hand in a report that answers the assignment questions but above all, critically discusses the material beyond this. Reports should summarise the key aspects, but more importantly, should include original and critical thought that show you have acquired a meta level understanding of the topic – plain summaries will not suffice. All sources you use should be appropriately referenced, any text you quote should be clearly identified as such. The expected length of a report is between 3 to 5 pages, but there is no limit.

For the topic of your assignment, choose one of the following:

1. #### The Scientific Discourse

Read Reshef et al. [1], where the authors introduce a novel measure of correlation (or dependence) to detect associations in large data sets. The authors present their findings rather confidently, yet other researchers don't seem to entirely agree. Read Simon & Tibshirani [2] and Kinney & Atwal [3] to see some of the rebuttals. In turn, Reshef et al. [4] responded to Kinney & Atwal [3], with Kinney & Atwal [5] answering back.

Who is right here? Are Reshef et al. [1] presenting false claims and over-selling the results? Are Kinney & Atwal [3] purposefully mis-interpreting Reshef et al. [1] and ignoring their claims? Are Simon & Tibshirani [2] presenting sensible criticism, or are they pedantic and besides the point? What is your opinion, is the concept of equitability a useful one, and is MIC a useful measure? Can data mining benefit from these? Consider also the latest pre-print [6] by the authors of [1] in your answer.

What does this exchange of letters and notes tell about the process of doing science? Has the general public been mis-led by the tone and impact of these publications? Was the Science magazine wrong at publishing [1] so soon? Should they retract it? Was it acceptable for PNAS to publish [3]? What is the value of pre-print servers such as arXiv (where [2] and [6] are published)?

2. #### Connecting the Dots

Read the following three papers, [7], [8], and [9]. All three consider vastly different and quite fascinating topics. First, discuss to what extend they convince you. For example, do you find the goals clearly defined? Do the authors make principled, or ad-hoc, choices to define the solution? Are the results convincing? Are the algorithms sufficiently evaluated? What extra experiments would you have liked to see? Can you identify possible improvements for the algorithms, how would you approach the problem? Second, and most importantly, can you connect the dots between them?

3. #### I Want That One

Data Mining aims at finding interesting novel knowledge from data. Yet, what is interesting? What is interestingness? De Bie [10,11] argues interestingness is inherently subjective, and that we should try to model the state of mind of the analyst, and proposes to do so using information theoretic principles. Discuss. Example questions you could address, but are not limited to include: What are the desired properties of the functions $$InformationContent$$ and $$DescriptionalComplexity$$? Is it not a play of words to hide subjective interestingness in two objective scores? Does it make sense to make the one subjective, and the other not? Can we make the other score subjective too? If so, how? And so on, the overarching goal of this assignment is to discuss how on earth we can measure 'interestingness' in a meaningful way.

4. #### Deep Learning: The Best Thing Since Sliced Bread or Just Another Bottle of Snake Oil?

To do well on this assignment, you must have a good understanding of deep learning. Do not choose it because omg so hype!.

The most talked-about machine learning related topic of the last years has definitely been deep learning. Many researchers have claimed that they have obtained impressive results with deep learning: they can classify images [12], write LaTeX articles that almost compile [13], beat humans in GO [14], and achieve scientific breakthroughts [15].

Explain what is deep learning, what is deep about deep learning, and how do these applications really use this 'deepness'? Do they all fit under one unified "deep learning" algorithm? If not, explain the core differences between the approaches. Are they all really (only) about deep learning? For example, is AlphaGo primarily deep learning or reinforcement learning method? Did it do all feature selection automatically, or were there hand-crafted rules? Would you credit the deep net for the discovery of the new antibiotic, or did it get a bit of help?

And have all these applications really been resounding success stories? Read also [16], [17], and [18]. What are in your perspective the core problems in deep learning? Are interpretability and robustness of neural networks realistic, or a pipe-dream?

The overarching question is, do deep learning and knowledge discovery go together, can they support each other, or are they mostly mutually exclusive?

Return the assignment by email to tada-staff@mpi-inf.mpg.de by 29 April, 1000 hours. The subject of the email must start with [TADA]. The assignment must be returned as a PDF and it must contain your name, matriculation number, and e-mail address together with the exact topic of the assignment.

Grading will take into account both Hardness of questions, as well as whether you answer the Bonus questions.

## References

You will need a username and password to access the papers outside the MPI network. Contact the lecturer if you don't know the username or password.

 [1] Reshef, D.N., Reshef, Y.A., Finucane, H.K., Grossman, S.R., McVean, G., Turnbaugh, P.J., Lander, E.S., Mitzenmacher, M. & Sabeti, P.C. Detecting Novel Associations in Large Data Sets. Science, 334(6062):1518-1524, 2011. [2] Simon, N. & Tibshirani, R. Comment on Detecting Novel Associations in Large Data Sets by Reshef et al, Science Dec 16, 2011. arXiv, 1401(7645), 2011. [3] Kinney, J.B. & Atwal, G.S. Equitability, mutual information, and the maximal information coefficient. Proc. Natl. Acad. Sci. USA, 111(9):3354-3359, 2014. [4] Reshef, D.N., Reshef, Y.A., Mitzenmacher, M. & Sabeti, P.C. Cleaning up the record on the maximal information coefficient and equitability. Proc. Natl. Acad. Sci. USA, 111(33):E3362-E3363, 2014. [5] Kinney, J.B. & Atwal, G.S. Reply to Reshef et al.: Falsifiability or bust. Proc. Natl. Acad. Sci. USA, 111(33):E3364-E3364, 2014. [6] Reshef, Y.A., Reshef, D.N., Sabeti, P.C. & Mitzenmacher, M.M. Equitability, interval estimation, and statistical power. arXiv, 2015. [7] Shahaf, D. & Guestrin, C. Connecting the dots between news articles. In Proceedings of the 16th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), Washington, DC, pages 623-632, 2010. [8] Shahaf, D., Horvitz, E. & Mankoff, R. Inside Jokes: Identifying Humorous Cartoon Captions. In Proceedings of the ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), pages 1065-1074, ACM, 2015. [9] Hope, T., Chan, J., Kittur, A. & Shahaf, D. Accelerating Innovation Through Analogy Mining. In Proceedings of the ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), pages 235-243, ACM, 2017. [10] De Bie, T. An information theoretic framework for data mining. In Proceedings of the 17th ACM International Conference on Knowledge Discovery and Data Mining (SIGKDD), San Diego, CA, pages 564-572, ACM, 2011. [11] De Bie, T. Subjective Interestingness in Exploratory Data Mining. Springer, 2013. [12] Krizhevsky, A., Sutskever, I. & Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. In NIPS '12, pages 1097-1105, 2012. [13] Karpathy, A. The Unreasonable Effectiveness of Recurrent Neural Networks. , 2015. [14] Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., van den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe, D., Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap, T., Leach, M., Kavukcuoglu, K., Graepel, T. & Hassabis, D. Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587):484-489, 2016. [15] Stokes, J.M., Yang, K., Swanson, K., Jin, W., Cubillos-Ruiz, A., Donghia, N.M., MacNair, C.R., French, S., Carfrae, L.A., Bloom-Ackermann, Z., Tran, V.M., Chiappino-Pepe, A., Badran, A.H., Andrews, I.W., Chory, E.J., Church, G.M., Brown, E.D., Jaakkola, T.S., Barzilay, R. & Collins, J.J. A Deep Learning Approach to Antibiotic Discovery. Cell, 180(4):688 - 702.e13, 2020. [16] Khurshudov, A. Suddenly, a leopard print sofa appears. , 2015. [17] Curtis, S. Google Photos labels black people as 'gorillas'. , 2015. [18] Zügner, D., Akbarnejad, A. & Günnemann, S. Adversarial Attacks on Neural Networks for Graph Data. In Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), 2018.