|Type||Seminar (7 ECTS)|
|Lecturer||Dr. Mario Boley|
|mboley+sgd (at) mpi-inf.mpg.de|
Wednesday, 10:15–11:45 in Room E1.7 3.23.
|Max Capacity||Participants are by now fixed. Guests are always welcome.|
|Summary||In this seminar we'll be investigating the following questions: How can we identify whether there exist any sub-populations in your data that stand out and are easy to describe? How can we efficiently discover such subgroups from data? How can we do so with optimality guarantees? How can we define 'standing out' in a meaningful manner? and, what does easy to describe mean? We'll be exploring this from a local pattern mining, or better, subgroup discovery perspective.|
|May||24||Lecture||Introduction to seminar and subgroup discovery|
|May||31||Lecture||Topic intro 1: Branch-&-bound (A), Optimistic estimators (B), Redundancy (C)|
|June||7||Lecture||Topic intro 2: Heuristic solvers (D), Statistical considerations (E), Applications (F)|
|June||14||Lecture||Report and presentation instructions, Q&A|
|August||3||Talks||Student presentations (9:40-12:00 am)||A, C, F|
Please submit required documents on day of deadline via mail to lecturer (and mentor if different from lecturer).
All required reading will be made available here. You will need a username and password to access the papers outside the MPI network. Contact the lecturer if you don't know the username or password. Latex templates for the paper and the extended abstract are here.
|||Subgroup discovery. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 5(1):35-49, Wiley Online Library, 2015.|
|||Identifying Consistent Statements about Numerical Data with Dispersion-Corrected Subgroup Discovery. arXiv preprint arXiv:1701.07696, 2017.|
|||Direct local pattern sampling by efficient two-step random procedures. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 582-590, 2011.|
|||Non-redundant subgroup discovery using a closure system. Machine Learning and Knowledge Discovery in Databases, pages 179-194, Springer, 2009.|
|||Exploiting False Discoveries--Statistical Validation of Patterns and Quality Measures in Subgroup Discovery. In Data Mining (ICDM), 2011 IEEE 11th International Conference on, pages 151-160, 2011.|
|||Different slopes for different folks: mining for exceptional regression models with cook's distance. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 868-876, 2012.|
|||Uncovering structure-property relationships of materials by subgroup discovery. New Journal of Physics, 19(1):013031, IOP Publishing Ltd and Deutsche Physikalische Gesellschaft, 2017.|
|||Tight optimistic estimates for fast subgroup discovery. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 440-456, 2008.|
|||Subgroup discovery for election analysis: a case study in descriptive data mining. In International Conference on Discovery Science, pages 57-71, 2010.|
|||Fast and memory-efficient discovery of the top-k relevant subgroups in a reduced candidate space. Machine Learning and Knowledge Discovery in Databases, pages 533-548, Springer, 2011.|
|||Subgroup discovery with CN2-SD. Journal of Machine Learning Research, 5(Feb):153-188, 2004.|
|||Fast exhaustive subgroup discovery with numerical target concepts. Data Mining and Knowledge Discovery, 30(3):711-762, Springer, 2016.|
|||Discovering statistically non-redundant subgroups. Knowledge-Based Systems, 67:315-327, Elsevier, 2014.|
|||Instant exceptional model mining using weighted controlled pattern sampling. In International Symposium on Intelligent Data Analysis, pages 203-214, 2014.|
|||Interpreting PET scans by structured patient data: a data mining case study in dementia research. Knowledge and Information Systems, 24(1):149-170, Springer, 2010.|
|||Subgroup Discovery with Proper Scoring Rules. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 492-510, 2016.|
|||OPUS: An efficient admissible algorithm for unordered search. Journal of Artificial Intelligence Research, 3:431-465, 1995.|
|||Discovering associations with numeric variables. In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, pages 383-388, 2001.|
In general terms, the course will consist of
The seminar is split into two parts.
The first phase will take place in four consecutive weeks during the semester (May 22 to June 16---one meeting per week, on Wednesday between 10:00 and 12:00). In this phase there will be two to three introductory lectures on subgroup discovery basics and some initial reading assignments to all participants. The phase is concluded by one discussion meeting where we identify key scientific questions and problems.
In the second phase, students prepare one detailed report (8 pages) and one short lecture (40m) on one assigned topic as well as one short report (1 page) on a secondary topic. Finally, we will conclude the seminar with the student lectures, which will be held in block-form in the first week after the end of the course period (July 31 to August 4), during which there most likely will be two to four presentations per day, most likely between 9:00 and 13:00. The exact times, and whether we will use all days, are to be decided.
Students should have basic working knowledge of data analysis and statistics, e.g. by successfully having taken courses related to data mining, machine learning, and/or statistics, such as Topics in Algorithmic Data Analysis, Information Retrieval and Data Mining, Machine Learning, Probabilistic Graphical Models, Statistical Learning, etc.
Every student will have to give a presentation and write a paper on an assigned main topic as well as handing on one short extended abstract on a secondary topic.
The presentation should be approximately 30 minutes long, and will be followed by a 10 minute discussion. The style of the presentation should be like a lecture; your fellow students can follow it with only the previous lectures of the course as background material, you should only discuss those details that are necessary, and make use of examples where possible. Topic will be assigned by the lecturer.
In addition to the presentation, students will have to hand in a paper that discusses the assigned topic. Unlike the presentation, which has to be sufficiently high-level, the paper is where you are allowed to show what you've learned in more detail. Summarize the material you read as clearly as you can in your own words, identifying the key contributions/most interesting or important aspects, relating the topic to any/all previous lectures in the course and/or papers read for the course, all the while correctly referring to the sources you've drawn from. The expected length of the report is 8 pages.
Finally, the extended abstract about the a secondary topic is a short document of at most two pages. Please use to the provided Latex templates for both documents. The templates also contain further information on required content and style.
Students will be graded on the slides, the presentation, your answers in the discussion, and the two written doscuments. Students are allowed to ask for feedback on their slides once a week, up till latest 1 week before the presentation. The provided reading material are provided as starting points. For some topics the provided materials will be enough, for others you may want to gather more material. When in doubt, contact the lecturer.