The deadline for this example assignment was long in the past. Students were free to hand in earlier. They had to choose one topic from the (truncated) list below, read the articles, and hand in a report that critically discussed this material and answers the assignment questions. Reports should have summarise the key aspects, but more importantly, should have included original and critical thought that showed the student acquired a meta-level understanding of the topic – plain summaries did not suffice. All sources used should have be appropriately referenced, any text quoted should have been clearly identified as such. The expected length of a report was between 3 to 5 pages, but there was no limit.
There were four topics for the assignment, but for exposition we only give the ones for which we provide the sample report.
Whenever we apply data-driven methods we have to be very careful that we do not introduce any algorithmic bias; we have to prevent our algorithms from making unfair or discriminative decisions, regardless of whether this is because of how what assumptions the method makes, or whether these biases might have been present in the data we trained the method on. Statistical parity, which assumes that all is fair if we make sure that the distribution of the sensitive attribute is the same between the groups we decide over, is a popular notion in fair machine learning, but has been attacked by Dwork et al. [1]. Bonchi et al. [2] do not mention parity at all in their paper, as they take a causality based approach. Read this paper carefully and critically. Do you find their approach convincing? Does it live up to the critiques that Dwork gives against statistical parity? Do you see potential other problems in this approach? How could it possibly go wrong? What are the main strengths, in your opinion, and what are the main weaknesses?
(Bonus) In many applications it is not so much about making a (binary) decision, but rather about ranking elements. Say, a search engine that we can query to discover who are the smartest students on campus—and of course we want to be fair with regard to hair colour. How would you solve the fair-ranking problem, if your only tool is statistical parity? And, how would you solve the same problem if you would have the SBCN?
Example The following report is an actual report from 2021 that was concise, clearly written, showed creative and critical thought, and therewith graded as Excellent.
The standard approach to mine frequent itemsets is to
and iterate until done. Eclat [3], proposed in 1997, is an example of a simple yet very efficient algorithm for mining frequent itemsets that follows this principle in a depth-first search.
Han et al. [4] claim that their method can mine frequent itemsets without candidate generation. This raises the question: where did the candidates go? Discuss whether this claim is valid or not, and why.
(Bonus) TreeProjection [5] was proposed before FPGrowth [4]. Han et al. [4] almost aggressively discuss that FPGrowth is really different than TreeProjection. Is it really? Why (not)? Discuss, and if possible, give an example where they are (not) different. Hint: consider how they explore the search space.
Example The following report is an actual report from 2015 that was concise, clearly written, showed creative and critical thought, and therewith graded as Excellent.
Grading took into account both Hardness of questions, as well as whether the Bonus questions were answered.
You will need a username and password to access the papers. It will be given out during the first lecture.
[1] | Fairness through Awareness. In Proceedings of the 3rd Innovations in Theoretical Computer Science Conference (ITCS), ACM, 2012. |
[2] | Exposing the probabilistic causal structure of discrimination. Data Science and Analytics, Springer, 2017. |
[3] | New algorithms for fast discovery of association rules. In Proceedings of the 3rd ACM International Conference on Knowledge Discovery and Data Mining (KDD), Newport Beach, CA, 1997. |
[4] | Mining frequent patterns without candidate generation. In Proceedings of the ACM International Conference on Management of Data (SIGMOD), Dallas, TX, pages 1-12, ACM, 2000. |
[5] | A tree projection algorithm for generation of frequent item sets. J. Parallel Distr. Com., 61(3):350-371, 2001. |