Abstract. Finding (functional) dependencies between attributes in databases is a well-known problem with applications in knowledge discovery, feature selection, and database management. While the recently introduced reliable fraction of information measure allows to soundly quantify dependence in a way that avoids overfitting when optimizing over high-dimensional spaces, the algorithmic implications of using this score have not yet been systematically explored. This includes the computational complexity of the resulting optimization problem.
To this end, this paper provides the following contributions: We show that the problem of maximizing the reliable fraction of information is NP-hard, which justifies the usage of worst-case exponential-time as well as heuristic search methods that do not guarantee optimal solutions. We then greatly improve the practical performance for both of these optimization styles by deriving a novel admissible bounding function, which has an unbounded potential for additional pruning over the previously proposed one. Finally, we empirically investigate for the first time the approximation ratio of the greedy algorithm and show that in fact it produces highly competitive results in a fraction of time needed for complete branch-and-bound style search. All findings are evaluated on a wide range of real-world datasets that are publicly available along with the implementation of the algorithmic contributions.
Our results suggest that in scenarios where no hard optimality guarantees are required, greedy optimization is a good alternative to branch-and-bound for dependency discovery. Also, the definition of the tighter bounding function is potentially more generally applicable than just to the reliable fraction of information and might be transferrable to other dependency measures.
Important Fedora has been generalized to mixed-type data with FDDNum
Discovering Dependencies with Reliable Mutual Information. Knowledge and Information Systems vol.62, pp 4223-4253, Springer, 2020. (IF 2.936) |
|
Discovering Reliable Dependencies from Data: Hardness and Improved Algorithms. In: Proceedings of the IEEE International Conference on Data Mining (ICDM'18), IEEE, 2018. (full paper, 8.9% acceptance rate; overall 19.9%) |
|
Discovering Reliable Dependencies from Data: Hardness and Improved Algorithms. Technical Report 1809.05467, arXiv, 2018. |
|
Discovering Reliable Approximate Functional Dependencies. In: Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), pp 355-363, ACM, 2017. (oral presentation, 8.6% acceptance rate; overall 17.5%) |
|