FOLD vs Decision Tree
A comparison of the classification performance and the idea behind generating rules from the data.
FOLDR++ and FOLDSE both generate a decision list as their output. A decision list which is also called an ordered rule-set is a collection of individual classification rules that collectively for a classifier. The benefit of a decision list is that they are considered more interpretable than the decision tree because of their reduced complexity.Lakkaraju et. al. introduce decision sets which they claim to be more interpretable than decision lists. They mention some limitations of decision lists such as the later rules in the list can only cover a narrow slice of the feature space. Moreover, the while interpreting the rules for a given decision it can be challenging to follow the path of the decision due to the need for checking every rule before the final rule is reached. They also discuss interpretability and explainability at great length in the paper.
Here are the questions I encountered while reading about decision lists, decision trees and decision sets:
Q) Is there some optimal way of learning a decision tree? The authors Bertsimas et. al. use Mixed integer Optimization (MIO) for the learning of a decision tree. They show their method of learning decision trees generates optimal trees and outperforms CART. Looking through the paper it is difficult to understand how they define “optimal” in this context.
Q) What are the ways in which optimality is defined? It could be finding the best hypothesis provably that gives highest accuracy on the test set, or it could be the size of the program. Optimal Classification Trees uses the first definition of optimality. ILASP uses the second definition of optimality. Yu et. al. define optimality as having minimum number of literals.
Q) Are there any theoretical foundations of a decision-list? Lot of work is available in this area. Heuristic based approaches include RIPPER and FOLDSE. According to different definitions of optimality, different articles are available. Yu et. al. find optimal decision sets and decision lists in terms of the minimum number of literals. They use SAT solvers to find the most optimal decision sets/lists.
Q) How does FOLDSE compare against the decision-list generating algorithms? The only comparison availbale is the one with RIPPER but no details of the experiments are provided in terms of the hyperparameters used for RIPPER or FOLDSE both. Although the rule lists generated by FOLDSE are smaller, there is no explanation given as to why this is observed specially since both RIPPER and FOLDSE use a similar sequential covering algorithm for generating the rule set. A possible explanation could be the Magic GINI Impurity heuristic and what are its effect on RIPPER would be interesting to see. A comparison with other decision-list generating algorithms would be interesting.
Q) Are there any theoretical foundations of a decision-set? The authors of the paper Interpretable Decision Sets: A Joint Framework for Description and Prediction present an algorithm for generating interpretable decision sets. They claim that as long as certain conditions are met the expressive power of a decision set is equivalent to that of a decision tree. They claim through human evaluation that “humans were three times more accurate given a decision set versus a decision list, and they used 74% fewer words and 71% less time to write their descriptions” They also consider “overlap” as a metric for interpretability. They define it as the number of data points that satisfy multiple rules. Note, FOLDSE does not have any overlap as the examples are removed as they are covered. They define an objective function based on interpretability and accuracy metrics which is non-negetive, non-normal, non-monotone and submodular. Then they use Smooth Local Search (SLS) algorithm to get the optimal desicion set. Because of these properties, the objective can be approximately optimized with theoretical guarantees.
Q) What optimality does ILASP talk about? The ILASP system searches for an optimal program. The authors define optimality as a program that has the least number of literals. They use CLINGO to find the optimal program. This is better than a greedy approach because it is guaranteed to find the optimal program. Hence maybe FOLDSE can be compared against ILASP.
Q) What type of algorithm does FOLDSE use in terms of the available literature in learning decision lists? FOLDSE uses sequential covering to generate the rule lists. It is an algorithm that is widely employed to generate rule lists and is also used in the popular rule learning algorithm RIPPER (Repeated Incremental Pruning to Produce Error Reduction) . Johannes Fürnkranz et. al. in their book Foundations of Rule Learning describe in detail various rule learning algorithms and their properties.
Q) What are the strengths of FOLDSE? FOLDSE is faster because of the prefix sum technique. The way the categorical and numerical values are compared could also be a contributing factor to the speed and size of the program generated.
//: # (There can be a final rule added that can serve as a “catch all” rule that the algorithm defaults to in case no other rule)