Soft Operators Decision Trees. Uncertainty and stability related issues
- The nowadays increasing number of fields where large quantities of data are collected generates an emergent demand for methods for extracting relevant information from huge databases. Amongst the various existing data mining models, decision trees are widely used since they represent a good trade-off between accuracy and interpretability. However, one of their main problems is that they are very instable, which complicates the process of the knowledge discovery because the users are disturbed by the different decision trees generated from almost the same input learning samples. In the current work, binary tree classifiers are analyzed and partially improved. The analysis of tree classifiers goes from their topology from the graph theory point of view to the creation of a new tree classification model by means of combining decision trees and soft comparison operators (Mlynski, 2003) with the purpose to not only overcome the well known instability problem of decision trees, but also in order to confer the ability of dealing with uncertainty. In order to study and compare the structural stability of tree classifiers, we propose an instability coefficient which is based on the notion of Lipschitz continuity and offer a metric to measure the proximity between decision trees. This thesis converges towards its main part with the presentation of our model ``Soft Operators Decision Tree\'\' (SODT). Mainly, we describe its construction, application and the consistency of the mathematical formulation behind this. Finally we show the results of the implementation of SODT and compare numerically the stability and accuracy of a SODT and a crisp DT. The numerical simulations support the stability hypothesis and a smaller tendency to overfitting the training data with SODT than with crisp DT is observed. A further aspect of this inclusion of soft operators is that we choose them in a way so that the resulting goodness function (used by this method) is differentiable and thus allows to calculate the best split points by means of gradient descent methods. The main drawback of SODT is the incorporation of the unpreciseness factor, which increases the complexity of the algorithm.