Why Are Decision Trees so Popular?
For many years Decision Trees have been a popular modelling approach amongst Data Scientists and Analysts due to their visual nature, ease of use and understanding, lack of data assumptions and simple deployment. If model explanation is key, explaining a visual, rules-based tree to senior managers is far easier than explaining complex neural networks or odds ratios.
In a recent study by KD Nuggets based on a sample of 844 voters, Decision Trees were the third most popular algorithm being used by 55% of Data Scientists: http://www.kdnuggets.com/2016/09/great-algorithm-tutorial-roundup.html
Decision Tree Applications
Decision Trees can be used for a variety of tasks within a data mining project including:
- Data exploration - to understand relationships in the data and identify key drivers of behaviour
- Data preparation - to optimally bin variables in the most predictive way
- Variable reduction - to subset the most predictive variables from set of predictors
- Building a baseline model - to quickly assess potential model lift in predicting a dependent variable from a set of predictors
- Building a final model
Rapid Model Build, Testing and Deployment
The increased availability of data and reliance by business on predictive modelling alongside the current shortage in trained Data Scientists means that productivity in model development is key. Rapid model build and deployment is particularly important when responding to constantly changing environments such as fraud.
Angoss Decision Trees allow users to find splits using the selected algorithm (Chaid, Entropy Variance, Gini Variance) or to force splits to apply business rules. Splits can then be edited by changing the selected variable, changing the binning structure or by copying a split from one node to another. This flexibility allows users to quickly assess top predictors, explore relationships and build out business rules. Decision trees can also be built automatically using the selected algorithm and then pruned back in order to simplify or prevent overfitting. Overfitting can also be prevented by limiting parent and child node sizes, setting the maximum number of branches at a split and the maximum tree depth.
Once the tree is built it can be validated using the Model Analyzer to assess model performance against training and hold-out samples and to compare multiple models. The final model can then be automatically converted into a variety of code types including SAS, SQL, PMML, XML and Java in order to speed up deployment.
This ability to quickly build, test and deploy accurate and robust Decision Tree models leads to significant productivity gains across the project life-cycle.