Nowadays, in the data mining world, having too much data has become a more prevailing problem than not having enough. Building a predictive model on all available variables can be a time consuming task, one that will take a long time to compute and becomes less robust and harder to interpret.
So, what can we do to reduce the variables to a manageable number without compromising the model’s predictive power? After profiling the data and learning some characteristics about each variable, here are a number of things I would do, listed from simple to complex:
- Remove the variables that have 0 or very low variability.
- Remove the variables that have too many missing values. How many is too many? Well, it depends. If there is no special reason for missing values, rule of thumb is 50%. When a variable has over half of its values missing it becomes harder to impute the missing values with meaningful significance or make good use of its information.
- Remove the variables that have high correlation, also known as “multicollinearity”. Simply, if two variables present high multicollinearity, that means these two variables contain more or less the same information. Year of Birth and Age are a good example of variables of high multicollinearity. We can identify such pairs of continuous variables by using Pearson coefficient matrix and select one to remove in each pair.
- Remove variables that are irrelevant to the dependent variable (target). We can simply look at some statistics from profiling, such as Information Value or Entropy Variance, etc. Or, we can build a Decision Tree and keep only the variables that are used in the splits. Decision Tree is great in this regard of screening variables, as it doesn’t require much data cleaning preparation before using this technique.
Of course, here I have only covered some of the more common methods used for variable reduction. We can apply one, some, or all of them and do a few iterations, depending on the time and effort allowed. There are also more advanced techniques for variable reduction, like variable clustering or principle component analysis. And don’t forget to take advantage of the domain knowledge you have or you can get.
Angoss software KnowledgeSEEKER or KnowledgeSTUDIO offers excellent data profiling capability. Its patented Decision Tree is considered best in class.