Nowadays, in the data mining world, having too much data has become a more prevailing problem than not having enough. Building a predictive model on all available variables can be a time consuming task, one that will take a long time to compute and becomes less robust and harder to interpret.
So, what can we do to reduce the variables to a manageable number without compromising the model’s predictive power? After profiling the data and learning some characteristics about each variable, here are a number of things I would do, listed from simple to complex:
Remove the variables that have 0 or very low variability.
Remove the variables that have too many ...