Dear Experts,
I’m trying to understand how correlated predictors impact the Relative
Importance measure in Stochastic Boosting Trees (J. Friedman). As Friedman
described “ …with single decision trees (referring to Brieman’s CART
algorithm), the relative importance measure is augmented by a strategy
involving surrogate splits intended to uncover the masking of influential
variables by others highly associated with them. This strategy is most
helpful with single decision trees where the opportunity for variables to
participate in splitting is limited by the size of the tree. In the context
of Boosting, however, the number of splitting opportunities is vastly
increased, and surrogate unmasking is less essential”.
Based on the results from the simulated example below (in R), if I have, say
two variables which are highly correlated, then the relative importance
measure derived from Boosting will tend to be high for one of the predictors
and low for the other. I’m trying to reconcile this observation with
Friedman’s description above, which according to my understanding, these two
variables should have about the same measure of importance. I'll appreciate
your comments. Thanks in advance!
require(gbm)
require(MASS)
#Generate multivariate random data such that X1 is moderetly correlated by
X2, strongly
# correlated with X3, and not correlated with X4 or X5.
cov.m <-
matrix(c(1,0.5,0.9,0,0,0.5,1,0.2,0,0,0.9,0.2,1,0,0,0,0,0,1,0,0,0,0,0,1),5,5,
byrow=T)
n <- 2000 # obs
X <- mvrnorm(n, rep(0, 5), cov.m)
Y <- apply(X, 1, sum)
SNR <- 10 # signal-to-noise ratio
sigma <- sqrt(var(Y)/SNR)
Y <- Y + rnorm(n,0,sigma)
mydata <- data.frame(X,Y)
#Fit Model (should take less than 20 seconds on an average modern computer)
gbm1 <- gbm(formula = Y ~ X1 + X2 + X3 + X4 + X5,
data=mydata,
distribution = "gaussian",
n.trees = 500,
interaction.depth = 2,
n.minobsinnode = 10,
shrinkage = 0.1,
bag.fraction = 0.5,
train.fraction = 1,
cv.folds=5,
keep.data = TRUE,
verbose = TRUE)
## Plot variable influence
best.iter <- gbm.perf(gbm1, plot.it = T, method="cv")
print(best.iter)
summary(gbm1,n.trees=best.iter) # based on the estimated best number of
trees
You may leave the list at any time by sending the command
SIGNOFF allstat
to [log in to unmask], leaving the subject line blank.
|