 Email discussion lists for the UK Education and Research communities  ## allstat@JISCMAIL.AC.UK

#### View:

 Message: [ First | Previous | Next | Last ] By Topic: [ First | Previous | Next | Last ] By Author: [ First | Previous | Next | Last ] Font: Proportional Font  LISTSERV Archives  ALLSTAT Home  ALLSTAT March 2010

#### Options  Subscribe or Unsubscribe   Log In   Get Password Subject: Boosting Trees

From:  Date: Sun, 28 Feb 2010 19:04:19 -0500

Content-Type: text/plain

Parts/Attachments:  text/plain (61 lines)
 ```Dear Experts, I’m trying to understand how correlated predictors impact the Relative Importance measure in Stochastic Boosting Trees (J. Friedman). As Friedman described “ …with single decision trees (referring to Brieman’s CART algorithm), the relative importance measure is augmented by a strategy involving surrogate splits intended to uncover the masking of influential variables by others highly associated with them. This strategy is most helpful with single decision trees where the opportunity for variables to participate in splitting is limited by the size of the tree. In the context of Boosting, however, the number of splitting opportunities is vastly increased, and surrogate unmasking is less essential”. Based on the results from the simulated example below (in R), if I have, say two variables which are highly correlated, then the relative importance measure derived from Boosting will tend to be high for one of the predictors and low for the other. I’m trying to reconcile this observation with Friedman’s description above, which according to my understanding, these two variables should have about the same measure of importance. I'll appreciate your comments. Thanks in advance! require(gbm) require(MASS) #Generate multivariate random data such that X1 is moderetly correlated by X2, strongly # correlated with X3, and not correlated with X4 or X5. cov.m <- matrix(c(1,0.5,0.9,0,0,0.5,1,0.2,0,0,0.9,0.2,1,0,0,0,0,0,1,0,0,0,0,0,1),5,5, byrow=T) n <- 2000 # obs X <- mvrnorm(n, rep(0, 5), cov.m) Y <- apply(X, 1, sum) SNR <- 10 # signal-to-noise ratio sigma <- sqrt(var(Y)/SNR) Y <- Y + rnorm(n,0,sigma) mydata <- data.frame(X,Y) #Fit Model (should take less than 20 seconds on an average modern computer) gbm1 <- gbm(formula = Y ~ X1 + X2 + X3 + X4 + X5, data=mydata, distribution = "gaussian", n.trees = 500, interaction.depth = 2, n.minobsinnode = 10, shrinkage = 0.1, bag.fraction = 0.5, train.fraction = 1, cv.folds=5, keep.data = TRUE, verbose = TRUE) ## Plot variable influence best.iter <- gbm.perf(gbm1, plot.it = T, method="cv") print(best.iter) summary(gbm1,n.trees=best.iter) # based on the estimated best number of trees You may leave the list at any time by sending the command SIGNOFF allstat to [log in to unmask], leaving the subject line blank. ```