Hello everyone,
I was wondering if anyone has ever used the "pec" procedure in R? This procedure deals with prediction error curves in a survival analysis setting. Prediction error curves are documented in, for example, "Efron-type measures of prediction error for survival analysis" by Gerds and Schumacher, Biometrics 63, 1283-1287 (2007).
If anyone has experience in this area, I'd appreciate your views on (any of) the following queries. Some of the queries assume familiarity with the R documentation.
I have detailed some syntax that I have used at the bottom of this email. The associated data is available upon request.
In the 'pec' function, used in the syntax, I have
formula=Surv(OVS,dead)~1
(ovs represents 'time' and dead is the 'status' variable where 0=alive and 1=dead)
...apparently "for right censored data, the RH side of the formula is used to specify conditional censoring models". When there are no covariates, as in the above, I understand that we are assuming that censoring occurs totally at random (for all models).
Say we have a data set containing potential predictor variables X1,X2....Xp. Say we had the "survival time" variable (time) measured in months and our status variable (status) had 0=alive and 1=dead.
Firstly, I would like to ask about how to derive conditional censoring models. From past conversations it seems that to establish the form of our censoring model(s), we would use alive =1 and dead=0. Is this correct?..
Now, say we are assuming that the model for censoring is of the cox regression type, do we assess the model for censoring using the usual variable selection procedures where candidate variables for variable selection are X1,X2....Xp and we code *alive=1 and dead=0*?
Say we found that X1 and X5 should be included in our cox regression model for censoring, then we would enter:
formula=Surv(time,status)~X1+X5 and cens.model=cox in function pec.
(where in the above, status: 0=alive and 1=dead)
Am I thinking about this correctly? I think derivation of conditional censoring models may be alot more difficult than this, so I'd appreciate some guidance.
"Replan" is the method for estimating prediction error curves. I understand that replan=none would be used when we don't cross validate i.e. we would create a model using a specific bootstrap sample and then evaluate its performance on the same bootstrap sample. This would be repeated B times.
If we are looking at "bootstrap crossvalidation" method...say we have 500 patients.....it says in the paper - Gerds and Schumacher - that bootstrap samples Q*_1....Q*_B each of size n are drawn with replacement from the original data (we have chosen 100 bootstrap samples i.e. B=100 in our work).
So here I assume that n is equal to the total number of patients i.e. 500 in this example.
When we decide to sample with replacement, as here,...it is obvious that, for each of the bootstrap samples, there are 500 patients but some of these patients could occur in this bootstrap sample more than once.
The documentation says "M is the size of the bootstrap samples for sampling *without* replacement".
Hence I assume that for sampling *with* replacement, M is equal to n i.e. M=n by default. However, if we choose M<n then each of our bootstrap samples will have 'sampling without replacement'. Since the models are assessed using observations that are not in the bootstrap sample, the validation set for "sampling without replacement" must be comprised of n-M patients.
[Each of the bootstrap samples acts as a 'training data set' to generate a model....the model is then validated using the patients which weren't in the bootstrap sample. ]
In a study I did using a 286 case data set (sampling with replacement), I noticed that my prediction error curves seem to terminate at around 35 months when the actual last survival time was 248 months. I checked the survival times and corresponding "alive/dead" for this data set and noticed that the number of 'deaths' gets very sparse after month 35....but I'm still a bit puzzled as to why the curves end at around 35 months. Perhaps the small number of cases in the original data set leads to small validation data sets and hence to a termination of the curves at low times? Has anybody else encountered this?
In our study, we wanted to develop a model on a specific sample and then test it out using observations which aren't in this sample i.e. the validation set. This would be repeated B times. We chose the 0.632+ bootstrap estimator. There are many types of estimators for prediction error curves. How should we decide which is 'best'?
Thanks for any advice on these questions, Kindest Regards, Kim
Dr Kim Pearce CStat
Industrial Statistics Research Unit (ISRU)
School of Mathematics and Statistics
Herschel Building
University of Newcastle
Newcastle upon Tyne
United Kingdom
NE1 7RU
syntax is as follows:
>library(Hmisc)
> library(survival)
> library(pec)
> AL<-read.table("PredErr2.txt",header=TRUE)
> Models<-list("Kaplan.Meier"=survfit(Surv(OVS,dead)~1,data=AL),"EBMT+SN
> Pgroup"=coxph(Surv(OVS,dead)~GroupMod,method="efron",data=AL),"EBMTgro
> up"=coxph(Surv(OVS,dead)~GroupGrat,
> method="efron",data=AL),"EBMToriginal"=coxph(Surv(OVS,dead)~Gratwohl,
> method="efron",data=AL),"EBMT+SNPoriginal"=
> coxph(Surv(OVS,dead)~Gratwohl+PIL1rAnyC+PIL4AnyT,
> method="efron",data=AL))
> PredError.632plus<-pec(object=Models,formula=Surv(OVS,dead)~1,data=AL,
> exact=TRUE,cens.model="marginal",replan="boot632plus",B=100,verbose=TR
> UE)
|