Email discussion lists for the UK Education and Research communities

## allstat@JISCMAIL.AC.UK

#### View:

 Message: [ First | Previous | Next | Last ] By Topic: [ First | Previous | Next | Last ] By Author: [ First | Previous | Next | Last ] Font: Monospaced Font

#### Options

Subject:

How to Predict Cancelling ?

From:

Date:

Thu, 27 May 1999 11:11:10 -0300

Content-Type:

text/plain

Parts/Attachments:

 text/plain (584 lines)
 Dear list-members: Thanks a lot to all those who replied to my question. Today I'm going to begin trying to solve it through Survival Analysis. Does anybody disagree, considering all the points stated ? Below are the copies of all the suggestions I got with my original question: - I asked: Dear list members: I have a question and I hope you can help me. I have a database with 22649 cases of clients who cancelled (I also have another database with the clients that are still active, the clients that did not cancel yet) their accounts that is modeled like this: IDClient Month(n) Month(n-1) Month(n-2) ... Month(n-11) Cancel Date - IDClient is an identifier of the client (integer); - Month(n) is how much the client spent is dollars on month n; - Month(n-1) is how much the client spent is dollars on month n - 1; - Month(n-2) is how much the client spent is dollars on month n - 2; - ... - Cancel Date is the date (mm/dd/yy) the client cancelled his account. I need to design a model to predict, some months beforehand (maybe one or two), that a given client will cancel its account, so I can try to get in touch with him and try to change his idea. What technique should I use ? Survival Analysis ? Any suggestion using SPSS 8.0 Professional ? Can anybody help me ? - Jay Warner wrote: I haven't done something of this _exact_ sort, but I think I can help you help yourself. 1) Recognize that the whole world is a system. In this case the output is probability of canceling in a given month, or percent of a group that cancel in a month. 2) The inputs should be things you can monitor, or even better, adjust in some way to reduce p(cancel). They may or may not be things you currently have in your data base. 3) Now we have to find a model equation, expressing the relationship of inputs to output. I went to Agresti (1996), to find a general linear model. He also suggests that when the response must be between 0 and 1, then a logit link is advised. g(mu) = log [mu/(1-mu)] I believe he says that a binomial distribution will transform by the logit into a normal distribution. Then he says, we can express all this in the equation: g(mu) = alpha + beta(1)*x(1) + ...beta(n)*x(n) where the right side is a standard general linear model. 4) Next we ask what the beta's might be - we want big ones (absolute value). How to get a mu different than we have? 5) Use step 5! I can group similar customers (similar by monthly sales, for ex.), and for each group calculate a % who cancel in a month. I might want to look at groups by percent of purchase decline over the last 6 month period. That is to say, a slope of sales by customer over the time. Your suggestion of survival analysis suggests you think there is a declining trend prior to canceling. 6) Finally, I can do what I know how to do easily. I set up a designed experiment, working on the variables and variable levels used for grouping above, and run the analysis. 7) Take the results, and predict % canceling for next month, by group. See if it is (mostly) true. If so, go for it! You can now identify early the customers likely to drop. I have used this mathematical approach in manufacturing cases, where the product test/inspection reported % fail. In a nutshell, it worked. We cut the failure rate from 40% to under 15% in eight months. Worth $500K/percentage point, I was told. I have two concerns in this scenario, applied to your issue. First, I am afraid that the data you have on each customer does not include the key things that make them cancel. This can be tested, but only you know the details of the product/service provided, customer alternatives and customer perceptions. If you follow step 3 of the A2Q Method (tm), you may well come up with a lot of good possible reasons for canceling. Many of them you will not have in your database. When this happens, many people focus only on the database information, because it is there. Well, yes. But consider expanding your database, too. Secondly, I don't feel comfortable with the way I got 0 or 1 (retain/cancel) into a percentage. I don't have Agresti in my hands at the moment, but I think he discusses other ways of getting it, too. One of them might work better. Now, does all this help you understand how to resolve your problem? Do you feel comfortable with the next step, and the next? - Gene Maguin wrote: My opinion is that you should put the two datasets together and use survival analysis. I think the basic question you are asking is what variables predict how long a person will maintain an open account. - I wrote to Gene Maguin: I have already put the two datasets together to use survival analysis. Yes, I need to know what variables predict how log a person will maintain an open account, however, the most important is the ammount the client spends each month before he close his account. I have this information for the last 12 months. - Gene Maguin wrote: Let me make sure i have a clear idea of what you want to know. Is this your analytical questions? Q: Does a person's spending increase in the months prior to the month in which they close their account? Note that this question does not require data on persons who do not close their account. - I wrote to Gene Maguin: I'm not sure, but I think the spending tends to diminish in the months prior to their cancelling. So, based on a given client's spending I would like to predict (with some level of confidence) if he will cancel his account in the following 1, 2 or three months or not. I have some other informations about the persons as well such as: age of his account, sex, age, profession, marital status, earnings, etc. I need to know when the person is going to cancel. So I think I need the information of those who did not cancelled as well, don't I ? - Gene Maguin wrote: Let's do some data exploration first. 1. Select persons who canceled in a particular month and do descriptives and/or plots of the means for the months previous to their cancelation. 2. Repeat for several different months. This will give you an idea of what actually does happen in the months preceding cancellation. 3. To test your conclusions, run a one within factor manova, either repeated measures or true manova, for persons who cancelled in a particular month to see if the means for the months preceding cancellation are actually not significantly different. 4. Repeat this for several different months to see if the conclusions generalize over different cancellation months. - Armin Roehrl wrote: I read your posting to the allstat mailing-list. If you want, I can try some neural networks, decision trees, etc. on that problem. Do you have further information that just how much the customers spent.. do you also know what they bought? How much would you be willing to spend? Best regards, Armin, Statoo.com team (www.statoo.com) - I wrote to Armin: No, I don't know what they bought. I have the age of his account, sex, age, profession, marital status, earnings, etc. Since it is the final project of my graduation course (Business Administration), I need some help to do the task by myself and not willing to spend anything. If you have some tips, please let me know. - Armin Roehrl wrote: Hey, it's me again. If we write a joint paper on that, maybe we can arrange some good deal -- i.e. I run it through my neuronal networks, etc. and you do the corresponding business talk, etc. - I wrote to Armin: Yes, we can write a joint paper, but what you mean by "do the corresponding bussiness talk" ? Since it is data that belongs to a company, I do not have permission to foward the data to anyone else. I can only use it for the graduation final project. Can I use your neural networks here ? - Armin Roehrl wrote: is your main analysis on the statistics side, or on the economics/business site. Suppose, I would help with the statistics, i.e. try out some of my algorithms/code, etc. Can't you ask the company for permission? If all you want is some neuronal network code to play with yourself, I recommend you download the SNNS code and then play with it.. ask again if you want more details. If I can't look at the data.. I'm not really willing to play with that project. - I wrote to Armin: My purpose is on the Statistics side. I asked for their permission, they didn't allow. I'll try the SNNS code. Thanks for your help. - Armin Roehrl wrote: Good luck, Armin STart with the standard SNNS and then later combince CARTS with SNNS (any architecture.. CARTS will help you find the right ropology) - John Shade wrote: Before looking to any special methods, I would suggest trying some very simple analyses. For example, standardise each row (the cases) by dividing it by the median value for that row. Split the data into 'cancel' and 'not cancel' and plot a a few randomly selected samples from each to look for possible patterns. For example, perhaps appreciably more 'cancels' are preceded by a steep downward trend than 'not cancels'. It may also be that some samples look to have more autocorrelation than others, or are more erratic and so on. Perhaps some simple model might look appropriate for many of the cases. At the very least, this work will increase your familiarity with the data, highlight suspect values, and build up a realistic expectation of how predictable the cancellation is, based on just these past account values and statistics derived from them. You could then move to try more formal methods. We sell a software product call CART which is very useful for going through large data sets looking for variables or combinations of variables with good predictive power for some final classification (such as your cancel/not cancel). You would want to add some variables suggested by the initial examination suggested above, and if you also had variables such as age, sex, occupation code, postcode (it can handle postcodes or zipcodes as categorical variables) and so on, then so much the better - CART can work with them all, and cope admirably (and automatically if allowed to) with missing values and outliers. CART can provide a final model for predictive use, or you can use CART to identify variables to be used in other modelling methods e.g. logistic regression. - Jay Warner wrote: Another thought for you. Why not start with your pile of data, and separate into 2 groups - those who canceled and those who didn't. Add an item for percent change in sales for each, then do a 'slice and dice' scene - do scatter plots, AoV's and the like, examining the two groups for obvious differences. I would do this first. Then, if anything comes up, I would go back and look very carefully at each item, perhaps in a multiple regression type analysis. In the second examination, check for proper orthogonality of data, as well. - John Aitchison wrote: Interesting problem. Presumably the 'pattern' that most predicts cancelation is that of a declining level of expenditure over several months, or of a constant low level of expenditure and coming up to renewal time (if there are annual fees). So, I would construct synthetic features "declining expenditure" and "consistent low level expenditure" and use those in some standard classification algorithm (eg CART, discriminant analysis). The "declining expenditure" feature I would construct by fitting a regression model to each individuals data, and using the estimated slope for the individual as the relevant feature (You could aslo cosnider using teh intercept and the variance) - Nikolai Kolev wrote: I think that the following model as appropriate for modelling your data: =================================================================== LATEX - version: Let$Y_i$be i.i.d. r.v.'s having Bernoulli distribution with a parameter$p$. Then consider the following process $$X_n = \sum_{i=1}^{X_{n-1}} Y_i + Z_n, \quad n = 1,2,\ldots$$ where$Z_i$is another sequence of i.i.d. r.v.'s having some discrete distribution (for example Poisson) and being independent of the sequence$Y_i$. The above model is known in the literature as a "binomial thinning" and has a simple interpretation: - the random sum$ \sum_{i=1}^{X_{n-1}} Y_i $represents the number of survivors in the period$(n-1,n]$; - the sequence$Z_n$gives the number of the new elements entering during the same period$(n-1,n]$. =================================================================== Reference: Al-Osh, M.A. and Alzaid, A.A. (1987). First-order integer valued autoregressive (INAR(1)) process, Journal of Time Series Analysis, Vol. 8, No. 3, 261-275. - Nikolai Kolev wrote: > but I didn't understand your notes pretty much. Can you detail it a > little bit more. For this is necessary a little bit more probability knowledge and interpretations of the simple discrete distributions. See below. > I need to know if a given person will survive or not, Very good. But I don't believe that there exist an expert who can give an answer of that question, predicting even 50% of the cases. The experts in banks are following their intuition basically, and some crazy deterministical rules (created before 30-40 years), which are satisfactory for the corresponding bank. > and not just the number of people who survived. Ok ? If you know (approximately, using some discrete probabilistic model) the number of clents survived after n-th month, you could have an idea (from your data base) with which subset of cients to provide a talk. > > I think that the following model as appropriate for modelling your data: > I am reconfirming my opinion. > =================================================================== > > LATEX - version: > > Let$Y_i$be i.i.d. r.v.'s having Bernoulli distribution with a > parameter$p$. Then consider the following process > $$> X_n = \sum_{i=1}^{X_{n-1}} Y_i + Z_n, \quad n = 1,2,\ldots >$$ > where$Z_i$is another sequence of i.i.d. r.v.'s having some discrete > distribution (for example Poisson) and being independent of the sequence >$Y_i$. > This is a discrete time series model. From your data, you could estimate the parameters of the underlying Bernoulli and Poisson distributions. > The above model is known in the literature as a "binomial thinning" and > has a simple interpretation: > > - the random sum$ \sum_{i=1}^{X_{n-1}} Y_i $represents the > number of survivors in the period$(n-1,n]$; > > - the sequence$Z_n$gives the number of the new elements entering > during the same period$(n-1,n]\$. > If you read the paper cited, and if you have some knowledge on discrete distributions, you could understand and will feel the model suggested. - Dave Reilly wrote: What you have is a time series problem where you wish to use historical data in order to assess an unusual value for purposes of early warning or intervention detection. One of our clients is using/investigating this in a banking application where the bank wishes to be pro-active and sense a change in status. - James Watts wrote: Cox regression in advanced models group. Will need time-to-the-event variable, and indicator variable of censored cases. I used it to forecast differences in liklihood of closing banking accounts. Also look at the Mantel-Haenszel approach which I think is in Bishop and Feinberg's Discrete Multivariate Analysis. In doing the modeling I ran some logistic regressions to look for influential cases and re-ran the cox regressions with and w/o these cases. I'm guessing that you might want to sample from the original files at least for the exploratory phase, or you might have a bit of a wait. We had monthly data that covered a couple of years, so could make charts of the quitters doing a nose dive to the baseline. Made variables such as cumulative months of account balance decreases and the relative amount of the decrease. Allison's Event History Analysis from Sage was helpful even though I think it predated the cox regression methodology. I asked James Watts : I already have the time-to-the-event (in months) variable and the indicator variable of the censored cases. Don't we have the Cox Regression in SPSS Professional 8.0 ? Did you run the logistic regression for what variables ? Just the person's spendings along the months ? Or you used some more variables ? I have other ones such as: sex, age, age of account, marital status, profession, earnings, etc. What did you mean by: - cumulative months of account balance decreases ? Is it the numbers of times the client spent less in a month than in the previous month ? - the relative amount of the decrease ? Is it the mean of all the percentages of spendingdecrease between a month and the previous month ? I'm facing a little dificultness in finding books on this topic. I ordered the Applied Survival Analysis from Chap T. Lee form Books.com. Can you point me someother ones ? - James Watts answered: Hi, Cox regression is in the advanced models in v. 9 and I think it was in that module in v.7. Don't know about v.8. We ran all of the analyses on the event of an household closing a checking account. We had data for two years of monthly data on deposits, atm transactions, monthly checking and savings balances as well as number of accounts in household, length of time household had been customer of any service. We generated lots of variables and just tested them in the cox reg, log reg and crosstabs, using mantel-haenszel (sp?) testing for ordered pairs. if you get a copy of the spss manual that has the cox reg explanation in it, you will see the types of graphs I am talking about. I think these graphs chart differences in the predicted risk function between the censored and event cases split by categories of a independent variable, such as high relative checking balance decline vs. low, or whatever. I think we used logistic regression to test runs of the event cases vs. the censored cases, the same as binary log reg. That wasn't successful without some sort of matching -- I think we matched on length of time hsehold had a cusotmer, but this wasn't a true case matching log reg, but rather an attempt at make-shift analyses to attempt to get the predicted probabilites that log reg produces. Cox reg predicts the hazard or risk rate. You know that you can use OLS on the time duration variable if you select out the censored cases. I think we did some of that also, but in the end we ran a cox reg on categories of variables such as the number of months of declines in balances, that the next month's balances substracted by the previous month and so on for several monthds, categories of account balances and categories of number of accounts. We used categories from continous variables because we were looking for cut offs points that would be simple rules for the bank managers to implement. I think you just need to play with your data in these various ways in a trial and error manner to see what works best. The best discussion of cox reg I've seen is on the sas web site. search on phreg or proportional hazard regression. haven't seen the book you describe. good luck - Zachary S. Feinstein wrote: Look into Survival Analysis and Hazzard Modeling. I knew about this stuff many years ago, but not much now. Let me know if this helps. - Arthur Weiss wrote: With what you have given me, I can't really help. You need to understand why clients cancel. And I would need to know more about your product and clients. For example: You could have sales patterns like this: Sales per month Client Number 1 2 3 4 Month 0 100 1000 100 Month -1 200 1000 200 Month -2 300 1000 300 Month -3 400 100 900 200 Month -4 500 800 120 Month -5 500 600 230 Month -6 500 500 340 Month -7 500 400 240 Month -8 500 100 300 130 Month -9 500 200 250 Month -10 500 100 360 Month -11 500 100 260 Month -12 500 100 100 160 Client 1: May drop as his usage has been decreasing steadily over a few months. Client 2: Occasional usage. May cancel as has no use for service. Client 3: Usage has increased but stopped increasing. Maybe is dual sourcing (i.e. buying from a competitor and you at the same time to compare products.) Client may cancel if competitor gives a better offer. Client 4. Cyclical purchasing - but cycles (and moving average) is decreasing over time. So similar to client 1. There are many more patterns. You need to be in touch with your clients to understand their purchasing behaviour. (There may be different behaviours for different groups of client). Only then can you start to model when clients will drop. Look at patterns in the cancelled file - and try and find similar patterns in the active file. This should help you. And yes, you should be able to do this with SPSS - although I am not an SPSS expert and so can't advise on how to do this. The important thing is to stop cancellation - as you know. The cost of finding a new customer is estimated at about 3 to 5 times that of keeping an existing customer. Also, you may find it better and cheaper to reactivate a cancelled customer than find a new one. (But find out why they cancelled first. This should be part of your marketing process. Whenever a customer cancels, you should always follow-up with a polite request as to why. (e.g. Dear Customer. We understand that you no longer wish to purchase from us. Although we are unhappy in your decision, we fully respect your feelings. If this has been due to poor service or quality on our part, please let us know so that we can work on improving our products. Perhaps at a future date, you may decide that we meet your needs again. In the meantime, please fill in the attached questionnaire as to why you cancelled. OK - this letter is far from perfect, but it may give you ideas!) - Angelika Schaffrath Rosario wrote: I think you could do a logistic regression: cancelling yes/no as the dependent variable and the amounts spent in the previous months as independent variables. On the other hand, you might have problems with collinearity since the independent variables are then probably highly correlated. Another idea would be to take the differences between month(n) and month(n-1) and month(n-1) and month(n-2) etc. as independent variables, or the regression coefficient of the regression line over the past months. However, I don't know how you account for the extra variability introduced if you use an estimated regression line as dependent variable. I am actually not even sure if you have to account for this. I think it has to do with two-stage models. Anyhow, I think you must combine your two databases, the one of the clients that cancelled and the other of the clients that didn't cancel. I would also like to put a question to you: I am a German statistician, but my husband is from Brasil. I have read somewhere that there is a Brasilian statistics list, but I don't know how to subscribe to it. Do you happen to know this? Caso você saiba, agradeceria muito se poderia me informar sobre isso. - I answered Angelika: I am trying to solve this problem through Survival Analysis, but I am just starting on it. I will try Logistic Regresssion as well. I already combined my two databases with the ones who have already cancelled and the ones who haven't yet (censored). Let's see if I can find out something. So your husband is brazilian ? I am also looking for a statistic list here in Brazil, but didn't find yet. If I find I mail it to you, please you you find first, do the same. - Paul Allison wrote: I would use survival analysis with time varying covariates. With SPSS you could either do a Cox regression or discrete-time logistic regression. See my book "Event History Analysis" for an explanation of how to do this. - Dr. Allan White wrote: I suggest that you use discriminant analysis. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%