Dear listmembers:
Thanks a lot to all those who replied to my question. Today I'm going to begin
trying to solve it through Survival Analysis. Does anybody disagree, considering
all the points stated ? Below are the copies of all the suggestions I got with
my original question:
 I asked:
Dear list members:
I have a question and I hope you can help me.
I have a database with 22649 cases of clients who cancelled (I also have another
database with the clients that are still active, the clients that did not cancel
yet) their accounts that is modeled like this:
IDClient Month(n) Month(n1) Month(n2) ... Month(n11)
Cancel Date
 IDClient is an identifier of the client (integer);
 Month(n) is how much the client spent is dollars on month n;
 Month(n1) is how much the client spent is dollars on month n  1;
 Month(n2) is how much the client spent is dollars on month n  2;
 ...
 Cancel Date is the date (mm/dd/yy) the client cancelled his account.
I need to design a model to predict, some months beforehand (maybe one or two),
that a given client will cancel its account, so I can try to get in touch with
him and try to change his idea.
What technique should I use ? Survival Analysis ? Any suggestion using SPSS 8.0
Professional ?
Can anybody help me ?
 Jay Warner wrote:
I haven't done something of this _exact_ sort, but I think I can help
you help yourself.
1) Recognize that the whole world is a system. In this case the
output is probability of canceling in a given month, or percent of a
group that cancel in a month.
2) The inputs should be things you can monitor, or even better,
adjust in some way to reduce p(cancel). They may or may not be things
you currently have in your data base.
3) Now we have to find a model equation, expressing the relationship
of inputs to output. I went to Agresti (1996), to find a general linear
model. He also suggests that when the response must be between 0 and 1,
then a logit link is advised.
g(mu) = log [mu/(1mu)] I believe he says that a binomial
distribution will transform by the logit into a normal distribution.
Then he says, we can express all this in the equation:
g(mu) = alpha + beta(1)*x(1) + ...beta(n)*x(n)
where the right side is a standard general linear model.
4) Next we ask what the beta's might be  we want big ones (absolute
value). How to get a mu different than we have?
5) Use step 5! I can group similar customers (similar by monthly
sales, for ex.), and for each group calculate a % who cancel in a
month. I might want to look at groups by percent of purchase decline
over the last 6 month period. That is to say, a slope of sales by
customer over the time. Your suggestion of survival analysis suggests
you think there is a declining trend prior to canceling.
6) Finally, I can do what I know how to do easily. I set up a
designed experiment, working on the variables and variable levels used
for grouping above, and run the analysis.
7) Take the results, and predict % canceling for next month, by
group. See if it is (mostly) true. If so, go for it! You can now
identify early the customers likely to drop.
I have used this mathematical approach in manufacturing cases, where the
product test/inspection reported % fail. In a nutshell, it worked. We
cut the failure rate from 40% to under 15% in eight months. Worth
$500K/percentage point, I was told.
I have two concerns in this scenario, applied to your issue. First, I
am afraid that the data you have on each customer does not include the
key things that make them cancel. This can be tested, but only you know
the details of the product/service provided, customer alternatives and
customer perceptions. If you follow step 3 of the A2Q Method (tm), you
may well come up with a lot of good possible reasons for canceling.
Many of them you will not have in your database. When this happens,
many people focus only on the database information, because it is
there. Well, yes. But consider expanding your database, too.
Secondly, I don't feel comfortable with the way I got 0 or 1
(retain/cancel) into a percentage. I don't have Agresti in my hands at
the moment, but I think he discusses other ways of getting it, too. One
of them might work better.
Now, does all this help you understand how to resolve your problem? Do
you feel comfortable with the next step, and the next?
 Gene Maguin wrote:
My opinion is that you should put the two datasets together and use survival
analysis. I think the basic question you are asking is what variables
predict how long a person will maintain an open account.
 I wrote to Gene Maguin:
I have already put the two datasets together to use survival analysis. Yes, I
need to know what variables predict how log a person will maintain an open
account, however, the most important is the ammount the client spends each month
before he close his account. I have this information for the last 12 months.
 Gene Maguin wrote:
Let me make sure i have a clear idea of what you want to know. Is this your
analytical questions?
Q: Does a person's spending increase in the months prior to the month in
which they close their account?
Note that this question does not require data on persons who do not close
their account.
 I wrote to Gene Maguin:
I'm not sure, but I think the spending tends to diminish in the months prior to
their cancelling. So, based on a given client's spending I would like to predict
(with some level of confidence) if he will cancel his account in the following
1, 2 or three months or not. I have some other informations about the persons as
well such as: age of his account, sex, age, profession, marital status,
earnings, etc.
I need to know when the person is going to cancel.
So I think I need the information of those who did not cancelled as well, don't
I ?
 Gene Maguin wrote:
Let's do some data exploration first.
1. Select persons who canceled in a particular month and do descriptives
and/or plots of the means for the months previous to their cancelation.
2. Repeat for several different months. This will give you an idea of what
actually does happen in the months preceding cancellation.
3. To test your conclusions, run a one within factor manova, either repeated
measures or true manova, for persons who cancelled in a particular month to
see if the means for the months preceding cancellation are actually not
significantly different.
4. Repeat this for several different months to see if the conclusions
generalize over different cancellation months.
 Armin Roehrl wrote:
I read your posting to the allstat mailinglist.
If you want, I can try some neural networks, decision trees,
etc. on that problem.
Do you have further information that just how much the customers
spent.. do you also know what they bought?
How much would you be willing to spend?
Best regards,
Armin, Statoo.com team
(www.statoo.com)
 I wrote to Armin:
No, I don't know what they bought. I have the age of his account, sex, age,
profession, marital status, earnings, etc.
Since it is the final project of my graduation course (Business Administration),
I need some help to do the task by myself and not willing to spend anything.
If you have some tips, please let me know.
 Armin Roehrl wrote:
Hey,
it's me again.
If we write a joint paper on that, maybe we can
arrange some good deal  i.e. I run it
through my neuronal networks, etc.
and you do the corresponding business talk, etc.
 I wrote to Armin:
Yes, we can write a joint paper, but what you mean by "do the corresponding
bussiness talk" ?
Since it is data that belongs to a company, I do not have permission to foward
the data to anyone else. I can only use it for the graduation final project. Can
I use your neural networks here ?
 Armin Roehrl wrote:
is your main analysis on the statistics side, or on the economics/business site.
Suppose, I would help with the statistics, i.e. try out some of my
algorithms/code,
etc.
Can't you ask the company for permission? If all you want is some neuronal
network code to play with yourself, I recommend you download
the SNNS code and then play with it.. ask again if you want more details.
If I can't look at the data.. I'm not really willing to play with that project.
 I wrote to Armin:
My purpose is on the Statistics side. I asked for their permission, they didn't
allow. I'll try the SNNS code. Thanks for your help.
 Armin Roehrl wrote:
Good luck, Armin
STart with the standard SNNS and then later combince
CARTS with SNNS (any architecture.. CARTS will
help you find the right ropology)
 John Shade wrote:
Before looking to any special methods, I would suggest trying some very
simple analyses. For example, standardise each row (the cases) by dividing
it by the median value for that row. Split the data into 'cancel' and 'not
cancel' and plot a a few randomly selected samples from each to look for
possible patterns. For example, perhaps appreciably more 'cancels' are
preceded by a steep downward trend than 'not cancels'. It may also be that
some samples look to have more autocorrelation than others, or are more
erratic and so on. Perhaps some simple model might look appropriate for
many of the cases. At the very least, this work will increase your
familiarity with the data, highlight suspect values, and build up a
realistic expectation of how predictable the cancellation is, based on
just these past account values and statistics derived from them. You could
then move to try more formal methods.
We sell a software product call CART which is very useful for going through
large data sets looking for variables or combinations of variables with
good predictive power for some final classification (such as your
cancel/not cancel). You would want to add some variables suggested by the
initial examination suggested above, and if you also had variables such as
age, sex, occupation code, postcode (it can handle postcodes or zipcodes as
categorical variables) and so on, then so much the better  CART can work
with them all, and cope admirably (and automatically if allowed to) with
missing values and outliers. CART can provide a final model for predictive
use, or you can use CART to identify variables to be used in other
modelling methods e.g. logistic regression.
 Jay Warner wrote:
Another thought for you. Why not start with your pile of data, and
separate into 2 groups  those who canceled and those who didn't. Add
an item for percent change in sales for each, then do a 'slice and dice'
scene  do scatter plots, AoV's and the like, examining the two groups
for obvious differences. I would do this first. Then, if anything
comes up, I would go back and look very carefully at each item, perhaps
in a multiple regression type analysis. In the second examination,
check for proper orthogonality of data, as well.
 John Aitchison wrote:
Interesting problem. Presumably the 'pattern' that most predicts
cancelation is that of a declining level of expenditure over several
months, or of a constant low level of expenditure and coming up to
renewal time (if there are annual fees). So, I would construct
synthetic features "declining expenditure" and "consistent low level
expenditure" and use those in some standard classification
algorithm (eg CART, discriminant analysis). The "declining
expenditure" feature I would construct by fitting a regression model
to each individuals data, and using the estimated slope for the
individual as the relevant feature (You could aslo cosnider using teh
intercept and the variance)
 Nikolai Kolev wrote:
I think that the following model as appropriate for modelling your data:
===================================================================
LATEX  version:
Let $Y_i$ be i.i.d. r.v.'s having Bernoulli distribution with a
parameter $p$. Then consider the following process
$$
X_n = \sum_{i=1}^{X_{n1}} Y_i + Z_n, \quad n = 1,2,\ldots
$$
where $Z_i$ is another sequence of i.i.d. r.v.'s having some discrete
distribution (for example Poisson) and being independent of the sequence
$Y_i$.
The above model is known in the literature as a "binomial thinning" and
has a simple interpretation:
 the random sum $ \sum_{i=1}^{X_{n1}} Y_i $ represents the
number of survivors in the period $(n1,n]$;
 the sequence $Z_n$ gives the number of the new elements entering
during the same period $(n1,n]$.
===================================================================
Reference:
AlOsh, M.A. and Alzaid, A.A. (1987). Firstorder integer valued
autoregressive (INAR(1)) process, Journal of Time Series Analysis, Vol.
8, No. 3, 261275.
 Nikolai Kolev wrote:
> but I didn't understand your notes pretty much. Can you detail it a
> little bit more.
For this is necessary a little bit more probability knowledge and
interpretations of the simple discrete distributions. See below.
> I need to know if a given person will survive or not,
Very good. But I don't believe that there exist an expert who can give
an answer of that question, predicting even 50% of the cases. The
experts in banks are following their intuition basically, and some crazy
deterministical rules (created before 3040 years), which are
satisfactory for the corresponding bank.
> and not just the number of people who survived. Ok ?
If you know (approximately, using some discrete probabilistic model) the
number of clents survived after nth month, you could have an idea (from
your data base) with which subset of cients to provide a talk.
>
> I think that the following model as appropriate for modelling your data:
>
I am reconfirming my opinion.
> ===================================================================
>
> LATEX  version:
>
> Let $Y_i$ be i.i.d. r.v.'s having Bernoulli distribution with a
> parameter $p$. Then consider the following process
> $$
> X_n = \sum_{i=1}^{X_{n1}} Y_i + Z_n, \quad n = 1,2,\ldots
> $$
> where $Z_i$ is another sequence of i.i.d. r.v.'s having some discrete
> distribution (for example Poisson) and being independent of the sequence
> $Y_i$.
>
This is a discrete time series model. From your data, you could estimate
the parameters of the underlying Bernoulli and Poisson distributions.
> The above model is known in the literature as a "binomial thinning" and
> has a simple interpretation:
>
>  the random sum $ \sum_{i=1}^{X_{n1}} Y_i $ represents the
> number of survivors in the period $(n1,n]$;
>
>  the sequence $Z_n$ gives the number of the new elements entering
> during the same period $(n1,n]$.
>
If you read the paper cited, and if you have some knowledge on discrete
distributions, you could understand and will feel the model suggested.
 Dave Reilly wrote:
What you have is a time series problem where you wish to use historical
data in order to assess an unusual value for purposes of early warning
or intervention detection.
One of our clients is using/investigating this in a banking application
where the bank wishes to be proactive and sense a change in status.
 James Watts wrote:
Cox regression in advanced models group. Will need timetotheevent variable,
and
indicator variable of censored cases.
I used it to forecast differences in liklihood of closing banking accounts.
Also
look at the MantelHaenszel approach which I think is in Bishop and Feinberg's
Discrete Multivariate Analysis. In doing the modeling I ran some logistic
regressions to look for influential cases and reran the cox regressions with
and
w/o these cases. I'm guessing that you might want to sample from the original
files
at least for the exploratory phase, or you might have a bit of a wait. We had
monthly data that covered a couple of years, so could make charts of the
quitters
doing a nose dive to the baseline. Made variables such as cumulative months of
account balance decreases and the relative amount of the decrease. Allison's
Event
History Analysis from Sage was helpful even though I think it predated the cox
regression methodology.
I asked James Watts :
I already have the timetotheevent (in months) variable and the indicator
variable of the censored cases.
Don't we have the Cox Regression in SPSS Professional 8.0 ?
Did you run the logistic regression for what variables ? Just the person's
spendings along the months ? Or you used some more variables ? I have other ones
such as: sex, age, age of account, marital status, profession, earnings, etc.
What did you mean by:
 cumulative months of account balance decreases ? Is it the numbers of times
the client spent less in a month than in the previous month ?
 the relative amount of the decrease ? Is it the mean of all the percentages of
spendingdecrease between a month and the previous month ?
I'm facing a little dificultness in finding books on this topic. I ordered the
Applied Survival Analysis from Chap T. Lee form Books.com. Can you point me
someother ones ?
 James Watts answered:
Hi, Cox regression is in the advanced models in v. 9 and I think it was in that
module in v.7. Don't know about v.8.
We ran all of the analyses on the event of an household closing a checking
account.
We had data for two years of monthly data on deposits, atm transactions, monthly
checking and savings balances as well as number of accounts in household, length
of
time household had been customer of any service. We generated lots of variables
and
just tested them in the cox reg, log reg and crosstabs, using mantelhaenszel
(sp?)
testing for ordered pairs. if you get a copy of the spss manual that has the
cox
reg explanation in it, you will see the types of graphs I am talking about. I
think
these graphs chart differences in the predicted risk function between the
censored
and event cases split by categories of a independent variable, such as high
relative
checking balance decline vs. low, or whatever. I think we used logistic
regression
to test runs of the event cases vs. the censored cases, the same as binary log
reg.
That wasn't successful without some sort of matching  I think we matched on
length
of time hsehold had a cusotmer, but this wasn't a true case matching log reg,
but
rather an attempt at makeshift analyses to attempt to get the predicted
probabilites that log reg produces. Cox reg predicts the hazard or risk rate.
You
know that you can use OLS on the time duration variable if you select out the
censored cases. I think we did some of that also, but in the end we ran a cox
reg
on categories of variables such as the number of months of declines in balances,
that the next month's balances substracted by the previous month and so on for
several monthds, categories of account balances and categories of number of
accounts. We used categories from continous variables because we were looking
for
cut offs points that would be simple rules for the bank managers to implement.
I
think you just need to play with your data in these various ways in a trial and
error manner to see what works best. The best discussion of cox reg I've seen
is on
the sas web site. search on phreg or proportional hazard regression. haven't
seen
the book you describe. good luck
 Zachary S. Feinstein wrote:
Look into Survival Analysis and Hazzard Modeling. I knew about this stuff many
years ago, but not much now. Let me know if this helps.
 Arthur Weiss wrote:
With what you have given me, I can't really help. You need to
understand why clients cancel. And I would need to know more about
your product and clients.
For example: You could have sales patterns like this:
Sales per month
Client Number 1 2 3 4
Month 0 100 1000 100
Month 1 200 1000 200
Month 2 300 1000 300
Month 3 400 100 900 200
Month 4 500 800 120
Month 5 500 600 230
Month 6 500 500 340
Month 7 500 400 240
Month 8 500 100 300 130
Month 9 500 200 250
Month 10 500 100 360
Month 11 500 100 260
Month 12 500 100 100 160
Client 1: May drop as his usage has been decreasing steadily over a
few months.
Client 2: Occasional usage. May cancel as has no use for service.
Client 3: Usage has increased but stopped increasing. Maybe is dual
sourcing (i.e. buying from a competitor and you at the same time to
compare products.) Client may cancel if competitor gives a better
offer.
Client 4. Cyclical purchasing  but cycles (and moving average) is
decreasing over time. So similar to client 1.
There are many more patterns. You need to be in touch with your
clients to understand their purchasing behaviour. (There may be
different behaviours for different groups of client). Only then can
you start to model when clients will drop. Look at patterns in the
cancelled file  and try and find similar patterns in the active file.
This should help you.
And yes, you should be able to do this with SPSS  although I am not
an SPSS expert and so can't advise on how to do this.
The important thing is to stop cancellation  as you know. The cost of
finding a new customer is estimated at about 3 to 5 times that of
keeping an existing customer. Also, you may find it better and cheaper
to reactivate a cancelled customer than find a new one. (But find out
why they cancelled first. This should be part of your marketing
process. Whenever a customer cancels, you should always followup with
a polite request as to why. (e.g. Dear Customer. We understand that
you no longer wish to purchase from us. Although we are unhappy in
your decision, we fully respect your feelings. If this has been due to
poor service or quality on our part, please let us know so that we can
work on improving our products. Perhaps at a future date, you may
decide that we meet your needs again. In the meantime, please fill in
the attached questionnaire as to why you cancelled.
OK  this letter is far from perfect, but it may give you ideas!)
 Angelika Schaffrath Rosario wrote:
I think you could do a logistic regression: cancelling yes/no as the
dependent variable and the amounts spent in the previous months
as independent variables. On the other hand, you might have
problems with collinearity since the independent variables are then
probably highly correlated. Another idea would be to take the
differences between month(n) and month(n1) and month(n1) and
month(n2) etc. as independent variables, or the regression
coefficient of the regression line over the past months. However, I
don't know how you account for the extra variability introduced if
you use an estimated regression line as dependent variable. I am
actually not even sure if you have to account for this. I think it has
to do with twostage models.
Anyhow, I think you must combine your two databases, the one of
the clients that cancelled and the other of the clients that didn't
cancel.
I would also like to put a question to you: I am a German
statistician, but my husband is from Brasil. I have read somewhere
that there is a Brasilian statistics list, but I don't know how to
subscribe to it. Do you happen to know this? Caso você saiba,
agradeceria muito se poderia me informar sobre isso.
 I answered Angelika:
I am trying to solve this problem through Survival Analysis, but I am just
starting on it. I will try Logistic Regresssion as well. I already combined my
two databases with the ones who have already cancelled and the ones who haven't
yet (censored). Let's see if I can find out something.
So your husband is brazilian ? I am also looking for a statistic list here in
Brazil, but didn't find yet. If I find I mail it to you, please you you find
first, do the same.
 Paul Allison wrote:
I would use survival analysis with time varying covariates. With SPSS you
could either do a Cox regression or discretetime logistic regression. See
my book "Event History Analysis" for an explanation of how to do this.
 Dr. Allan White wrote:
I suggest that you use discriminant analysis.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
