JiscMail Logo
Email discussion lists for the UK Education and Research communities

Help for ALLSTAT Archives

ALLSTAT Archives

ALLSTAT Archives













By Topic:










By Author:











Monospaced Font








Subscribe or Unsubscribe

Subscribe or Unsubscribe

Log In

Log In

Get Password

Get Password


How to Predict Cancelling ?


[log in to unmask]


[log in to unmask]


Thu, 27 May 1999 11:11:10 -0300





text/plain (584 lines)

Dear list-members:

Thanks a lot to all those who replied to my question. Today I'm going to begin
trying to solve it through Survival Analysis. Does anybody disagree, considering
all the points stated ? Below are the copies of all the suggestions I got with
my original question:

- I asked:

Dear list members:

I have a question and I hope you can help me.

I have a database with 22649 cases of clients who cancelled (I also have another
database with the clients that are still active, the clients that did not cancel
yet) their accounts that is modeled like this:

IDClient Month(n) Month(n-1) Month(n-2) ... Month(n-11)
Cancel Date

- IDClient is an identifier of the client (integer);
- Month(n) is how much the client spent is dollars on month n;
- Month(n-1) is how much the client spent is dollars on month n - 1;
- Month(n-2) is how much the client spent is dollars on month n - 2;
- ...
- Cancel Date is the date (mm/dd/yy) the client cancelled his account.

I need to design a model to predict, some months beforehand (maybe one or two),
that a given client will cancel its account, so I can try to get in touch with
him and try to change his idea.

What technique should I use ? Survival Analysis ? Any suggestion using SPSS 8.0
Professional ?

Can anybody help me ?

- Jay Warner wrote:

I haven't done something of this _exact_ sort, but I think I can help
you help yourself.

1) Recognize that the whole world is a system. In this case the
output is probability of canceling in a given month, or percent of a
group that cancel in a month.

2) The inputs should be things you can monitor, or even better,
adjust in some way to reduce p(cancel). They may or may not be things
you currently have in your data base.

3) Now we have to find a model equation, expressing the relationship
of inputs to output. I went to Agresti (1996), to find a general linear
model. He also suggests that when the response must be between 0 and 1,
then a logit link is advised.

g(mu) = log [mu/(1-mu)] I believe he says that a binomial
distribution will transform by the logit into a normal distribution.

Then he says, we can express all this in the equation:

g(mu) = alpha + beta(1)*x(1) + ...beta(n)*x(n)

where the right side is a standard general linear model.

4) Next we ask what the beta's might be - we want big ones (absolute
value). How to get a mu different than we have?

5) Use step 5! I can group similar customers (similar by monthly
sales, for ex.), and for each group calculate a % who cancel in a
month. I might want to look at groups by percent of purchase decline
over the last 6 month period. That is to say, a slope of sales by
customer over the time. Your suggestion of survival analysis suggests
you think there is a declining trend prior to canceling.

6) Finally, I can do what I know how to do easily. I set up a
designed experiment, working on the variables and variable levels used
for grouping above, and run the analysis.

7) Take the results, and predict % canceling for next month, by
group. See if it is (mostly) true. If so, go for it! You can now
identify early the customers likely to drop.

I have used this mathematical approach in manufacturing cases, where the
product test/inspection reported % fail. In a nutshell, it worked. We
cut the failure rate from 40% to under 15% in eight months. Worth
$500K/percentage point, I was told.

I have two concerns in this scenario, applied to your issue. First, I
am afraid that the data you have on each customer does not include the
key things that make them cancel. This can be tested, but only you know
the details of the product/service provided, customer alternatives and
customer perceptions. If you follow step 3 of the A2Q Method (tm), you
may well come up with a lot of good possible reasons for canceling.
Many of them you will not have in your database. When this happens,
many people focus only on the database information, because it is
there. Well, yes. But consider expanding your database, too.

Secondly, I don't feel comfortable with the way I got 0 or 1
(retain/cancel) into a percentage. I don't have Agresti in my hands at
the moment, but I think he discusses other ways of getting it, too. One
of them might work better.

Now, does all this help you understand how to resolve your problem? Do
you feel comfortable with the next step, and the next?

- Gene Maguin wrote:

My opinion is that you should put the two datasets together and use survival
analysis. I think the basic question you are asking is what variables
predict how long a person will maintain an open account.

- I wrote to Gene Maguin:

I have already put the two datasets together to use survival analysis. Yes, I
need to know what variables predict how log a person will maintain an open
account, however, the most important is the ammount the client spends each month
before he close his account. I have this information for the last 12 months.

- Gene Maguin wrote:

Let me make sure i have a clear idea of what you want to know. Is this your
analytical questions?

Q: Does a person's spending increase in the months prior to the month in
which they close their account?

Note that this question does not require data on persons who do not close
their account.

- I wrote to Gene Maguin:

I'm not sure, but I think the spending tends to diminish in the months prior to
their cancelling. So, based on a given client's spending I would like to predict
(with some level of confidence) if he will cancel his account in the following
1, 2 or three months or not. I have some other informations about the persons as
well such as: age of his account, sex, age, profession, marital status,
earnings, etc.
I need to know when the person is going to cancel.
So I think I need the information of those who did not cancelled as well, don't
I ?

- Gene Maguin wrote:

Let's do some data exploration first.
1. Select persons who canceled in a particular month and do descriptives
and/or plots of the means for the months previous to their cancelation.
2. Repeat for several different months. This will give you an idea of what
actually does happen in the months preceding cancellation.
3. To test your conclusions, run a one within factor manova, either repeated
measures or true manova, for persons who cancelled in a particular month to
see if the means for the months preceding cancellation are actually not
significantly different.
4. Repeat this for several different months to see if the conclusions
generalize over different cancellation months.

- Armin Roehrl wrote:

I read your posting to the allstat mailing-list.

If you want, I can try some neural networks, decision trees,
etc. on that problem.

Do you have further information that just how much the customers
spent.. do you also know what they bought?

How much would you be willing to spend?

Best regards,
    Armin, Statoo.com team

- I wrote to Armin:

No, I don't know what they bought. I have the age of his account, sex, age,
profession, marital status, earnings, etc.
Since it is the final project of my graduation course (Business Administration),
I need some help to do the task by myself and not willing to spend anything.
If you have some tips, please let me know.

- Armin Roehrl wrote:

it's me again.
If we write a joint paper on that, maybe we can
arrange some good deal -- i.e. I run it
through my neuronal networks, etc.
and you do the corresponding business talk, etc.

- I wrote to Armin:

Yes, we can write a joint paper, but what you mean by "do the corresponding
bussiness talk" ?
Since it is data that belongs to a company, I do not have permission to foward
the data to anyone else. I can only use it for the graduation final project. Can
I use your neural networks here ?

- Armin Roehrl wrote:

is your main analysis on the statistics side, or on the economics/business site.
Suppose, I would help with the statistics, i.e. try out some of my
Can't you ask the company for permission? If all you want is some neuronal
network code to play with yourself, I recommend you download
the SNNS code and then play with it.. ask again if you want more details.
If I can't look at the data.. I'm not really willing to play with that project.

- I wrote to Armin:

My purpose is on the Statistics side. I asked for their permission, they didn't
allow. I'll try the SNNS code. Thanks for your help.

- Armin Roehrl wrote:

Good luck, Armin
STart with the standard SNNS and then later combince
CARTS with SNNS (any architecture.. CARTS will
help you find the right ropology)

- John Shade wrote:

Before looking to any special methods, I would suggest trying some very
simple analyses. For example, standardise each row (the cases) by dividing
it by the median value for that row. Split the data into 'cancel' and 'not
cancel' and plot a a few randomly selected samples from each to look for
possible patterns. For example, perhaps appreciably more 'cancels' are
preceded by a steep downward trend than 'not cancels'. It may also be that
some samples look to have more autocorrelation than others, or are more
erratic and so on. Perhaps some simple model might look appropriate for
many of the cases. At the very least, this work will increase your
familiarity with the data, highlight suspect values, and build up a
realistic expectation of how predictable the cancellation is, based on
just these past account values and statistics derived from them. You could
then move to try more formal methods.

We sell a software product call CART which is very useful for going through
large data sets looking for variables or combinations of variables with
good predictive power for some final classification (such as your
cancel/not cancel). You would want to add some variables suggested by the
initial examination suggested above, and if you also had variables such as
age, sex, occupation code, postcode (it can handle postcodes or zipcodes as
categorical variables) and so on, then so much the better - CART can work
with them all, and cope admirably (and automatically if allowed to) with
missing values and outliers. CART can provide a final model for predictive
use, or you can use CART to identify variables to be used in other
modelling methods e.g. logistic regression.

- Jay Warner wrote:

Another thought for you. Why not start with your pile of data, and
separate into 2 groups - those who canceled and those who didn't. Add
an item for percent change in sales for each, then do a 'slice and dice'
scene - do scatter plots, AoV's and the like, examining the two groups
for obvious differences. I would do this first. Then, if anything
comes up, I would go back and look very carefully at each item, perhaps
in a multiple regression type analysis. In the second examination,
check for proper orthogonality of data, as well.

- John Aitchison wrote:

Interesting problem. Presumably the 'pattern' that most predicts
cancelation is that of a declining level of expenditure over several
months, or of a constant low level of expenditure and coming up to
renewal time (if there are annual fees). So, I would construct
synthetic features "declining expenditure" and "consistent low level
expenditure" and use those in some standard classification
algorithm (eg CART, discriminant analysis). The "declining
expenditure" feature I would construct by fitting a regression model
to each individuals data, and using the estimated slope for the
individual as the relevant feature (You could aslo cosnider using teh
intercept and the variance)

- Nikolai Kolev wrote:

I think that the following model as appropriate for modelling your data:


LATEX - version:

Let $Y_i$ be i.i.d. r.v.'s having Bernoulli distribution with a
parameter $p$. Then consider the following process
X_n = \sum_{i=1}^{X_{n-1}} Y_i + Z_n, \quad n = 1,2,\ldots
where $Z_i$ is another sequence of i.i.d. r.v.'s having some discrete
distribution (for example Poisson) and being independent of the sequence

The above model is known in the literature as a "binomial thinning" and
has a simple interpretation:

- the random sum $ \sum_{i=1}^{X_{n-1}} Y_i $ represents the
number of survivors in the period $(n-1,n]$;

- the sequence $Z_n$ gives the number of the new elements entering
during the same period $(n-1,n]$.



Al-Osh, M.A. and Alzaid, A.A. (1987). First-order integer valued
autoregressive (INAR(1)) process, Journal of Time Series Analysis, Vol.
8, No. 3, 261-275.

- Nikolai Kolev wrote:

> but I didn't understand your notes pretty much. Can you detail it a
> little bit more.

For this is necessary a little bit more probability knowledge and
interpretations of the simple discrete distributions. See below.

> I need to know if a given person will survive or not,

Very good. But I don't believe that there exist an expert who can give
an answer of that question, predicting even 50% of the cases. The
experts in banks are following their intuition basically, and some crazy
deterministical rules (created before 30-40 years), which are
satisfactory for the corresponding bank.

> and not just the number of people who survived. Ok ?

If you know (approximately, using some discrete probabilistic model) the
number of clents survived after n-th month, you could have an idea (from
your data base) with which subset of cients to provide a talk.

> I think that the following model as appropriate for modelling your data:

I am reconfirming my opinion.

> ===================================================================
> LATEX - version:
> Let $Y_i$ be i.i.d. r.v.'s having Bernoulli distribution with a
> parameter $p$. Then consider the following process
> $$
> X_n = \sum_{i=1}^{X_{n-1}} Y_i + Z_n, \quad n = 1,2,\ldots
> $$
> where $Z_i$ is another sequence of i.i.d. r.v.'s having some discrete
> distribution (for example Poisson) and being independent of the sequence
> $Y_i$.

This is a discrete time series model. From your data, you could estimate
the parameters of the underlying Bernoulli and Poisson distributions.

> The above model is known in the literature as a "binomial thinning" and
> has a simple interpretation:
> - the random sum $ \sum_{i=1}^{X_{n-1}} Y_i $ represents the
> number of survivors in the period $(n-1,n]$;
> - the sequence $Z_n$ gives the number of the new elements entering
> during the same period $(n-1,n]$.

If you read the paper cited, and if you have some knowledge on discrete
distributions, you could understand and will feel the model suggested.

- Dave Reilly wrote:

What you have is a time series problem where you wish to use historical
data in order to assess an unusual value for purposes of early warning
or intervention detection.

One of our clients is using/investigating this in a banking application
where the bank wishes to be pro-active and sense a change in status.

- James Watts wrote:

Cox regression in advanced models group. Will need time-to-the-event variable,
indicator variable of censored cases.
I used it to forecast differences in liklihood of closing banking accounts.
look at the Mantel-Haenszel approach which I think is in Bishop and Feinberg's
Discrete Multivariate Analysis. In doing the modeling I ran some logistic
regressions to look for influential cases and re-ran the cox regressions with
w/o these cases. I'm guessing that you might want to sample from the original
at least for the exploratory phase, or you might have a bit of a wait. We had
monthly data that covered a couple of years, so could make charts of the
doing a nose dive to the baseline. Made variables such as cumulative months of
account balance decreases and the relative amount of the decrease. Allison's
History Analysis from Sage was helpful even though I think it predated the cox
regression methodology.

I asked James Watts :

I already have the time-to-the-event (in months) variable and the indicator
variable of the censored cases.

Don't we have the Cox Regression in SPSS Professional 8.0 ?
Did you run the logistic regression for what variables ? Just the person's
spendings along the months ? Or you used some more variables ? I have other ones
such as: sex, age, age of account, marital status, profession, earnings, etc.
What did you mean by:
- cumulative months of account balance decreases ? Is it the numbers of times
the client spent less in a month than in the previous month ?

- the relative amount of the decrease ? Is it the mean of all the percentages of
spendingdecrease between a month and the previous month ?

I'm facing a little dificultness in finding books on this topic. I ordered the
Applied Survival Analysis from Chap T. Lee form Books.com. Can you point me
someother ones ?

- James Watts answered:

Hi, Cox regression is in the advanced models in v. 9 and I think it was in that
module in v.7. Don't know about v.8.

We ran all of the analyses on the event of an household closing a checking
We had data for two years of monthly data on deposits, atm transactions, monthly
checking and savings balances as well as number of accounts in household, length
time household had been customer of any service. We generated lots of variables
just tested them in the cox reg, log reg and crosstabs, using mantel-haenszel
testing for ordered pairs. if you get a copy of the spss manual that has the
reg explanation in it, you will see the types of graphs I am talking about. I
these graphs chart differences in the predicted risk function between the
and event cases split by categories of a independent variable, such as high
checking balance decline vs. low, or whatever. I think we used logistic
to test runs of the event cases vs. the censored cases, the same as binary log
That wasn't successful without some sort of matching -- I think we matched on
of time hsehold had a cusotmer, but this wasn't a true case matching log reg,
rather an attempt at make-shift analyses to attempt to get the predicted
probabilites that log reg produces. Cox reg predicts the hazard or risk rate.
know that you can use OLS on the time duration variable if you select out the
censored cases. I think we did some of that also, but in the end we ran a cox
on categories of variables such as the number of months of declines in balances,
that the next month's balances substracted by the previous month and so on for
several monthds, categories of account balances and categories of number of
accounts. We used categories from continous variables because we were looking
cut offs points that would be simple rules for the bank managers to implement.
think you just need to play with your data in these various ways in a trial and
error manner to see what works best. The best discussion of cox reg I've seen
is on
the sas web site. search on phreg or proportional hazard regression. haven't
the book you describe. good luck

- Zachary S. Feinstein wrote:

Look into Survival Analysis and Hazzard Modeling. I knew about this stuff many
years ago, but not much now. Let me know if this helps.

- Arthur Weiss wrote:

With what you have given me, I can't really help. You need to
understand why clients cancel. And I would need to know more about
your product and clients.

For example: You could have sales patterns like this:

Sales per month
Client Number 1 2 3 4
Month 0 100 1000 100
Month -1 200 1000 200
Month -2 300 1000 300
Month -3 400 100 900 200
Month -4 500 800 120
Month -5 500 600 230
Month -6 500 500 340
Month -7 500 400 240
Month -8 500 100 300 130
Month -9 500 200 250
Month -10 500 100 360
Month -11 500 100 260
Month -12 500 100 100 160

Client 1: May drop as his usage has been decreasing steadily over a
few months.
Client 2: Occasional usage. May cancel as has no use for service.
Client 3: Usage has increased but stopped increasing. Maybe is dual
sourcing (i.e. buying from a competitor and you at the same time to
compare products.) Client may cancel if competitor gives a better
Client 4. Cyclical purchasing - but cycles (and moving average) is
decreasing over time. So similar to client 1.

There are many more patterns. You need to be in touch with your
clients to understand their purchasing behaviour. (There may be
different behaviours for different groups of client). Only then can
you start to model when clients will drop. Look at patterns in the
cancelled file - and try and find similar patterns in the active file.
This should help you.

And yes, you should be able to do this with SPSS - although I am not
an SPSS expert and so can't advise on how to do this.

The important thing is to stop cancellation - as you know. The cost of
finding a new customer is estimated at about 3 to 5 times that of
keeping an existing customer. Also, you may find it better and cheaper
to reactivate a cancelled customer than find a new one. (But find out
why they cancelled first. This should be part of your marketing
process. Whenever a customer cancels, you should always follow-up with
a polite request as to why. (e.g. Dear Customer. We understand that
you no longer wish to purchase from us. Although we are unhappy in
your decision, we fully respect your feelings. If this has been due to
poor service or quality on our part, please let us know so that we can
work on improving our products. Perhaps at a future date, you may
decide that we meet your needs again. In the meantime, please fill in
the attached questionnaire as to why you cancelled.

OK - this letter is far from perfect, but it may give you ideas!)

- Angelika Schaffrath Rosario wrote:

I think you could do a logistic regression: cancelling yes/no as the
dependent variable and the amounts spent in the previous months
as independent variables. On the other hand, you might have
problems with collinearity since the independent variables are then
probably highly correlated. Another idea would be to take the
differences between month(n) and month(n-1) and month(n-1) and
month(n-2) etc. as independent variables, or the regression
coefficient of the regression line over the past months. However, I
don't know how you account for the extra variability introduced if
you use an estimated regression line as dependent variable. I am
actually not even sure if you have to account for this. I think it has
to do with two-stage models.

Anyhow, I think you must combine your two databases, the one of
the clients that cancelled and the other of the clients that didn't

I would also like to put a question to you: I am a German
statistician, but my husband is from Brasil. I have read somewhere
that there is a Brasilian statistics list, but I don't know how to
subscribe to it. Do you happen to know this? Caso vocÍ saiba,
agradeceria muito se poderia me informar sobre isso.

- I answered Angelika:

I am trying to solve this problem through Survival Analysis, but I am just
starting on it. I will try Logistic Regresssion as well. I already combined my
two databases with the ones who have already cancelled and the ones who haven't
yet (censored). Let's see if I can find out something.

So your husband is brazilian ? I am also looking for a statistic list here in
Brazil, but didn't find yet. If I find I mail it to you, please you you find
first, do the same.

- Paul Allison wrote:

I would use survival analysis with time varying covariates. With SPSS you
could either do a Cox regression or discrete-time logistic regression. See
my book "Event History Analysis" for an explanation of how to do this.

- Dr. Allan White wrote:

I suggest that you use discriminant analysis.


Top of Message | Previous Page | Permalink

JiscMail Tools

RSS Feeds and Sharing

Advanced Options


November 2020
October 2020
September 2020
August 2020
July 2020
June 2020
May 2020
April 2020
March 2020
February 2020
January 2020
December 2019
November 2019
October 2019
September 2019
August 2019
July 2019
June 2019
May 2019
April 2019
March 2019
February 2019
January 2019
December 2018
November 2018
October 2018
September 2018
August 2018
July 2018
June 2018
May 2018
April 2018
March 2018
February 2018
January 2018
December 2017
November 2017
October 2017
September 2017
August 2017
July 2017
June 2017
May 2017
April 2017
March 2017
February 2017
January 2017
December 2016
November 2016
October 2016
September 2016
August 2016
July 2016
June 2016
May 2016
April 2016
March 2016
February 2016
January 2016
December 2015
November 2015
October 2015
September 2015
August 2015
July 2015
June 2015
May 2015
April 2015
March 2015
February 2015
January 2015
December 2014
November 2014
October 2014
September 2014
August 2014
July 2014
June 2014
May 2014
April 2014
March 2014
February 2014
January 2014
December 2013
November 2013
October 2013
September 2013
August 2013
July 2013
June 2013
May 2013
April 2013
March 2013
February 2013
January 2013
December 2012
November 2012
October 2012
September 2012
August 2012
July 2012
June 2012
May 2012
April 2012
March 2012
February 2012
January 2012
December 2011
November 2011
October 2011
September 2011
August 2011
July 2011
June 2011
May 2011
April 2011
March 2011
February 2011
January 2011
December 2010
November 2010
October 2010
September 2010
August 2010
July 2010
June 2010
May 2010
April 2010
March 2010
February 2010
January 2010
December 2009
November 2009
October 2009
September 2009
August 2009
July 2009
June 2009
May 2009
April 2009
March 2009
February 2009
January 2009
December 2008
November 2008
October 2008
September 2008
August 2008
July 2008
June 2008
May 2008
April 2008
March 2008
February 2008
January 2008
December 2007
November 2007
October 2007
September 2007
August 2007
July 2007
June 2007
May 2007
April 2007
March 2007
February 2007
January 2007

JiscMail is a Jisc service.

View our service policies at https://www.jiscmail.ac.uk/policyandsecurity/ and Jisc's privacy policy at https://www.jisc.ac.uk/website/privacy-notice

For help and support help@jisc.ac.uk

Secured by F-Secure Anti-Virus CataList Email List Search Powered by the LISTSERV Email List Manager