JiscMail Logo
Email discussion lists for the UK Education and Research communities

Help for ALLSTAT Archives


ALLSTAT Archives

ALLSTAT Archives


allstat@JISCMAIL.AC.UK


View:

Message:

[

First

|

Previous

|

Next

|

Last

]

By Topic:

[

First

|

Previous

|

Next

|

Last

]

By Author:

[

First

|

Previous

|

Next

|

Last

]

Font:

Proportional Font

LISTSERV Archives

LISTSERV Archives

ALLSTAT Home

ALLSTAT Home

ALLSTAT  2003

ALLSTAT 2003

Options

Subscribe or Unsubscribe

Subscribe or Unsubscribe

Log In

Log In

Get Password

Get Password

Subject:

Summary of responses: moving the origin (was "Where's time zero?")

From:

"R. Allan Reese" <[log in to unmask]>

Reply-To:

R. Allan Reese

Date:

Mon, 22 Sep 2003 16:45:24 +0100

Content-Type:

Text/Plain

Parts/Attachments:

Parts/Attachments

Text/Plain (366 lines)

Moving or choosing the origin - adjusting the intercept in models

On allstat, I wrote that a regression predictor x could be expressed as
x-z, thus moving the origin, and asked for advice or references as to
when and how z could be legitimately and usefully chosen as a part of
the analysis and interpretation.  The equations f(x) and f(x-z) would be
equivalent in goodness of fit, and if multiplied out would give the same
coefficients for terms in x.  However, a model expressed in x-z might
have a direct relevance to the domain.  The example that had
precipitated this was that of regressing UK government health spending
on time; it nicely fitted a straight line with the origin in 1947.

I received several helpful and stimulating replies from allstat, which
are appended.  However, I eventually realized that what I am referring
to is simply an equivalent *re-parameterization* of a model, and
related to (but different from) the OFFSET concept in Glim (where the
offset is added to the linear predictor).  Since the extra parameter z
does not change the *algebraic* fit, justifying or optimizing its value
must be based on another logic.

A web search on intercept correction located another interpretation in
econometrics. Of 146 hits, at least 141 mentioned the work of Michael
Clements & David Hendry, and it was their books that I had
serendipitously found in our library.  Forecasting Economic Time Series
(CUP 1998) p180 states that "published [economic] forecasts reflect in
varying degree the properties of the models and the skills of the
models proprietors. ... adjustments (often extensive) are often made to
the model-based predictions in arriving at a final forecast, typically
to the constant terms or intercepts in the models equations."  Rothman
(J Econ Lit), reviewing their follow-up book Forecasting Non-stationary
Economic Time Series (MIT 1999), comments, "[as a student] I initially
thought that this corrective procedure was an ad hoc device used to
cover up and mask poor model performance" but goes on the commend the
book.

My understanding is that Clements & Hendry are describing attempts to
incorporate into a mathematical model the qualitative information or
opinion that reflects the wider economic and social effects.  One
unfortunate example they give, however, is to correct a moving average
process by treating it as autoregressive: instead of forecasting
y(t+1) = mu the mean, forecast y(t+1)= y(t).  That is Demings classic
case of management by interference.  It also worries me that C&H use the
term non-stationary to describe structural breaks in the generating
system; I think it is misleading to talk of a model with a structural
break when what may be implied is a totally new situation.  Forecasting
generally fails for just that reason, which is why a model based on
data pre black-Wednesday / 9.11 / hurricane Isabel may need more than a
tweak to its parameters to continue to be useful.  I would distinguish
broken-stick models from models in which the parameters are functions
of time.

The processes described above are what I would term *heuristic*, and
justify the health-spending example at least to the extent that it is
common practice to make adjustments so that the model more faithfully
reflects external information.  The Glim OFFSET directive reflects a
stronger imperative to make a model conform to fixed data points, or
*prior knowledge*.  One case of this is to fit a model with / without an
intercept, the latter imposing the strong condition that y=0 when x=0.
Those two models do not, however, generally have the same goodness of
fit. Nelder's concept (see emails) of Functional Marginality must be a
consideration.  Arbitrarily insisting on a line through the origin when
x=0 is well outside the data range may not be sensible.

Other reasons for moving the origin are computational.  Computational
imperatives should be incorporated into software, but the use may still
depend upon the understanding of the user. Humphrey, Newson and Swank
below all suggest centring on the range of the predictor or near to the
point of inflexion when fitting a quadratic. Their reason is that
otherwise x and x^2 will be very collinear.  Hopkins
(www.sportsci.org/resource/stats/polynomial.html) adds the heuristic, "I
find it easier to interpret [the quadratic coefficient] if I transform
the X values so they range from -1 to +1," but this is exceptional.
Websites promoting software do not recognise the problem. One striking
example shows a regression based on salary, salary^2 and salary^3 (The
Data Mining Group, www.dmg.org/v1-1/polynomialregression.htm).  SPSS,
using both REGRESSION and CURVEFIT did not make any adjustment and
hence the quadratic fit was not found; there was one figure labelled
collinearity but no assistance in interpreting this. Genstat output
included a clear statement that year and year^2 were collinear, so
implying a poor fit.

I would expect software that fitted polynomial terms to do so using
orthogonal polynomials, but this is not promoted, and apparently SPSS
CURVEFIT does not. One of the oldest textbooks on my shelf, the
excellent "Statistics Manual" by Edwin Crow et al (originally a US
Naval Ordnance document but reprinted 1960 by Dover)states, "We note
that x, x**2 and x**3 are certainly not statistically independent [as
predictors], but independence is not necessary in multiple regression
analysis."

In view of the prevalence of statistic models using polynomial fits, it
worries me that I have been unconscious to such problems for so long and
that the problem may generally not be addressed in courses.  Note that
there are both statistical and computational issues - most users assume
that the computer will deliver the correct answer.

Further comments or references from allstat members are invited,
especially from anyone who can offer a *short* defence of econometrics!

Allan

----- Initial query
Date: Mon, 8 Sep 2003 16:43:44 +0100 (BST)
From: R. Allan Reese <[log in to unmask]>
Subject: Question: where's time zero?

I am working on time series and removing obvious trends.  Something
which
has me intrigued is that by choosing different times as the origin I can
fit equivalent but different models, but I do not recall seeing this
discussed in my training or in the literature.

For example, my data run from 1960 to 2002.  A linear trend will have
the
same slope regardless of whether it is regressed on year or (year-1900)
or
(year-1960) but will have different intercepts.  As these are economic
data, it was of interest to note that "solving" the regression suggested
an intercept of zero in the year before the programme began, so there
was
a logic in choosing that as the origin year.

Some of the series, however, require a curved fit, and a quadratic was
used as a first approximation.  Fitting year^2 may be equivalent to
fitting (year-z)^2, but the first caused a numerical failure in the
algorithm (tolerance exceeded).  I had a pragmatic reason for choosing a
value for z in the centre of the distribution.  Different z's give the
same overall fit (of course) but strongly influence the coefficients on
lower powers.

It seems to me therefore that the choice of z ought to be a
consideration
in the analysis, maybe using a pragmatic or theory-based value.  If z is
considered another parameter, which criterion should be "optimized"
given
that all models fit equally?  How would you define the "simplest" model?
There may be a connection with fitting orthogonal polynomials, so adding
the kth order does not change the coefficients on k-1 etc, but this
seems
to me an extra topic.

Comments or references to existing literature, sent to me, would be
welcomed.

R. Allan Reese                       Email:     [log in to unmask]
Associate Manager GRI                Direct voice:   +44 1482 466845
Graduate School                      Voice messages: +44 1482 466844
Hull University, Hull HU6 7RX, UK.   Fax:            +44 1482 466436

---------------------------------------------------------------------
Date: Tue, 9 Sep 2003 10:47:33 +0100
From: [log in to unmask]
To: R. Allan Reese <[log in to unmask]>
Subject: Re: Query: where's time zero?

I don't think choices of z should have influenced the lower order terms,
as long as the numerical accuracy of all the coefficients is adequate.
Note in (year-z)^2 you have a term 2*z*year which should be counted in
the
linear term. The total contribution of the linear term should be the
same.

Jason
GlaxoSmithKline

---------------------------------------------------------------------
Date: Tue, 9 Sep 2003 12:23:20 +0100 (BST)
To: [log in to unmask]

The point you make is exactly what I mean by being equivalent models,
but
it makes a big difference to the interpretation to say that

    spending = b1 * year   + c
or
    spending = b2 * (year - z)

where z is now a meaningful figure and the "constant" is not
significant.

In the quadratic term, I can see why there are numerical instabilities
in
trying to fit 1960^2 to 1990^2, rather than fixing the origin at 1979 to
make the range -20 to +10.

I'll post a summary of replies.  The question overlaps with
"identifiability", except that is usually applied to a constraint that
makes a system mathematically soluble.

Allan

---------------------------------------------------------------------
Date: Tue, 09 Sep 2003 10:54:04 +0100
From: Roger W Humphry <[log in to unmask]>

Can't really help except to say I was recommended to use the midpoint
for
z and as you say, it made little difference to the estimates for the
turning point etc.

yours,
Roger

---------------------------------------------------------------------
Date: Tue, 9 Sep 2003 13:37:27 +0100 (BST)
To: Roger W Humphry <[log in to unmask]>

As usual, I'm relieved not to be immediately deluged with "Didn't you
know
that!!!!"  No solid leads yet, but I will summarize to the list.

Allan

---------------------------------------------------------------------
Date: Tue, 09 Sep 2003 13:44:32 +0100
From: Roger Newson <[log in to unmask]>

Most people would centre the time axis, choosing a zero time and
substituting t-t_0 for t in the model. In this case, the intercept
parameter is the value of the quadratic at t_0, the linear parameter is
the rate of change at t_0, and the quadratic parameter is the constant
acceleration rate (the second derivative), which is the same at all
times.

The time t_0 is usually chosen to be central, or at least inside the
range
of the data.

An alternative approach might be to treat the quadratic as a special
case
of a quadratic spline, and to parameterise it by the values of the
spline
at the beginning, middle and end of the range. The method for doing this
(in the Stata statistical package) is in my paper (Newson, 2003). I have
used splines extensively in my work on time series of asthma-related
hospital admissions.

Newson R. B-splines and splines parameterized by their values at
reference
points. Downloadable as from 10 June 2003 from my website at
http://www.kcl-phs.org.uk/rogernewson/

Roger Newson
Lecturer in Medical Statistics
King's College London, London SE1 3QD

---------------------------------------------------------------------
Date: Tue, 9 Sep 2003 08:47:57 -0500
From: Paul R Swank <[log in to unmask]>

With a quadratic model with positive values for the predictor, X, X and
X^2 are typically highly correlated (for example, if x = 1 to 10, the
correlation of X and X^2 is about .97). Thus, by centring X, using Z for
instance), you reduce the collinearity in the model and allow a
solution.

Paul R. Swank, Ph.D.
Professor, Developmental Pediatrics
Medical School, UT Health Science Center at Houston

---------------------------------------------------------------------
Date: Tue, 9 Sep 2003 17:48:40 +0100 (BST)
To: Paul R Swank <[log in to unmask]>

Thanks for that advice, which will go in the summary.  Does that suggest
one might choose z so that the correlation of t and (t-z)^2 is
minimized?

Allan

---------------------------------------------------------------------
Date: Tue, 09 Sep 2003 22:26:04 -0500
From: Jay Warner <[log in to unmask]>

I believe that, mathematically speaking, your choice of z is not an
issue.
Yes, of course the coefficients will change, dramatically.  And your
software may not like to handle ind. vars with sizes of near 2000. (the
probable source of your error msg.)

However, since the regression _assumes_ the x values were measured
exactly, it doesn't matter whether you use z = 0, z = 1900, or z = 1981.
It does matter if your software rounds those itty bitty digits near the
end of the string :)  It does matter if your software does not
internally
adjust the x' = 0 to the center of your indep. variable.  Which it
should,
if it is a self-respecting software.  Reason: only if the x's are
centered will the correlation between x^2 coef and x coef be (near)
zero.

Your equation may be able to explain your data, but unless you do the
regression properly, you may not be able to predict anything in the
future
with it.

Cheers, and hope you find a real expert at this,
Jay

Jay Warner
Principal Scientist
Warner Consulting, Inc. Racine, WI 53404-1216, USA

---------------------------------------------------------------------
Date: Wed, 10 Sep 2003 16:35:35 +0100 (BST)
To: Jay Warner <[log in to unmask]>

Thanks, but the choice of z precisely *is* the issue.  The numerical
stability is a minor issue, but introducing z as an additional parameter
raises the problem of identification.  But I'm getting other good
thoughts
from others.

Allan

----------------------------------------------------------------------
Date: Wed, 10 Sep 2003 14:41:42 +0100
From: "Nelder, John A" <[log in to unmask]>

The idea you need is that of functional marginality.  I have a paper in
J.Appl.Stats. but cannot give you exact reference because I am at home
prior to going to S.Africa.  Let me know then if you can't find it.

John Nelder.

---------------------------------------------------------------------
Date: Wed, 10 Sep 2003 16:55:45 +0100
To: "Nelder, John A" <[log in to unmask]>

Thanks.  I had no problem tracing the reference and we have an online
subscription so it has printed at my desk.

Nelder JA Functional Marginality and response-surface fitting. J Applied
Statistics (2000) Vol 27 No 1 pp109-112

--- Functional Marginality is important. Letter+discussion. Appl
Statistics (1997) Vol 46 No 2 pp281-286

Functional marginality is a consideration, but, as you state, y = a +
b(x-z) + c(x-z)^2 gives the same goodness of fit for all z.  It is a
question of identifiability to fix z.  FM applies if the criterion for
choosing z is that lower order terms can be dropped, given that z=0 has
no a priori logic. That is essentially what I did, without putting
the name FM to it.  The example was UK health service spending. It was
striking that spending extrapolated as zero in 1947, but I suspect it
was a fluke.

Have a good trip,
Allan
--------------------------------------end of summary







--- End Forwarded Message ---


--------------------------------------------------------------
R Allan Reese                      Email: [log in to unmask]
Graduate School
University of Hull
Tel +44 1482 466845                       Fax: +44 1482 466436

Top of Message | Previous Page | Permalink

JiscMail Tools


RSS Feeds and Sharing


Advanced Options


Archives

May 2024
April 2024
March 2024
February 2024
January 2024
December 2023
November 2023
October 2023
September 2023
August 2023
July 2023
June 2023
May 2023
April 2023
March 2023
February 2023
January 2023
December 2022
November 2022
October 2022
September 2022
August 2022
July 2022
June 2022
May 2022
April 2022
March 2022
February 2022
January 2022
December 2021
November 2021
October 2021
September 2021
August 2021
July 2021
June 2021
May 2021
April 2021
March 2021
February 2021
January 2021
December 2020
November 2020
October 2020
September 2020
August 2020
July 2020
June 2020
May 2020
April 2020
March 2020
February 2020
January 2020
December 2019
November 2019
October 2019
September 2019
August 2019
July 2019
June 2019
May 2019
April 2019
March 2019
February 2019
January 2019
December 2018
November 2018
October 2018
September 2018
August 2018
July 2018
June 2018
May 2018
April 2018
March 2018
February 2018
January 2018
December 2017
November 2017
October 2017
September 2017
August 2017
July 2017
June 2017
May 2017
April 2017
March 2017
February 2017
January 2017
December 2016
November 2016
October 2016
September 2016
August 2016
July 2016
June 2016
May 2016
April 2016
March 2016
February 2016
January 2016
December 2015
November 2015
October 2015
September 2015
August 2015
July 2015
June 2015
May 2015
April 2015
March 2015
February 2015
January 2015
December 2014
November 2014
October 2014
September 2014
August 2014
July 2014
June 2014
May 2014
April 2014
March 2014
February 2014
January 2014
December 2013
November 2013
October 2013
September 2013
August 2013
July 2013
June 2013
May 2013
April 2013
March 2013
February 2013
January 2013
December 2012
November 2012
October 2012
September 2012
August 2012
July 2012
June 2012
May 2012
April 2012
March 2012
February 2012
January 2012
December 2011
November 2011
October 2011
September 2011
August 2011
July 2011
June 2011
May 2011
April 2011
March 2011
February 2011
January 2011
December 2010
November 2010
October 2010
September 2010
August 2010
July 2010
June 2010
May 2010
April 2010
March 2010
February 2010
January 2010
December 2009
November 2009
October 2009
September 2009
August 2009
July 2009
June 2009
May 2009
April 2009
March 2009
February 2009
January 2009
December 2008
November 2008
October 2008
September 2008
August 2008
July 2008
June 2008
May 2008
April 2008
March 2008
February 2008
January 2008
December 2007
November 2007
October 2007
September 2007
August 2007
July 2007
June 2007
May 2007
April 2007
March 2007
February 2007
January 2007
2006
2005
2004
2003
2002
2001
2000
1999
1998


JiscMail is a Jisc service.

View our service policies at https://www.jiscmail.ac.uk/policyandsecurity/ and Jisc's privacy policy at https://www.jisc.ac.uk/website/privacy-notice

For help and support help@jisc.ac.uk

Secured by F-Secure Anti-Virus CataList Email List Search Powered by the LISTSERV Email List Manager