A while ago, I asked for references on regression without intercept /
regression through the origin.
First of all, thanks for all the numerous responses I received!
First, some notes on the properties of a regression without intercept.
Below, you will find a list of references.
----------------------------------------------------------------------------
---
Fixing the intercept at the origin can strongly influence the estimated
slope of the regression line. One should therefore evaluate carefully
whether the assumption of a zero intercept is really justified. If one
expects a zero intercept, but the empirical intercept is different from
zero, this might be caused by one of the following:
1. The true intercept is zero, but the relationship is non-linear and a
linear approximation leads to an empirical intercept different from zero.
2. The model is wrong, the true intercept is not zero (see the following
quote on the assumption that the error term has a zero population mean). An
example from epidemiology (which was the motivation for my query): Given
some category-specific odds ratios, I want to fit a linear regression line
to the log odds ratios. Of course the odds ratio for the reference category
is one (and the log odds ratio zero), but should I force the trend line to
pass through the origin? In this case, an intercept of the regression line
that is different from zero might indicate that uncontrolled confounding is
present, i.e. the reference category differs from the exposed not only in
the exposition, but also in some other factor.
3. Random variation, not to be forgotten.
In a regression without intercept (also called "the homogeneous case" as
opposed to the "inhomogeneous case", when the design matrix contains a
vector of 1's), the usual variance decomposition no longer holds. Therefore,
the usual R^2 looses its meaning (see Greene (1997) and the comment by
Robert Jung). Another related point is that the residuals do not sum up to
zero.
The variance of the estimated slope is usually smaller in a regression
without intercept, so this leads to an anti-conservative analysis when the
assumption of a zero intercept is not justified.
Often, the y-variable can only have positive values. In this case, the
assumption of a regression without intercept is inevitably linked to the
assumption that the variance of the y_i is not constant, but proportional to
x or x^2.
----------------------------------------------------------------------------
---
The classical assumptions about OLS modeling include the following:
Assumption II: the error term has a zero population mean.
"[This requires that the constant term absorb any nonzero mean that the
observations of the error might have in a given sample. Thus, suppressing
the constant term can lead to a violation of this classical assumption. The
only time that this assumption would not be violated by leaving out the
intercept term is when the mean effect of the unobserved error term (without
a constant term) is zero over all the observations.// The consequence of
suppressing the error term is that the slope coefficient estimates are
potentially biased and their t-scores are inflated".
Using Econometrics: A Practical Guide, Studenmund A. H., published by
Addision-Wesley, 1996, p.214.
As an aside, if an intercept term is superfluous, theoretically the model
should produce an intercept value very close to zero even if included in the
model itself (when calculated either by hand or using software programs).
The usual debate of underlying theory versus limitations of the data versus
"over-fitting" the data may be relevant here.
(from David Rutherford)
-------------------------------------------------------------------------
(This paragraph is in German.)
Wenn sie eine homogene Regression durchführen, dann zwingen Sie die
geschätze
Regressionsgerade durch den Koordinatenursprung. Wenn das Ihr
zugrundeliegendes
Modell so vorschreibt, ist das völlig in Ordnung. Wenn nicht, dann muß man
sich
einige Gedanken machen.
Die geschätzten Regressionskoeffizienten behalten ihre Interpretation bei,
unabhängig davon, ob ein Absolutglied mitgeschätzt wird oder nicht. Von
daher
ist keine spezielle Interpretation notwendig - daher vielleicht auch der
Mangel
an allfälligen Literaturquellen.
Sie sollten allerdings beachten, daß bei homogenen Regressionen die
Streuungszerlegung nicht mehr gültig ist und somit das Bestimmtheitsmaß R^2
nicht mehr zwischen 0 und 1 liegen muß. D.h. die Interpretation von R^2 als
Anteil der durch das Modell erklärten Variation der abhängigen Variablen ist
nicht mehr gültig (siehe dazu eine Diskussion in GREENE (1997), Econometric
Analysis, S. 255 ff.
(from Robert Jung)
----------------------------------------------------------------------------
---
As far as I understand it you can take out the constant if this is
deemed conceptually correct. Eg. if your dependent variable is
sales, and your independents are advertising and staff. If
advertising is 0 this does not mean that sales will be 0 because
staff could be another factor causing sales.
Furthermore, if you have a regression slope which is not overly
steep, then it could be that your constant strongly influences the
model. The best thing to do is to play around with it and see what
the outcome is.
(from Susanne Goller)
----------------------------------------------------------------------------
-----
NOW A SUMMARY OF THE REFERENCES:
----------------------------------------------------------------------------
-----
Bliss (1967) Statistics in Biology, Vol 1, pp 444-451. McGraw-Hill.
Snedecor & Cochran (1967) Statistical Methods, 6th edition, pp 166-170.
Iowa State Univ. Press.
Snedecor & Cochran (1980) Statistical Methods, 7th edition, pp 172-174.
Iowa State Univ. Press.
Gordon (1981) Errors in computer packages. Least squares regression
through
the origin. Statistician 30, 23-29.
Later discussion by: Beale, Lane & Nelder; Bissell; Goldsmith (1980)
Statistician 30, 231-234.
Turner (1960). Straight line regression through the origin. Biometrics
16,
483-485.
(from Peter Lane)
----------------------------------------------------------------------------
------
For the properties, see Neter, Wasserman, and Kutner (1996, newest edition)
"Applied Linear Statistical Models". For a general discussion of
nonhierarchal models (excluding intercept, excluding main effects when
interaction term is included), see McCullagh and Nelder (1983) "Generalized
Linear Models" (Chapman & Hall).
(from Laura Thompson)
----------------------------------------------------------------------------
--------
Check the Walpole and Myers introduction to statistics for engineers.
(from Isaac Dialsingh)
----------------------------------------------------------------------------
--------
A. Sen, M. Srivastava
Regression Analysis - Theory, Methods and Applications
Springer Verlag Ed., 1990
(from Alessio Pollice)
----------------------------------------------------------------------------
---------
It has something to do with marginality of terms, i.e. testing for
significance
of estimated intercept is usually meaningless (see Nelder in American
Statistician 1990).
(from Freek Huele)
----------------------------------------------------------------------------
---------
Greene (1997): Econometric Analysis, p. 255
(from Robert Jung)
----------------------------------------------------------------------------
--------
Bissell A.F. Lines through the origin - is NO INT the answer? Journal of
Applied Statistics, 1992, Vol. 19, No. 2, 193-210
(from André Charlett)
----------------------------------------------------------------------------
---------
Hair, Anderson, Tatham & Black "Multivariate Data
Analysis" 5th Edition, 1998 (or any later edition).
(from Susanne Goller)
******************************************************
Angelika Schaffrath Rosario
Diplom-Statistikerin
GSF - Forschungszentrum für Umwelt und Gesundheit GmbH
Institut für Epidemiologie
Postfach 1129
D-85758 Neuherberg
Telefon (089) 3187-4577
Telefax (089) 3187-3222
e-mail [log in to unmask]
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|