Hello everyone,
Thankyou to those who replied to my query last week regarding collapsing
the levels of categorical variables in a modelling situation. I will
compile a list of the replies and post it to the list after I pose a
further query which again revisits a similar example......
If our outcome was 'attractiveness' (measured on a continuous scale for
arguments sake so that we were conducting a multiple regression with a
higher score being more attractive). I will use a hypothetical example.
Say in our model we had 2 predictors: 'colour' with categories blue,
red and green (i.e. k=3) and 'material' with categories leather, cotton,
nylon, silk (i.e. k=4). Each of these categorical variables have been
entered into the model as a 'set' of k-1 dummy binary variables where
the reference category is coded as '0' for each of the k-1 variables.
Choosing 'leather' and 'blue' to be the reference categories, we may
get:
Predictor Coef P
Material:
cotton 0.0984709 0.796
nylon 0.0801628 0.858
silk 0.356450 0.001
Colour:
red 0.898338 0.149
green 0.962118 0.000
Tests for terms with more than 1 degree of freedom
Term Chi-Square DF P
material 3.8453 3 0.045
colour 15.0917 2 0.001
Each of the p values for the categories 'red' and 'green' above tell us
the significance of the category relative to 'blue'; each of the p
values of cotton, nylon and silk tell us the significance of the
category relative to 'leather'. The tests at the bottom of the output
provide an 'omnibus' test of the null hypothesis that, for example (for
the 'colour' variable), the coefficients for red and green are both
equal to zero.
Now, when I am commenting on the contribution of the design elements to
say the response 'attractiveness' we can see that the design element
cotton has p value 0.796; the p value for nylon is 0.858 similarly the p
value for 'red' is 0.149. Now, my question is, should I comment on the
contribution of cotton, silk and the colour 'red' to 'attractiveness'
even though their p values are more than 0.05 i.e. retain all levels 'as
a set' regardless of their individual significance? i.e. would it be OK
to use the coefficient values to comment that, compared to leather (the
ref category for material), it is indicated that silk is found more
attractive (as is cotton and nylon but to a lesser extent) [with the
level of colour held constant], also compared to blue (the ref. category
for colour), red and green are indicated as being the more attractive
(when the level of material held constant)?
I have seen one example where they do use the coefficients of dummy
binary variables such as these to form conclusions even though some of
them are non significant, but I thought I'd ask a second opinion.
Many thanks,
All the Best,
Kim
|