No, the correct form is
f_(x) = p_1 * f_1(x) + p_2 * f_2(x) + (1-p_1-p_2) f_3(x)
Basilio
Edmond Ng Escreveu:
> Dear listers,
>
> I have a question that is based on a problem that I am facing.
>
> A research company has been commissioned to study the revenues of companies in an industry. They have conducted a census of all companies from the finite population of companies in the industry. The companies could be stratified by sub-sectors within the industry and they have some background information on all of the companies but not revenue. Not all companies responded or provided useful information.
>
> Using an official database (source 1), they are able to ascertain the declared revenues of a portion of the companies. For those companies whose revenues are not registered on the database, the researcher replaced the missing declared revenue by the reported figure from the census (source 2) as a proxy to the declared revenue.
>
> For those companies whose revenues cannot be ascertained from either the official database or the census, they used another proxy by multiplying the average revenue per employee for the sub-sector by the (supposedly known) number of employees of the company (source 3).
>
> What I am interested to know is the form of the p.d.f. of the revenue values given by the substitution strategy above. Suppose that the data from the three sources follow 3 p.d.f., f_1(x), f_2(x) and f_3(x) (they might be quite different distributions), the p.d.f of the collated revenue values should then be a mixture distribution as shown below.
>
> f_(x) = p_1 * f_1(x) + (1-p_1)*p_2 * f_2(x) + (1-p_1)*(1-p_2)*p_3 * f_3(x) [Eqn 1]
>
> where
> p_1 = probability of revenue from the official database being non-missing
> p_2 = prob. of non-missing revenue from census, given that revenue is missing from the official database.
> p_3 = prob. of non-missing revenue using the third proxy, given that revenue is missing from both the official database and census.
>
> My questions are as follows.
> Q1: Does Eqn 1 look about right (at least approximately)?
>
> Q2: The researcher has been asked to construct confidence intervals for the mean and total revenue for each of the strata as well as for all companies. Given that the distribution of the collated revenue values follow a complicated (& mixture) distribution whose form cannot be ascertained easily and time is very tight, I wonder if it would make sense to suggest to the researcher to construct non-parametric bootstrap confidence intervals using the collated values (re-sampling with replacement) instead of the usual confidence intervals that rely on approximate normality of the data.
>
> Any comment, idea and suggestion would be gladly recieved. I will summarize the replies, if any. Thank you in advance.
>
> Best wishes,
> Edmond
>
Basilio de Bragança Pereira
*Titular Professor of Bioestatistics and of Applied Statistics
*FM-Faculty of Medicine and COPPE-Posgraduate School of Engineering and
HUCFF-University Hospital Clementino Fraga Filho.
*UFRJ-Federal University of Rio de Janeiro
*Tel: (55 21) 2562-2594 or /2558/7045
www.po.ufrj.br/basilio/
*MailAddress:
COPPE/UFRJ
Caixa Postal 68507
CEP 21941-972 Rio de Janeiro,RJ
Brasil
|