Dear listers,
I have a question that is based on a problem that I am facing.
A research company has been commissioned to study the revenues of companies in an industry. They have conducted a census of all companies from the finite population of companies in the industry. The companies could be stratified by sub-sectors within the industry and they have some background information on all of the companies but not revenue. Not all companies responded or provided useful information.
Using an official database (source 1), they are able to ascertain the declared revenues of a portion of the companies. For those companies whose revenues are not registered on the database, the researcher replaced the missing declared revenue by the reported figure from the census (source 2) as a proxy to the declared revenue.
For those companies whose revenues cannot be ascertained from either the official database or the census, they used another proxy by multiplying the average revenue per employee for the sub-sector by the (supposedly known) number of employees of the company (source 3).
What I am interested to know is the form of the p.d.f. of the revenue values given by the substitution strategy above. Suppose that the data from the three sources follow 3 p.d.f., f_1(x), f_2(x) and f_3(x) (they might be quite different distributions), the p.d.f of the collated revenue values should then be a mixture distribution as shown below.
f_(x) = p_1 * f_1(x) + (1-p_1)*p_2 * f_2(x) + (1-p_1)*(1-p_2)*p_3 * f_3(x) [Eqn 1]
where
p_1 = probability of revenue from the official database being non-missing
p_2 = prob. of non-missing revenue from census, given that revenue is missing from the official database.
p_3 = prob. of non-missing revenue using the third proxy, given that revenue is missing from both the official database and census.
My questions are as follows.
Q1: Does Eqn 1 look about right (at least approximately)?
Q2: The researcher has been asked to construct confidence intervals for the mean and total revenue for each of the strata as well as for all companies. Given that the distribution of the collated revenue values follow a complicated (& mixture) distribution whose form cannot be ascertained easily and time is very tight, I wonder if it would make sense to suggest to the researcher to construct non-parametric bootstrap confidence intervals using the collated values (re-sampling with replacement) instead of the usual confidence intervals that rely on approximate normality of the data.
Any comment, idea and suggestion would be gladly recieved. I will summarize the replies, if any. Thank you in advance.
Best wishes,
Edmond
|