Greetings,
I need to draw a random sample from a dataset of about 18000 records,
based primarily upon the value of a specific field in the dataset. Call
this field 'SIZE' for convenience. Empirically, about 25% of the size
field has values = 0, the rest are >0. For the non-zero values, previous
work has shown that the distribution of these values tends to be very
long-tailed. I've found a gamma distribution is often a good
approximation.
I'm hoping to get some advice on how I might draw a random sample that
"suitably" reflects the underlying distribution of this data. Obviously
as the domain of the gamma distribution is > 0, that cannot be used
alone as it will miss the zero entries. Is there another distribution I
might use, with domain >= 0 ? Alternatively, could I try a two-stage
sampling, where I draw at random from the zero values and then use a
gamma to draw from the non-zero values?
The records that are selected will then be subject to a more intensive
review. Due to cost and time constraints, we estimate that no more than
about n = 250 records can be drawn.
I'd appreciate any suggestions AllStat members might offer.
Best regards,
Mark
|