Hi All,
I have the following sampling question that came about
after a discussion with a colleague who is a computer
scientist specialising in data bases.
I have 2 tables in a data base and can do either of
the following:
1. Join them based on a single column and then take a
random sample from the joined table.
2. Take a random sample from each table and then
perform the same join on the 2 samples.
I am interested in the implications of both 1. and 2.
in terms of which is the better sampling stategy, and
are they equivalent in some sense?
Any references to papers that may have addressed this
question would be appreciated.
In SQL parlance, I am interested in this in the
context of aggregated queries where the output is a
scalar for example
select sum(c1) from A, B where A.c2 = B.c3
sum() may be replaced by count(), average(), stdev()
etc...
So from a statistical standpoint is it better to
sample from A and B and then join, or join A and B and
then sample?
It seems to me that there willl be less variance if
one takes a single sample from the joined table rather
than 2 samples and then join them.
From a database standpoint the sample then join
strategy is highly desirable from an efficiency
standpoint.
Any thoughts or pointers on this problem would be most
appreciated
Best regards,
Mary
__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com
|