Hi Ali,
I think we are mostly disagreeing on language.
I was thinking about colliders when I chose the word fishing. Thinking about it like the data that is needed are the fish, and the rest of the data that are not fish is the ocean.
Yes, of course, in general the more complex (heterogenous/interdependent/sparse events...) the situation is, the more difficult it is to make theory. One way out of this is to collect more data either as a way to ensure that one gets relevant data (what I called fishing) or to try to get more confidence in the findings. The latter raises the interesting idea that in the limit proof comes because one has data about every possible state of a situation - but this is usually less than efficient.
I agree with you also about increased data are needed to capture variation and variety. Sometimes, however all that is needed is to be able to extract a measure of their scale (one of the original reasons for pilot experiments). In many cases, as I wrote earlier, the data can be a small quantity.
For example, if I want to theorise about a system and I want to know whether its behaviour is homeostatic, rheostatic, social or none of these, then a dozen data points may be enough. At that point I have the theory boxed to a small group of families and a few more data points can provide calibration of the curves good enough to generalise. A few more points at possible boundary situations will give some idea of the limits of generalisation.
A lot of years ago in the early 80s, I worked as a programmer (not a very good one at that time!) for Gwylam Jenkins Partnership who with George Box (the Box-Jenkins models) invented time series analysis. I was involved in programming the computers for the prediction of the day by day gas supply needs of France. What I remember was one on hand the potential multifactorial complexity of the data as a whole that was represented, and on the other hand the relatively small amount of data that was needed to create, test and keep the models and theories live and accurate on a moment to moment basis.
I can agree that some circumstances need high levels of data. I suggest, however, that in most cases, this is not so, and that even in the cases in which high levels of data are currently needed, this is not intrinsic and in many cases, smaller amounts of more salient data would be more helpful.
Cheers,
Terry
==
Dr Terence Love
MICA, PMACM, MAISA, FDRS, AMIMechE
Director
Design Out Crime & CPTED Centre
Perth, Western Australia
[log in to unmask]
www.designoutcrime.org
+61 (0)4 3497 5848
==
ORCID 0000-0002-2436-7566
-----Original Message-----
From: [log in to unmask] [mailto:[log in to unmask]] On Behalf Of Ali Ilhan
Sent: Tuesday, 16 January 2018 10:45 PM
To: PhD-Design - This list is for discussion of PhD studies and related research in Design <[log in to unmask]>
Subject: Re: Expanding the discussion about statistics and design
Dear Terry,
I have torespectfully disagree on a number of things.
You wrote:
“The main use massive data (in the sense that Ken comments) is to go fishing to try to identify causal explanations in areas about which very little is known. That is appropriate to some situations but not all.”
This is not true at all. The amount of data is typically context depedent.
For instance, if there is more heterogeneity in the process that you are investigating, you will definitely need more data. Or if you are looking at rare events again you will need more data to capture the variation.
Sometimes, even in the cases when there is a lot of knowledge , the process that generates data itself is only viable when there is a lot of data ( i know this is a weird sentence). Take some mundane physics experiments in colliders (i am not talking about cutting edge physics). They are looking for theorerically very well defined particles but those particles only “exist” for the briefest moments and particle collisions tend to create a lot of data. This is not fishing at all. But sometimes, you indeed go fishing, some forms of machine learning indeed work like that as you have pointed out. And so on.
I think we also have to delineate theory creation from theory testing.
Again in some areas like econometrics, (at least in principle) all you need to create a theory is a pencil, a lot of paper and deep knowledge of calculus. But then to test the theory you may need a of data. (for some cases a few hundred to few thousand observations will suffice, for some others you may need millions). And on the other hand, for some areas a thousand cases is massive, while some others billions are tiny.
In between grading, what I am trying to say is data and its usage can differ immensely depending on the context (context being a mixture of disciplinary conventions, research questions, previous theory, model choice, population heterogeneity, endogeneity, practical
issues-money,time,etc.-) so we should not rush to hasty generalizations.
Yours,
ali
-----------------------------------------------------------------
PhD-Design mailing list <[log in to unmask]> Discussion of PhD studies and related research in Design Subscribe or Unsubscribe at https://www.jiscmail.ac.uk/phd-design
-----------------------------------------------------------------
-----------------------------------------------------------------
PhD-Design mailing list <[log in to unmask]>
Discussion of PhD studies and related research in Design
Subscribe or Unsubscribe at https://www.jiscmail.ac.uk/phd-design
-----------------------------------------------------------------
|