> From [log in to unmask] Sun Oct 18 00:25:18 1998
> I have just been communicating with a person about the problem of
> reading a "huge" file of data when you can't predict in advance how big
> it will be. ....
...
> two options:
>
> 1. Allocate an array with space for N items;
> do forever:
> [inner loop:] fill the current array; exit if EOF occurs during
> input
> allocate a new array with space for 2N items
> copy the current array to part of the new one
> change current array name to new array name and change N to 2N
> ! (You can use the same two array names alternately.)
> end do.
>
>
> 2. Make a pass through the data, counting it but not storing it; then
> rewind, allocate an array that will hold it all, and really read it.
...
> 3. Any other suggestions? If you can tell in advace how much space could
> be made available internally, you could just allocate it all and be a
> space hog. I assume that this would be economically or politically or
> socially unfeasible.
Between the test and the allocation, some space might go away if
other processes are running; furthermore, if your machine
implements overcommitment of virtual memory, it might be hard
to get the information you really need. This is aside from
the issue of being a cad. :-)
This whole area is where one really misses REALLOCATE.
In converting our legacy app. to F90 dynamic allocation, we identified
three situations:
a. You have no idea how big the array has to be before the first
allocation.
b. You do know how big it has to be.
c. You know an upper limit, but not the final size. Then you allocate
a temp array to hold the upper limit; when it's filled, allocate
the "real" array to the right size; fill it; deallocate the
temp array. This is only usable when the upper limit is low
enough to be feasible, which, in our program, isn't often, even
when we do know the upper limit.
You are in situation (a), but you could move to (b) if you wished
to go through the data twice. BTW, in order to go through the
data twice, you have to be damn sure that it's always going to
be in a static file; in other words, it's not coming in from
std. input or some kind of named pipe or socket. Are you really
sure the data will always be in a REWINDable file?
Anyway, the way I handle (a) is not by growing to N, then 2N, etc.;
we wrote a MODULE for dynamic arrays. For each array that is
initialized, we allow specification of a "grow increment function";
this is a function that's called each time the array has to be grown
and which returns the new (larger) size. In most cases, we
were able to guess a "reasonable" grow size for each array, and
the functions we use return constants. But the functions you
use could return anything you want; in particular, one could
pass a function that carries out the doubling you propose.
Given the fact that I already have this MODULE, I would not
go through the data twice; I would just use my MODULE.
Still another way to go would be to use linked lists, rather
than arrays, adding a link with each incoming record. At
the end, if you need the data to be in arrays, allocate the
arrays to the known exact size, copy into them from the linked
list, then delete the linked list.
-P.
*** "Freedom's just another word for nothing left to lose." (B. Yeltsin)***
*Peter Shenkin; Chemistry, Columbia U.; [log in to unmask] (212)854-5143*
*MacroModel WWW page: http://www.columbia.edu/cu/chemistry/mmod/mmod.html *
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|