Print

Print


It's always nice to see a practical programmer.   In fact, I use this
method to read a large survey file with multiple record types.  There's
Household, family level and person level data.   I have a program that
reads the raw data which has one record per person that is fixed in size
and which repeats the relevant household and family data on each person
record.  The file also contains longitudinal data up to 48 months.   The
file size is over 1 GB.  The programs that use this data is a micro
simulation model which can create new variables on the file.  I keep track
of the variables on the files by creating an ASCII text file with variable
names and sizes on the file.

I wanted to create read statements that would read this data quickly.   So
I ended up breaking the data into three different arrays, one each for
Household data which occurs 1 time, family records which can occur multiple
times for each household and person records which can occur multiple times
per household.

In the end, I wrote my program so that the file was written out as a binary
file, with a few variables at the beginning which defined the dimensions
for each of my 3 arrays.  The household array is one dimension, the family
and person arrays are 3 dimensional with the number of families, by the
number of month and finally by the number of variables in the family array.
The person array is the same but dimensioned by persons instead of families.

So before I read the arrays, I have enough information to allocate arrays
of exactly the size I need.   So all I need to do is read the allocated arrays
in a read statement that looks like: READ(8,END=NNNN) HHLD, FAMILY,
PERSON.  It's an extremely fast read.

Of course I use equivalences so I don't need to create 6 arrays.   Now the
powers that be don't seem to like equivalences, but it's not hard to keep
track of the few real variables.

I read the ASCII data dictionary in once at the beginning, this file tells
me the location in the variable dimension.   This way I don't need to have
the input variables in a fixed location since I can lookup the location in
my data dictionary once.

I hope this gives you some idea of the flexibility of writing out entire
arrays and associating a data dictionary with the data.



From: Jing Guo <[log in to unmask]>


At 05:00 PM 7/10/01 -0400, you wrote:
> >
> > It's maybe a simple question.
>
>It may be a simple question, but there is no simple answer.
>
>If I were your programmer given this assignment, I would first ask you
>for the specifications of the file format and the data to be extracted
>from the file.  I would say that the size of the file is not the same
>thing as the size of the data.  There may be and may be not a simple
>relationship between them.
>
>Assume the data in the file were written out as a flat data stream.  In
>this case, one may be able to link the file size to the data size in a
>simple expression.  Even so, one won't be able to tell if the data is a
>simple 1-d array for a row or a column vector or a 2-d array for a
>matrix.  All that information, if not specified in the file, would have
>to be specified somewhere else (Even the software engineers of MATLAB
>would have to either store dimensional information in a data file or ask
>users to specify them).
>
>Now it may be geting more controvertial.
>
>For your specific problem, I would first suggest a long term solution,
>by redefine your output file format, such that all dimensional
>information are the _first_ piece of data one can get from a file (why
>does anyone want to read a file twice?).  It is simply because losing
>information is not a reversiable adiabatic process.  It is not always
>possible to reverse engineering data dimensional information from file
>sizes.
>
>If you have no control over the output data format, and the file size
>data can be easily mapped to data dimensional information, why don't you
>collect the file size data _before_ you use them?  For instance, one can
>create a list of files with their sizes listed next to their names, such
>as:
>
>13579 "file1.dat"
>2468  "file2.dat"
>999   "file3.dat"
>
>I won't suggest some "smart" solutions in Fortran.  In my programming
>life, I tried to program some "smart" solutions myself.  I now believe
>all those solutions were wrong.  If the data integraty is broken, the
>best solution should be to patch the data, not to create some generic
>solutions that in the best require some very case-specific information
>to work.
>
>
> > Suppose you have several data files, each
>
> > of them have different length and will be read in by your program. Usually
> > you have to specify the corresponding array length in your program
> > otherwise you would run in trouble in read. I want to know if there is a
> > mean in Fortran to read data file without specify the size, something like
> > that in MATLAB:
> >
> >   r1=myfile(:,1)
> >
> > or
> >
> >    fscanf(fid,'%g %g',[2 inf]);
> >
> > Thus, all the data will be read in and we can get the size of array by
> > simple command size(...).
> >
> > Many thanks.
> >
> > Yongcheng
> >
>
>
>--
>________________________________ _-__-_-_ _-___---
>Jing Guo, [log in to unmask], (301)614-6172(o), (301)614-6297(fx)
>Data Assimilation Office, Code 910.3, NASA/GSFC, Greenbelt, MD 20771


Bob Cohen
(703) 534-7618
[log in to unmask]