Hi Petre,

Another thought: how large, and full, is the filesystem hosting /tmp? If this is filling up, even transiently, when you run lots of simultaneous jobs then that might cause the observed behaviour. Of course one might have hoped for write errors to have appeared in your logs so maybe this isn't the case... inability to log all errors might also be happening under these circumstances. Cluster nodes need large, and ideally dedicated, /tmp filesystems. 

On 17 Dec 2010, at 01:49, Bogdan Petre wrote:

Indeed, these errors are always associated with the /tmp directory.  Although I would never have guessed the cause could have been so simple, the /tmp folder often has an overwhelming number of files within it (over 10k, but still well below the filesystem inode limit), so much so that the 'rm' binary will complain about 'too many parameters' when it is passed a wildcard ('rm -r *').  While we have other methods to delete the files it does indicate that the sheer number may be overwhelming some binaries, either fsl or system related.  A large number of these however appear related to feat (e.g. /tmp/feat_zoPbjS).  Some of these (the majority) are simply empty files, others are design files (*.fsf, *.con, *.mat, and associated png and ppm files).  Although our /tmp directories are cleared each time we power cycle any of our nodes (however infrequently that may be), these files reappear which seems to indicate there could be something interfering with the garbage collection process.

If FSL programs are supposed to garbage collect any tmp files after they're finished running, do you know of any typical situation when they might not, particularly any such situation involving feats?

Thanks,

Bogdan Petre <[log in to unmask]>
Departments of Integrated Science and Physiology
Northwestern University


On 12/14/10 17:33, Mark Jenkinson wrote:
[log in to unmask]" type="cite">
Hi,

This is quite serious - as I'm sure you realise.
The warning message means that the file is corrupt and does not contain 
enough data.  In fact they only seem to contain about half.

Are all the files that this happens to in /tmp ?
If it is only happening under conditions of high load, then maybe your
/tmp is unable to deal with the traffic/space/number-of-files, or maybe the 
mechanism that fsl is using to select unique filenames inside of /tmp is 
sometimes giving clashes (although this function - mkstemp - is very 
fundamental and so it seems unlikely given that it doesn't happen to other
users).  So I would first check whether it is always associated with
/tmp or not.  If it is then that at least gives us something to go on.

As for the missing /tmp directory - that is not surprising as our scripts
and executables clean up any temporary directories in /tmp after they
have run.

I hope this will help you track down the problem.
All the best,
	Mark


On 14 Dec 2010, at 19:24, Bogdan Petre wrote:

Hey Everyone,

We've encountered issues reading *.nii.gz files which are not specific to any particular FSL script or dataset and have been produced by everything from melodic to fslmaths.  An example error message is shown below:

/usr/local/fsl/bin/fsl_motion_outliers /home/lejian/Gamble_preprocessing/Results/funct_test/con002/Scan2/gamb1/gamb1.feat/prefiltered_func_data 0 /home/lejian/Gamble_preprocessing/Results/funct_test/con002/Scan2/gamb1/gamb1.feat/motion_outliers.txt
WARNING: nifti_read_buffer(/tmp/fsl_F4Bd5B_mc/fmri_mcf.nii.gz):
  data bytes needed = 990720
  data bytes input  = 439262
  number missing    = 551458 (set to 0)
WARNING: nifti_read_buffer(/tmp/fsl_F4Bd5B_mc/fmri_mcf.nii.gz):
  data bytes needed = 990720
  data bytes input  = 439262
  number missing    = 551458 (set to 0)

This issue has never occurred when running a single analysis, but occur with varying frequency when running multiple simultaneously.  Melodic fails the most frequently (somethings >40% failure rate), and programs like fdt and fast were found to fail least often (~1% failure rate), while some programs have not failed at all (e.g. BET & SUSAN).  The errors are not consistent, meaning that if after a script has failed it is rerun it won't necessarily fail again.  However, running a sufficiently large batch of simultaneous jobs (e.g. submitting 100 to SGE) will invariably result in such errors.  Nothing suspicious has been noted regarding the image files themselves either and header information obtained using fslhd together with fslstats -r -R output are listed below for prefiltered_func_data.nii.gz (which is the input file used above to produce the error output).

fslhd:

bogdan@bob:/home/lejian/Gamble_preprocessing/Results/funct_test/con002/Scan2/gamb1/gamb1.feat$ fslhd prefiltered_func_data.nii.gz 
filename       prefiltered_func_data.nii.gz

sizeof_hdr     348
data_type      FLOAT32
dim0           4
dim1           86
dim2           72
dim3           40
dim4           276
dim5           1
dim6           1
dim7           1
vox_units      mm
time_units     s
datatype       16
nbyper         4
bitpix         32
pixdim0        0.0000000000
pixdim1        2.9767441750
pixdim2        2.9767441750
pixdim3        3.0000000000
pixdim4        2.5499999523
pixdim5        0.0000000000
pixdim6        0.0000000000
pixdim7        0.0000000000
vox_offset     352
cal_max        0.0000
cal_min        0.0000
scl_slope      0.000000
scl_inter      0.000000
phase_dim      0
freq_dim       0
slice_dim      0
slice_name     Unknown
slice_code     0
slice_start    0
slice_end      0
slice_duration 0.000000
time_offset    0.000000
intent         Unknown
intent_code    0
intent_name    
intent_p1      0.000000
intent_p2      0.000000
intent_p3      0.000000
qform_name     Scanner Anat
qform_code     1
qto_xyz:1      -2.976732  -0.000006  -0.008716  126.350861
qto_xyz:2      -0.000843  2.962765  0.290396  -194.549240
qto_xyz:3      -0.008607  -0.288146  2.985899  13.005946
qto_xyz:4      0.000000  0.000000  0.000000  1.000000
qform_xorient  Right-to-Left
qform_yorient  Posterior-to-Anterior
qform_zorient  Inferior-to-Superior
sform_name     Scanner Anat
sform_code     1
sto_xyz:1      -2.976731  0.000000  -0.008833  126.350861
sto_xyz:2      -0.000849  2.962765  0.290396  -194.549240
sto_xyz:3      -0.008724  -0.288146  2.985899  13.005946
sto_xyz:4      0.000000  0.000000  0.000000  1.000000
sform_xorient  Right-to-Left
sform_yorient  Posterior-to-Anterior
sform_zorient  Inferior-to-Superior
file_type      NIFTI-1+
file_code      1
descrip        FSL4.0
aux_file       


fslstats -r -R:

bogdan@bob:/home/lejian/Gamble_preprocessing/Results/funct_test/con002/Scan2/gamb1/gamb1.feat$ fslstats prefiltered_func_data.nii.gz -r -R
16.176001 1493.583984 0.000000 2696.000000 

After the script fails I looked for /tmp/fsl_F4Bd5B_mc/fmri_mcf.nii.gz but couldn't find it, in fact the directory /tmp/fsl_F4Bd5B_mc/ did not exist.

We run fsl on an ubuntu linux cluster where all our nodes are diskless and mount their filesystems over NFS.  A memory stress test was also conducted to ensure there were no hardware errors involved on the processing nodes and two independent servers (hosting raid 1 arrays) with identical configurations have be used as the master nodes for the cluster with these processing nodes to produce the same errors under similar conditions (i.e. running multiple simultaneous analyses), so the hardware is unlikely to be related to these errors.  fslerrorreport was run in the directory containing the fsl_motion_outliers input.  Below is the output:

bogdan@bob:/home/lejian/Gamble_preprocessing/Results/funct_test/con002/Scan2/gamb1/gamb1.feat$ ^Crefiltered_func_data.nii.gz 
bogdan@bob:/home/lejian/Gamble_preprocessing/Results/funct_test/con002/Scan2/gamb1/gamb1.feat$ fslerrorreport 
/usr/local/fsl/bin/fslerrorreport: 92: quota: not found
cat: /home/lejian/Gamble_preprocessing/Results/funct_test/con002/Scan2/gamb1/gamb1.feat/report.log: No such file or directory
cat: /home/lejian/Gamble_preprocessing/Results/funct_test/con002/Scan2/gamb1/gamb1.feat/design.mat: No such file or directory
cat: /home/lejian/Gamble_preprocessing/Results/funct_test/con002/Scan2/gamb1/gamb1.feat/design.con: No such file or directory
cat: /home/lejian/Gamble_preprocessing/Results/funct_test/con002/Scan2/gamb1/gamb1.feat/design.fts: No such file or directory
ls: cannot access /home/lejian/Gamble_preprocessing/Results/funct_test/con002/Scan2/gamb1/gamb1.feat/stats: No such file or directory

######################################################################
#### MACHINE INFORMATION
######################################################################

Uname:
Linux

df:
Filesystem           1K-blocks      Used Available Use% Mounted on
165.124.111.159:/home
                    6341360640 5295659264 723578880  88% /home

quota:

Memory and Swap info:
MemTotal:       16459800 kB
MemFree:         5555752 kB
Buffers:               0 kB
Cached:         10029180 kB
SwapCached:            0 kB
Active:          4490240 kB
Inactive:        5891500 kB
Active(anon):      14508 kB
Inactive(anon):   343900 kB
Active(file):    4475732 kB
Inactive(file):  5547600 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:             0 kB
SwapFree:              0 kB
Dirty:                 4 kB
Writeback:            56 kB
AnonPages:        352536 kB
Mapped:            53980 kB
Slab:             144928 kB
SReclaimable:     111732 kB
SUnreclaim:        33196 kB
PageTables:        24448 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:     8229900 kB
Committed_AS:    1026412 kB
VmallocTotal:   34359738367 kB
VmallocUsed:      399000 kB
VmallocChunk:   34359338875 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:        7680 kB
DirectMap2M:    16760832 kB

######################################################################
#### ENVIRONMENT INFORMATION
######################################################################

FSLMACHTYPE=gnu_64-gcc4.4
FSLDIR=/usr/local/fsl
FSLTCLSH=/usr/local/fsl/bin/fsltclsh
FSLMULTIFILEQUIT=TRUE
FSLMACHINELIST=
FSLOUTPUTTYPE=NIFTI_GZ
FSLWISH=/usr/local/fsl/bin/fslwish
FSLREMOTECALL=
FSLCONFDIR=/usr/local/fsl/config
FSLLOCKDIR=

MATLABPATH=/usr/local/matlab_scripts:/usr/local/spm5:/usr/local/vbmtools
PATH=/home/sge/bin/lx24-amd64:/usr/local/freesurfer/bin:/usr/local/caret/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/home/sge/bin/lx24-amd64/:/usr/local/fsl/bin
MANPATH=/home/sge/man:/usr/share/man:/usr/local/share/man

######################################################################
#### DIRECTORY INFORMATION
######################################################################

PWD:
/home/lejian/Gamble_preprocessing/Results/funct_test/con002/Scan2/gamb1/gamb1.feat

total 384400
drwxr-xr-x 15 lejian ava     4096 Dec 14 12:31 .
drwxr-xr-x  6 lejian ava     4096 Nov 29 13:12 ..
-rw-r--r--  1 lejian ava    25348 Nov 30 10:08 confound_MWV.txt
-rw-r--r--  1 lejian ava    28074 Nov 30 10:08 confound_MWVG.txt
-rw-r--r--  1 lejian ava   357635 Dec 14 12:31 example_func.nii.gz
-rw-r--r--  1 lejian ava 46665793 Nov 30 10:25 filtered_MWV.nii.gz
-rw-r--r--  1 lejian ava 46664822 Nov 30 10:09 filtered_MWVG.nii.gz
-rw-r--r--  1 lejian ava 46663373 Nov 30 10:25 filtered_MWVG_ICA_outliner.nii.gz
-rw-r--r--  1 lejian ava 46661990 Nov 30 10:41 filtered_MWV_ICA_outliner.nii.gz
-rw-r--r--  1 lejian ava 60792601 Nov 30 10:03 filtered_func_data.nii.gz
-rw-r--r--  1 lejian ava 46340819 Nov 30 10:04 filtered_func_no_edge_data.nii.gz
-rw-r--r--  1 lejian ava     6639 Nov 30 10:04 filtered_func_no_edge_data_mask.nii.gz
-rw-r--r--  1 lejian ava     3997 Nov 30 09:54 mask.nii.gz
drwxr-xr-x  3 lejian ava     4096 Nov 30 09:47 mc
-rw-r--r--  1 lejian ava   220426 Nov 30 10:03 mean_func.nii.gz
-rw-r--r--  1 lejian ava 98591857 Dec 14 12:22 prefiltered_func_data.nii.gz
drwxr-x---  2 lejian ava    12288 Nov 23 21:21 prefiltered_func_data_mcf.mat
drwxr-x---  2 lejian ava    12288 Nov 24 09:13 prefiltered_func_data_mcf.mat+
drwxr-x---  2 lejian ava    12288 Nov 28 13:41 prefiltered_func_data_mcf.mat++
drwxr-x---  2 lejian ava    12288 Nov 28 15:08 prefiltered_func_data_mcf.mat+++
drwxr-x---  2 lejian ava    12288 Nov 28 17:00 prefiltered_func_data_mcf.mat++++
drwxr-x---  2 lejian ava    12288 Nov 28 17:49 prefiltered_func_data_mcf.mat+++++
drwxr-x---  2 lejian ava    12288 Nov 28 18:11 prefiltered_func_data_mcf.mat++++++
drwxr-x---  2 lejian ava    12288 Nov 29 14:15 prefiltered_func_data_mcf.mat+++++++
drwxr-x---  2 lejian ava    12288 Nov 30 09:46 prefiltered_func_data_mcf.mat++++++++
drwxr-xr-x  2 lejian ava     4096 Nov 30 09:44 reg
drwxr-xr-x  2 lejian ava     4096 Nov 30 10:08 roi
drwxr-xr-x  2 lejian ava     4096 Nov 30 10:08 seg

######################################################################
#### FEAT INFORMATION
######################################################################

Report log:

######################################################################

Design Matrix:

######################################################################

Contrast Matrix:

######################################################################

FTS Matrix:

######################################################################
#### Main directory:

total 384400
drwxr-xr-x 15 lejian ava     4096 Dec 14 12:31 .
drwxr-xr-x  6 lejian ava     4096 Nov 29 13:12 ..
-rw-r--r--  1 lejian ava    25348 Nov 30 10:08 confound_MWV.txt
-rw-r--r--  1 lejian ava    28074 Nov 30 10:08 confound_MWVG.txt
-rw-r--r--  1 lejian ava   357635 Dec 14 12:31 example_func.nii.gz
-rw-r--r--  1 lejian ava 46665793 Nov 30 10:25 filtered_MWV.nii.gz
-rw-r--r--  1 lejian ava 46664822 Nov 30 10:09 filtered_MWVG.nii.gz
-rw-r--r--  1 lejian ava 46663373 Nov 30 10:25 filtered_MWVG_ICA_outliner.nii.gz
-rw-r--r--  1 lejian ava 46661990 Nov 30 10:41 filtered_MWV_ICA_outliner.nii.gz
-rw-r--r--  1 lejian ava 60792601 Nov 30 10:03 filtered_func_data.nii.gz
-rw-r--r--  1 lejian ava 46340819 Nov 30 10:04 filtered_func_no_edge_data.nii.gz
-rw-r--r--  1 lejian ava     6639 Nov 30 10:04 filtered_func_no_edge_data_mask.nii.gz
-rw-r--r--  1 lejian ava     3997 Nov 30 09:54 mask.nii.gz
drwxr-xr-x  3 lejian ava     4096 Nov 30 09:47 mc
-rw-r--r--  1 lejian ava   220426 Nov 30 10:03 mean_func.nii.gz
-rw-r--r--  1 lejian ava 98591857 Dec 14 12:22 prefiltered_func_data.nii.gz
drwxr-x---  2 lejian ava    12288 Nov 23 21:21 prefiltered_func_data_mcf.mat
drwxr-x---  2 lejian ava    12288 Nov 24 09:13 prefiltered_func_data_mcf.mat+
drwxr-x---  2 lejian ava    12288 Nov 28 13:41 prefiltered_func_data_mcf.mat++
drwxr-x---  2 lejian ava    12288 Nov 28 15:08 prefiltered_func_data_mcf.mat+++
drwxr-x---  2 lejian ava    12288 Nov 28 17:00 prefiltered_func_data_mcf.mat++++
drwxr-x---  2 lejian ava    12288 Nov 28 17:49 prefiltered_func_data_mcf.mat+++++
drwxr-x---  2 lejian ava    12288 Nov 28 18:11 prefiltered_func_data_mcf.mat++++++
drwxr-x---  2 lejian ava    12288 Nov 29 14:15 prefiltered_func_data_mcf.mat+++++++
drwxr-x---  2 lejian ava    12288 Nov 30 09:46 prefiltered_func_data_mcf.mat++++++++
drwxr-xr-x  2 lejian ava     4096 Nov 30 09:44 reg
drwxr-xr-x  2 lejian ava     4096 Nov 30 10:08 roi
drwxr-xr-x  2 lejian ava     4096 Nov 30 10:08 seg

######################################################################
#### Stats directory:


######################################################################
#### Reg directory:

total 33424
drwxr-xr-x  2 lejian ava     4096 Nov 30 09:44 .
drwxr-xr-x 15 lejian ava     4096 Dec 14 12:31 ..
-rw-r--r--  1 lejian ava      138 Nov 30 09:40 example_func2highres.mat
-rw-r--r--  1 lejian ava 22971814 Nov 30 09:41 example_func2highres.nii.gz
-rw-r--r--  1 lejian ava  1519639 Nov 30 09:43 example_func2highres.png
-rw-r--r--  1 lejian ava   808985 Nov 30 09:41 example_func2highres1.png
-rw-r--r--  1 lejian ava   710605 Nov 30 09:43 example_func2highres2.png
-rw-r--r--  1 lejian ava      133 Nov 30 09:44 example_func2standard.mat
-rw-r--r--  1 lejian ava  2456896 Nov 30 09:44 example_func2standard.nii.gz
-rw-r--r--  1 lejian ava   477763 Nov 30 09:44 example_func2standard.png
-rw-r--r--  1 lejian ava   269418 Nov 30 09:44 example_func2standard1.png
-rw-r--r--  1 lejian ava   207975 Nov 30 09:44 example_func2standard2.png
-rw-r--r--  1 lejian ava  2382368 Nov 30 09:40 highres.nii.gz
-rw-r--r--  1 lejian ava      144 Nov 30 09:41 highres2example_func.mat
-rw-r--r--  1 lejian ava      142 Nov 30 09:44 highres2standard.mat
-rw-r--r--  1 lejian ava   964872 Nov 30 09:44 highres2standard.nii.gz
-rw-r--r--  1 lejian ava   448252 Nov 30 09:44 highres2standard.png
-rw-r--r--  1 lejian ava   226162 Nov 30 09:44 highres2standard1.png
-rw-r--r--  1 lejian ava   222143 Nov 30 09:44 highres2standard2.png
-rw-r--r--  1 lejian ava   414189 Nov 30 09:40 standard.nii.gz
-rw-r--r--  1 lejian ava      143 Nov 30 09:44 standard2example_func.mat
-rw-r--r--  1 lejian ava      143 Nov 30 09:44 standard2highres.mat

This error report is saved in the file: /tmp/fsl_m6UaYm.gz

I was wondering if anybody here had ever experienced anything of this nature or might have any ideas regarding the cause?

A copy of the entire directory which contains the input files listed above used can be found here:
http://apkarianlab.northwestern.edu/gamb1.feat.tar.gz

Thanks in advance for any help,

Bogdan Petre
Northwestern University
[log in to unmask]


--

Bogdan Petre <[log in to unmask]>
Departments of Integrated Science and Physiology
Northwestern University