On 03/02/11 23:00, Stephen Burke wrote:
> Testbed Support for GridPP member institutes
>> [mailto:[log in to unmask]] On Behalf Of Peter Grandi said:
>>> - Reason = Got a job held event,
>> reason: Globus error 21: the job manager failed to locate an
>> internal script argument file
>
> The closest I can see is this, it may be worth checking the causes listed here:
>
> http://goc.grid.sinica.edu.tw/gocwiki/Globus_error_22%3A_the_job_manager_failed_to_create_an_internal_script_argument_file
>
> Stephen
That's useful, thanks very much for that pointer, I could not find that page because I think it has a typo, or perhaps it is about a slightly different version of the sw, as the error number is different.
The suggestion to look at:
/var/log/cleanup-grid-accounts.log
has been useful because while the cleanup scripts get run, there are occasional errors in the log, and in particular for the user reporting the issue:
Cleaning up /home/pheno069
Cleaning up /home/pheno070
find: ./gram_job_mgr_23428.log: Input/output error
find: ./.lcgjm/pbsqueue.cache.proc.locked: Input/output error
d 0700 3 207070 207000 4096 Jul 22 2010 \
./.globus/job
./.globus/job: Directory not empty
Cleaning up /home/pheno071
This may be just a locked dir because there is an active job, but I feel that there are some filesystem problems here, and indeed there are corrupt directory entries in the home on CE02:
-rw-r--r-- 1 pheno070 pheno 16941 Jan 28 13:56 gram_job_mgr_16276.log
-rw-r--r-- 1 pheno070 pheno 16941 Jan 28 13:56 gram_job_mgr_16286.log
-rw-r--r-- 1 pheno070 pheno 15897 Jan 28 12:56 gram_job_mgr_15441.log
?--------- ? ? ? ? ? gram_job_mgr_23428.log
There is nothing like that on CE01 for that user, also the CE01 version of the home dir is a lot smaller and "cleaner". This may not the issue with "pheno" in general, but looks like a good explanation for some anomalies on that specific CE.
BTW the most likely cause for the corruption is some delays due to the CE01 and CE02 being VMs with a virtual disk on an NFS server and occasionally latency causes virtual IO errors. This is going to be fixed.
|