Hi *
I have been trying to debug a VERY nasty problem here the last few days.
It goes like this: for a significant number of jobs coming to NIKHEF, we
get failures & resubmissions; these can be traced to important files or
directories suddenly "disappearing" (no, my cache cleaner script is
turned off ;-)
After days of tracing thru logs, what it looks like happens is this:
- job is submitted to CE
- process audit logs show repeated 'qstat' calls
to the appropriate PBS id
- a new job shows up for the same pool user
- gass-utils is called, first with 'query' option and
then with 'cache-cleanup' option
- no more 'qstat' calls are seen for this job!
Further inspection shows hints that the qstat caching might be involved,
although there is no direct evidence. The mechanism would be something
like having the grid-manager (or monitor? i can never remember) seeing
a trace somewhere off in gass-cache space that indicates the job is
running (timestamps changed, or output present) while the qstat caching
is still showing that same job in a 'queued' state. the manager gets
confused either because it is looking for the job specifically in 'R'
state, or because the states as observed by files & qstat are different,
figures something is wrong, and aborts the job.
Either this has been very well hidden, or has started around the time of
the LCG 2.6.0 upgrade. Is there something in the grid-manager behavior
that has changed recently?
Also, does anything in the job depend on the R-GMA job monitoring and
the associated LB schema changes? Since we don't yet have legal
clearance to broadcast all the info into R-GMA we have not yet applied
these updates; will that break anything?
Thanks for all help and suggestions,
J "would be pulling my hair out if I still had any" T
|