[log in to unmask]"
type="cite">
We have been suffering from a problem with our lcg-CE (which is not in
the same machine as our maui/torque server). Some of our jobs are run
multiple times, i.e.: for a single glite submission (a single EDG job
id), we have several identical jobs running in the batch system,
starting with a difference of seconds (at least for the case we looked
at in more detail).
Has somebody ever seen this or have an idea of what may be happening?
During the EGEE years this has been seen at a few sites with various
batch systems (also Condor and LSF) and AFAIK the trouble has always
been caused by the batch system, i.e. not by the middleware.
One thing we've seen is that for the 'ghost' jobs there is no gridmap-*
log file in /opt/edg/var/gatekeeper/. That is, there is one entry for
one job and the rest leave no trace. Looking at the dgas code that
creates this entry, we have seen that actually the code fails earlier,
What code fails earlier?
when trying to find a file for the job under
/opt/edg/var/gatekeeper/jobs/ (so no entry of the form
1278506637:lcgpbs:internal_3902709601:19626.1278506629 is found). Who
should create this file? I have seen some dgas code in the job manager
perl code but didn't find the relevant code... also, who defines
GATEKEEPER_DGAS_FD and writes it into
/var/log/globus-gatekeeper.log? Answer to these questions (and a pointer
to the source code :)) might allow us to dig further if nobody has a
better clue.
The gatekeeper creates the intermediate file and the environment variable.
Code:
http://jra1mw.cvs.cern.ch/cgi-bin/jra1mw.cgi/fabric_mgt/gridification/lcg-gatekeeper/
The tag is VDT1_6_0x86_rhas_4_LCG-3 (sic) for the gatekeeper
and VDT1_6_0x86_rhas_4_LCG-1 for the "standard" job managers.
For completeness, the "lcg" job managers are here:
http://jra1mw.cvs.cern.ch/cgi-bin/jra1mw.cgi/lcg-extra-jobmanagers/