Hola Maarten,

Thanks for the info. I was off for a few days, so couldn't reply earlier. We will follow this and report back but I'm not sure how long this will take since the thing does not seem easy and we're short of personnel during this summer time... It may be related with communication problems between the batch and the WNs or the CE (of which we've had some hints in the past).

Regarding your question:
One thing we've seen is that for the 'ghost' jobs there is no gridmap-* 
log file in /opt/edg/var/gatekeeper/. That is, there is one entry for 
one job and the rest leave no trace. Looking at the dgas code that 
creates this entry, we have seen that actually the code fails earlier, 
    

What code fails earlier?
I meant that 'dgas-add-record' is not able to write the file under '/opt/edg/var/gatekeeper/' but the failure occurs not at writing time but earlier when trying to read the corresponding file under '/opt/edg/var/gatekeeper/jobs/'. Therefore I asked who created that file. Now I'll look at the pointers you gave me to see if I get something else.

Cheers,
    Antonio


[log in to unmask]" type="cite">
We have been suffering from a problem with our lcg-CE (which is not in 
the same machine as our maui/torque server). Some of our jobs are run 
multiple times, i.e.: for a single glite submission (a single EDG job 
id), we have several identical jobs running in the batch system, 
starting with a difference of seconds (at least for the case we looked 
at in more detail).

Has somebody ever seen this or have an idea of what may be happening?
    

During the EGEE years this has been seen at a few sites with various
batch systems (also Condor and LSF) and AFAIK the trouble has always
been caused by the batch system, i.e. not by the middleware.

  
One thing we've seen is that for the 'ghost' jobs there is no gridmap-* 
log file in /opt/edg/var/gatekeeper/. That is, there is one entry for 
one job and the rest leave no trace. Looking at the dgas code that 
creates this entry, we have seen that actually the code fails earlier, 
    

What code fails earlier?

  
when trying to find a file for the job under 
/opt/edg/var/gatekeeper/jobs/ (so no entry of the form 
1278506637:lcgpbs:internal_3902709601:19626.1278506629 is found). Who 
should create this file? I have seen some dgas code in the job manager 
perl code but didn't find the relevant code... also, who defines 
GATEKEEPER_DGAS_FD and writes it into
/var/log/globus-gatekeeper.log? Answer to these questions (and a pointer 
to the source code :)) might allow us to dig further if nobody has a 
better clue.
    

The gatekeeper creates the intermediate file and the environment variable.
Code:

http://jra1mw.cvs.cern.ch/cgi-bin/jra1mw.cgi/fabric_mgt/gridification/lcg-gatekeeper/

The tag is VDT1_6_0x86_rhas_4_LCG-3 (sic) for the gatekeeper
and VDT1_6_0x86_rhas_4_LCG-1 for the "standard" job managers.

For completeness, the "lcg" job managers are here:

http://jra1mw.cvs.cern.ch/cgi-bin/jra1mw.cgi/lcg-extra-jobmanagers/