Hi,
As Maarten says, we had a problem that looked similar to this one. In
order to see if it was indeed the same, you could have a loook at
/var/log/messages and look for 'dgas' error messages, like:
Jul 22 15:17:51 lcg02 dgas-add-record[26141]:
/opt/lcg/sbin/dgas-add-record: cannot open '/opt/edg/var/gatekee
per/jobs/1279804636:lcgpbs:internal_284714274:25092.1279804632': No such
file or directory
We've also observed problems in copying output from WN to CE but not
only related to duplicate jobs, and anyway in principle this should be a
consequence, not a cause.
The problem appeared firstly sometime on June and went worse when the
load on the CE increased. Then they were solved completely with a reboot
of the CE (whole machine, not just daemons) on August 5th.
However, and this is new information for Maarten also, the problem
appeared again after two months (October 10th). We solved it again via a
reboot. I apologize that, after solving the issue, we forgot to report
this... I'll add it to the bug now.
Given that we were not aware of any other change in the machine, we
originally thought the problem was somehow caused by a disk enlargement
(we use LVM) but since it has reappeared, we now think it must be due to
something (logs, memory, processes?) growing too much... but we have no
clue about what can that be :(
We're not extremely worried though since it seems it is controllable by
rebooting the CE once each two months and we are hoping CREAM will not
suffer from the same problem.
Cheers,
Antonio
> Hallo Ralph,
>
>> we have mysterious problem.
>> We run an extra torque server and a lcg-CE.
>> After an update of the torque server (operating system update), suddenly
>> all jobs are submitted several times to the system.
>>
>> A job arrives at the lcg-CE and is passed to the torque server (other
>> machine). Torque accepts the job and runs it.
>> The logs both on the CE and torque look normal. But after about half a
>> minute the same job (same Grid ID) is submitted again to the lcg-CE. And
>> again and again. The same job is submitted 9 times.
>>
>> The jobs then fail when trying to copy there output from the WN to the CE:
>> from WN /var/log/messages:
>> Nov 3 17:36:03 r1i1n15 pbs_mom: sys_copy, command '/usr/bin/scp -rpB
>> /var/spool/pbs/spool/5137506.service0.ice.zih.tu-dresden.de.OU
>> [log in to unmask]:/home/ziops022/.globus/job/desdemona.zih.tu-dresden.de/32270.1288802028/stdout'
>> failed with status=1, giving up after 4 attempts
>> Nov 3 17:36:03 r1i1n15 pbs_mom: req_cpyfile, Unable to copy file
>> /var/spool/pbs/spool/5137506.service0.ice.zih.tu-dresden.de.OU to
>> [log in to unmask]:/home/ziops022/.globus/job/desdemona.zih.tu-dresden.de/32270.1288802028/stdout
>>
>> The reason is that the directory
>> /home/ziops022/.globus/job/desdemona.zih.tu-dresden.de/32270.1288802028/
>> on the CE does not exist anymore.
> A few months ago the CIEMAT site reported a similar problem.
> After a lot of debugging effort the LCG-CE was rebooted and
> the problem has not returned since: did you try a reboot?
>
> We never understood the cause, but we did provide a version
> of the "lcg" job managers that has a protection against
> multiple submissions of the same job:
>
> https://savannah.cern.ch/patch/?4331
>
> But your CE is using the standard Globus job managers...
|