>> we have a strange problem here after the installation of a new kernel:
>> incoming jobs are submitted several times to the batch system.
>>
>> A job is accepted by the gatekeeper and submitted to our torque.
>> Immediatly after the submission the job is again submitted and so on
>> untill there are 11 torque jobs.
>>
>> Another strange thing is that even if these 11 jobs are still waiting to
>> be executed in torque (as the queues are full) WMS tells me that they
>> were aborted. So it seems to me they somehow fail at once and thus are
>> resubmitted by WMS.
>
> The evidence simply suggests that each job immediately fails,
> i.e. before the user payload gets started.
> By default the WMS will do up to 10 shallow resubmissions then.
Indeed, in /opt/edg/var/gatekeeper/grid-jobmap_20091221 the WMS job ID
appears 11 times, confirming the job was resubmitted 10 times.
On the CE the job submission e.g. for "ops" does not work:
bash-3.00$ echo date | qsub -q gridexpr_scli
2961328.service0-ib0.ice.zih.tu-dresden.de
bash-3.00$ qstat -f 2961328.service0-ib0.ice.zih.tu-dresden.de
qstat: Unknown Job Id 2961328.service0.ice.zih.tu-dresden.de
Note the slight difference in the job ID returned by qstat:
service0-ib0 --> service0
I suspect your network configuration got screwed by the kernel upgrade.
|