Hallo,
in our case, times of pbs respons are very very long sometimes. We have
to drain Creams for more hours, and as it affects all cream CEs it could
stop the production.
Interessting thing is, that lcg-CE don't have such a problem.. maybe
there is some differens between caching and resubmision(or getting
output) of jobs, by bad pbs response?
Regards
Dimitri
On 09/30/2010 01:35 PM, Massimo Sgaravatto - INFN Padova wrote:
> Hi Dimitri
>
> I can't see how bug #55427 could be relevant for this problem.
> I am also not able to find in that bug any mention to configuration
> parameter to select asynchronous registration.
>
> At any rate, as we already discussed some time ago with you and the
> other KIT guys, that error means that after having issue the
> submission command
> (bsub/qsub/...), the lrms jobid was not returned within 200 seconds.
>
>
> It you want to increase that timeout, this is possible. The relevant
> attribute is:
>
> blah_child_poll_timeout
>
> in /opt/glite/etc/blah.config
>
>
> But 200 seconds (more than 3 minutes) seem a lot to me and I wouldn't
> suggest to do that.
>
>
> when there are such problems in the batch server, probably the right
> thing to do is to disable new job submissions using the
> glite-ce-disable-submission command (assuming that you are an admin of
> that CE).
> See: http://grid.pd.infn.it/cream/field.php?n=Main.HowToDrainACREAMCE
> for more detals
>
> Please have a look also at this page:
>
> http://grid.pd.infn.it/cream/field.php?n=Main.Self-limitingCREAMBehavior
>
> which explains the details of the machinery that automatically disable
> new job submissions when certain conditions are met
>
> Cheers, Massimo
>
>
>
> On Thu, 30 Sep 2010, Nilsen Dimitri wrote:
>
>> Hi,
>>
>> from time to time we observe this error message by job at our creamCEs:
>> (stderr: execute_cmd: 200 seconds timeout expired, killing child
>> process.)
>> I found also a bag https://savannah.cern.ch/bugs/?55427 where is
>> written, there is a configuration parameter to select asynchronous
>> registration. Could you tell me where I could find this parameter and
>> what is a name of it?
>>
>> Also some jobs fail due to: "blah error: send command timeout". Could it
>> be a ralation between this two errors?
>>
>> Regards
>> Dimitri
>>
>> --
>> Dimitri Nilsen, Dipl.-Ing(FH)
>>
>> Karlsruhe Institute of Technology (KIT)
>> Steinbuch Centre for Computing
>> Postfach 3640
>> 76344 Eggenstein-Leopoldshafen, Germany
>>
>> Tel.: +49 7247 82-8607
>> Fax.: +49 7247 82-4972
>> Email: [log in to unmask]
>>
>
> \|||/
> -----------0oo----( o o )----oo0-------------------
> (_)
> INFN Sezione di Padova
> Via Marzolo, 8
> 35131 Padova - Italy E-mail: massimo.sgaravatto [at] pd.infn.it
> Tel: ++39 0498275908 Skype: massimo.sgaravatto
> Fax: ++39 0498275952
--
Dimitri Nilsen, Dipl.-Ing(FH)
Karlsruhe Institute of Technology (KIT)
Steinbuch Centre for Computing
Postfach 3640
76344 Eggenstein-Leopoldshafen, Germany
Tel.: +49 7247 82-8607
Fax.: +49 7247 82-4972
Email: [log in to unmask]
|