Hi Maarten,
thanks for you fast reply.
On 07/23/2010 11:15 PM, [log in to unmask] wrote:
> Hallo Christoph,
>
>
>> after some time of rather quiet WMS operation we have (again) one WMS
>> server that is stuck in changing job states. It might be just by change
>> the one server that we upgraded to most recent WMS release 3.1.29
>> (following the details in the release node of course). Some few test
>> jobs went through but after one of production work the machine seems to
>> be stuck.
>>
>> There are some jobs on the WMS that are still in the state running or
>> scheduled although we know that they are done (even the output sandbox
>> is on the WMS!). New jobs stay in the infamous Ready/unavailable status
>>
> Did you check this page:
>
> http://goc.grid.sinica.edu.tw/gocwiki/Jobs_sent_to_some_CE_stay_in_Ready_state_forever
>
>
I did it right now. Since the problem is independent of the CE where the
job is submitted to it seems to be a WMS issue. But the page gives no
hints what it actually could be. Firewall issues are excluded as well
since the effect is same for the local CEs that are not firewalled. As
said the WMS worked a day or after the upgrade is stuck now.
>> for many hours. Trying to understand what's happening to those jobs, we
>> see that they make it until to the actual Condor submit but remain in
>> the Condor state Idle for whatever reason.
>>
> Any clues in the Condor logfiles? You can raise the logging level and
> logfile sizes in the Condor configuration file, then restart Condor-G.
>
>
Nothing obvious to me, but I am not really used to read that stuff.
There quite some messages "Cannot cancel job from queue" in
/var/glite/logmonitor/ConderG.log/*. Might be there are too many cancel
request stuck and they are now blocking the whole system.
How do change the logging of Conder stuff? I discovered some settings at
the end of part 2 of /opt/condor-7.4.1/etc/condor_config that seem to
address Condor logging. But I am uncertain how to change the parameters.
Thanks anyway,
Christoph
|