Hi Yves
Nothing very interesting from the ps, I'm afraid.
If they are still alive, then you should find a file called
RunTransform.log in the job's working directory. There might also be
a file called GridWrapper.log. send me these (off the list) and I can
have a look at what's gone wrong.
BTW, you should definitely kill off orphaned processes like 15908. (I
think that Alessandra put a howto on the sysadmin wiki.)
Cheers
Graeme
On 15 Nov 2007, at 13:36, Yves Coppens wrote:
> Hi Graeme,
>
> I've attached the output of ps auxwww for node epcf25. The output for
> the other nodes is similar. There is a hanging process that has
> escaped
> pbs_mom:
>
> atlasprd 15908 0.0 0.0 29084 728 ? S Nov09 5:01 python
>
> I should have trapped it :(
>
> Thanks for info and help,
>
> Yves
>
> On Thu, 15 Nov 2007, Graeme Stewart wrote:
>
>> Hi Yves
>>
>> The cronus executor has been shut down. Production jobs you are
>> seeing will be coming from the standard EGEE grid LEXOR executor.
>>
>> Have these jobs consumed CPU yet, or are they trying to get started?
>>
>> I agree this is a terrible waste of site's resources and that has
>> been a big motivating factor in the decision to move ATLAS production
>> to PanDA. Because PanDA stages input datasets on the site's SE and
>> puts outputs onto the site's SE as well (it does all other data
>> movements asynchronously using ATLAS DDM) we will not see these large
>> data management timeouts which currently cripple atlas production in
>> EGEE.
>>
>> If you send me the output from ps auxwww I'll try and see what the
>> jobs are doing. It's possible you can kill them off - but please
>> don't do it yet.
>>
>> Thanks
>>
>> Graeme
>>
>> PS. Yes, we also see inefficient atlasprd jobs at Glasgow.
>>
>>
>> On 15 Nov 2007, at 12:47, Yves Coppens wrote:
>>
>>> Hello,
>>>
>>> While investigating while we were failing the Atlas test again, I
>>> found
>>> (once more) than many prd atlas jobs are sleeping.
>>>
>>> [root@epcf25 root]# ps -ef | grep sleep
>>> atlasprd 27603 23385 0 12:02 ? 00:00:00 sleep 9600
>>> atlasprd 27604 23386 0 12:02 ? 00:00:00 sleep 9600
>>> root 27712 8088 0 12:17 pts/0 00:00:00 grep sleep
>>>
>>> [root@epcf28 root]# ps -ef | grep sleep
>>> atlasprd 19537 6438 0 10:11 ? 00:00:00 sleep 9600
>>> atlasprd 19667 19352 0 11:42 ? 00:00:00 sleep 9600
>>> root 19873 19716 0 12:18 pts/0 00:00:00 grep sleep
>>> [root@epcf28 root]#
>>>
>>> and the same on three other worker nodes!
>>>
>>> I issued ggus ticket (25848) back in August about this. But no
>>> one has
>>> addressed it yet. Are they using CRONUS and is it really that bad!?
>>>
>>> I do not think this has got anything to do with my failing Steve's
>>> test:
>>> the failure is caused by a missing file which is actually available
>>> in the
>>> Atlas software area on all my workers - I shall take this offline
>>> with
>>> Frederic.
>>>
>>> Are VOs really claiming that pilot jobs are necessary because they
>>> allow
>>> them to make more effective use of resources?
>>>
>>> We should definitely do wall time accounting rather than CPU time
>>> accounting.
>>>
>>> Have other sites seen this too?
>>>
>>> Yves
>>
>> --
>> Dr Graeme Stewart - http://wiki.gridpp.ac.uk/wiki/User:Graeme_stewart
>> ScotGrid - http://www.scotgrid.ac.uk/ http://scotgrid.blogspot.com/
>> <epcf25.psauxwww>
--
Dr Graeme Stewart - http://wiki.gridpp.ac.uk/wiki/User:Graeme_stewart
ScotGrid - http://www.scotgrid.ac.uk/ http://scotgrid.blogspot.com/
|