Hi Andreas,
Thank you very much. I have also seen these errors in the WNs. I will
continue to monitor it.
Cheers,
Carles
On 09/05/2014 11:35 AM, Andreas Gellrich wrote:
> Hi Carles,
> at DESY-HH we have been seeing this as well. And if I remember
> correctly NIKHEF (Jeff?) confirmed a similar observation once.
>
> We spent quite some time to understand what's going on but we we were
> not successful. There are indications that this is due to timing
> problems / race conditions. A connected (?) observation is that WNs
> fail to copy back sandboxes because the corresponding dir on the
> CREAM-CE is gone (torque error 15059).
>
> Cheers,
> Andreas
>
> On Fri, 5 Sep 2014, Carles Acosta wrote:
>
>> Dear all,
>>
>> I am a bit confused with a behaviour discovered in our
>> cream-CE+Torque system. It is easier to explain with an example:
>>
>> In our cream-CE, we see this entry (and just this entry corresponding
>> to this lrmsID and clientID) in the blahp.log-20140824 log file.
>>
>> "timestamp=2014-08-24 06:41:37" "userDN=/DC=ch/DC=cern/OU=Organic
>> Units/OU=Users/CN=atlpilo2/CN=531497/CN=Robot: ATLAS Pilot2"
>> "userFQAN=/atlas/Role=production/Capability=NULL"
>> "userFQAN=/atlas/Role=NULL/Capability=NULL"
>> "userFQAN=/atlas/lcg1/Role=NULL/Capability=NULL"
>> "userFQAN=/atlas/usatlas/Role=NULL/Capability=NULL"
>> "ceID=ce07.pic.es:8443/cream-pbs-mcore_sl6_atlas"
>> "jobID=CREAM542031943" "lrmsID=28935676.pbs04.pic.es" "localUser=42008"
>> "clientID=cream_542031943"
>>
>> However, in the Torque logs, I can see 3 different jobs with the same
>> jobname that corresponds to the above clientID:
>>
>> 20140824:08/24/2014 12:55:23;E;28935635.pbs04.pic.es;user=atprd008
>> group=atprd jobname=cream_542031943 queue=mcore_sl6_atlas
>> ctime=1408862221 qtime=1408862221 etime=1408862221 start=1408873325
>> [log in to unmask]
>> exec_host=td712.pic.es/15+td712.pic.es/14+td712.pic.es/13+td712.pic.es/12+td712.pic.es/11+td712.pic.es/10+td712.pic.es/9+td712.pic.es/8
>> Resource_List.neednodes=1:ppn=8 Resource_List.nodect=1
>> Resource_List.nodes=1:ppn=8
>> Resource_List.walltime=107:00:00 session=17263 end=1408877723
>> Exit_status=0 resources_used.cput=05:08:52
>> resources_used.mem=9663200kb resources_used.vmem=14835684kb
>> resources_used.walltime=01:02:56
>>
>> 20140824:08/24/2014 18:24:39;E;28935586.pbs04.pic.es;user=atprd008
>> group=atprd jobname=cream_542031943 queue=mcore_sl6_atlas
>> ctime=1408861996 qtime=1408861996 etime=1408861996 start=1408872595
>> [log in to unmask]
>> exec_host=td601.pic.es/15+td601.pic.es/14+td601.pic.es/13+td601.pic.es/12+td601.pic.es/11+td601.pic.es/10+td601.pic.es/9+td601.pic.es/1
>> Resource_List.neednodes=1:ppn=8 Resource_List.nodect=1
>> Resource_List.nodes=1:ppn=8
>> Resource_List.walltime=107:00:00 session=7058 end=1408897479
>> Exit_status=1 resources_used.cput=03:16:56
>> resources_used.mem=9564740kb resources_used.vmem=14638964kb
>> resources_used.walltime=06:08:39
>>
>> 20140824:08/24/2014 19:11:41;E;28935676.pbs04.pic.es;user=atprd008
>> group=atprd jobname=cream_542031943 queue=mcore_sl6_atlas
>> ctime=1408862446 qtime=1408862446 etime=1408862446 start=1408873812
>> [log in to unmask]
>> exec_host=td614.pic.es/15+td614.pic.es/14+td614.pic.es/13+td614.pic.es/12+td614.pic.es/11+td614.pic.es/10+td614.pic.es/9+td614.pic.es/8
>> Resource_List.neednodes=1:ppn=8 Resource_List.nodect=1
>> Resource_List.nodes=1:ppn=8
>> Resource_List.walltime=107:00:00 session=24806 end=1408900301
>> Exit_status=1 resources_used.cput=07:33:45
>> resources_used.mem=9816308kb resources_used.vmem=14863868kb
>> resources_used.walltime=06:32:26
>>
>> Maybe I am wrong, but I believed that the jobID//clientID in the
>> cream-CE was used to establish the jobname in Torque. Thus, we have 3
>> jobs with the same jobname at the same day, with the same user and
>> from the same cream-CE,
>> but the jobs have different queued times, were executed in different
>> ways in different WNs. From the cream-CE, I only can track one job,
>> the last one, the other two are not registered in any logfile.
>>
>> I am checking it and this issue affects a small percentage of the
>> total jobs received.
>>
>> What do you think? Is this an expected behaviour and I am just wrong
>> or it is a kind of misunderstanding between the cream-CEs and Torque?
>>
>> Thank you in advance.
>>
>> Best regards,
>>
>> Carles
>>
>> --
>> Carles Acosta i Silva
>> PIC (Port d'Informació Científica)
>> Campus UAB, Edifici D
>> E-08193 Bellaterra, Barcelona
>> Tel: +34 93 581 33 22
>> Fax: +34 93 581 41 10
>> http://www.pic.es Avís - Aviso - Legal Notice:
>> http://www.ifae.es/legal.html
>>
>>
>
> # Andreas Gellrich
> # DESY IT / Grid Computing
> # 2b/317, Notkestr. 85, D-22607 Hamburg, +49 40 8998 2732
--
Carles Acosta i Silva
PIC (Port d'Informació Científica)
Campus UAB, Edifici D
E-08193 Bellaterra, Barcelona
Tel: +34 93 581 33 22
Fax: +34 93 581 41 10
http://www.pic.es
Avís - Aviso - Legal Notice: http://www.ifae.es/legal.html
|