Hi Carles,
at DESY-HH we have been seeing this as well. And if I remember correctly
NIKHEF (Jeff?) confirmed a similar observation once.
We spent quite some time to understand what's going on but we we were
not successful. There are indications that this is due to timing
problems / race conditions. A connected (?) observation is that WNs fail
to copy back sandboxes because the corresponding dir on the CREAM-CE is
gone (torque error 15059).
Cheers,
Andreas
On Fri, 5 Sep 2014, Carles Acosta wrote:
> Dear all,
>
> I am a bit confused with a behaviour discovered in our cream-CE+Torque system. It is easier to explain with an example:
>
> In our cream-CE, we see this entry (and just this entry corresponding to this lrmsID and clientID) in the blahp.log-20140824 log file.
>
> "timestamp=2014-08-24 06:41:37" "userDN=/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=atlpilo2/CN=531497/CN=Robot: ATLAS Pilot2" "userFQAN=/atlas/Role=production/Capability=NULL" "userFQAN=/atlas/Role=NULL/Capability=NULL"
> "userFQAN=/atlas/lcg1/Role=NULL/Capability=NULL" "userFQAN=/atlas/usatlas/Role=NULL/Capability=NULL" "ceID=ce07.pic.es:8443/cream-pbs-mcore_sl6_atlas" "jobID=CREAM542031943" "lrmsID=28935676.pbs04.pic.es" "localUser=42008"
> "clientID=cream_542031943"
>
> However, in the Torque logs, I can see 3 different jobs with the same jobname that corresponds to the above clientID:
>
> 20140824:08/24/2014 12:55:23;E;28935635.pbs04.pic.es;user=atprd008 group=atprd jobname=cream_542031943 queue=mcore_sl6_atlas ctime=1408862221 qtime=1408862221 etime=1408862221 start=1408873325 [log in to unmask]
> exec_host=td712.pic.es/15+td712.pic.es/14+td712.pic.es/13+td712.pic.es/12+td712.pic.es/11+td712.pic.es/10+td712.pic.es/9+td712.pic.es/8 Resource_List.neednodes=1:ppn=8 Resource_List.nodect=1 Resource_List.nodes=1:ppn=8
> Resource_List.walltime=107:00:00 session=17263 end=1408877723 Exit_status=0 resources_used.cput=05:08:52 resources_used.mem=9663200kb resources_used.vmem=14835684kb resources_used.walltime=01:02:56
>
> 20140824:08/24/2014 18:24:39;E;28935586.pbs04.pic.es;user=atprd008 group=atprd jobname=cream_542031943 queue=mcore_sl6_atlas ctime=1408861996 qtime=1408861996 etime=1408861996 start=1408872595 [log in to unmask]
> exec_host=td601.pic.es/15+td601.pic.es/14+td601.pic.es/13+td601.pic.es/12+td601.pic.es/11+td601.pic.es/10+td601.pic.es/9+td601.pic.es/1 Resource_List.neednodes=1:ppn=8 Resource_List.nodect=1 Resource_List.nodes=1:ppn=8
> Resource_List.walltime=107:00:00 session=7058 end=1408897479 Exit_status=1 resources_used.cput=03:16:56 resources_used.mem=9564740kb resources_used.vmem=14638964kb resources_used.walltime=06:08:39
>
> 20140824:08/24/2014 19:11:41;E;28935676.pbs04.pic.es;user=atprd008 group=atprd jobname=cream_542031943 queue=mcore_sl6_atlas ctime=1408862446 qtime=1408862446 etime=1408862446 start=1408873812 [log in to unmask]
> exec_host=td614.pic.es/15+td614.pic.es/14+td614.pic.es/13+td614.pic.es/12+td614.pic.es/11+td614.pic.es/10+td614.pic.es/9+td614.pic.es/8 Resource_List.neednodes=1:ppn=8 Resource_List.nodect=1 Resource_List.nodes=1:ppn=8
> Resource_List.walltime=107:00:00 session=24806 end=1408900301 Exit_status=1 resources_used.cput=07:33:45 resources_used.mem=9816308kb resources_used.vmem=14863868kb resources_used.walltime=06:32:26
>
> Maybe I am wrong, but I believed that the jobID//clientID in the cream-CE was used to establish the jobname in Torque. Thus, we have 3 jobs with the same jobname at the same day, with the same user and from the same cream-CE,
> but the jobs have different queued times, were executed in different ways in different WNs. From the cream-CE, I only can track one job, the last one, the other two are not registered in any logfile.
>
> I am checking it and this issue affects a small percentage of the total jobs received.
>
> What do you think? Is this an expected behaviour and I am just wrong or it is a kind of misunderstanding between the cream-CEs and Torque?
>
> Thank you in advance.
>
> Best regards,
>
> Carles
>
> --
> Carles Acosta i Silva
> PIC (Port d'Informació Científica)
> Campus UAB, Edifici D
> E-08193 Bellaterra, Barcelona
> Tel: +34 93 581 33 22
> Fax: +34 93 581 41 10
> http://www.pic.es
> Avís - Aviso - Legal Notice: http://www.ifae.es/legal.html
>
>
# Andreas Gellrich
# DESY IT / Grid Computing
# 2b/317, Notkestr. 85, D-22607 Hamburg, +49 40 8998 2732
|