Hi Winnie,
This might be a buit an obvious question, but are you sure the
'missing' jobs just weren't submitted from the other CEs (if they
submit to the same batch system - I don't know your setup), assuming
lrmsID is jobid + ce_name ?
Daniela
On 3 December 2010 11:52, Winnie Lacesso <[log in to unmask]> wrote:
> Happy Friday!
>
> An lcg-CE had several hundred jobs flood in in a short period of time last
> night. Dozens are stuck so I'm cancelling & notifying the user. In finding
> out who, it's noticed some job ID's don't show up in doing, eg,
>
> grep 220511 /opt/edg/var/gatekeeper/grid-jobmap_`date +%Y%m%d`
> <nothing>
> But tracejob knows about it & it's running (stuck).
>
> In looking into grid-jobmap_`date +%Y%m%d`, it looks like stretches of
> jobs are not being logged there - 10's or even a stretch of 100 jobs - nor
> in /var/log/messages either. Does that matter? Bit of a surprise, it was
> thought every job would be logged.
> Eg these are adjacent lines in the grid_jobmap for last night:
>
> "localUser=70006" "userDN=/C=RU/O=RDIG/OU=users/OU=sinp.msu.ru/CN=Andrey
> Belyaev" "userFQAN=/cms/Role=NULL/Capability=NULL"
> "jobID=https://lb006.cnaf.infn.it:9000/wa9i_xopgzQm8DtshospCg"
> "ceID=lcgce04.phy.bris.ac.uk:2119/jobmanager-lcgpbs-medium"
> "lrmsID=220472.lcgce04.phy.bris.ac.uk" "timestamp=2010-12-03 01:37:43"
>
> "localUser=70006" "userDN=/C=RU/O=RDIG/OU=users/OU=sinp.msu.ru/CN=Andrey
> Belyaev" "userFQAN=/cms/Role=NULL/Capability=NULL"
> "jobID=https://lb006.cnaf.infn.it:9000/ex6bjeAN_KP3JBtLrF4ndA"
> "ceID=lcgce04.phy.bris.ac.uk:2119/jobmanager-lcgpbs-long"
> "lrmsID=220474.lcgce04.phy.bris.ac.uk" "timestamp=2010-12-03 01:37:54"
>
> 220473 is completely missing; but tracejob knows about it
> Then a stretch of more missing to 483:
>
> "localUser=70006" "userDN=/C=RU/O=RDIG/OU=users/OU=sinp.msu.ru/CN=Andrey
> Belyaev" "userFQAN=/cms/Role=NULL/Capability=NULL"
> "jobID=https://lb006.cnaf.infn.it:9000/lu_2vRaIuLwpol-WIP3IcQ"
> "ceID=lcgce04.phy.bris.ac.uk:2119/jobmanager-lcgpbs-medium"
> "lrmsID=220483.lcgce04.phy.bris.ac.uk" "timestamp=2010-12-03 01:38:23"
>
> Then the very next line in grid-jobmap is timestamped 10 min later = 608:
> So over 100 are not in grid-jobmap (many are still queued, some running -
> stuck):
>
> "localUser=65621" "userDN=/DC=ch/DC=cern/OU=Organic
> Units/OU=Users/CN=asciaba/CN=430796/CN=Andrea Sciaba"
> "userFQAN=/cms/Role=lcgadmin/Capability=NULL"
> "userFQAN=/cms/Role=NULL/Capability=NULL"
> "userFQAN=/cms/TEAM/Role=NULL/Capability=NULL"
> "userFQAN=/cms/dbs/Role=NULL/Capability=NULL"
> "jobID=https://wms206.cern.ch:9000/J7CGwYBPE23R4xqymiqCjA"
> "ceID=lcgce04.phy.bris.ac.uk:2119/jobmanager-lcgpbs-express"
> "lrmsID=220608.lcgce04.phy.bris.ac.uk" "timestamp=2010-12-03 01:48:52"
>
> So there was quite a flood of jobs in 10 min. Does it matter
> (grid-accounting-wise) that they're not being logged in grid-jobmap?
> I can find out who the user of these stuck jobs is from other (also stuck)
> jobs that are logged.
>
> On our new CREAM-CE the logs show about 40 jobs arriving every minute for
> stretches but in a quick glance no gaps show up in the grid-jobmap files.
>
> Grateful for Enlightenment!
>
--
-----------------------------------------------------------
[log in to unmask]
HEP Group/Physics Dep
Imperial College
Tel: +44-(0)20-75947810
http://www.hep.ph.ic.ac.uk/~dbauer/
|