Hi Daniela,
For ICE jobs you should be able to get some more information by using 'queryDb' tool (part of glite-wms-ice-3.1.53-1.slc4 on RAL WMSes) - see 'queryDb --help'
For Condor jobs, 'condor_q' would do the job (see '-long' and '-constraint' options)
i.e.
condor_q -constraint 'x509userproxysubject == "<user_DN>"'
condor_q -constraint 'regexp("<user_DN>",x509userproxysubject)'
Also, have a look in /opt/glite/bin/glite-wms-stats.py on a glite-WMS and you'll find the right numbers for job states (I do believe they are the same on LB)
...
self.JobStates={'1':'Submitted',
'2':'Waiting',
'3':'Ready',
'4':'Scheduled',
'5':'Running',
'6':'Done',
'7':'Cleared',
'8':'Aborted',
'9':'Cancelled',
'10':'Unknown',
'11':'Purged'
}
...
Regards,
Catalin Condurache
RAL Tier1 Grid Services
> -----Original Message-----
> From: LHC Computer Grid - Rollout [mailto:[log in to unmask]]
> On Behalf Of Daniela Bauer
> Sent: 07 February 2011 17:01
> To: [log in to unmask]
> Subject: Re: [LCG-ROLLOUT] libglite_wms_dli.so: Cannot allocate
> memory (but there's GBs of it lying around !)
>
> Hi All,
>
> So coming back to my WMS - the firewall might have made a difference
> (it seems to take longer to fail), but that's not it.
>
> I get (on wms02.grid.hep.ph.ic.ac.uk)
> 07 Feb, 10:47:22 -W: [Warning]
> resolve_filemapping_info(dli_utils.cpp:200): cannot load DLI_SI
> helper lib libglite_wms_dli.so
> 07 Feb, 10:47:22 -W: [Warning]
> resolve_filemapping_info(dli_utils.cpp:201): dlerror returns:
> libcrypto_gcc32dbg.so.0: failed to map segment from shared object:
> Cannot allocate memory
> 07 Feb, 10:47:22 -W: [Warning]
> resolve_filemapping_info(dli_utils.cpp:335): cannot perform
> lfn:/grid/t2k.org/nd280/raw/ND280/ND280/00005000_00005999/nd280_00005
> 000_0000.daq.mid.gz's resolution
>
> Restarting glite seems to help for a while (though the machine never
> claims to run out of memory, so it's not something I can monitor
> easily). I've tried to see how many jobs of this type the WMS is
> actually handling at the point when it falls over, but I cannot come
> up with a suitable mysql query (are the LB jobs statuses documented
> anywhere -- all I get is a number ?). All I know that the LB has
> ~13000 jobs of this user, but I can't figure out how to sort them by
> status as a lot of them are clearly done and forgotten.
>
> Looking at the WMS I get:
> [root@wms02 glite]# /opt/glite/bin/queryStats
> JOB_REGISTERED=6519
> JOB_IDLE=6519
> JOB_RUNNING=5911
> JOB_REALLY-RUNNING=5830
> JOB_CANCELLED=196
> JOB_DONE-OK=5356
> JOB_DONE-FAILED=101
> JOB_ABORTED=88
>
> Does anybody have a creative idea ?
>
> I've got a cron job that restarts the WMS every 6 h, I hope that'll
> keep it up.
>
> Cheers,
>
> Daniela
>
>
> On 26 January 2011 16:09, Daniela Bauer
> <[log in to unmask]> wrote:
>
>
> Hi Maarten et al,
>
> Both wms are real machines, wms01 runs Scientific Linux SL
> release 4.8 (Beryllium) and wms02 runs CentOS release 4.8 (Final).
>
> I'll see if the firewall makes a difference.
>
> Cheers,
>
> Daniela
>
>
>
> On 26 January 2011 15:33, Maarten Litmaath
> <[log in to unmask]> wrote:
>
>
> Hi Daniela,
>
> > My WMS (wms01.hep.ph.ic.ac.uk
> <http://wms01.hep.ph.ic.ac.uk>) currently
>
> > seems to be unable to handle jobs that look up a file
> location on an
> > LFC. It works for a couple of jobs (so in principle it
> can handle it),
> > but fails soon with the following error in the
> workload_manager_events.log:
> >
> > 24 Jan, 18:19:44 -W: [Warning]
> > resolve_filemapping_info(dli_utils.cpp:200): cannot
> load DLI_SI helper
> > lib libglite_wms_dli.so
> > 24 Jan, 18:19:44 -W: [Warning]
> > resolve_filemapping_info(dli_utils.cpp:201): dlerror
> returns:
> > libglite_wms_dli.so: failed to map segment from shared
> object: Cannot
> > allocate memory
> > 24 Jan, 18:19:44 -W: [Warning]
> > resolve_filemapping_info(dli_utils.cpp:335): cannot
> perform
> >
> lfn:/grid/t2k.org/nd280/raw/ND280/ND280/00005000_00005999/nd280_00005
> 000_0000.daq.mid.gz's
> <http://t2k.org/nd280/raw/ND280/ND280/00005000_00005999/nd280_0000500
> 0_0000.daq.mid.gz%27s>
>
> >
> <http://t2k.org/nd280/raw/ND280/ND280/00005000_00005999/nd280_0000500
> 0_0000.daq.mid.gz%27s>
>
> > resolution
> >
> > The memory usage of the workload manager is about 18%
> at the time (the
> > machine has 16GB). Restarting it (I just restart
> everything
> > /etc/init.d/gLite restart) helps briefly, but it's a
> matter of minutes
> > rather than hours before it fails again.
> >
> > This is the offending bit in the jdl file:
> > DataRequirements = {
> > [
> > DataCatalogType = "DLI";
> > DataCatalog = "http://lfc.gridpp.rl.ac.uk:8085/";
> > InputData =
> >
> {"lfn:/grid/t2k.org/nd280/raw/ND280/ND280/00005000_00005999/nd280_000
> 05000_0000.daq.mid.gz
>
> >
> <http://t2k.org/nd280/raw/ND280/ND280/00005000_00005999/nd280_0000500
> 0_0000.daq.mid.gz>"};
>
> > ]
> > };
> > DataAccessProtocol = {"gsiftp","gridftp","rfio"};
> >
> >
> > The WMS is uptodate:
> > [root@wms01 glite]# rpm -qa | grep glite-WMS
> > glite-WMS-3.1.30-0.slc4
>
>
> What OS does the machine run? Is it a real machine or a
> VM?
>
> For me this JDL works e.g. via gswms01.cern.ch (which
> you can use too):
>
> --------------------------------------------------------
> ---------------------
> JobType = "Normal";
> Executable = "/bin/hostname";
> StdOutput = "hello.out";
> StdError = "hello.err";
> InputSandbox = {"/etc/group"};
> OutputSandbox = {"hello.out","hello.err"};
> RetryCount = 0;
>
> DataRequirements = {
> [
> DataCatalogType = "DLI";
> DataCatalog = "http://lfc.gridpp.rl.ac.uk:8085/";
> InputData =
>
> {"lfn:/grid/t2k.org/nd280/raw/ND280/ND280/00005000_00005999/nd
> 280_00005000_0000.daq.mid.gz"};
> ]
> };
> DataAccessProtocol = {"gsiftp","gridftp","rfio"};
>
> --------------------------------------------------------
> ---------------------
>
> I have matched it 100 times and always get a result that
> looks OK for
> the "ops" VO that I was using:
>
>
> ==============================================================
> ============
>
> COMPUTING ELEMENT IDs LIST
> The following CE(s) matching your job requirements have
> been found:
>
> *CEId*
> - ce02.esc.qmul.ac.uk:2119/jobmanager-lcgsge-lcg_long
> - ce02.esc.qmul.ac.uk:2119/jobmanager-lcgsge-lcg_short
> - ce03.esc.qmul.ac.uk:2119/jobmanager-lcgsge-lcg_short
> - ce04.esc.qmul.ac.uk:8443/cream-sge-lcg_long
> - lcgce05.gridpp.rl.ac.uk:8443/cream-pbs-grid500M
> - lcgce05.gridpp.rl.ac.uk:8443/cream-pbs-gridS
> - ce01.esc.qmul.ac.uk:2119/jobmanager-lcgsge-lcg_test
>
>
> ==============================================================
> ============
>
>
>
>
>
> --
> -----------------------------------------------------------
> [log in to unmask]
> HEP Group/Physics Dep
> Imperial College
> Tel: +44-(0)20-75947810
>
> http://www.hep.ph.ic.ac.uk/~dbauer/
> <http://www.hep.ph.ic.ac.uk/%7Edbauer/>
>
>
>
>
>
> --
> -----------------------------------------------------------
> [log in to unmask]
> HEP Group/Physics Dep
> Imperial College
> Tel: +44-(0)20-75947810
> http://www.hep.ph.ic.ac.uk/~dbauer/
|