On Thu, May 05, 2005 at 07:36:32PM +0200 or thereabouts, Maarten Litmaath wrote:
> Dear RLS Team,
> where are those requests coming from? Perhaps a single bad application
> is responsible for the whole mess!
I had a look around on some nodes running atlas jobs to see if they were
doing anything particularly interesting.
An alarming thing which is unrelated is we have loads of old lcg-* commands
which have been left behind by jobs that have long since vanished.
lcg0473.gridpp.rl.ac.uk is currently running no atlas jobs but
atlas002 11284 0.0 0.1 9476 3680 ? S Apr18 0:00 /stage/lcg-sl3-exp/LCG-2_4_0/lcg/bin/lcg-cp --vo atlas gsiftp://castor.grid.sinica.edu.tw/castor/grid.sinica.edu.tw/grid/atlas/datafiles/rome/digit/rome.004618.digit.ExBH5TeV/rome.004618.digit.ExBH5TeV._00171.pool.root.1 file:/pool/atlas002_585666.csflnx353.rl.ac.uk/condorg_l7YIUG/rundir/rome.004618.digit.ExBH5TeV._00171.pool.root
atlas002 18907 0.0 0.1 9436 3668 ? S Apr20 0:00 /stage/lcg-sl3-exp/LCG-2_4_0/lcg/bin/lcg-cp --vo atlas gsiftp://castorgrid.ific.uv.es/castor/ific.uv.es/grid/atlas/datafiles/rome/digit/rome.004813.digit.JF7_pythia_jet_filter/rome.004813.digit.JF7_pythia_jet_filter._02969.pool.root.1 file:/pool/atlas002_591161.csflnx353.rl.ac.uk/condorg_mfzTXG/rundir/rome.004813.digit.JF7_pythia_jet_filter._02969.pool.root
however a netstat shows no active connections to the RLS only
to the castor servers.
What is perhaps more significant though is there are however quite a
few lcg-lg around
They look to all be belong to
/C=IT/O=INFN/OU=Personal Certificate/L=Milano/CN=Silvia [log in to unmask]
I would say these jobs are a high candidate, RAL has had are lcg- utilties
on a shared file system for some time now but we have only been having
problems recently with the file system being overloaded since this
user started running. Looking at the jobs they seem to allways be
doing some kind of lcg- operation and nothing much else that I have ever
noticed.
Steve
>
> ________________________________
>
> From: Dirk Duellmann
> Sent: Thu 5/5/2005 5:04 PM
> To: users-rls (users of the CERN rls)
> Cc: James Casey; Jamie Shiers; Guido Negri; Miguel Anjo
> Subject: Re: RLS information
>
>
>
> Dear All,
>
> we have restarted the ATLAS RLS already several times but until
> either the number of request
> is decreased on the ATLAS side or the RLS application is improved a
> stable service can
> not be achieved. Please let us know if ATLAS could limit the number
> of RLS request so
> that at least some useful work can be done and other users of the RLS
> database are not affected.
>
> Cheers, Dirk
>
> On 4 May 2005, at 20:06, Miguel Anjo wrote:
>
> > Dear users,
> >
> > After spending more than 15 hours with the problem (since 4am this
> > morning), the database team that supports RLS made the application the
> > most stable possible on the database and application server side for
> > which we are responsible to cope with the unexpected increased
> > workload
> > on the system.
> >
> > This application is performing several 'SELECT COUNT(*)' over very
> > large
> > tables and queries using the 'LIKE' keyword on the WHERE clause that
> > causes the database to read everytime tables of about 600MB. The
> > machine
> > where the database is has runned out of memory and the queries slowed
> > down.
> >
> > Other problem on the RLS application is the lack of support of bulk
> > inserts which obliges to perform a commit after a single insert and a
> > physical write to disk, causing many db log sync wait events,
> > impossible
> > to overpass.
> >
> > As consequence the load performed by the enourmous amount of calls to
> > the RLS application makes the application server unavailable for more
> > connections (as it is waiting for the database).
> >
> > The only way at the momment to resolve the problem would be fixing the
> > 'bugs' existent in the RLS application, which is not in our hands (and
> > anyway RLS is not developed anymore).
> >
> > Workarounds possible to attenuate the load include obviously
> > decreasing
> > the number of calls to RLS (as it was requested this morning) and
> > escape
> > the '_' character in the filenames with '\_' so the query uses an
> > 'WHERE
> > filename =' instead of 'WHERE filename LIKE', performing the query
> > using
> > an index.
> >
> > Cheers,
> > Oracle support team
> >
>
>
--
Steve Traylen
[log in to unmask]
http://www.gridpp.ac.uk/
|