JISCMail - TB-SUPPORT Archives

Hi,

Thanks, there's a fair amount there to go on so a few follow up questions.

What are suggested reasonable times for the automatic job purging? We've got
between a couple of thousand and about 30,000 for our newest (in production
today) and oldest Creams.

Are you using a custom nagios check command for the file count in
registry.npudir? A quite check of the default ones didn't seem to have that
functionality and I can always knock one up but if someone else already
has... 

Any suggestions of mysql tunings for the creamdb would be very welcome and
I'm by no means a mysql expert.

In know you can split the creamce and blah parts on to separate nodes and
even have one blah parser support multiple CreamCEs, is that just putting
into a single point of (likely) failure?

We've 3 Cream CEs with between 6 and 8GB of RAM so I'm hoping that will be
enough.

Thanks,
Chris.

> -----Original Message-----
> From: Testbed Support for GridPP member institutes [mailto:TB-
> [log in to unmask]] On Behalf Of Stuart Purdie
> Sent: 06 September 2011 15:06
> To: [log in to unmask]
> Subject: Re: CreamCE Tuning
> 
> On 6 Sep 2011, at 13:35, Chris Brew wrote:
> 
> > Hi All,
> >
> > Having replaced all our CEs with CreamCEs we've been having problems
> keeping
> > them up for extended period of time.
> >
> > They seem to run fine for a few days to a week or so before falling
> over
> > with "The endpoint is blacklisted" errors (generally) but the error
> isn't
> > transient - it always seems to go down just after I leave work and
> stay that
> > way until I get in the next morning.
> >
> > The "help" for that error is not in actual fat helpful - "Oh, that's
> the WMS
> > seeing timeouts." I paraphrase. Timeouts on what? What can I do on
> the CE to
> > fix them?
> >
> > So, after much googling I've increased the MySQL temporary table
> space
> > "innodb_buffer_pool_size=1024M" and reduced the blah purge time
> > "purge_interval=1000000"
> >
> > So does anyone have any more CreamCE tuning tips I can try? We're
> running
> > the UMD release but I've "backported" the trustamanager fix from emi.
> 
> Not directly tuning, but:
> 
> Cream is actually two parts - the bit that talks to the outside world,
> X509 and such, which keeps it's data base in mysql, and Blah, which
> talks to the batch system, is the BNotifier/BUpdatorXXX thing, and
> keeps it's database in an ad hoc, informally-specified, bug-ridden,
> slow implementation of half of a proper database engine.
> 
> 
> On the cream/mysql side: it's generally seen that tightening up the
> purger times - so that once a job is finished it doesn't hang about for
> too long, is a handy step.  Users can request purging in advance of
> this interval - so this is a maximum limit, not a minimum limit.
> http://grid.pd.infn.it/cream/field.php?n=Main.HowToPurgeJobsFromTheCREA
> MDB covers how to set up JOB_PURGE_POLICY.  Note that this is totally
> different from the blah configuration purge_interval.  I don't think
> we've done that up here - my memory says I indexed the mysql DB beyond
> the default (hence making a larger DB less of a problem), but I
> honestly can't find any notes I made on that (I went through all the
> grid services with mysql at one point).  I think I'll have to revist
> that at some point.
> 
> Given the timeout's you're seeing, load on the mysql server might well
> cause them [0], hence the JOB_PURGE_POLICY is where I'd look first.
> (We run with innodb_buffer_pool_size=256M).  I think the default policy
> is 10 days for everything?  Depending on how much traffic you get, that
> can make a difference.
> 
> 
> The Lease manager, which often seems more like a mis-feature, can slow
> things down a lot if there are a lot of 'leases' created - at one point
> Atlas pilot factories were creating many of these, although they sorted
> that now, it's possible that on an old install you might have old ones
> slowing things down.  I can't dig out the query used to count the
> number of them, anyone have it to hand?
> 
> 
> Although it's not one you've noted above, the biggest performance sink
> for us is when Blah breaks, and has to start using the registry.npudir.
> This happens when one instance of the blah code can't access the
> "proper" registry, and has to degrade to one file per job.  Because
> this is not optimised, it ends up stat(2) ing the directory for each
> operation, then walking though each file.  So it's something like an
> O(n^3) algorithm or thereabouts, for n files in the dir.  We have
> nagios alarms setup if the number of files in there gets large - a
> value we've set to 5.  After a clean restart, when the locks are sorted
> out, blah tidies up that dir at around 50 a minute, so that can be
> responsible for a very long wait on restart that's sometimes observed.
> 
> More directly, it's also responsible for timeouts of the sort:
> 
> failureReason=BLAH error: no jobId in submission script's output
> (stdout:) (stderr: <blah> execute_cmd: 200 seconds timeout expired,
> killing child process.-)
> 
> Keeping the purge_interval low can help with this situation, but I'm
> (now) of the opinion that it's best avoided, by intervention when the
> npudir starts to fill up.
> 
> 
> Fundamentally, however, most of these issues are load dependant, so the
> best return for sysadmin effort is probably to set up another Cream CE.
> We run 3 (+ 1 experiemental) at the moment, Imperial has 4.  In between
> CPR on them, I'm poking at ways of alleviating the worst of the
> problems by poking through the source RPM's.
> 
> Speaking of hardware - I'm of the opinion that 4GB of ram is not
> enough, and 6GB is the minimum, with 8GB a sensible baseline - and
> that's allowing a 256M / 512M innodb buffer pool. With a 1024M buffer
> pool, I don't think everything will fit in 4GB of ram.
> 
> Hrm - sorry if that's turned a bit rambly, fire alarm halfway through
> de-railed my train of thought...
> 
> [0] That is: I think that the timeouts are cream taking a long time to
> respond to the WMS's request for status updates - probably because the
> WMS uses a single Lease, and thus is trying to get a lot of data at
> once, or at least forcing cream to search a large subset.