Hi, Thanks, there's a fair amount there to go on so a few follow up questions. What are suggested reasonable times for the automatic job purging? We've got between a couple of thousand and about 30,000 for our newest (in production today) and oldest Creams. Are you using a custom nagios check command for the file count in registry.npudir? A quite check of the default ones didn't seem to have that functionality and I can always knock one up but if someone else already has... Any suggestions of mysql tunings for the creamdb would be very welcome and I'm by no means a mysql expert. In know you can split the creamce and blah parts on to separate nodes and even have one blah parser support multiple CreamCEs, is that just putting into a single point of (likely) failure? We've 3 Cream CEs with between 6 and 8GB of RAM so I'm hoping that will be enough. Thanks, Chris. > -----Original Message----- > From: Testbed Support for GridPP member institutes [mailto:TB- > [log in to unmask]] On Behalf Of Stuart Purdie > Sent: 06 September 2011 15:06 > To: [log in to unmask] > Subject: Re: CreamCE Tuning > > On 6 Sep 2011, at 13:35, Chris Brew wrote: > > > Hi All, > > > > Having replaced all our CEs with CreamCEs we've been having problems > keeping > > them up for extended period of time. > > > > They seem to run fine for a few days to a week or so before falling > over > > with "The endpoint is blacklisted" errors (generally) but the error > isn't > > transient - it always seems to go down just after I leave work and > stay that > > way until I get in the next morning. > > > > The "help" for that error is not in actual fat helpful - "Oh, that's > the WMS > > seeing timeouts." I paraphrase. Timeouts on what? What can I do on > the CE to > > fix them? > > > > So, after much googling I've increased the MySQL temporary table > space > > "innodb_buffer_pool_size=1024M" and reduced the blah purge time > > "purge_interval=1000000" > > > > So does anyone have any more CreamCE tuning tips I can try? We're > running > > the UMD release but I've "backported" the trustamanager fix from emi. > > Not directly tuning, but: > > Cream is actually two parts - the bit that talks to the outside world, > X509 and such, which keeps it's data base in mysql, and Blah, which > talks to the batch system, is the BNotifier/BUpdatorXXX thing, and > keeps it's database in an ad hoc, informally-specified, bug-ridden, > slow implementation of half of a proper database engine. > > > On the cream/mysql side: it's generally seen that tightening up the > purger times - so that once a job is finished it doesn't hang about for > too long, is a handy step. Users can request purging in advance of > this interval - so this is a maximum limit, not a minimum limit. > http://grid.pd.infn.it/cream/field.php?n=Main.HowToPurgeJobsFromTheCREA > MDB covers how to set up JOB_PURGE_POLICY. Note that this is totally > different from the blah configuration purge_interval. I don't think > we've done that up here - my memory says I indexed the mysql DB beyond > the default (hence making a larger DB less of a problem), but I > honestly can't find any notes I made on that (I went through all the > grid services with mysql at one point). I think I'll have to revist > that at some point. > > Given the timeout's you're seeing, load on the mysql server might well > cause them [0], hence the JOB_PURGE_POLICY is where I'd look first. > (We run with innodb_buffer_pool_size=256M). I think the default policy > is 10 days for everything? Depending on how much traffic you get, that > can make a difference. > > > The Lease manager, which often seems more like a mis-feature, can slow > things down a lot if there are a lot of 'leases' created - at one point > Atlas pilot factories were creating many of these, although they sorted > that now, it's possible that on an old install you might have old ones > slowing things down. I can't dig out the query used to count the > number of them, anyone have it to hand? > > > Although it's not one you've noted above, the biggest performance sink > for us is when Blah breaks, and has to start using the registry.npudir. > This happens when one instance of the blah code can't access the > "proper" registry, and has to degrade to one file per job. Because > this is not optimised, it ends up stat(2) ing the directory for each > operation, then walking though each file. So it's something like an > O(n^3) algorithm or thereabouts, for n files in the dir. We have > nagios alarms setup if the number of files in there gets large - a > value we've set to 5. After a clean restart, when the locks are sorted > out, blah tidies up that dir at around 50 a minute, so that can be > responsible for a very long wait on restart that's sometimes observed. > > More directly, it's also responsible for timeouts of the sort: > > failureReason=BLAH error: no jobId in submission script's output > (stdout:) (stderr: <blah> execute_cmd: 200 seconds timeout expired, > killing child process.-) > > Keeping the purge_interval low can help with this situation, but I'm > (now) of the opinion that it's best avoided, by intervention when the > npudir starts to fill up. > > > Fundamentally, however, most of these issues are load dependant, so the > best return for sysadmin effort is probably to set up another Cream CE. > We run 3 (+ 1 experiemental) at the moment, Imperial has 4. In between > CPR on them, I'm poking at ways of alleviating the worst of the > problems by poking through the source RPM's. > > Speaking of hardware - I'm of the opinion that 4GB of ram is not > enough, and 6GB is the minimum, with 8GB a sensible baseline - and > that's allowing a 256M / 512M innodb buffer pool. With a 1024M buffer > pool, I don't think everything will fit in 4GB of ram. > > Hrm - sorry if that's turned a bit rambly, fire alarm halfway through > de-railed my train of thought... > > [0] That is: I think that the timeouts are cream taking a long time to > respond to the WMS's request for status updates - probably because the > WMS uses a single Lease, and thus is trying to get a lot of data at > once, or at least forcing cream to search a large subset.