Hi Fotis,
Here at CERN, we run the top level BDIIs with load balanced DNS. Trying
to stress test with BDII on the same machine as the RB is just asking
for trouble. The RB load on the machine will be so high that it will
impede the BDII.
You should defiantly move the BDII to a dedicated machine and if that is
not enough you should add multiple machines behind a round robin dns alias.
A new version of the BDII is almost ready. This has been slight
re-engineered to increase performance under load. It is still going
through the certification process and should be ready for release LCG-2_4_0.
It would be very useful if you could install this version of the BDII
and try to use it in the stress tests. It can be downloaded from.
http://lfield.home.cern.ch/lfield/bdii/rpms/lcg-bdii-3.2.0-1.noarch.rpm
Thanks
Laurence
Fotis Georgatos wrote:
> Hello all,
>
> as you probably already know I am in the process of stress testing
> a couple of Resource Brokers, one installed in Athens and one in Nicosia,
> which are aimed to generic use by other SEE-grid participants.
>
> I run the job submissions in parallel runs in order to spot the envelope
> that the RBs allow in terms of throughput capabilities; my findings
> follow.
> I'm making all this fuss so that others know what to expect.
>
> The -lightwait- jobs go fine as long as I don't push a certain limit,
> which appears to be around 32 concurrent submissions from the UI to
> the RB.
>
> I can repeatably have the submissions breaking both GR and CY RBs,
> which are indeed installed by two different and independent teams!
>
> rb.isabella.grnet.gr, run #1: (first column should be 227 = 226 jobs +
> header)
> 227 gr_jobs_1
> 227 gr_jobs_8
> 227 gr_jobs_16
> 227 gr_jobs_32
> 225 gr_jobs_32
> 219 gr_jobs_64
> 154 gr_jobs_128
>
> rb.isabella.grnet.gr, run #2: (first column should be 227 = 226 jobs +
> header)
> 226 gr_jobs_008
> 225 gr_jobs_032
> 225 gr_jobs_016
> 196 gr_jobs_064
> 151 gr_jobs_128
>
> rb101.grid.ucy.ac.cy, run #1: (first column should be 223 = 222 jobs +
> header)
> 223 cy_jobs_008
> 223 cy_jobs_016
> 217 cy_jobs_032
> 202 cy_jobs_064
> 147 cy_jobs_128
>
> rb101.grid.ucy.ac.cy, run #2: (first column should be 223 = 222 jobs +
> header)
> 223 cy_jobs_008
> 222 cy_jobs_016
> 223 cy_jobs_032
> 196 cy_jobs_064
> 144 cy_jobs_128
>
> The major problem is that no sane log message leading to the cause is
> seen anywhere, although it is definatelly something from the part of
> rb/bdii.
>
> I initially thought it was some kind of open file descriptors problem
> or such,
> but I eventually came to accept it's caused by memory excaustion on
> the rb,
> since I noticed it happens just as soon as our rb starts swapping in/out.
>
> Cyprus' side seems to break a little sooner,
> I presume they either have 512 MBs of memory while we hold 1 Gigabyte or,
> they do have 1 Gigabyte of memory with somewhat less cpu horsepower.
> Can please someone from that side confirm?
>
>
> The problems manifest themselves from the UI side as:
>
> **** Error: API_NATIVE_ERROR ****
> Error while calling the "NSClient::multi" native api
> IOException: Unable to connect to remote (rb101.grid.ucy.ac.cy:7772)
>
> **** Error: UI_NO_NS_CONTACT ****
> Unable to contact any Network Server
>
>
> I am still in the process of debugging this thing,
> but I wanted to let you know what is going on.
>
> BTW,
> The greek BDII is certainly feeling the heat, as it runs on the RB node:
> http://goc.grid.sinica.edu.tw/gstat/HG-01-GRNET/BDIINode_Perf_ent_.html
> Yeah, we should probably have these two seperated or "fenced" in
> resources...
> ...and have this explicitily said somewhere in the RB documentation?
>
> cheers,
> Fotis
>
> --
> echo "sysadmin know better bash than english" | sed s/min/mins/ \
> | sed 's/better bash/bash better/' # Yelling in a CERN forum
>
|