Hello all,
as you probably already know I am in the process of stress testing
a couple of Resource Brokers, one installed in Athens and one in Nicosia,
which are aimed to generic use by other SEE-grid participants.
I run the job submissions in parallel runs in order to spot the envelope
that the RBs allow in terms of throughput capabilities; my findings follow.
I'm making all this fuss so that others know what to expect.
The -lightwait- jobs go fine as long as I don't push a certain limit,
which appears to be around 32 concurrent submissions from the UI to the RB.
I can repeatably have the submissions breaking both GR and CY RBs,
which are indeed installed by two different and independent teams!
rb.isabella.grnet.gr, run #1: (first column should be 227 = 226 jobs + header)
227 gr_jobs_1
227 gr_jobs_8
227 gr_jobs_16
227 gr_jobs_32
225 gr_jobs_32
219 gr_jobs_64
154 gr_jobs_128
rb.isabella.grnet.gr, run #2: (first column should be 227 = 226 jobs + header)
226 gr_jobs_008
225 gr_jobs_032
225 gr_jobs_016
196 gr_jobs_064
151 gr_jobs_128
rb101.grid.ucy.ac.cy, run #1: (first column should be 223 = 222 jobs + header)
223 cy_jobs_008
223 cy_jobs_016
217 cy_jobs_032
202 cy_jobs_064
147 cy_jobs_128
rb101.grid.ucy.ac.cy, run #2: (first column should be 223 = 222 jobs + header)
223 cy_jobs_008
222 cy_jobs_016
223 cy_jobs_032
196 cy_jobs_064
144 cy_jobs_128
The major problem is that no sane log message leading to the cause is
seen anywhere, although it is definatelly something from the part of rb/bdii.
I initially thought it was some kind of open file descriptors problem or such,
but I eventually came to accept it's caused by memory excaustion on the rb,
since I noticed it happens just as soon as our rb starts swapping in/out.
Cyprus' side seems to break a little sooner,
I presume they either have 512 MBs of memory while we hold 1 Gigabyte or,
they do have 1 Gigabyte of memory with somewhat less cpu horsepower.
Can please someone from that side confirm?
The problems manifest themselves from the UI side as:
**** Error: API_NATIVE_ERROR ****
Error while calling the "NSClient::multi" native api
IOException: Unable to connect to remote (rb101.grid.ucy.ac.cy:7772)
**** Error: UI_NO_NS_CONTACT ****
Unable to contact any Network Server
I am still in the process of debugging this thing,
but I wanted to let you know what is going on.
BTW,
The greek BDII is certainly feeling the heat, as it runs on the RB node:
http://goc.grid.sinica.edu.tw/gstat/HG-01-GRNET/BDIINode_Perf_ent_.html
Yeah, we should probably have these two seperated or "fenced" in resources...
...and have this explicitily said somewhere in the RB documentation?
cheers,
Fotis
--
echo "sysadmin know better bash than english" | sed s/min/mins/ \
| sed 's/better bash/bash better/' # Yelling in a CERN forum
|