Hi Fotis,
could you be more specific on your tests, this is not very useful;
We run several tests, after we recieved your email.
We run 99 concurrent jobs (light-weight) from 3 different UI machines
(33 jobs each): 1 in Germany and 2 from Cyprus.
*ALL* jobs were successfully submitted and a valid jobid was returned.
Also these tests were performed DURING the load imposed by your tests!
(As we could see in the logs)
The tests were completed in about 2-3 minutes.
The physical memory on the RB is 1GB.
Again, could you be more specific on your measurements? is it 32
concurrent jobs or 226 or 227 or what ?
what is your measured discrepancy between the CY and GR RB?
The CPU load on the RB machine does not go much over 70% ...
but there are some nasty messages in the log (which of course may not be
important) e.g.:
Mar 3 18:41:09 rb101 kernel: application bug: edg-wl-in.ftpd(11828) has
SIGCHLD set to SIG_IGN but calls wait().
Mar 3 18:41:09 rb101 kernel: (see the NOTES section of 'man 2 wait').
Workaround activated.
Thanks
Wei
Fotis Georgatos wrote:
>
> Hello all,
>
> as you probably already know I am in the process of stress testing
> a couple of Resource Brokers, one installed in Athens and one in Nicosia,
> which are aimed to generic use by other SEE-grid participants.
>
> I run the job submissions in parallel runs in order to spot the envelope
> that the RBs allow in terms of throughput capabilities; my findings
> follow.
> I'm making all this fuss so that others know what to expect.
>
> The -lightwait- jobs go fine as long as I don't push a certain limit,
> which appears to be around 32 concurrent submissions from the UI to
> the RB.
>
> I can repeatably have the submissions breaking both GR and CY RBs,
> which are indeed installed by two different and independent teams!
>
> rb.isabella.grnet.gr, run #1: (first column should be 227 = 226 jobs +
> header)
> 227 gr_jobs_1
> 227 gr_jobs_8
> 227 gr_jobs_16
> 227 gr_jobs_32
> 225 gr_jobs_32
> 219 gr_jobs_64
> 154 gr_jobs_128
>
> rb.isabella.grnet.gr, run #2: (first column should be 227 = 226 jobs +
> header)
> 226 gr_jobs_008
> 225 gr_jobs_032
> 225 gr_jobs_016
> 196 gr_jobs_064
> 151 gr_jobs_128
>
> rb101.grid.ucy.ac.cy, run #1: (first column should be 223 = 222 jobs +
> header)
> 223 cy_jobs_008
> 223 cy_jobs_016
> 217 cy_jobs_032
> 202 cy_jobs_064
> 147 cy_jobs_128
>
> rb101.grid.ucy.ac.cy, run #2: (first column should be 223 = 222 jobs +
> header)
> 223 cy_jobs_008
> 222 cy_jobs_016
> 223 cy_jobs_032
> 196 cy_jobs_064
> 144 cy_jobs_128
>
> The major problem is that no sane log message leading to the cause is
> seen anywhere, although it is definatelly something from the part of
> rb/bdii.
>
> I initially thought it was some kind of open file descriptors problem
> or such,
> but I eventually came to accept it's caused by memory excaustion on
> the rb,
> since I noticed it happens just as soon as our rb starts swapping in/out.
>
> Cyprus' side seems to break a little sooner,
> I presume they either have 512 MBs of memory while we hold 1 Gigabyte or,
> they do have 1 Gigabyte of memory with somewhat less cpu horsepower.
> Can please someone from that side confirm?
>
>
> The problems manifest themselves from the UI side as:
>
> **** Error: API_NATIVE_ERROR ****
> Error while calling the "NSClient::multi" native api
> IOException: Unable to connect to remote (rb101.grid.ucy.ac.cy:7772)
>
> **** Error: UI_NO_NS_CONTACT ****
> Unable to contact any Network Server
>
>
> I am still in the process of debugging this thing,
> but I wanted to let you know what is going on.
>
> BTW,
> The greek BDII is certainly feeling the heat, as it runs on the RB node:
> http://goc.grid.sinica.edu.tw/gstat/HG-01-GRNET/BDIINode_Perf_ent_.html
> Yeah, we should probably have these two seperated or "fenced" in
> resources...
> ...and have this explicitily said somewhere in the RB documentation?
>
> cheers,
> Fotis
>
--
============================================================
Wei Xing, M.Sc.
Research Associate Tel: 00357-22892663
Dept. of Computer Science Fax: 00357-22892701
University of Cyprus email: [log in to unmask]
PO Box 20537
CY1678, Nicosia, CYPRUS
|