Hi Stephen et al,
>>During the submissions, I managed to confirm that the RBs=20
>>have a breaking
>>point at around 32-64 concurrent submissions of jobs and a=20
>>throughput of 25 jobs/minute.
>
>
> Is the throughput the number of jobs you can submit, or the rate at
> which they run? You can often submit jobs faster than they can be
> processed, especially if there are complex input-file matches.
"Throughput" refers to the innate ability of the RB to accept jobs,
as seen from a nearby UI; it is in practice the load that can be
effectively be set by a single user on a single UI and a single RB,
having confirmed that the submission bottleneck is the RB itself.
I haven't tried yet more complex experiments, like having some
balancing technique among multiple RBs or using multiple UIs.
The rate of 25 jobs/minute is not necessarily bad, since the LCG
is meant for cpu-intensive tasks where the submission time is miniscule
compared to the run time of the load imposed on the Grid worker nodes.
FYI, the RB in question is a dual P4 2.8 GHz IBM xSeries 335 server.
I find more important the breaking point between 32-64 parallel jobs,
I believe the developers should give as a hint if the behaviour is normal.
The "rate at which the jobs run" is a completely different matter,
and is dependent on the nature of the jobs, the nodes in question,
their I/O dependency etc. This is yet another experiment, not yet done.
>>All of these tasks failed with "Job RetryCount (3) hit" error.
>
> That isn't a real error, it means that the system tried to run the job
> three times, maybe at three different places, and they all failed. You
> can see the individual failure reasons if you do
> edg-job-get-logging-info -v 2. It may also be that some of the jobs you
> are counting as OK actually failed somewhere and retried. If you want to
> measure the underlying error rate it may be worth turning off the
> retries.
Well, the situation is that all the tests where done with the -r parameter.
This in effect forces the RB to use specific sites and specific queues
in these sites, in a "hard-wired" manner. In my view, it is pretty fine
that at any pretty given moment a number of the queues might not be
functional. What I find not much acceptable though, is that there are
queues in sites that are experimental or disabled and still advertise
"Production" status. I believe this is not only poisoning the BDII's
quality of provided information, it is also imposing an unnecessary
RB load, to the point of wasting users' time.
To sum up, queues that are not trully "Production" shouldn't advertise
as such. I suggest we abide to the rule that any sites appearing by
doing edg-job-list-match should either execute a job or be configured
to show up as in maintenance status. This is not yet the case.
"Job RetryCount 3 hit" should only appear when the network connection
between RB and CE is not available for a "reasonable time period", in
any other case it should imply either a bug or bad site maintenance...
What do you think?
cheers,
Fotis
PS, FYI, the script that has been used in the tests follows:
[gef@ui01 8]$ cat rbload
#!/bin/sh
HOWMANY=$2
HOWMANY=${HOWMANY:=32}
time cat $1_matches |xargs -n1 -P$HOWMANY --replace \
edg-job-submit --config-vo myui.$1 --nomsg -r {} sleep.jdl \
|tee $1_jobs.$$_$HOWMANY.log
[gef@ui01 8]$
It takes two parameters, eg, "./rbload gr 16", which drives it to
send 16 jobs in parallel, by interacting with the files:
* myui.gr # input vo configuration file, supply rb and/or proxy
* gr_jobs_$$_16.log # output logfile, name contains process id
PS2.
My other comment on the stability of the latest bdii rpm
and urging others to deploy it, should only be read as:
"We did a great range of tests with that piece of software,
and no matter what we tried on it, it appeared to function correctly;
in contrast to our previous bdii setup that had low stability karma"
--
echo "sysadmin know better bash than english" | sed s/min/mins/ \
| sed 's/better bash/bash better/' # Yelling in a CERN forum
|