Hallo Luuk,
> I'm wondering what the expected performance of the wms matchmaking
> process is?
> Our wms seems to get stuck at 10 to 12 per minute for some fairly simple
> jdl files.
> This while there's still cpu cycles and memory left.
>
> Is this a reasonable rate?
Yes and no. The current implementation does not support matchmaking times
much shorter than what you report. The shortest times I have seen on the
WMS nodes at CERN are just above 2 s, but 5 or 6 s are more common.
There is an ongoing investigation on why those times gradually get worse
and then sometimes recover without a restart of the Workload Manager.
The trouble is that the matchmaking requires an exclusive lock on the ISM
(Information System Supermarket), the WMS internal image of the relevant
parts of the BDII. This makes the process to a large extent single-threaded.
The matchmaking code has been significantly redone for the gLite 3.2 WMS,
which will allow different threads to work in parallel, thereby improving
the job throughput quite a lot.
The time per job scales with the number of entries in the ISM.
To get better performance one can:
1. reduce the number entries by applying a VO filter, or
2. submit jobs with the same requirements in _collections_.
Option 1 becomes possible when patch #2562 goes to production (now PPS):
https://savannah.cern.ch/patch/?2562
https://twiki.cnaf.infn.it/cgi-bin/twiki/view/EgeeJra1It/WMS_guide
Option 2 is already used today e.g. by CMS, ATLAS and LHCb to submit e.g.
15k jobs per WMS per day without backlogs (I have seen higher numbers).
However, beware of this bug:
https://savannah.cern.ch/bugs/index.php?32345
Cheers,
Maarten
> Lately this is causing some trouble because job submission to the wms is
> a much faster process.
> When there are lots of job submissions the resulting queue can easily
> take hours to drain, thus making
> the wms unusable for people who expect a reasonable turnaround time for
> their jobs.
>
> Cheers,
>
> Luuk Uljee
>
> sample jdl (taken from the sandboxdir):
> [
> requirements = other.GlueCEStateStatus == "Production";
> RetryCount = 0;
> MyProxyServer = "px.grid.sara.nl";
> AllowZippedISB = true;
> JobType = "normal";
> SignificantAttributes = { "Requirements","Rank","FuzzyRank" };
> Executable = "job.sh";
> Stdoutput = "stdout.txt";
> OutputSandbox = { "stdout.txt","stderr.txt" };
> VirtualOrganisation = "lsgrid";
> rank = -other.GlueCEStateEstimatedResponseTime;
> Type = "job";
> ShallowRetryCount = 10;
> StdError = "stderr.txt";
> DefaultRank = -other.GlueCEStateEstimatedResponseTime;
> ZippedISB = { "ISBfiles_....tar.gz" };
> InputSandbox = { "file:///..../job.sh" }
> ]
>
|