Peter et al,
We (TCD) have implemented a basic solution using PBS routing queues.
This currently deals with one cluster of 240 CPUs and a smaller one of
32 CPUs. The main components that needed modification are the lcgpbs job
manager and the information providers to allow for them to follow the
routing queue to its final destination(s), and to publish the correct
information wrt the number of FreeCPUs etc.
I believe that each batch queue could be declared a a subcluster, so
that the correct information is also published (including OS etc).
[Stephen (Burke), is my assumption correct?]. This may require a change
to the GIP configuration. I have not implemented a subcluster solution
as the nodes are a similar SPEC to those published by the site.
Following routing queues does have a downside, namely a routing queue
may have multiple destinations (with the correct one chosen by
Torque to suit the Job Specification). This can give rise to
inconsistent published information (max wallclock/cpu times).
We have also seen problems where a submission of a job from an RB may
be retried, but the original job is not removed from the queue.
a) does the cluster need to access the Grid, or is Grid being used as
an access method to submit non-grid submit into the cluster? It might
be possible to publish that this subcluster has no outbound IP access
(again Stephen, can you verify this?). Also, I'm not sure how many
applications on EGEE actually bother to use IP connectivity requirements
in the JDL.
At the moment, we are working on a cleaner solution (a new PBS
compatible job manager) for remote submission so that the published name
of the remote machine is embedded into the "queue" field. This is then
(split) parsed by the JobManager to get the remote host and queue name.
I hope to have a working prototype in the next two weeks.
There was some effort on a "remote" job manager for Globus 2.4 a few
years ago, but having looked at the code I am not overly convinced that
this is the correct solution for us. See:
ftp://ftp.globus.org/pub/gt2/2.4/2.4-latest/extra/src/globus_gram_job_manager_setup_remote-1.1.tar.gz
To comment on David McBride's followup (not quoted):
The Imperial solution for SGE is quite nice, but does not seem support
multiple distinct SGE installations at the site (David, correct me if I
am wrong here). David has already commented on pool accounts, so I wont
discuss that here anymore.
On the different OS types, there is a good deal of effort going into
porting. A summary of who is doing what can be found at:
http://www.grid.ie/porting
cheers,
John
-------------------------------------------------------------------------------------
Grid Manager, Grid-Ireland,
Trinity College, Dublin 2
|