Hi Stephen
Thanks for sharing the idea more widely. We are already supposed to be sharing nagios scripts but there is a licence issues to resolve - the action to follow up is on me - which has delayed some sharing.
We'll be working on the monitoring workshop agenda next week.
The Torque/Maui question should be answered by those who actually run it so I'll leave that one.
Jeremy
-----Original Message-----
From: Testbed Support for GridPP member institutes on behalf of Stephen Childs
Sent: Fri 10/5/2007 9:15 AM
To: [log in to unmask]
Subject: Learning from the Torque and Maui gurus
One recurring topic during EGEE'07 is how to make collaboration within the UKI
federation more meaningful. One thing I'd like is to benefit from the
considerable expertise that exists in the GridPP world. In this respect, the
upcoming HEPSYSMAN session on monitoring looks like a great idea and we'll try
and make sure we send someone. And hopefully the monitoring recipes will be
documented through the presentations.
Anyway one area where I'd like to learn more is Torque/Maui configuration and
operation. After three years using PBS/Torque/Maui I still don't feel that I
really understand how it works and I repeatedly bang up against things that I
think should be easy to do but are impossible.(*) I also find the logging and
debugging fairly baffling even after many hours poring over it.
Every now and again I consider migrating to SGE, but it would be a big
disruption, and so many people seem happy with Torque/Maui that I wonder if
it's just me ...
So can I suggest a future session for GridPP or HEPSYSMAN where the T/M gurus
can explain the basic concepts and work through some examples commonly used at
sites?
It would also be great to update the page
http://www.gridpp.ac.uk/wiki/Torque_and_Maui
(probably created by Steve T. before his exodus?) with common recipes that
various sites have found useful.
Stephen
(*) A few examples off the top of my head:
* Dealing with jobs on failed worker nodes -- ideally restarting or even
just killing them.
* Flexibly listing nodes in a certain state
|