Oxford has been orange on GridMap for a week now. Looking at the failure
file, it appears that the last test was run on March 14th.
Is there any reason more recent tests haven't been run? (or does this only
display the first failed test?)
http://www.hep.ph.ic.ac.uk/~dguser/logs/tbce01.physics.ox.ac.uk_2119_jobmana
ger-pbs-workq.txt.log
I am able to globus-job-run, ldap search, and qsub on the CE. Furthermore,
the CE has been continuously running jobs for the last several weeks. Right
now every CPU is running at 100% with jobs from
C=fr,o=cnrs,ou=cppm,cn=vincent garonne,[log in to unmask]
From looking at it more carefully (qstat -f), it appears that these jobs
were submitted on Monday and have been "hogging" the queue for 4 days. How
do other people deal with this sort of problem?
Question 1) Why doesn't it look like GridMap tests are being sent to Oxford?
Question 2) Is it possible to have a "test" queue which will only accept
short jobs, but run them concurrently with larger jobs? Would this solve
the "problem" of people hogging all CPUs at a site?
Cheers,
Ian.
--
Ian Stokes-Rees [log in to unmask]
Particle Physics, Oxford http://www-pnp.physics.ox.ac.uk/~stokes/
|