> There are some problems with Steve Lloyd's jobs at our site
> (UKI-SOUTHGRID-CAM-HEP) since few days. The jobs are being aborted for
> reasons like:
> cannot retrieve previous matches for
> https://lcgrb01.gridpp.rl.ac.uk:9000/XRUS5IjB-TtenaEIfb-3bw
>
> It looks like a problem with the RB but other sites are not affected.
> Does anyone know what's wrong?
>
Some other sites have been affected to a greater or lesser degree
intermittently. At Liverpool, the most recent User Analysis job has
aborted for a similar reason and I see a few others seem to have
similarly aborted tests too. At times last week most of the tests
running here were showing as aborted for this reason.
The stateEnterTimes are typically all similar, e.g.:
- stateEnterTimes =
Submitted : Mon Nov 5 10:32:18 2007
Waiting : Mon Nov 5 10:48:36 2007
Ready : Mon Nov 5 10:34:00 2007
Scheduled : Mon Nov 5 10:34:38 2007
Running : Mon Nov 5 10:36:43 2007
Done : Mon Nov 5 10:48:20 2007
Cleared : ---
Aborted : Mon Nov 5 10:50:39 2007
Cancelled : ---
That is, the job seems to run ok, but becomes aborted approximately 2
minutes after reaching the 'Done' state. That does look to me like it
could be a problem with the RB. I did email Steve Lloyd about it
directly, and he suggested that the RB lost track of the job, which
seems to happen when it's loaded.
I'm not sure why it seems to occur more for some sites at some times
than others though.
--
Robert Fay [log in to unmask]
System Administrator office: 210
High Energy Physics Division tel (int): 43396
Oliver Lodge Laboratory tel (ext): +44 (0)151 794 3396
University of Liverpool http://hep.ph.liv.ac.uk
|