On Fri, 2 Feb 2007, Stijn De Smet wrote:
> Maarten Litmaath wrote:
> > Stijn De Smet wrote:
> >
> >> Hello,
> >>
> >> Since I have upgraded my site, I sometimes get errors on jobs submitted
> >
> > When and what did you upgrade?
> >
> LCG 2.7.0 -> gLite 3.0.2
> But the problem is solved. The remote site still had the old hostnames
> for lbhost and nshost in it's VO configuration(we noticed the remote CE
> contacting the old RB when debugging).
Those parameters are only relevant on the UI. They have nothing to do
with the problem you described, viz. that a job may get resubmitted to
the same (or another) CE while it is still running on that CE: that can
happen when the CE is misconfigured, causing the RB to think that the
first attempt failed. With "edg-job-get-logging-info -v 1" one can see
all the stages the job went through. If it failed, the entries from the
LogMonitor process normally point to the problem. Look for the exact
error string in the Job Submission section of the GOC Wiki pages:
http://goc.grid.sinica.edu.tw/gocwiki/SiteProblemsFollowUpFaq
> Regards,
> Stijn
> >> from my resource broker to other sites. In the logs I can find these
> >> errors:
> >> 01 Feb, 10:20:49 -C- CondorMonitor::processEvent(...): Got job
> >> executing event.
> >> 01 Feb, 10:20:49 -C- CondorMonitor::processEvent(...): For cluster
> >> 6536 at host ce01.cmi.ua.ac.be
> >> 01 Feb, 10:20:49 -C- CondorMonitor::processEvent(...): EDG id =
> >> https://gridrb.atlantis.ugent.be:9000/B9x50f_nJ41TRI8uBbP9tA
> >> 02 Feb, 08:06:57 -F- CondorMonitor::processEvent(...): Got unhandled
> >> event 19.
> >> 02 Feb, 08:06:57 -F- CondorMonitor::processEvent(...): Meaning:
> >> "ULOG_GLOBUS_RESOURCE_UP".
> >> 02 Feb, 08:06:57 -I- EventLogger::unhandled_event(...): Unhandled
> >> event, what to do ?
> >> 02 Feb, 08:06:57 -C- CondorMonitor::processEvent(...): Reason =
> >> "Globus error 12: the connection to the server failed (check host and
> >> port)".
> >> 02 Feb, 08:06:57 -C- CondorMonitor::processEvent(...): Code = 2,
> >> SubCode = 12
> >> 02 Feb, 08:06:57 -C- CondorMonitor::processEvent(...): EDG id =
> >> https://gridrb.atlantis.ugent.be:9000/B9x50f_nJ41TRI8uBbP9tA
> >> 02 Feb, 08:07:52 -C- CondorMonitor::processEvent(...): For cluster 6536.
> >> 02 Feb, 08:07:52 -C- CondorMonitor::processEvent(...): Maybe our
> >> brother want say something to us ??
> >> 02 Feb, 08:07:52 -I- CondorMonitor::processEvent(...): It's really a
> >> message from my beloved JobController!
> >> 02 Feb, 08:07:52 -I- CondorMonitor::processEvent(...): Message says:
> >> "Job cancelled from queue".
> >> 02 Feb, 08:07:52 -I- CondorMonitor::processGenericEvent(...):
> >> Attaching force remove timeout to cluster 6536
> >> 02 Feb, 08:07:52 -I- CondorMonitor::processGenericEvent(...): Timeout
> >> force removal will happen in 600 seconds.
> >> 02 Feb, 08:07:56 -C- CondorMonitor::processEvent(...): Got job
> >> aborted event.
> >> 02 Feb, 08:07:56 -C- CondorMonitor::processEvent(...): For cluster 6536
> >> 02 Feb, 08:07:56 -C- CondorMonitor::processEvent(...): EDG id =
> >> https://gridrb.atlantis.ugent.be:9000/B9x50f_nJ41TRI8uBbP9tA
> >>
> >> After this, the job is resubmitted, but the already running job(since
> >> there is nothing really wrong with it) is still running at the remote
> >> site, so 2 jobs at the reomte site are running the same job. Is this
> >> just a connection problem, or is something wrong with the configuration?
> >
> > This points to a bug or misconfiguration somewhere. Can you provide the
> > output of this command for an example job that suffered this problem:
> >
> > edg-job-get-logging-info -v 1 job_ID
>
|