Hello,
Since I have upgraded my site, I sometimes get errors on jobs submitted
from my resource broker to other sites. In the logs I can find these errors:
01 Feb, 10:20:49 -C- CondorMonitor::processEvent(...): Got job executing
event.
01 Feb, 10:20:49 -C- CondorMonitor::processEvent(...): For cluster 6536
at host ce01.cmi.ua.ac.be
01 Feb, 10:20:49 -C- CondorMonitor::processEvent(...): EDG id =
https://gridrb.atlantis.ugent.be:9000/B9x50f_nJ41TRI8uBbP9tA
02 Feb, 08:06:57 -F- CondorMonitor::processEvent(...): Got unhandled
event 19.
02 Feb, 08:06:57 -F- CondorMonitor::processEvent(...): Meaning:
"ULOG_GLOBUS_RESOURCE_UP".
02 Feb, 08:06:57 -I- EventLogger::unhandled_event(...): Unhandled event,
what to do ?
02 Feb, 08:06:57 -C- CondorMonitor::processEvent(...): Reason = "Globus
error 12: the connection to the server failed (check host and port)".
02 Feb, 08:06:57 -C- CondorMonitor::processEvent(...): Code = 2, SubCode
= 12
02 Feb, 08:06:57 -C- CondorMonitor::processEvent(...): EDG id =
https://gridrb.atlantis.ugent.be:9000/B9x50f_nJ41TRI8uBbP9tA
02 Feb, 08:07:52 -C- CondorMonitor::processEvent(...): For cluster 6536.
02 Feb, 08:07:52 -C- CondorMonitor::processEvent(...): Maybe our brother
want say something to us ??
02 Feb, 08:07:52 -I- CondorMonitor::processEvent(...): It's really a
message from my beloved JobController!
02 Feb, 08:07:52 -I- CondorMonitor::processEvent(...): Message says:
"Job cancelled from queue".
02 Feb, 08:07:52 -I- CondorMonitor::processGenericEvent(...): Attaching
force remove timeout to cluster 6536
02 Feb, 08:07:52 -I- CondorMonitor::processGenericEvent(...): Timeout
force removal will happen in 600 seconds.
02 Feb, 08:07:56 -C- CondorMonitor::processEvent(...): Got job aborted
event.
02 Feb, 08:07:56 -C- CondorMonitor::processEvent(...): For cluster 6536
02 Feb, 08:07:56 -C- CondorMonitor::processEvent(...): EDG id =
https://gridrb.atlantis.ugent.be:9000/B9x50f_nJ41TRI8uBbP9tA
After this, the job is resubmitted, but the already running job(since
there is nothing really wrong with it) is still running at the remote
site, so 2 jobs at the reomte site are running the same job. Is this
just a connection problem, or is something wrong with the configuration?
Regards,
Stijn
|