Maarten Litmaath wrote:
> Stijn De Smet wrote:
>
>> Hello,
>>
>> Since I have upgraded my site, I sometimes get errors on jobs submitted
>
> When and what did you upgrade?
>
LCG 2.7.0 -> gLite 3.0.2
But the problem is solved. The remote site still had the old hostnames
for lbhost and nshost in it's VO configuration(we noticed the remote CE
contacting the old RB when debugging).
Regards,
Stijn
>> from my resource broker to other sites. In the logs I can find these
>> errors:
>> 01 Feb, 10:20:49 -C- CondorMonitor::processEvent(...): Got job
>> executing event.
>> 01 Feb, 10:20:49 -C- CondorMonitor::processEvent(...): For cluster
>> 6536 at host ce01.cmi.ua.ac.be
>> 01 Feb, 10:20:49 -C- CondorMonitor::processEvent(...): EDG id =
>> https://gridrb.atlantis.ugent.be:9000/B9x50f_nJ41TRI8uBbP9tA
>> 02 Feb, 08:06:57 -F- CondorMonitor::processEvent(...): Got unhandled
>> event 19.
>> 02 Feb, 08:06:57 -F- CondorMonitor::processEvent(...): Meaning:
>> "ULOG_GLOBUS_RESOURCE_UP".
>> 02 Feb, 08:06:57 -I- EventLogger::unhandled_event(...): Unhandled
>> event, what to do ?
>> 02 Feb, 08:06:57 -C- CondorMonitor::processEvent(...): Reason =
>> "Globus error 12: the connection to the server failed (check host and
>> port)".
>> 02 Feb, 08:06:57 -C- CondorMonitor::processEvent(...): Code = 2,
>> SubCode = 12
>> 02 Feb, 08:06:57 -C- CondorMonitor::processEvent(...): EDG id =
>> https://gridrb.atlantis.ugent.be:9000/B9x50f_nJ41TRI8uBbP9tA
>> 02 Feb, 08:07:52 -C- CondorMonitor::processEvent(...): For cluster 6536.
>> 02 Feb, 08:07:52 -C- CondorMonitor::processEvent(...): Maybe our
>> brother want say something to us ??
>> 02 Feb, 08:07:52 -I- CondorMonitor::processEvent(...): It's really a
>> message from my beloved JobController!
>> 02 Feb, 08:07:52 -I- CondorMonitor::processEvent(...): Message says:
>> "Job cancelled from queue".
>> 02 Feb, 08:07:52 -I- CondorMonitor::processGenericEvent(...):
>> Attaching force remove timeout to cluster 6536
>> 02 Feb, 08:07:52 -I- CondorMonitor::processGenericEvent(...): Timeout
>> force removal will happen in 600 seconds.
>> 02 Feb, 08:07:56 -C- CondorMonitor::processEvent(...): Got job
>> aborted event.
>> 02 Feb, 08:07:56 -C- CondorMonitor::processEvent(...): For cluster 6536
>> 02 Feb, 08:07:56 -C- CondorMonitor::processEvent(...): EDG id =
>> https://gridrb.atlantis.ugent.be:9000/B9x50f_nJ41TRI8uBbP9tA
>>
>> After this, the job is resubmitted, but the already running job(since
>> there is nothing really wrong with it) is still running at the remote
>> site, so 2 jobs at the reomte site are running the same job. Is this
>> just a connection problem, or is something wrong with the configuration?
>
> This points to a bug or misconfiguration somewhere. Can you provide the
> output of this command for an example job that suffered this problem:
>
> edg-job-get-logging-info -v 1 job_ID
|