Hi Marten,
regarding glite-wms-job-logging-info -v 2:
After an hour I get the foloowing:
Event: Done
- arrived = Wed Feb 4 10:37:00 2009 CET
- exit_code = 1
- host = grid-wms1.desy.de
- reason = Got a job held event, reason: Unspecified
gridmanager error
- source = LogMonitor
- src_instance = unique
- status_code = FAILED
- timestamp = Wed Feb 4 10:37:00 2009 CET
- user = /O=GermanGrid/OU=TUD/CN=Ralph
Mueller-Pfefferkorn/CN=proxy/CN=proxy
---
Event: Done
- arrived = Wed Feb 4 10:37:12 2009 CET
- exit_code = 1
- host = grid-wms1.desy.de
- reason = Job got an error while in the CondorG queue.
- source = LogMonitor
- src_instance = unique
- status_code = FAILED
- timestamp = Wed Feb 4 10:37:12 2009 CET
- user = /O=GermanGrid/OU=TUD/CN=Ralph
Mueller-Pfefferkorn/CN=proxy/CN=proxy
---
I check the the things written on
http://goc.grid.sinica.edu.tw/gocwiki/Unspecified_gridmanager_error
Greetings.
Ralph
Ralph Müller-Pfefferkorn wrote on 04.02.2009 10:23:
> Hello Marten,
>
> [log in to unmask] wrote on 31.01.2009 16:55:
>> What does "glite-wms-job-logging-info -v 2" report?
>> Maybe an error that has a Wiki entry here:
>>
>> http://goc.grid.sinica.edu.tw/gocwiki/SiteProblemsFollowUpFaq
> No. No error message at all.
> The last entry is
> Event: Accepted
> - arrived = Wed Feb 4 09:36:01 2009 CET
> - from = JobController
> - from_host = localhost
> - from_instance = unavailable
> - host = grid-wms1.desy.de
> - local_jobid = 517905
> - source = LogMonitor
> - src_instance = unique
> - timestamp = Wed Feb 4 09:36:01 2009 CET
> - user = /O=GermanGrid/OU=TUD/CN=Ralph
> Mueller-Pfefferkorn/CN=proxy/CN=proxy
>
>
>> The WMS (i.e. Condor-G) uses a feature called "two-phase commit" that is
>> not used by globus-job-run. It is more sensitive to firewall settings.
>> An example of the traffic back and forth between WMS and CE is given here:
>> http://goc.grid.sinica.edu.tw/gocwiki/Dialog_between_RB_and_CE
>> The WMS has the same behavior as the RB, because both use Condor-G.
> We checked the firewall and we don't see any drops during the submission.
>
>> I tried to have a look at your CE, but it seems to sit on a local network
>> or no longer exists:
>> $ uberftp service1.ice.zih.tu-dresden.de pwd
>> globus_xio: Unable to connect to service1.ice.zih.tu-dresden.de:2811
>> globus_xio: globus_libc_getaddrinfo failed.
>> globus_common: Name or service not known
>> Failed to connect to service1.ice.zih.tu-dresden.de port 2811.
> service1.ice.zih.tu-dresden.de is the CEs name in the internal network over
> which it contacts the torque server.
> For the outside network/world it is desdemona.zih.tu-dresden.de.
> $ uberftp desdemona.zih.tu-dresden.de pwd
> 220 service1.ice.zih.tu-dresden.de GridFTP Server 2.3 (gcc32dbg,
> 1144436882-63) ready.
> 230 User zihp0040 logged in.
> 257 "/home/zihp0040" is current directory.
>
>
> We still investigate the Maui issue. Do you know if it is really
> the case that the CE uses Maui to get usage information?
> As I said in the originial mail there are two different maui version,
> the gLite version on the CE and the one which runs on the torque/maui
> server node (which runs with SLES10). Just for a try we copied maui from
> the SLES node to the CE and with these binaries maui works (e.g. a
> showres). It seems that the compiled in authorization keys of maui are
> really an issue.
> What we would like to try is to recompile the gLite-maui version with
> the right key. Do you know where to get the source code (source rpm) for
> it (the version installed is 3.2.6p20-snap.1182974819.8)?
> We don't know if this is the problem, but ... ;)
>
> Greetings.
> Ralph
>
>
|