Print

Print


hi,
just to inform you that the problem was caused by the following 
envirnmental variabiles missing:

GLOBUS_LOCATION=/opt/globus
GLOBUS_GMA=true
GLOBUS_TCP_PORT_RANGE=20000,25000

cheers,
Alessandro

Alessandro Paolini ha scritto:
> Alessandro Paolini ha scritto:
>> Hi,
>> we are testing a site gridce.pg.infn.it and the job submitted to it 
>> remain in ready state for long time before aborting
>>
>> instead the simble command works fine:
>>
>> $ globus-job-run gridce.pg.infn.it/jobmanager-lcgpbs -queue cert 
>> /bin/pwd
>> /home/infngrid007/globus-tmp.node235.8182.0
>>
>> after the job submission, on CE correctly appears the perl process
>>
>> 2457      8230  0.1  0.3  4548 2816 ?        S    09:54   0:00 perl 
>> /home/infngrid007/.globus/.gass_cache/local/md5/87/bc7978431a697555a6502b0d97e0f6/md5/4d/528aa9c06455c127e34d86161aa2b4/data 
>> --dest-url=https://gridit-cert-rb.cnaf.infn.it:20001/tmp/condor_g_scratch.0x8c36900.14730/grid-monitor.gridce.pg.infn.it:2119.1/grid-monitor-job-status 
>>
>> 2457      8233  2.4  0.9 10244 8372 ?        S    09:54   0:00  \_ 
>> perl /tmp/grid_manager_monitor_agent.infngrid007.8230.1000 
>> --delete-self --maxtime=3600s
>>
>> but nothing else happens.
>>
>> On WMS, in the Condor.G logs there are the following lines:
>>
>> ...
>> 000 (053.000.000) 09/16 09:53:44 Job submitted from host: 
>> <131.154.99.40:20335>
>>    (https://lb009.cnaf.infn.it:9000/hWcl6dGVN7Z3JgXHbUKTkg) 
>> (UI=000000:NS=0000000004:WM=000004:BH=0000000000:JSS=000003:LM=000000:L
>> RMS=000000:APP=000000:LBS=000000) (0)
>> ...
>> 020 (053.000.000) 09/16 10:09:53 Detected Down Globus Resource
>>    RM-Contact: gridce.pg.infn.it:2119/jobmanager-lcgpbs
>> ...
>> 026 (053.000.000) 09/16 10:09:53 Detected Down Grid Resource
>>    GridResource: gt2 gridce.pg.infn.it:2119/jobmanager-lcgpbs
>>
>> the site-admins are sure that no firewall blocks the necessary ports, 
>> so I haven't any idea what may cause this behaviour (clocks?)
> after a bit, in the condor log appeared also the following lines:
>
> ...
> 019 (053.000.000) 09/16 10:31:14 Globus Resource Back Up
>    RM-Contact: gridce.pg.infn.it:2119/jobmanager-lcgpbs
> ...
> 025 (053.000.000) 09/16 10:31:15 Grid Resource Back Up
>    GridResource: gt2 gridce.pg.infn.it:2119/jobmanager-lcgpbs
> ...
>
> and eventualy the job is aborted:
>
> [paolini@ui ~]$ glite-wms-job-status 
> https://lb009.cnaf.infn.it:9000/hWcl6dGVN7Z3JgXHbUKTkg
>
>
> *************************************************************
> BOOKKEEPING INFORMATION:
>
> Status info for the Job : 
> https://lb009.cnaf.infn.it:9000/hWcl6dGVN7Z3JgXHbUKTkg
> Current Status:     Waiting
> Status Reason:      BrokerHelper: no compatible resources
> Destination:        gridce.pg.infn.it:2119/jobmanager-lcgpbs-cert
> Submitted:          Wed Sep 16 09:53:36 2009 CEST
> *************************************************************
>
> in the logging-info -v2:
>
> ---
> Event: Transfer
> - Arrived                    =    Wed Sep 16 10:47:08 2009 CEST
> - Dest host                  =    unavailable
> - Dest instance              =    
> /var/glite/logmonitor/CondorG.log/CondorG.1229530674.log
> - Dest jobid                 =    unavailable
> - Destination                =    LRMS
> - Host                       =    gridit-cert-rb.cnaf.infn.it
> - Reason                     =    8 the user cancelled the job
> - Result                     =    FAIL
> - Source                     =    LogMonitor
> - Src instance               =    unique
> - Timestamp                  =    Wed Sep 16 10:47:08 2009 CEST
> - User                       =    /C=IT/O=INFN/OU=Personal 
> Certificate/L=CNAF/CN=Alessandro Paolini/CN=proxy/CN=proxy
>        ---
> Event: Done
> - Arrived                    =    Wed Sep 16 10:47:21 2009 CEST
> - Exit code                  =    1
> - Host                       =    gridit-cert-rb.cnaf.infn.it
> - Reason                     =    Job got an error while in the 
> CondorG queue.
> - Source                     =    LogMonitor
> - Src instance               =    unique
> - Status code                =    FAILED
> - Timestamp                  =    Wed Sep 16 10:47:21 2009 CEST
> - User                       =    /C=IT/O=INFN/OU=Personal 
> Certificate/L=CNAF/CN=Alessandro Paolini/CN=proxy/CN=proxy
>        ---
>
> cheers,
> Alessandro
>>
>> Have you ever seen a similar problem?
>>
>> Cheers,
>> Alessandro
>>
>> P.S. in case you may submit jobs to this site by using 
>> glite-rb-01.cnaf.infn.it
>>
>
>


-- 
Dr. Alessandro Paolini
INFN - CNAF
Viale Berti Pichat 6/2
40127 Bologna
Italy
tel: +39 051 6092723
fax: +39 051 6092916
ICQ: 192172027
skype: alex.paolini
**********************
"credo nel potere del riso e delle lacrime"
   "come antidoto all'odio ed al terrore"
        "un giorno senza un sorriso"
             "è un giorno perso" >>> Charlie Chaplin