Hi
Thanks ... what do the maui logs say? And does torque say anything
about a failed attempt to connect from maui? What does the torque
qmgr -c
print about administrator rights, is the maui user allowed to schedule
jobs? (via operator / manager / acl_hosts in torque)
Also, does maui know the correct torque server host? (via maui
SERVERHOST, ADMINHOST, RMHOST and RMSERVER in maui.cfg)
JT
Y.Lyublev wrote:
> Hi.
>
> PBS works correct.
> [root@ceitep root]# !qs
> qstat -q
>
> server: ceitep.itep.ru
>
> Queue Memory CPU Time Walltime Node Run Que Lm State
> ---------------- ------ -------- -------- ---- --- --- -- -----
> atlas -- 120:00:0 140:00:0 -- 6 0 -- E R
> alice -- 120:00:0 140:00:0 -- 0 0 -- E R
> lhcb -- 120:00:0 140:00:0 -- 7 0 -- E R
> cms -- 120:00:0 140:00:0 -- 0 0 -- E R
> dteam -- 48:00:00 72:00:00 -- 0 0 -- E R
> photon -- 48:00:00 72:00:00 -- 0 0 -- E R
> ops -- 48:00:00 72:00:00 -- 0 0 -- E R
> ----- -----
> 13 0
>
> Jobs running and ending orderly.
> [root@ceitep root]# last -10
> alice008 ftpd19737 wn62.itep.ru Mon Mar 12 14:32 - 14:32 (00:00)
> alice008 ftpd17847 wn62.itep.ru Mon Mar 12 14:31 - 14:31 (00:00)
> alice008 ftpd17840 wn62.itep.ru Mon Mar 12 14:31 - 14:31 (00:00)
> alice008 ftpd17819 wn62.itep.ru Mon Mar 12 14:31 - 14:31 (00:00)
> alice010 ftpd17135 wn63.itep.ru Mon Mar 12 14:30 - 14:30 (00:00)
> alice008 ftpd16304 wn62.itep.ru Mon Mar 12 14:30 - 14:30 (00:00)
> alice010 ftpd15566 wn63.itep.ru Mon Mar 12 14:29 - 14:29 (00:00)
> root pts/4 vitep2.itep.ru Mon Mar 12 14:24 still logged in
> ops001 ftpd8216 wn50.itep.ru Mon Mar 12 14:23 - 14:23 (00:00)
> cmssgm ftpd5590 wn63.itep.ru Mon Mar 12 14:21 - 14:21 (00:00)
>
> Work parameters of MAUI for queues -
> NODEALLOCATIONPOLICY CPULOAD
> GROUPCFG[alice] MAXPROC=20
> GROUPCFG[atlas] MAXPROC=20
> GROUPCFG[cms] MAXPROC=20
> GROUPCFG[lhcb] MAXPROC=20
>
> But MAUI commands itself are not perfected -
> [root@ceitep root]# showq
> ERROR: lost connection to server
> ERROR: cannot request service (status)
>
> Regards, Yevgeniy.
>
>
>> Hi,
>>
>> after this:
>>
>> Y.Lyublev wrote:
>>
>>> [root@testbed01 root]# /etc/init.d/pbs_server restart
>>> Shutting down TORQUE Server: [ OK ]
>>> Starting TORQUE Server: [ OK ]
>> now try e.g.
>>
>> ps uaxw | grep pbs_server
>>
>> and
>>
>> qstat -q
>>
>> and
>>
>> qstat -f
>>
>> and look in /var/spool/pbs/server_logs. Just the fact that the startup
>> was successful, doesn't mean that the server keeps running for more than
>> a few milliseconds after it "successfully" starts up. "lost connection
>> to server" sounds like either the maui user is not authenticated to
>> torque, OR that the server has died immediately after startup (or has
>> hung) ... sometimes there is a "bad job" in
>>
>> /var/spool/pbs/server_priv/jobs
>>
>> that is causing the whole thing to hang ...
>>
>> JT
>>
>>> [root@testbed01 root]# /etc/init.d/maui restart
>>> Shutting down MAUI Scheduler: ERROR: lost connection to server
>>> ERROR: cannot request service (status)
>>> [FAILED]
>>> Starting MAUI Scheduler: [ OK ]
>>> [root@testbed01 root]# /etc/init.d/maui restart
>>> Shutting down MAUI Scheduler: ERROR: lost connection to server
>>> ERROR: cannot request service (status)
>>> [FAILED]
>>> Starting MAUI Scheduler: [ OK ]
>>>
>>>
>>>> Steve
>>>>> Yes.
>>>>> For gLite CE -
>>>>> $ configure_node site-info.def gliteCE TORQUE_server
>>>>>
>>>>> For LCG CE -
>>>>> $ configure_node site-info.def CE_torque
>>>>>
>>>>>> Steve
>>>>>>
>>>>>>> 2. LFC server works incorrect:
>>>>>>> On LFC server LFC LOG has -
>>>>>>> [root@glwms ORIG]# grep error /var/log/lfc*/log
>>>>>>> /var/log/lfc/log:03/12 04:43:43 2948,0 sendrep: NS002 - send
>>>>>>> error : Broken
>>>>>>> pipe
>>>>>>> /var/log/lfc/log:03/12 05:44:06 2948,0 sendrep: NS002 - send
>>>>>>> error : Broken
>>>>>>> pipe
>>>>>>> /var/log/lfc/log:03/12 06:45:40 2948,0 sendrep: NS002 - send
>>>>>>> error : Broken
>>>>>>> pipe
>>>>>>> /var/log/lfc/log:03/12 07:43:56 2948,0 sendrep: NS002 - send
>>>>>>> error : Broken
>>>>>>> pipe
>>>>>>> /var/log/lfc/log:03/12 08:49:12 2948,0 sendrep: NS002 - send
>>>>>>> error : Broken
>>>>>>> pipe
>>>>>>> /var/log/lfc/log:03/12 09:03:50 2948,0 Cns_insert_rep_entry:
>>>>>>> mysql_query
>>>>>>> error: Unknown column 'CTIME' in 'field l
>>>>>>> ist'
>>>>>>> /var/log/lfc/log:03/12 09:09:35 24056,0 Cns_list_rep_entry:
>>>>>>> mysql_query
>>>>>>> error: Unknown column 'CTIME' in 'field lis
>>>>>>> t'
>>>>>>> /var/log/lfc/log:03/12 09:09:44 24056,0 Cns_list_rep_entry:
>>>>>>> mysql_query
>>>>>>> error: Unknown column 'CTIME' in 'field lis
>>>>>>> t'
>>>>>>> /var/log/lfc/log:03/12 09:09:47 24056,0 Cns_list_rep_entry:
>>>>>>> mysql_query
>>>>>>> error: Unknown column 'CTIME' in 'field lis
>>>>>>> t'
>>>>>>> /var/log/lfc/log:03/12 09:09:50 24056,0 Cns_list_rep_entry:
>>>>>>> mysql_query
>>>>>>> error: Unknown column 'CTIME' in 'field lis
>>>>>>> t'
>>>>>>> /var/log/lfc/log:03/12 09:09:58 24056,0 Cns_list_rep_entry:
>>>>>>> mysql_query
>>>>>>> error: Unknown column 'CTIME' in 'field list'
>>>>>>> /var/log/lfc/log:03/12 09:10:01 24056,0 Cns_list_rep_entry:
>>>>>>> mysql_query
>>>>>>> error: Unknown column 'CTIME' in 'field list'
>>>>>>> /var/log/lfc/log:03/12 09:10:20 24056,0 Cns_list_rep_entry:
>>>>>>> mysql_query
>>>>>>> error: Unknown column 'CTIME' in 'field list'
>>>>>>> /var/log/lfc/log:03/12 09:10:20 24056,0 Cns_list_rep_entry:
>>>>>>> mysql_query
>>>>>>> error: Unknown column 'CTIME' in 'field list'
>>>>>>> /var/log/lfc/log:03/12 09:10:20 24056,0 Cns_list_rep_entry:
>>>>>>> mysql_query
>>>>>>> error: Unknown column 'CTIME' in 'field list'
>>>>>>> /var/log/lfc/log:03/12 09:14:13 24056,0 Cns_insert_rep_entry:
>>>>>>> mysql_query
>>>>>>> error: Unknown column 'CTIME' in 'field list'
>>>>>>>
>>>>>>> And on UI user gets errors when worhs with SE through LFC -
>>>>>>> [lublev@uiitep TEST]$ lcg-cr -v -d se2.itep.ru -l
>>>>>>> /grid/alice/my_dir/fileSE22.dat --vo alice
>>>>>>> file:/home/users/lab240/lublev/JOBS/SC3/file.dat
>>>>>>> Using grid catalog type: lfc
>>>>>>> Using grid catalog : glwms.itep.ru
>>>>>>> Source URL: file:/home/users/lab240/lublev/JOBS/SC3/file.dat
>>>>>>> File size: 1073741824
>>>>>>> VO name: alice
>>>>>>> Destination specified: se2.itep.ru
>>>>>>> Destination URL for copy:
>>>>>>> gsiftp://se2.itep.ru/se2.itep.ru:/storage/alice/2007-03-12/
>>>>>>> file91e66140-81f2
>>>>>>> -4ca5-ae44-6b86c31d1832.523505.0
>>>>>>> # streams: 1
>>>>>>> # set timeout to 0 seconds
>>>>>>> Alias registered in Catalog: lfn:/grid/alice/my_dir/fileSE22.dat
>>>>>>> 1059061760 bytes 24352.58 KB/sec avg 22341.82 KB/sec inst
>>>>>>> Transfer took 43420 ms
>>>>>>> Internal error
>>>>>>> Could not register in Catalog the URL
>>>>>>> srm://se2.itep.ru/dpm/itep.ru/home/alice/generated/2007-03-12/
>>>>>>> file91e66140-8
>>>>>>> 1f2-4ca5-ae44-6b86c31d1832
>>>>>>> lcg_cr: Communication error on send
>>>>>>>
>>>>>>>
>>>>>>> [lublev@uiitep TEST]$ lcg-del -s se2.itep.ru --vo alice
>>>>>>> lfn:/grid/alice/my_dir/fileSE22.dat
>>>>>>> Internal error
>>>>>>> lcg_del: Communication error on send
>>>>>>>
>>>>>>>
>>>>>>> [lublev@uiitep TEST]$ lfc-rm -f alice lfn:/grid/alice/my_dir/
>>>>>>> fileSE22.dat
>>>>>>> alice: invalid path
>>>>>>> send2nsd: NS009 - fatal configuration error: Host unknown: lfn
>>>>>>> lfn:/grid/alice/my_dir/:fileSE22.dat Host not known
>>>>>>>
>>>>>>>
>>>>>>> Any suggestion on how to proceed?
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Yevgeniy.
>>>>>> --
>>>>>> Steve Traylen
>>>>>> [log in to unmask]
>>>>>> CERN, IT-GD-OPS.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>> --
>>>> Steve Traylen
>>>> [log in to unmask]
>>>> CERN, IT-GD-OPS.
>>>>
>>>>
>>>>
>>>>
|