Hi, Ronald.
You are right.
Your advice helps to solve this problems.
Thank You vary much.
Cheers, Yevgeniy.
> Hi Yevgeni,
>
> Did you already try to stop the maui server with "kill"? I remember a
> similar problem, where the service could not be restarted after an
> upgrade of the rpm. Killing the service and then performing a normal
> "service maui start" did the trick.
>
> Cheers,
> Ronald
>
>
>
> Y.Lyublev wrote:
> > Hi.
> > All worked orderly before morning 9.03.2007.
> > After messages -
> > Following packages have been upgraded on your system:
> >
> > maui (3.2.6p11-2_SL30X => 3.2.6p17-1_sl3)
> > maui-client (3.2.6p11-2_SL30X => 3.2.6p17-1_sl3)
> > maui-server (3.2.6p11-2_SL30X => 3.2.6p17-1_sl3)
> > Shutting down MAUI Scheduler: ERROR: lost connection to server
> > ERROR: cannot request service (status)
> > [FAILED]
> > Starting MAUI Scheduler: [ OK ]
> >
> >
> > --
> > message sent by apt-autoupdate system from ceitep.itep.ru
> > see: /etc/sysconfig/apt-autoupdate for options
> >
> > The MAUI and PBS_SERVER ceased to understand each other.
> > Regards, Yevgeniy.
> >
> >> Hi
> >>
> >> Thanks ... what do the maui logs say? And does torque say anything
> >> about a failed attempt to connect from maui? What does the torque
> >>
> >> qmgr -c
> >>
> >> print about administrator rights, is the maui user allowed to schedule
> >> jobs? (via operator / manager / acl_hosts in torque)
> >>
> >> Also, does maui know the correct torque server host? (via maui
> >> SERVERHOST, ADMINHOST, RMHOST and RMSERVER in maui.cfg)
> >>
> >> JT
> >>
> >>
> >> Y.Lyublev wrote:
> >>> Hi.
> >>>
> >>> PBS works correct.
> >>> [root@ceitep root]# !qs
> >>> qstat -q
> >>>
> >>> server: ceitep.itep.ru
> >>>
> >>> Queue Memory CPU Time Walltime Node Run Que Lm State
> >>> ---------------- ------ -------- -------- ---- --- --- -- -----
> >>> atlas -- 120:00:0 140:00:0 -- 6 0 -- E R
> >>> alice -- 120:00:0 140:00:0 -- 0 0 -- E R
> >>> lhcb -- 120:00:0 140:00:0 -- 7 0 -- E R
> >>> cms -- 120:00:0 140:00:0 -- 0 0 -- E R
> >>> dteam -- 48:00:00 72:00:00 -- 0 0 -- E R
> >>> photon -- 48:00:00 72:00:00 -- 0 0 -- E R
> >>> ops -- 48:00:00 72:00:00 -- 0 0 -- E R
> >>> ----- -----
> >>> 13 0
> >>>
> >>> Jobs running and ending orderly.
> >>> [root@ceitep root]# last -10
> >>> alice008 ftpd19737 wn62.itep.ru Mon Mar 12 14:32 - 14:32
(00:00)
> >>> alice008 ftpd17847 wn62.itep.ru Mon Mar 12 14:31 - 14:31
(00:00)
> >>> alice008 ftpd17840 wn62.itep.ru Mon Mar 12 14:31 - 14:31
(00:00)
> >>> alice008 ftpd17819 wn62.itep.ru Mon Mar 12 14:31 - 14:31
(00:00)
> >>> alice010 ftpd17135 wn63.itep.ru Mon Mar 12 14:30 - 14:30
(00:00)
> >>> alice008 ftpd16304 wn62.itep.ru Mon Mar 12 14:30 - 14:30
(00:00)
> >>> alice010 ftpd15566 wn63.itep.ru Mon Mar 12 14:29 - 14:29
(00:00)
> >>> root pts/4 vitep2.itep.ru Mon Mar 12 14:24 still logged
> > in
> >>> ops001 ftpd8216 wn50.itep.ru Mon Mar 12 14:23 - 14:23
(00:00)
> >>> cmssgm ftpd5590 wn63.itep.ru Mon Mar 12 14:21 - 14:21
(00:00)
> >>>
> >>> Work parameters of MAUI for queues -
> >>> NODEALLOCATIONPOLICY CPULOAD
> >>> GROUPCFG[alice] MAXPROC=20
> >>> GROUPCFG[atlas] MAXPROC=20
> >>> GROUPCFG[cms] MAXPROC=20
> >>> GROUPCFG[lhcb] MAXPROC=20
> >>>
> >>> But MAUI commands itself are not perfected -
> >>> [root@ceitep root]# showq
> >>> ERROR: lost connection to server
> >>> ERROR: cannot request service (status)
> >>>
> >>> Regards, Yevgeniy.
> >>>
> >>>
> >>>> Hi,
> >>>>
> >>>> after this:
> >>>>
> >>>> Y.Lyublev wrote:
> >>>>
> >>>>> [root@testbed01 root]# /etc/init.d/pbs_server restart
> >>>>> Shutting down TORQUE Server: [ OK ]
> >>>>> Starting TORQUE Server: [ OK ]
> >>>> now try e.g.
> >>>>
> >>>> ps uaxw | grep pbs_server
> >>>>
> >>>> and
> >>>>
> >>>> qstat -q
> >>>>
> >>>> and
> >>>>
> >>>> qstat -f
> >>>>
> >>>> and look in /var/spool/pbs/server_logs. Just the fact that the
startup
> >>>> was successful, doesn't mean that the server keeps running for more
> > than
> >>>> a few milliseconds after it "successfully" starts up. "lost
connection
> >>>> to server" sounds like either the maui user is not authenticated to
> >>>> torque, OR that the server has died immediately after startup (or has
> >>>> hung) ... sometimes there is a "bad job" in
> >>>>
> >>>> /var/spool/pbs/server_priv/jobs
> >>>>
> >>>> that is causing the whole thing to hang ...
> >>>>
> >>>> JT
> >>>>
> >>>>> [root@testbed01 root]# /etc/init.d/maui restart
> >>>>> Shutting down MAUI Scheduler: ERROR: lost connection to server
> >>>>> ERROR: cannot request service (status)
> >>>>> [FAILED]
> >>>>> Starting MAUI Scheduler: [ OK ]
> >>>>> [root@testbed01 root]# /etc/init.d/maui restart
> >>>>> Shutting down MAUI Scheduler: ERROR: lost connection to server
> >>>>> ERROR: cannot request service (status)
> >>>>> [FAILED]
> >>>>> Starting MAUI Scheduler: [ OK ]
> >>>>>
> >>>>>
> >>>>>> Steve
> >>>>>>> Yes.
> >>>>>>> For gLite CE -
> >>>>>>> $ configure_node site-info.def gliteCE TORQUE_server
> >>>>>>>
> >>>>>>> For LCG CE -
> >>>>>>> $ configure_node site-info.def CE_torque
> >>>>>>>
> >>>>>>>> Steve
> >>>>>>>>
> >>>>>>>>> 2. LFC server works incorrect:
> >>>>>>>>> On LFC server LFC LOG has -
> >>>>>>>>> [root@glwms ORIG]# grep error /var/log/lfc*/log
> >>>>>>>>> /var/log/lfc/log:03/12 04:43:43 2948,0 sendrep: NS002 - send
> >>>>>>>>> error : Broken
> >>>>>>>>> pipe
> >>>>>>>>> /var/log/lfc/log:03/12 05:44:06 2948,0 sendrep: NS002 - send
> >>>>>>>>> error : Broken
> >>>>>>>>> pipe
> >>>>>>>>> /var/log/lfc/log:03/12 06:45:40 2948,0 sendrep: NS002 - send
> >>>>>>>>> error : Broken
> >>>>>>>>> pipe
> >>>>>>>>> /var/log/lfc/log:03/12 07:43:56 2948,0 sendrep: NS002 - send
> >>>>>>>>> error : Broken
> >>>>>>>>> pipe
> >>>>>>>>> /var/log/lfc/log:03/12 08:49:12 2948,0 sendrep: NS002 - send
> >>>>>>>>> error : Broken
> >>>>>>>>> pipe
> >>>>>>>>> /var/log/lfc/log:03/12 09:03:50 2948,0 Cns_insert_rep_entry:
> >>>>>>>>> mysql_query
> >>>>>>>>> error: Unknown column 'CTIME' in 'field l
> >>>>>>>>> ist'
> >>>>>>>>> /var/log/lfc/log:03/12 09:09:35 24056,0 Cns_list_rep_entry:
> >>>>>>>>> mysql_query
> >>>>>>>>> error: Unknown column 'CTIME' in 'field lis
> >>>>>>>>> t'
> >>>>>>>>> /var/log/lfc/log:03/12 09:09:44 24056,0 Cns_list_rep_entry:
> >>>>>>>>> mysql_query
> >>>>>>>>> error: Unknown column 'CTIME' in 'field lis
> >>>>>>>>> t'
> >>>>>>>>> /var/log/lfc/log:03/12 09:09:47 24056,0 Cns_list_rep_entry:
> >>>>>>>>> mysql_query
> >>>>>>>>> error: Unknown column 'CTIME' in 'field lis
> >>>>>>>>> t'
> >>>>>>>>> /var/log/lfc/log:03/12 09:09:50 24056,0 Cns_list_rep_entry:
> >>>>>>>>> mysql_query
> >>>>>>>>> error: Unknown column 'CTIME' in 'field lis
> >>>>>>>>> t'
> >>>>>>>>> /var/log/lfc/log:03/12 09:09:58 24056,0 Cns_list_rep_entry:
> >>>>>>>>> mysql_query
> >>>>>>>>> error: Unknown column 'CTIME' in 'field list'
> >>>>>>>>> /var/log/lfc/log:03/12 09:10:01 24056,0 Cns_list_rep_entry:
> >>>>>>>>> mysql_query
> >>>>>>>>> error: Unknown column 'CTIME' in 'field list'
> >>>>>>>>> /var/log/lfc/log:03/12 09:10:20 24056,0 Cns_list_rep_entry:
> >>>>>>>>> mysql_query
> >>>>>>>>> error: Unknown column 'CTIME' in 'field list'
> >>>>>>>>> /var/log/lfc/log:03/12 09:10:20 24056,0 Cns_list_rep_entry:
> >>>>>>>>> mysql_query
> >>>>>>>>> error: Unknown column 'CTIME' in 'field list'
> >>>>>>>>> /var/log/lfc/log:03/12 09:10:20 24056,0 Cns_list_rep_entry:
> >>>>>>>>> mysql_query
> >>>>>>>>> error: Unknown column 'CTIME' in 'field list'
> >>>>>>>>> /var/log/lfc/log:03/12 09:14:13 24056,0 Cns_insert_rep_entry:
> >>>>>>>>> mysql_query
> >>>>>>>>> error: Unknown column 'CTIME' in 'field list'
> >>>>>>>>>
> >>>>>>>>> And on UI user gets errors when worhs with SE through LFC -
> >>>>>>>>> [lublev@uiitep TEST]$ lcg-cr -v -d se2.itep.ru -l
> >>>>>>>>> /grid/alice/my_dir/fileSE22.dat --vo alice
> >>>>>>>>> file:/home/users/lab240/lublev/JOBS/SC3/file.dat
> >>>>>>>>> Using grid catalog type: lfc
> >>>>>>>>> Using grid catalog : glwms.itep.ru
> >>>>>>>>> Source URL: file:/home/users/lab240/lublev/JOBS/SC3/file.dat
> >>>>>>>>> File size: 1073741824
> >>>>>>>>> VO name: alice
> >>>>>>>>> Destination specified: se2.itep.ru
> >>>>>>>>> Destination URL for copy:
> >>>>>>>>> gsiftp://se2.itep.ru/se2.itep.ru:/storage/alice/2007-03-12/
> >>>>>>>>> file91e66140-81f2
> >>>>>>>>> -4ca5-ae44-6b86c31d1832.523505.0
> >>>>>>>>> # streams: 1
> >>>>>>>>> # set timeout to 0 seconds
> >>>>>>>>> Alias registered in Catalog: lfn:/grid/alice/my_dir/fileSE22.dat
> >>>>>>>>> 1059061760 bytes 24352.58 KB/sec avg 22341.82 KB/sec inst
> >>>>>>>>> Transfer took 43420 ms
> >>>>>>>>> Internal error
> >>>>>>>>> Could not register in Catalog the URL
> >>>>>>>>> srm://se2.itep.ru/dpm/itep.ru/home/alice/generated/2007-03-12/
> >>>>>>>>> file91e66140-8
> >>>>>>>>> 1f2-4ca5-ae44-6b86c31d1832
> >>>>>>>>> lcg_cr: Communication error on send
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> [lublev@uiitep TEST]$ lcg-del -s se2.itep.ru --vo alice
> >>>>>>>>> lfn:/grid/alice/my_dir/fileSE22.dat
> >>>>>>>>> Internal error
> >>>>>>>>> lcg_del: Communication error on send
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> [lublev@uiitep TEST]$ lfc-rm -f alice lfn:/grid/alice/my_dir/
> >>>>>>>>> fileSE22.dat
> >>>>>>>>> alice: invalid path
> >>>>>>>>> send2nsd: NS009 - fatal configuration error: Host unknown: lfn
> >>>>>>>>> lfn:/grid/alice/my_dir/:fileSE22.dat Host not known
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Any suggestion on how to proceed?
> >>>>>>>>>
> >>>>>>>>> Cheers,
> >>>>>>>>> Yevgeniy.
> >>>>>>>> --
> >>>>>>>> Steve Traylen
> >>>>>>>> [log in to unmask]
> >>>>>>>> CERN, IT-GD-OPS.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>> --
> >>>>>> Steve Traylen
> >>>>>> [log in to unmask]
> >>>>>> CERN, IT-GD-OPS.
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
|