Hi
>
> Thanks ... what do the maui logs say? And does torque say anything
> about a failed attempt to connect from maui? What does the torque
>
> qmgr -c
[root@ceitep root]# qmgr -c 'print server'
#
# Create queues and set their attributes.
#
#
# Create and define queue atlas
#
create queue atlas
set queue atlas queue_type = Execution
set queue atlas resources_max.cput = 120:00:00
set queue atlas resources_max.walltime = 140:00:00
set queue atlas acl_group_enable = True
set queue atlas acl_groups = atlas
set queue atlas enabled = True
set queue atlas started = True
#
# Create and define queue alice
#
create queue alice
set queue alice queue_type = Execution
set queue alice resources_max.cput = 120:00:00
set queue alice resources_max.walltime = 140:00:00
...
>
> print about administrator rights, is the maui user allowed to schedule
> jobs? (via operator / manager / acl_hosts in torque)
>
> Also, does maui know the correct torque server host? (via maui
> SERVERHOST, ADMINHOST, RMHOST and RMSERVER in maui.cfg)
[root@ceitep root]# cat /root/MAUI/maui.cfg
# MAUI configuration example
SERVERHOST ceitep.itep.ru
ADMIN1 root
ADMIN3 edginfo rgma
ADMINHOST ceitep.itep.ru
RMCFG[base] TYPE=PBS
SERVERPORT 40559
SERVERMODE NORMAL
# Set PBS server polling interval. If you have short # queues or/and jobs it
is worth to set a short interval. (10 seconds)
RMPOLLINTERVAL 00:00:10
# a max. 10 MByte log file in a logical location
LOGFILE /var/log/maui.log
LOGFILEMAXSIZE 10000000
LOGLEVEL 1
# Set the delay to 1 minute before Maui tries to run a job again, # in case
it failed to run the first time.
# The default value is 1 hour.
DEFERTIME 00:01:00
# Necessary for MPI grid jobs
ENABLEMULTIREQJOBS TRUE
NODEALLOCATIONPOLICY CPULOAD
GROUPCFG[alice] MAXPROC=20
GROUPCFG[atlas] MAXPROC=20
GROUPCFG[cms] MAXPROC=20
GROUPCFG[lhcb] MAXPROC=20
GROUPCFG[photon] MAXPROC=2
GROUPCFG[dteam] MAXPROC=4
GROUPCFG[ops] MAXPROC=4
>
> JT
>
>
> Y.Lyublev wrote:
> > Hi.
> >
> > PBS works correct.
> > [root@ceitep root]# !qs
> > qstat -q
> >
> > server: ceitep.itep.ru
> >
> > Queue Memory CPU Time Walltime Node Run Que Lm State
> > ---------------- ------ -------- -------- ---- --- --- -- -----
> > atlas -- 120:00:0 140:00:0 -- 6 0 -- E R
> > alice -- 120:00:0 140:00:0 -- 0 0 -- E R
> > lhcb -- 120:00:0 140:00:0 -- 7 0 -- E R
> > cms -- 120:00:0 140:00:0 -- 0 0 -- E R
> > dteam -- 48:00:00 72:00:00 -- 0 0 -- E R
> > photon -- 48:00:00 72:00:00 -- 0 0 -- E R
> > ops -- 48:00:00 72:00:00 -- 0 0 -- E R
> > ----- -----
> > 13 0
> >
> > Jobs running and ending orderly.
> > [root@ceitep root]# last -10
> > alice008 ftpd19737 wn62.itep.ru Mon Mar 12 14:32 - 14:32 (00:00)
> > alice008 ftpd17847 wn62.itep.ru Mon Mar 12 14:31 - 14:31 (00:00)
> > alice008 ftpd17840 wn62.itep.ru Mon Mar 12 14:31 - 14:31 (00:00)
> > alice008 ftpd17819 wn62.itep.ru Mon Mar 12 14:31 - 14:31 (00:00)
> > alice010 ftpd17135 wn63.itep.ru Mon Mar 12 14:30 - 14:30 (00:00)
> > alice008 ftpd16304 wn62.itep.ru Mon Mar 12 14:30 - 14:30 (00:00)
> > alice010 ftpd15566 wn63.itep.ru Mon Mar 12 14:29 - 14:29 (00:00)
> > root pts/4 vitep2.itep.ru Mon Mar 12 14:24 still logged
in
> > ops001 ftpd8216 wn50.itep.ru Mon Mar 12 14:23 - 14:23 (00:00)
> > cmssgm ftpd5590 wn63.itep.ru Mon Mar 12 14:21 - 14:21 (00:00)
> >
> > Work parameters of MAUI for queues -
> > NODEALLOCATIONPOLICY CPULOAD
> > GROUPCFG[alice] MAXPROC=20
> > GROUPCFG[atlas] MAXPROC=20
> > GROUPCFG[cms] MAXPROC=20
> > GROUPCFG[lhcb] MAXPROC=20
> >
> > But MAUI commands itself are not perfected -
> > [root@ceitep root]# showq
> > ERROR: lost connection to server
> > ERROR: cannot request service (status)
> >
> > Regards, Yevgeniy.
> >
> >
> >> Hi,
> >>
> >> after this:
> >>
> >> Y.Lyublev wrote:
> >>
> >>> [root@testbed01 root]# /etc/init.d/pbs_server restart
> >>> Shutting down TORQUE Server: [ OK ]
> >>> Starting TORQUE Server: [ OK ]
> >> now try e.g.
> >>
> >> ps uaxw | grep pbs_server
> >>
> >> and
> >>
> >> qstat -q
> >>
> >> and
> >>
> >> qstat -f
> >>
> >> and look in /var/spool/pbs/server_logs. Just the fact that the startup
> >> was successful, doesn't mean that the server keeps running for more
than
> >> a few milliseconds after it "successfully" starts up. "lost connection
> >> to server" sounds like either the maui user is not authenticated to
> >> torque, OR that the server has died immediately after startup (or has
> >> hung) ... sometimes there is a "bad job" in
> >>
> >> /var/spool/pbs/server_priv/jobs
> >>
> >> that is causing the whole thing to hang ...
> >>
> >> JT
> >>
> >>> [root@testbed01 root]# /etc/init.d/maui restart
> >>> Shutting down MAUI Scheduler: ERROR: lost connection to server
> >>> ERROR: cannot request service (status)
> >>> [FAILED]
> >>> Starting MAUI Scheduler: [ OK ]
> >>> [root@testbed01 root]# /etc/init.d/maui restart
> >>> Shutting down MAUI Scheduler: ERROR: lost connection to server
> >>> ERROR: cannot request service (status)
> >>> [FAILED]
> >>> Starting MAUI Scheduler: [ OK ]
> >>>
> >>>
> >>>> Steve
> >>>>> Yes.
> >>>>> For gLite CE -
> >>>>> $ configure_node site-info.def gliteCE TORQUE_server
> >>>>>
> >>>>> For LCG CE -
> >>>>> $ configure_node site-info.def CE_torque
> >>>>>
> >>>>>> Steve
> >>>>>>
> >>>>>>> 2. LFC server works incorrect:
> >>>>>>> On LFC server LFC LOG has -
> >>>>>>> [root@glwms ORIG]# grep error /var/log/lfc*/log
> >>>>>>> /var/log/lfc/log:03/12 04:43:43 2948,0 sendrep: NS002 - send
> >>>>>>> error : Broken
> >>>>>>> pipe
> >>>>>>> /var/log/lfc/log:03/12 05:44:06 2948,0 sendrep: NS002 - send
> >>>>>>> error : Broken
> >>>>>>> pipe
> >>>>>>> /var/log/lfc/log:03/12 06:45:40 2948,0 sendrep: NS002 - send
> >>>>>>> error : Broken
> >>>>>>> pipe
> >>>>>>> /var/log/lfc/log:03/12 07:43:56 2948,0 sendrep: NS002 - send
> >>>>>>> error : Broken
> >>>>>>> pipe
> >>>>>>> /var/log/lfc/log:03/12 08:49:12 2948,0 sendrep: NS002 - send
> >>>>>>> error : Broken
> >>>>>>> pipe
> >>>>>>> /var/log/lfc/log:03/12 09:03:50 2948,0 Cns_insert_rep_entry:
> >>>>>>> mysql_query
> >>>>>>> error: Unknown column 'CTIME' in 'field l
> >>>>>>> ist'
> >>>>>>> /var/log/lfc/log:03/12 09:09:35 24056,0 Cns_list_rep_entry:
> >>>>>>> mysql_query
> >>>>>>> error: Unknown column 'CTIME' in 'field lis
> >>>>>>> t'
> >>>>>>> /var/log/lfc/log:03/12 09:09:44 24056,0 Cns_list_rep_entry:
> >>>>>>> mysql_query
> >>>>>>> error: Unknown column 'CTIME' in 'field lis
> >>>>>>> t'
> >>>>>>> /var/log/lfc/log:03/12 09:09:47 24056,0 Cns_list_rep_entry:
> >>>>>>> mysql_query
> >>>>>>> error: Unknown column 'CTIME' in 'field lis
> >>>>>>> t'
> >>>>>>> /var/log/lfc/log:03/12 09:09:50 24056,0 Cns_list_rep_entry:
> >>>>>>> mysql_query
> >>>>>>> error: Unknown column 'CTIME' in 'field lis
> >>>>>>> t'
> >>>>>>> /var/log/lfc/log:03/12 09:09:58 24056,0 Cns_list_rep_entry:
> >>>>>>> mysql_query
> >>>>>>> error: Unknown column 'CTIME' in 'field list'
> >>>>>>> /var/log/lfc/log:03/12 09:10:01 24056,0 Cns_list_rep_entry:
> >>>>>>> mysql_query
> >>>>>>> error: Unknown column 'CTIME' in 'field list'
> >>>>>>> /var/log/lfc/log:03/12 09:10:20 24056,0 Cns_list_rep_entry:
> >>>>>>> mysql_query
> >>>>>>> error: Unknown column 'CTIME' in 'field list'
> >>>>>>> /var/log/lfc/log:03/12 09:10:20 24056,0 Cns_list_rep_entry:
> >>>>>>> mysql_query
> >>>>>>> error: Unknown column 'CTIME' in 'field list'
> >>>>>>> /var/log/lfc/log:03/12 09:10:20 24056,0 Cns_list_rep_entry:
> >>>>>>> mysql_query
> >>>>>>> error: Unknown column 'CTIME' in 'field list'
> >>>>>>> /var/log/lfc/log:03/12 09:14:13 24056,0 Cns_insert_rep_entry:
> >>>>>>> mysql_query
> >>>>>>> error: Unknown column 'CTIME' in 'field list'
> >>>>>>>
> >>>>>>> And on UI user gets errors when worhs with SE through LFC -
> >>>>>>> [lublev@uiitep TEST]$ lcg-cr -v -d se2.itep.ru -l
> >>>>>>> /grid/alice/my_dir/fileSE22.dat --vo alice
> >>>>>>> file:/home/users/lab240/lublev/JOBS/SC3/file.dat
> >>>>>>> Using grid catalog type: lfc
> >>>>>>> Using grid catalog : glwms.itep.ru
> >>>>>>> Source URL: file:/home/users/lab240/lublev/JOBS/SC3/file.dat
> >>>>>>> File size: 1073741824
> >>>>>>> VO name: alice
> >>>>>>> Destination specified: se2.itep.ru
> >>>>>>> Destination URL for copy:
> >>>>>>> gsiftp://se2.itep.ru/se2.itep.ru:/storage/alice/2007-03-12/
> >>>>>>> file91e66140-81f2
> >>>>>>> -4ca5-ae44-6b86c31d1832.523505.0
> >>>>>>> # streams: 1
> >>>>>>> # set timeout to 0 seconds
> >>>>>>> Alias registered in Catalog: lfn:/grid/alice/my_dir/fileSE22.dat
> >>>>>>> 1059061760 bytes 24352.58 KB/sec avg 22341.82 KB/sec inst
> >>>>>>> Transfer took 43420 ms
> >>>>>>> Internal error
> >>>>>>> Could not register in Catalog the URL
> >>>>>>> srm://se2.itep.ru/dpm/itep.ru/home/alice/generated/2007-03-12/
> >>>>>>> file91e66140-8
> >>>>>>> 1f2-4ca5-ae44-6b86c31d1832
> >>>>>>> lcg_cr: Communication error on send
> >>>>>>>
> >>>>>>>
> >>>>>>> [lublev@uiitep TEST]$ lcg-del -s se2.itep.ru --vo alice
> >>>>>>> lfn:/grid/alice/my_dir/fileSE22.dat
> >>>>>>> Internal error
> >>>>>>> lcg_del: Communication error on send
> >>>>>>>
> >>>>>>>
> >>>>>>> [lublev@uiitep TEST]$ lfc-rm -f alice lfn:/grid/alice/my_dir/
> >>>>>>> fileSE22.dat
> >>>>>>> alice: invalid path
> >>>>>>> send2nsd: NS009 - fatal configuration error: Host unknown: lfn
> >>>>>>> lfn:/grid/alice/my_dir/:fileSE22.dat Host not known
> >>>>>>>
> >>>>>>>
> >>>>>>> Any suggestion on how to proceed?
> >>>>>>>
> >>>>>>> Cheers,
> >>>>>>> Yevgeniy.
> >>>>>> --
> >>>>>> Steve Traylen
> >>>>>> [log in to unmask]
> >>>>>> CERN, IT-GD-OPS.
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>> --
> >>>> Steve Traylen
> >>>> [log in to unmask]
> >>>> CERN, IT-GD-OPS.
> >>>>
> >>>>
> >>>>
> >>>>
|