JISCMail - LCG-ROLLOUT Archives

Hi

>
> Thanks ... what do the maui logs say?  And does torque say anything
> about a failed attempt to connect from maui?  What does the torque
>
> qmgr -c

[root@ceitep root]# qmgr -c 'print server'
#
# Create queues and set their attributes.
#
#
# Create and define queue atlas
#
create queue atlas
set queue atlas queue_type = Execution
set queue atlas resources_max.cput = 120:00:00
set queue atlas resources_max.walltime = 140:00:00
set queue atlas acl_group_enable = True
set queue atlas acl_groups = atlas
set queue atlas enabled = True
set queue atlas started = True
#
# Create and define queue alice
#
create queue alice
set queue alice queue_type = Execution
set queue alice resources_max.cput = 120:00:00
set queue alice resources_max.walltime = 140:00:00
...



>
> print about administrator rights, is the maui user allowed to schedule
> jobs? (via operator / manager / acl_hosts in torque)
>
> Also, does maui know the correct torque server host?   (via maui
> SERVERHOST, ADMINHOST, RMHOST and RMSERVER in maui.cfg)

[root@ceitep root]# cat /root/MAUI/maui.cfg
# MAUI configuration example

SERVERHOST              ceitep.itep.ru
ADMIN1                  root
ADMIN3                  edginfo rgma
ADMINHOST               ceitep.itep.ru
RMCFG[base]             TYPE=PBS
SERVERPORT              40559
SERVERMODE              NORMAL

# Set PBS server polling interval. If you have short # queues or/and jobs it
is worth to set a short interval. (10 seconds)

RMPOLLINTERVAL        00:00:10

# a max. 10 MByte log file in a logical location

LOGFILE               /var/log/maui.log
LOGFILEMAXSIZE        10000000
LOGLEVEL              1

# Set the delay to 1 minute before Maui tries to run a job again, # in case
it failed to run the first time.
# The default value is 1 hour.

DEFERTIME       00:01:00

# Necessary for MPI grid jobs
ENABLEMULTIREQJOBS TRUE

NODEALLOCATIONPOLICY CPULOAD
GROUPCFG[alice] MAXPROC=20
GROUPCFG[atlas] MAXPROC=20
GROUPCFG[cms]   MAXPROC=20
GROUPCFG[lhcb]  MAXPROC=20
GROUPCFG[photon]        MAXPROC=2
GROUPCFG[dteam] MAXPROC=4
GROUPCFG[ops]   MAXPROC=4

>
> JT
>
>
> Y.Lyublev wrote:
> > Hi.
> >
> > PBS works correct.
> > [root@ceitep root]# !qs
> > qstat -q
> >
> > server: ceitep.itep.ru
> >
> > Queue            Memory CPU Time Walltime Node  Run Que Lm  State
> > ---------------- ------ -------- -------- ----  --- --- --  -----
> > atlas              --   120:00:0 140:00:0   --    6   0 --   E R
> > alice              --   120:00:0 140:00:0   --    0   0 --   E R
> > lhcb               --   120:00:0 140:00:0   --    7   0 --   E R
> > cms                --   120:00:0 140:00:0   --    0   0 --   E R
> > dteam              --   48:00:00 72:00:00   --    0   0 --   E R
> > photon             --   48:00:00 72:00:00   --    0   0 --   E R
> > ops                --   48:00:00 72:00:00   --    0   0 --   E R
> >                                                ----- -----
> >                                                   13     0
> >
> > Jobs running and ending orderly.
> > [root@ceitep root]# last -10
> > alice008 ftpd19737    wn62.itep.ru     Mon Mar 12 14:32 - 14:32  (00:00)
> > alice008 ftpd17847    wn62.itep.ru     Mon Mar 12 14:31 - 14:31  (00:00)
> > alice008 ftpd17840    wn62.itep.ru     Mon Mar 12 14:31 - 14:31  (00:00)
> > alice008 ftpd17819    wn62.itep.ru     Mon Mar 12 14:31 - 14:31  (00:00)
> > alice010 ftpd17135    wn63.itep.ru     Mon Mar 12 14:30 - 14:30  (00:00)
> > alice008 ftpd16304    wn62.itep.ru     Mon Mar 12 14:30 - 14:30  (00:00)
> > alice010 ftpd15566    wn63.itep.ru     Mon Mar 12 14:29 - 14:29  (00:00)
> > root     pts/4        vitep2.itep.ru   Mon Mar 12 14:24   still logged
in
> > ops001   ftpd8216     wn50.itep.ru     Mon Mar 12 14:23 - 14:23  (00:00)
> > cmssgm   ftpd5590     wn63.itep.ru     Mon Mar 12 14:21 - 14:21  (00:00)
> >
> > Work parameters of MAUI for queues -
> > NODEALLOCATIONPOLICY CPULOAD
> > GROUPCFG[alice] MAXPROC=20
> > GROUPCFG[atlas] MAXPROC=20
> > GROUPCFG[cms]   MAXPROC=20
> > GROUPCFG[lhcb]  MAXPROC=20
> >
> > But MAUI commands itself are not perfected -
> > [root@ceitep root]# showq
> > ERROR:    lost connection to server
> > ERROR:    cannot request service (status)
> >
> > Regards, Yevgeniy.
> >
> >
> >> Hi,
> >>
> >> after this:
> >>
> >> Y.Lyublev wrote:
> >>
> >>> [root@testbed01 root]# /etc/init.d/pbs_server restart
> >>> Shutting down TORQUE Server:                               [  OK  ]
> >>> Starting TORQUE Server:                                    [  OK  ]
> >> now try e.g.
> >>
> >> ps uaxw | grep pbs_server
> >>
> >> and
> >>
> >> qstat -q
> >>
> >> and
> >>
> >> qstat -f
> >>
> >> and look in /var/spool/pbs/server_logs.  Just the fact that the startup
> >> was successful, doesn't mean that the server keeps running for more
than
> >> a few milliseconds after it "successfully" starts up.  "lost connection
> >> to server" sounds like either the maui user is not authenticated to
> >> torque, OR that the server has died immediately after startup (or has
> >> hung) ... sometimes there is a "bad job" in
> >>
> >> /var/spool/pbs/server_priv/jobs
> >>
> >> that is causing the whole thing to hang ...
> >>
> >> JT
> >>
> >>> [root@testbed01 root]# /etc/init.d/maui restart
> >>> Shutting down MAUI Scheduler: ERROR:    lost connection to server
> >>> ERROR:    cannot request service (status)
> >>>                                                            [FAILED]
> >>> Starting MAUI Scheduler:                                   [  OK  ]
> >>> [root@testbed01 root]# /etc/init.d/maui restart
> >>> Shutting down MAUI Scheduler: ERROR:    lost connection to server
> >>> ERROR:    cannot request service (status)
> >>>                                                            [FAILED]
> >>> Starting MAUI Scheduler:                                   [  OK  ]
> >>>
> >>>
> >>>>    Steve
> >>>>> Yes.
> >>>>> For gLite CE -
> >>>>> $ configure_node site-info.def gliteCE TORQUE_server
> >>>>>
> >>>>> For LCG CE -
> >>>>> $ configure_node site-info.def CE_torque
> >>>>>
> >>>>>>   Steve
> >>>>>>
> >>>>>>> 2. LFC server works incorrect:
> >>>>>>>  On LFC server LFC LOG  has -
> >>>>>>>  [root@glwms ORIG]# grep error /var/log/lfc*/log
> >>>>>>> /var/log/lfc/log:03/12 04:43:43  2948,0 sendrep: NS002 - send
> >>>>>>> error : Broken
> >>>>>>> pipe
> >>>>>>> /var/log/lfc/log:03/12 05:44:06  2948,0 sendrep: NS002 - send
> >>>>>>> error : Broken
> >>>>>>> pipe
> >>>>>>> /var/log/lfc/log:03/12 06:45:40  2948,0 sendrep: NS002 - send
> >>>>>>> error : Broken
> >>>>>>> pipe
> >>>>>>> /var/log/lfc/log:03/12 07:43:56  2948,0 sendrep: NS002 - send
> >>>>>>> error : Broken
> >>>>>>> pipe
> >>>>>>> /var/log/lfc/log:03/12 08:49:12  2948,0 sendrep: NS002 - send
> >>>>>>> error : Broken
> >>>>>>> pipe
> >>>>>>> /var/log/lfc/log:03/12 09:03:50  2948,0 Cns_insert_rep_entry:
> >>>>>>> mysql_query
> >>>>>>> error: Unknown column 'CTIME' in 'field l
> >>>>>>> ist'
> >>>>>>> /var/log/lfc/log:03/12 09:09:35 24056,0 Cns_list_rep_entry:
> >>>>>>> mysql_query
> >>>>>>> error: Unknown column 'CTIME' in 'field lis
> >>>>>>> t'
> >>>>>>> /var/log/lfc/log:03/12 09:09:44 24056,0 Cns_list_rep_entry:
> >>>>>>> mysql_query
> >>>>>>> error: Unknown column 'CTIME' in 'field lis
> >>>>>>> t'
> >>>>>>> /var/log/lfc/log:03/12 09:09:47 24056,0 Cns_list_rep_entry:
> >>>>>>> mysql_query
> >>>>>>> error: Unknown column 'CTIME' in 'field lis
> >>>>>>> t'
> >>>>>>> /var/log/lfc/log:03/12 09:09:50 24056,0 Cns_list_rep_entry:
> >>>>>>> mysql_query
> >>>>>>> error: Unknown column 'CTIME' in 'field lis
> >>>>>>> t'
> >>>>>>> /var/log/lfc/log:03/12 09:09:58 24056,0 Cns_list_rep_entry:
> >>>>>>> mysql_query
> >>>>>>> error: Unknown column 'CTIME' in 'field list'
> >>>>>>> /var/log/lfc/log:03/12 09:10:01 24056,0 Cns_list_rep_entry:
> >>>>>>> mysql_query
> >>>>>>> error: Unknown column 'CTIME' in 'field list'
> >>>>>>> /var/log/lfc/log:03/12 09:10:20 24056,0 Cns_list_rep_entry:
> >>>>>>> mysql_query
> >>>>>>> error: Unknown column 'CTIME' in 'field list'
> >>>>>>> /var/log/lfc/log:03/12 09:10:20 24056,0 Cns_list_rep_entry:
> >>>>>>> mysql_query
> >>>>>>> error: Unknown column 'CTIME' in 'field list'
> >>>>>>> /var/log/lfc/log:03/12 09:10:20 24056,0 Cns_list_rep_entry:
> >>>>>>> mysql_query
> >>>>>>> error: Unknown column 'CTIME' in 'field list'
> >>>>>>> /var/log/lfc/log:03/12 09:14:13 24056,0 Cns_insert_rep_entry:
> >>>>>>> mysql_query
> >>>>>>> error: Unknown column 'CTIME' in 'field list'
> >>>>>>>
> >>>>>>> And on UI user gets errors when worhs with SE through LFC -
> >>>>>>> [lublev@uiitep TEST]$ lcg-cr -v -d se2.itep.ru -l
> >>>>>>> /grid/alice/my_dir/fileSE22.dat --vo alice
> >>>>>>> file:/home/users/lab240/lublev/JOBS/SC3/file.dat
> >>>>>>> Using grid catalog type: lfc
> >>>>>>> Using grid catalog : glwms.itep.ru
> >>>>>>> Source URL: file:/home/users/lab240/lublev/JOBS/SC3/file.dat
> >>>>>>> File size: 1073741824
> >>>>>>> VO name: alice
> >>>>>>> Destination specified: se2.itep.ru
> >>>>>>> Destination URL for copy:
> >>>>>>> gsiftp://se2.itep.ru/se2.itep.ru:/storage/alice/2007-03-12/
> >>>>>>> file91e66140-81f2
> >>>>>>> -4ca5-ae44-6b86c31d1832.523505.0
> >>>>>>> # streams: 1
> >>>>>>> # set timeout to 0 seconds
> >>>>>>> Alias registered in Catalog: lfn:/grid/alice/my_dir/fileSE22.dat
> >>>>>>>    1059061760 bytes  24352.58 KB/sec avg  22341.82 KB/sec inst
> >>>>>>> Transfer took 43420 ms
> >>>>>>> Internal error
> >>>>>>> Could not register in Catalog the URL
> >>>>>>> srm://se2.itep.ru/dpm/itep.ru/home/alice/generated/2007-03-12/
> >>>>>>> file91e66140-8
> >>>>>>> 1f2-4ca5-ae44-6b86c31d1832
> >>>>>>> lcg_cr: Communication error on send
> >>>>>>>
> >>>>>>>
> >>>>>>> [lublev@uiitep TEST]$ lcg-del -s se2.itep.ru --vo alice
> >>>>>>> lfn:/grid/alice/my_dir/fileSE22.dat
> >>>>>>> Internal error
> >>>>>>> lcg_del: Communication error on send
> >>>>>>>
> >>>>>>>
> >>>>>>> [lublev@uiitep TEST]$ lfc-rm -f alice lfn:/grid/alice/my_dir/
> >>>>>>> fileSE22.dat
> >>>>>>> alice: invalid path
> >>>>>>> send2nsd: NS009 - fatal configuration error: Host unknown: lfn
> >>>>>>> lfn:/grid/alice/my_dir/:fileSE22.dat  Host not known
> >>>>>>>
> >>>>>>>
> >>>>>>> Any suggestion on how to proceed?
> >>>>>>>
> >>>>>>> Cheers,
> >>>>>>> Yevgeniy.
> >>>>>> -- 
> >>>>>> Steve Traylen
> >>>>>> [log in to unmask]
> >>>>>> CERN, IT-GD-OPS.
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>> -- 
> >>>> Steve Traylen
> >>>> [log in to unmask]
> >>>> CERN, IT-GD-OPS.
> >>>>
> >>>>
> >>>>
> >>>>