Hi > > Thanks ... what do the maui logs say? And does torque say anything > about a failed attempt to connect from maui? What does the torque > > qmgr -c [root@ceitep root]# qmgr -c 'print server' # # Create queues and set their attributes. # # # Create and define queue atlas # create queue atlas set queue atlas queue_type = Execution set queue atlas resources_max.cput = 120:00:00 set queue atlas resources_max.walltime = 140:00:00 set queue atlas acl_group_enable = True set queue atlas acl_groups = atlas set queue atlas enabled = True set queue atlas started = True # # Create and define queue alice # create queue alice set queue alice queue_type = Execution set queue alice resources_max.cput = 120:00:00 set queue alice resources_max.walltime = 140:00:00 ... > > print about administrator rights, is the maui user allowed to schedule > jobs? (via operator / manager / acl_hosts in torque) > > Also, does maui know the correct torque server host? (via maui > SERVERHOST, ADMINHOST, RMHOST and RMSERVER in maui.cfg) [root@ceitep root]# cat /root/MAUI/maui.cfg # MAUI configuration example SERVERHOST ceitep.itep.ru ADMIN1 root ADMIN3 edginfo rgma ADMINHOST ceitep.itep.ru RMCFG[base] TYPE=PBS SERVERPORT 40559 SERVERMODE NORMAL # Set PBS server polling interval. If you have short # queues or/and jobs it is worth to set a short interval. (10 seconds) RMPOLLINTERVAL 00:00:10 # a max. 10 MByte log file in a logical location LOGFILE /var/log/maui.log LOGFILEMAXSIZE 10000000 LOGLEVEL 1 # Set the delay to 1 minute before Maui tries to run a job again, # in case it failed to run the first time. # The default value is 1 hour. DEFERTIME 00:01:00 # Necessary for MPI grid jobs ENABLEMULTIREQJOBS TRUE NODEALLOCATIONPOLICY CPULOAD GROUPCFG[alice] MAXPROC=20 GROUPCFG[atlas] MAXPROC=20 GROUPCFG[cms] MAXPROC=20 GROUPCFG[lhcb] MAXPROC=20 GROUPCFG[photon] MAXPROC=2 GROUPCFG[dteam] MAXPROC=4 GROUPCFG[ops] MAXPROC=4 > > JT > > > Y.Lyublev wrote: > > Hi. > > > > PBS works correct. > > [root@ceitep root]# !qs > > qstat -q > > > > server: ceitep.itep.ru > > > > Queue Memory CPU Time Walltime Node Run Que Lm State > > ---------------- ------ -------- -------- ---- --- --- -- ----- > > atlas -- 120:00:0 140:00:0 -- 6 0 -- E R > > alice -- 120:00:0 140:00:0 -- 0 0 -- E R > > lhcb -- 120:00:0 140:00:0 -- 7 0 -- E R > > cms -- 120:00:0 140:00:0 -- 0 0 -- E R > > dteam -- 48:00:00 72:00:00 -- 0 0 -- E R > > photon -- 48:00:00 72:00:00 -- 0 0 -- E R > > ops -- 48:00:00 72:00:00 -- 0 0 -- E R > > ----- ----- > > 13 0 > > > > Jobs running and ending orderly. > > [root@ceitep root]# last -10 > > alice008 ftpd19737 wn62.itep.ru Mon Mar 12 14:32 - 14:32 (00:00) > > alice008 ftpd17847 wn62.itep.ru Mon Mar 12 14:31 - 14:31 (00:00) > > alice008 ftpd17840 wn62.itep.ru Mon Mar 12 14:31 - 14:31 (00:00) > > alice008 ftpd17819 wn62.itep.ru Mon Mar 12 14:31 - 14:31 (00:00) > > alice010 ftpd17135 wn63.itep.ru Mon Mar 12 14:30 - 14:30 (00:00) > > alice008 ftpd16304 wn62.itep.ru Mon Mar 12 14:30 - 14:30 (00:00) > > alice010 ftpd15566 wn63.itep.ru Mon Mar 12 14:29 - 14:29 (00:00) > > root pts/4 vitep2.itep.ru Mon Mar 12 14:24 still logged in > > ops001 ftpd8216 wn50.itep.ru Mon Mar 12 14:23 - 14:23 (00:00) > > cmssgm ftpd5590 wn63.itep.ru Mon Mar 12 14:21 - 14:21 (00:00) > > > > Work parameters of MAUI for queues - > > NODEALLOCATIONPOLICY CPULOAD > > GROUPCFG[alice] MAXPROC=20 > > GROUPCFG[atlas] MAXPROC=20 > > GROUPCFG[cms] MAXPROC=20 > > GROUPCFG[lhcb] MAXPROC=20 > > > > But MAUI commands itself are not perfected - > > [root@ceitep root]# showq > > ERROR: lost connection to server > > ERROR: cannot request service (status) > > > > Regards, Yevgeniy. > > > > > >> Hi, > >> > >> after this: > >> > >> Y.Lyublev wrote: > >> > >>> [root@testbed01 root]# /etc/init.d/pbs_server restart > >>> Shutting down TORQUE Server: [ OK ] > >>> Starting TORQUE Server: [ OK ] > >> now try e.g. > >> > >> ps uaxw | grep pbs_server > >> > >> and > >> > >> qstat -q > >> > >> and > >> > >> qstat -f > >> > >> and look in /var/spool/pbs/server_logs. Just the fact that the startup > >> was successful, doesn't mean that the server keeps running for more than > >> a few milliseconds after it "successfully" starts up. "lost connection > >> to server" sounds like either the maui user is not authenticated to > >> torque, OR that the server has died immediately after startup (or has > >> hung) ... sometimes there is a "bad job" in > >> > >> /var/spool/pbs/server_priv/jobs > >> > >> that is causing the whole thing to hang ... > >> > >> JT > >> > >>> [root@testbed01 root]# /etc/init.d/maui restart > >>> Shutting down MAUI Scheduler: ERROR: lost connection to server > >>> ERROR: cannot request service (status) > >>> [FAILED] > >>> Starting MAUI Scheduler: [ OK ] > >>> [root@testbed01 root]# /etc/init.d/maui restart > >>> Shutting down MAUI Scheduler: ERROR: lost connection to server > >>> ERROR: cannot request service (status) > >>> [FAILED] > >>> Starting MAUI Scheduler: [ OK ] > >>> > >>> > >>>> Steve > >>>>> Yes. > >>>>> For gLite CE - > >>>>> $ configure_node site-info.def gliteCE TORQUE_server > >>>>> > >>>>> For LCG CE - > >>>>> $ configure_node site-info.def CE_torque > >>>>> > >>>>>> Steve > >>>>>> > >>>>>>> 2. LFC server works incorrect: > >>>>>>> On LFC server LFC LOG has - > >>>>>>> [root@glwms ORIG]# grep error /var/log/lfc*/log > >>>>>>> /var/log/lfc/log:03/12 04:43:43 2948,0 sendrep: NS002 - send > >>>>>>> error : Broken > >>>>>>> pipe > >>>>>>> /var/log/lfc/log:03/12 05:44:06 2948,0 sendrep: NS002 - send > >>>>>>> error : Broken > >>>>>>> pipe > >>>>>>> /var/log/lfc/log:03/12 06:45:40 2948,0 sendrep: NS002 - send > >>>>>>> error : Broken > >>>>>>> pipe > >>>>>>> /var/log/lfc/log:03/12 07:43:56 2948,0 sendrep: NS002 - send > >>>>>>> error : Broken > >>>>>>> pipe > >>>>>>> /var/log/lfc/log:03/12 08:49:12 2948,0 sendrep: NS002 - send > >>>>>>> error : Broken > >>>>>>> pipe > >>>>>>> /var/log/lfc/log:03/12 09:03:50 2948,0 Cns_insert_rep_entry: > >>>>>>> mysql_query > >>>>>>> error: Unknown column 'CTIME' in 'field l > >>>>>>> ist' > >>>>>>> /var/log/lfc/log:03/12 09:09:35 24056,0 Cns_list_rep_entry: > >>>>>>> mysql_query > >>>>>>> error: Unknown column 'CTIME' in 'field lis > >>>>>>> t' > >>>>>>> /var/log/lfc/log:03/12 09:09:44 24056,0 Cns_list_rep_entry: > >>>>>>> mysql_query > >>>>>>> error: Unknown column 'CTIME' in 'field lis > >>>>>>> t' > >>>>>>> /var/log/lfc/log:03/12 09:09:47 24056,0 Cns_list_rep_entry: > >>>>>>> mysql_query > >>>>>>> error: Unknown column 'CTIME' in 'field lis > >>>>>>> t' > >>>>>>> /var/log/lfc/log:03/12 09:09:50 24056,0 Cns_list_rep_entry: > >>>>>>> mysql_query > >>>>>>> error: Unknown column 'CTIME' in 'field lis > >>>>>>> t' > >>>>>>> /var/log/lfc/log:03/12 09:09:58 24056,0 Cns_list_rep_entry: > >>>>>>> mysql_query > >>>>>>> error: Unknown column 'CTIME' in 'field list' > >>>>>>> /var/log/lfc/log:03/12 09:10:01 24056,0 Cns_list_rep_entry: > >>>>>>> mysql_query > >>>>>>> error: Unknown column 'CTIME' in 'field list' > >>>>>>> /var/log/lfc/log:03/12 09:10:20 24056,0 Cns_list_rep_entry: > >>>>>>> mysql_query > >>>>>>> error: Unknown column 'CTIME' in 'field list' > >>>>>>> /var/log/lfc/log:03/12 09:10:20 24056,0 Cns_list_rep_entry: > >>>>>>> mysql_query > >>>>>>> error: Unknown column 'CTIME' in 'field list' > >>>>>>> /var/log/lfc/log:03/12 09:10:20 24056,0 Cns_list_rep_entry: > >>>>>>> mysql_query > >>>>>>> error: Unknown column 'CTIME' in 'field list' > >>>>>>> /var/log/lfc/log:03/12 09:14:13 24056,0 Cns_insert_rep_entry: > >>>>>>> mysql_query > >>>>>>> error: Unknown column 'CTIME' in 'field list' > >>>>>>> > >>>>>>> And on UI user gets errors when worhs with SE through LFC - > >>>>>>> [lublev@uiitep TEST]$ lcg-cr -v -d se2.itep.ru -l > >>>>>>> /grid/alice/my_dir/fileSE22.dat --vo alice > >>>>>>> file:/home/users/lab240/lublev/JOBS/SC3/file.dat > >>>>>>> Using grid catalog type: lfc > >>>>>>> Using grid catalog : glwms.itep.ru > >>>>>>> Source URL: file:/home/users/lab240/lublev/JOBS/SC3/file.dat > >>>>>>> File size: 1073741824 > >>>>>>> VO name: alice > >>>>>>> Destination specified: se2.itep.ru > >>>>>>> Destination URL for copy: > >>>>>>> gsiftp://se2.itep.ru/se2.itep.ru:/storage/alice/2007-03-12/ > >>>>>>> file91e66140-81f2 > >>>>>>> -4ca5-ae44-6b86c31d1832.523505.0 > >>>>>>> # streams: 1 > >>>>>>> # set timeout to 0 seconds > >>>>>>> Alias registered in Catalog: lfn:/grid/alice/my_dir/fileSE22.dat > >>>>>>> 1059061760 bytes 24352.58 KB/sec avg 22341.82 KB/sec inst > >>>>>>> Transfer took 43420 ms > >>>>>>> Internal error > >>>>>>> Could not register in Catalog the URL > >>>>>>> srm://se2.itep.ru/dpm/itep.ru/home/alice/generated/2007-03-12/ > >>>>>>> file91e66140-8 > >>>>>>> 1f2-4ca5-ae44-6b86c31d1832 > >>>>>>> lcg_cr: Communication error on send > >>>>>>> > >>>>>>> > >>>>>>> [lublev@uiitep TEST]$ lcg-del -s se2.itep.ru --vo alice > >>>>>>> lfn:/grid/alice/my_dir/fileSE22.dat > >>>>>>> Internal error > >>>>>>> lcg_del: Communication error on send > >>>>>>> > >>>>>>> > >>>>>>> [lublev@uiitep TEST]$ lfc-rm -f alice lfn:/grid/alice/my_dir/ > >>>>>>> fileSE22.dat > >>>>>>> alice: invalid path > >>>>>>> send2nsd: NS009 - fatal configuration error: Host unknown: lfn > >>>>>>> lfn:/grid/alice/my_dir/:fileSE22.dat Host not known > >>>>>>> > >>>>>>> > >>>>>>> Any suggestion on how to proceed? > >>>>>>> > >>>>>>> Cheers, > >>>>>>> Yevgeniy. > >>>>>> -- > >>>>>> Steve Traylen > >>>>>> [log in to unmask] > >>>>>> CERN, IT-GD-OPS. > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>> -- > >>>> Steve Traylen > >>>> [log in to unmask] > >>>> CERN, IT-GD-OPS. > >>>> > >>>> > >>>> > >>>>