Hi Matt, Patrick,
If it would help I can be available on afternoon Thursday (is
13:00-14:00 or so good to aim for?).
I'll try and tidy up a copy of the Edinburgh Tier2 configs (SGE<->ARC6)
for comparison in advance.
You likely have a much simpler SGE config than us at Edinburgh but I
have a handful of site-specific config changes around ARC which might
help fix some the issues you may be seeing here.
Unfortunately, I have no idea what a JSV script is or how they fit into
the ARC/SGE setup. I'm assuming this is re-writing or editing job
requirements on the fly such that the ATLAS jobs are handled better by
the SGE broker.
With that in mind @Edinburgh I've had to set a few additional settings
to get ATLAS to play nice with our scheduler. We have some extra env
parameters defined to tell Java in a singularity container (and Java in
a container in a container) to behave with regard to things like memory
management otherwise jobs get killed intermittently at seemingly random
stages of the execution.
(This presented as lost heartbeat(s) at Edinburgh for a _long_ time
before it was tracked down!)
If you're using UNIVA (without reviewing your full config) I would also
caution that the 2.5G h_vmem seems potentially inappropriate for ATLAS
workloads and I have this pinned to 8GB-vmem/slot at Edinburgh which
seems to yield a good job efficiency vs managing to get access to
resources on our shared cluster.
Best Regards,
Rob
On 2021-05-07 13:18, Doidge, Matt wrote:
> (sent on the behalf of Patrick, who is having mail alias problems with
> jiscmail)
>
> Dear All,
>
> Here at Sussex site (UKI-SOUTHGRID-SUSX) we are still experiencing
> high failure rates of production ATLAS jobs and I can't get to the
> bottom of why. Previously, we had 100% success running pilot test
> jobs.
>
> I would like to ask if any of you with expertise in ARC CE6 and/or SGE
> Grid Engine could kindly spare any time on the afternoon of Thursday
> 13th May for a meeting to look at our configuration and see if we can
> iron out these issues please?
>
> Also, the accounting on our ARC CE 6 is not working as expected
> despite registering the ARC CE as a gLite-APEL service on GOCDB, there
> are still no entries for 'UKI-SOUTHGRID-SUSX' on:
>
> https://accounting-next.egi.eu/wlcg/country/United
> Kingdom/normelap_processors/SITE/DATE/2020/9/2021/4/lhc/localinfrajobs/
>
> My arc.conf is below with some other configuration information.
>
> The main job errors seem to relate to lack of HDD space, although the
> scratch DIR should be adequate on WNs, and lost heartbeat.
>
> Thank you,
> Kind regards,
> Patrick Smith | Linux Systems Analyst (HPC GridPP Manager)
> School of Mathematical and Physical Sciences | University of Sussex
> ---------------------------------
>
> My /var/log/arc/jura.log is completely empty. However, an older log
> has:
>
> [grid-arc-01 ~]# tail /var/log/arc/jura.log-20210429
> [2021-04-28 10:56:57,138] [ARC.AccountingDB] [INFO] [2186]
> [Established connection to accounting publishing state database]
> [2021-04-28 10:56:57,139] [ARC.Accounting.Publisher] [INFO] [2186]
> [Publishing latest accounting data to [arex/jura/apel:EGI] target
> (jobs finished since 2020-01-08 00:00:00).]
>
>
> I am running version:
> Name : nordugrid-arc-arex
> Arch : x86_64
> Version : 6.5.0
> OS : centOS 7
> -------------------------------------------------------------------
> Services running are:
> [grid-arc-01 ~]# arcctl service list
> arc-acix-index (Not installed, Disabled, Stopped)
> arc-acix-scanner (Not installed, Disabled, Stopped)
> arc-arex (Installed, Enabled, Running)
> arc-datadelivery-service (Not installed, Disabled, Stopped)
> arc-gridftpd (Installed, Enabled, Running)
> arc-infosys-ldap (Installed, Enabled, Running)
>
> -------------------------------------------------------------------
> [grid-arc-01 ~]# arcctl accounting apel-brokers
> http://mq.cro-ngi.hr:6163/
> http://broker-prod1.argo.grnet.gr:6163/
>
> [grid-arc-01 ~]# arcctl accounting stats
> A-REX Accounting Statistics:
> Number of Jobs: 619167
> Execution timeframe: 2020-01-08 12:44:12 - 2021-04-30 09:41:26
> Total WallTime: 8905 days, 20:02:41
> Total CPUTime: 9917 days, 9:37:09 (including 237 days, 20:31:14 of
> kernel time)
> Data staged in: 136.0M
> Data staged out: 226.7K
> -------------------------------------------------------------------
> [grid-arc-01 ~]# arcctl accounting job info 1172265
> [2021-04-30 10:55:28,628] [ARCCTL.Accounting] [ERROR] [11695] [There
> are no job accounting information found for job 1172265]
> -------------------------------------------------------------------
> arc.conf
> # ARC-CE v6 Compute Element @ SouthGrid Susx
> # Main configuration file: '/etc/arc.conf'
>
> [common]
> hostname = grid-arc-01.hpc.susx.ac.uk
> x509_host_key = /etc/grid-security/hostkey.pem
> x509_host_cert = /etc/grid-security/hostcert.pem
>
> # voms = vo_name group role capabilities
> [authgroup:dteam]
> voms = dteam * * *
>
> [authgroup:ops]
> voms = ops * * *
>
> [authgroup:atlas]
> voms = atlas * * *
>
> [authgroup:all-vos]
> authgroup = dteam ops atlas
>
> [mapping]
> map_to_pool = atlas /etc/grid-security/pool/atlas
> map_to_pool = ops /etc/grid-security/pool/ops
> map_to_pool = dteam /etc/grid-security/pool/dteam
> # map_with_plugin = authgroup_name timeout plugin [arg1 [arg2 [...]]]
> #map_with_plugin = all-vos 30 /usr/libexec/arc/arc-lcmaps %D %P
> liblcmaps.so /usr/lib64 /etc/lcmaps/lcmaps.db arc
> #map_with_file = all-vos /etc/grid-security/grid-mapfile
>
> [lrms]
> #lrms = fork
> lrms = sge
> sge_root = /cm/shared/apps/sge/current
> sge_bin_path = /cm/shared/apps/sge/current/bin/lx-amd64
> #sge_qmaster_port=536
> #sge_execd_port=537
>
> [arex]
> #sessiondir=/var/spool/arc/sessiondir
> sessiondir=/cm/shared/gridpp/arc/sessiondir
> norootpower=yes
> shared_filesystem = yes
> #scratchdir=/var/spool/arc/scratchdir
> scratchdir=/scratch/tmp
> loglevel = 5
>
> [arex/jura]
> loglevel = INFO
>
> [arex/jura/archiving]
>
> [arex/jura/apel: EGI]
> targeturl = https://mq.cro-ngi.hr:6162
> topic = /queue/global.accounting.cpu.central
> gocdb_name = UKI-SOUTHGRID-SUSX
> legacy_fallback = no
> #benchmark_type = HEPSPEC
> #benchmark_value = 8.74
> #use_ssl = yes
>
> [arex/ws]
>
> [arex/ws/jobs]
> allowaccess = all-vos
>
> [gridftpd]
> allowaccess = all-vos
> loglevel = DEBUG
>
> [gridftpd/jobs]
> allowaccess = all-vos
>
> [infosys]
> loglevel = INFO
>
> [infosys/ldap]
> bdii_debug_level = INFO
>
> [infosys/nordugrid]
>
> [infosys/glue2]
> admindomain_name = UKI-SOUTHGRID-SUSX
>
> [infosys/glue2/ldap]
> #user=slapd
> #slapd=/usr/lib/systemd/system/slapd
> #infosys_ldap_run_dir=/var/run/arc/infosys
> #ldap_schema_dir=/etc/ladap/schema/
>
> [infosys/cluster]
> advertisedvo = ops
> advertisedvo = dteam
> advertisedvo = atlas
> alias = SouthGrid Susx
> hostname = grid-arc-01.hpc.susx.ac.uk
> cluster_location = UK-BN19RH
> cluster_owner = University_of_Sussex
> clustersupport = [log in to unmask]
> #nodememory = 6000
> #defaultmemory = 2048
> nodeaccess = outbound
>
> [queue:gridpp.q]
> comment = Queue for GridPP jobs
> homogeneity=false
> advertisedvo=atlas
> advertisedvo=dteam
> advertisedvo=ops
> allowaccess = all-vos
>
> -------------------------------------------
>
> Due to hard & soft limits being set jobs were failing so I had to
> implement a JSV script below:
>
> if (exists $params{q_hard}{"gridpp.q"}) {
> jsv_sub_del_param('l_hard', 'h_vmem');
> jsv_sub_del_param('l_hard', 'm_mem_free');
> jsv_sub_del_param('l_hard', 's_vmem');
> jsv_sub_del_param('l_hard', 'h_rt');
> jsv_sub_del_param('l_hard', 'h_cpu');
> jsv_sub_del_param('l_hard', 's_cpu');
> jsv_sub_del_param('l_hard', 's_rt');
> }else{
> unless ($params{l_hard}{m_mem_free} ||
> $params{l_hard}{h_vmem}) {
> jsv_sub_add_param('l_hard', 'm_mem_free', '2G');
> jsv_sub_add_param('l_hard', 'h_vmem', '2.5G');
> }
> -------------------------------------------
> gridpp.q configuration on SGE:
> $ qconf -sq gridpp.q
> qname gridpp.q
> hostlist @compute_gridpp
> seq_no 1,[@compute_intel_r440_grid=20], \
> [@compute_intel_c6220_grid=32], \
> [@compute_intel_r430_grid=40], \
>
> [@compute_amd_c6145_grid=64],[@compute_amd_r6515_grid=128]
> load_thresholds np_load_avg=1.75
> suspend_thresholds NONE
> nsuspend 1
> suspend_interval 00:05:00
> priority 0
> min_cpu_interval 00:05:00
> qtype BATCH
> ckpt_list NONE
> pe_list openmp
> jc_list NO_JC,ANY_JC
> tmpdir /local/grid/scratch
> rerun FALSE
> rerun_limit 0
> rerun_limit_action NONE
> slots 1,[@compute_amd_c6145_grid=64], \
> [@compute_intel_c6220_grid=32], \
> [@compute_intel_r430_grid=40], \
> [@compute_intel_r440_grid=20], \
> [@compute_amd_r6515_grid=128]
> shell /bin/bash
> prolog /usr/bin/sge_filestaging --stagein
> epilog /usr/bin/sge_filestaging --stageout
> shell_start_mode posix_compliant
> starter_method NONE
> suspend_method NONE
> resume_method NONE
> terminate_method NONE
> notify 00:00:60
> owner_list NONE
> user_lists gridpp_users
> xuser_lists NONE
> subordinate_list NONE
> complex_values NONE
> projects NONE
> xprojects NONE
> calendar NONE
> initial_state default
> s_rt INFINITY
> h_rt 72:00:00
> d_rt INFINITY
> s_cpu INFINITY
> h_cpu INFINITY
> s_fsize INFINITY
> h_fsize INFINITY
> s_data INFINITY
> h_data INFINITY
> s_stack INFINITY
> h_stack INFINITY
> s_core INFINITY
> h_core INFINITY
> s_rss INFINITY
> h_rss INFINITY
> s_vmem INFINITY
> h_vmem INFINITY
> -------------------------------------------
> ########################################################################
>
> To unsubscribe from the TB-SUPPORT list, click the following link:
> https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=TB-SUPPORT&A=1
>
> This message was issued to members of www.jiscmail.ac.uk/TB-SUPPORT, a
> mailing list hosted by www.jiscmail.ac.uk, terms & conditions are
> available at https://www.jiscmail.ac.uk/policyandsecurity/
The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. Is e buidheann carthannais a th’ ann an Oilthigh Dhùn Èideann, clàraichte an Alba, àireamh clàraidh SC005336.
########################################################################
To unsubscribe from the TB-SUPPORT list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/WA-JISC.exe?SUBED1=TB-SUPPORT&A=1
This message was issued to members of www.jiscmail.ac.uk/TB-SUPPORT, a mailing list hosted by www.jiscmail.ac.uk, terms & conditions are available at https://www.jiscmail.ac.uk/policyandsecurity/
|