Hi Patrick
That gridengine log points to my job which runs unlike the real atlas jobs.
I can think of two reasons why the real atlas jobs fail.
1) The real atlas jobs arrive with either a pilot or production role, while my jobs arrive without a defined role. There might be an issue with the mappings. Is it possible for someone to submit a few hello world job with different atlas roles?
2) gridengine rejects the job for some reason. Is there any sign in the gridengine logs of the other atlas jobs? e.g. the job might be requesting resources that don't exist,
dan
* Dr Daniel Traynor, Grid cluster system manager
* Tel +44(0)20 7882 6560, Particle Physics,QMUL
________________________________________
From: Testbed Support for GridPP member institutes <[log in to unmask]> on behalf of Patrick Smith <[log in to unmask]>
Sent: 21 January 2020 14:00
To: [log in to unmask]
Subject: Re: ARC CE6 / UGE not working with Panda Queue
Here is my UGE configuration and a log file from a failed job:
# qconf -ss
grid-arc-01.hpc.susx.ac.uk
# qconf -sql
gridpp.q
# qconf -sq gridpp.q
qname gridpp.q
hostlist @compute_gridpp
seq_no 0
load_thresholds np_load_avg=1.75
suspend_thresholds NONE
nsuspend 1
suspend_interval 00:05:00
priority 0
min_cpu_interval 00:05:00
qtype BATCH
ckpt_list NONE
pe_list openmp
jc_list NO_JC,ANY_JC
tmpdir /tmp
rerun FALSE
rerun_limit 0
rerun_limit_action NONE
slots 1,[@compute_gridpp=40]
shell /bin/bash
prolog NONE
epilog NONE
shell_start_mode posix_compliant
starter_method NONE
suspend_method NONE
resume_method NONE
terminate_method NONE
notify 00:00:60
owner_list NONE
user_lists gridpp_users
xuser_lists NONE
subordinate_list NONE
complex_values NONE
projects NONE
xprojects NONE
calendar NONE
initial_state default
s_rt 72:00:00
h_rt INFINITY
d_rt INFINITY
s_cpu INFINITY
h_cpu INFINITY
s_fsize INFINITY
h_fsize INFINITY
s_data INFINITY
h_data INFINITY
s_stack INFINITY
h_stack INFINITY
s_core INFINITY
h_core INFINITY
s_rss INFINITY
h_rss INFINITY
s_vmem INFINITY
h_vmem INFINITY
# qconf -shgrp @compute_gridpp
group_name @compute_gridpp
hostlist node205.cm.cluster node206.cm.cluster
# qhost
HOSTNAME ARCH NCPU NSOC NCOR NTHR NLOAD MEMTOT MEMUSE SWAPTO SWAPUS
----------------------------------------------------------------------------------------------
node205 lx-amd64 20 2 20 20 0.00 157.0G 4.2G 24.0G 0.0
node206 lx-amd64 20 2 20 20 0.00 157.0G 5.4G 32.0G 0.0
UGE log files:
# cat /cm/shared/apps/uge/current/default/faulty_jobs/41526/node205/active_jobs_dir/41526.1/config
add_grp_id=20213
fs_stdin_host=""
fs_stdin_path=
fs_stdin_tmp_path=/tmp/41526.1.gridpp.q/
fs_stdin_file_staging=0
fs_stdout_host=""
fs_stdout_path=
fs_stdout_tmp_path=/tmp/41526.1.gridpp.q/
fs_stdout_file_staging=0
fs_stderr_host=""
fs_stderr_path=
fs_stderr_tmp_path=/tmp/41526.1.gridpp.q/
fs_stderr_file_staging=0
stdout_path=/cm/shared/gridpp/arc/sessiondir/eHXNDmpkmDwn6eHrap0pOjmp5Gqd1mABFKDmydIKDmABFKDmas3Arn/.comment
stderr_path=/local/grid/atlas003
stdin_path=/dev/null
merge_stderr=1
tmpdir=/tmp/41526.1.gridpp.q
tmpdir_count=1
handle_as_binary=0
no_shell=0
ckpt_job=0
max_ijs_client_wait_time=60
h_vmem=2684354560
cgroups_limit_h_vmem=2684354560
s_vmem=INFINITY
h_cpu=INFINITY
s_cpu=INFINITY
h_rss=INFINITY
s_rss=INFINITY
h_stack=INFINITY
s_stack=INFINITY
h_data=INFINITY
s_data=INFINITY
h_core=INFINITY
s_core=INFINITY
h_fsize=INFINITY
s_fsize=INFINITY
s_descriptors=UNDEFINED
h_descriptors=UNDEFINED
s_maxproc=UNDEFINED
h_maxproc=UNDEFINED
s_memorylocked=32M
h_memorylocked=32M
s_locks=UNDEFINED
h_locks=UNDEFINED
priority=0
shell_path=/bin/sh
script_file=/cm/shared/apps/uge/current/default/spool/node205/job_scripts/41526
job_owner=atlas003
min_gid=0
min_uid=0
cwd=/local/grid/atlas003
prolog=none
epilog=none
starter_method=NONE
suspend_method=NONE
resume_method=NONE
terminate_method=NONE
script_timeout=120
pe=none
pe_slots=1
host_slots=1
shell_start_mode=posix_compliant
use_login_shell=1
[log in to unmask]
mail_options=0
forbid_reschedule=0
forbid_apperror=0
queue=gridpp.q
host=node205.cm.cluster
processors=UNDEFINED
binding=explicit:0,0:ScCCCCCCCCCSCCCCCCCCCC
simulate_binding=false
cgroups_path=/sys/fs/cgroup/
cgroups_subdir_name=UGE
cgroups_auto_mount=false
cgroups_core_binding=true
cgroups_enable_forced_numa_nodes=false
cgroups_enable_m_mem_free_limit_as_hard=true
cgroups_enable_m_mem_free_limit_as_soft=false
cgroups_enable_h_vmem_limit=true
cgroups_enable_freezer_suspend_resume=false
cgroups_enable_additional_killing=true
cgroups_lower_m_mem_free_limit=2200M
cgroups_freeze_pe_tasks=false
m_mem_free=2.000G
suspend_pe_tasks=true
cgroups_devices=
mbind=0
job_name=xRSL_Hello_Worl
job_id=41526
ja_task_id=0
account=sge
submission_time=1579528723692
notify=0
acct_project=none
njob_args=0
queue_tmpdir=/tmp
use_afs=0
admin_user=none
notify_kill_type=1
notify_kill=default
notify_susp_type=1
notify_susp=default
qsub_gid=no
pty=0
umask=022
write_osjob_id=1
inherit_env=1
enable_windomacc=0
enable_addgrp_kill=0
cray_xc30_login_node=0
csp=0
ignore_fqdn=1
default_domain=none
communication_params=
sge_root=/cm/shared/apps/uge/current
sge_cell=default
sge_dir_service_timeout=1
port_range=none
start_container_as_root=0
automap_container_users=0
docker_response_timeout=60
xd_run_as_image_user=0
________________________________
From: Testbed Support for GridPP member institutes [[log in to unmask]] on behalf of sjones [[log in to unmask]]
Sent: 21 January 2020 12:40
To: [log in to unmask]
Subject: Re: ARC CE6 / UGE not working with Panda Queue
On 2020-01-21 12:30, Daniela Bauer wrote:
> no, I meant from the account your pilot gets mapped to (unless you map
> pilots to standard user accounts).
I can see in the logs that the mapping is to atlas002, not atlas001. But
yes, it does go to standard atlas grid user accounts, not special pilot
ones. Patrick is not using ARGUS... it's a local system.
On 2020-01-21 12:02, Alessandra Forti wrote:
> for ATLAS it tells you the job submission to the batch system failed.
> So the problem seems to be between the ARC-CE and the batch system.
>
>> 2020-01-21 05:37:46 Finished - job id:
>> ZwKNDmcP1Dwn6eHrap0pOjmp5Gqd1mABFKDmmcHKDmABFKDmM8Vu5n, unix user:
>> 72002:72000, name: "arc_pilot", owner: "/DC=ch/DC=cern/OU=Organic
>> Units/OU=Users/CN=atlpilo1/CN=614260/CN=Robot: ATLAS Pilot1", lrms:
>> sge, queue: gridpp.q, failure: "Job submission to LRMS failed."
>
It sure looks like that. BTW: Some message blow seem to show the proxy
is OK, and the mapping is OK. Perhaps the answer might lie in batch
system logs… why did sge reject the job?
Ste
/var/log/arc/gridftp.log:
[2020-01-20 11:13:00] [Arc.JobPlugin] [INFO] [14245/23563104] Job
submission user: atlas002 (72002:72000)
[2020-01-20 11:13:00] [Arc.GridFTP_Commands] [VERBOSE] [14245/23563104]
response: 235 Authentication successful\\
[2020-01-20 11:13:00] [Arc.DirectFilePlugin] [VERBOSE]
[14189/139942919896736] open: changing owner for
/cm/shared/gridpp/arc/sessiondir/3HwNDmDAkDwn6eHrap0pOjmp5Gqd1mABFKDm1rIKDmABFKDmu07KKn/runpilot2-wrapper.sh,
72002, 72000
[2020-01-20 11:13:00] [Arc.GridFTP_Commands] [INFO] [14252/23563104]
User subject: /DC=ch/DC=cern/OU=Organic
Units/OU=Users/CN=atlpilo1/CN=614260/CN=Robot: ATLAS Pilot1
[2020-01-20 11:13:00] [Arc.GridFTP_Commands] [INFO] [14252/23563104]
Encrypted: true
[2020-01-20 11:13:00] [Arc.DirectFilePlugin] [VERBOSE]
[14189/139942919896736] open: owner: 72002 72000
[2020-01-20 11:13:00] [Arc.GridFTP_Commands] [VERBOSE]
[14189/139942919896736] response: 150 Opening connection.\\
[2020-01-20 11:13:00] [Arc.Credential] [DEBUG] [14252/23563104]
Certificate format is PEM …
>
> Cheers,
> Daniela
>
> On Tue, 21 Jan 2020 at 12:28, Patrick Smith <[log in to unmask]>
> wrote:
>
>> Hi Alessadra & Daniela,
>>
>> Yes I can su to an atlas001 local user and qsub manually.
>>
> -------------------------------------------------------------------------------------------------------
>>
>> [atlas001@grid-arc-01 ~]$ qsub -q gridpp.q sleep.sh
>> Your job 39943 ("sleep.sh") has been submitted
>> [atlas001@grid-arc-01 ~]$ qstat -j 39943
>>
> -------------------------------------------------------------------------------------------------------
>>
>> hostname node206.cm.cluster
>> group atlas
>> owner atlas001
>> project NONE
>> department gridpp_users
>> jobname sleep.sh
>> jobnumber 39943
>> taskid undefined
>> pe_taskid NONE
>> account sge
>> priority 0
>> cwd /local/grid/atlas001
>> submit_host grid-arc-01.hpc.susx.ac.uk [1]
>> submit_cmd qsub -q gridpp.q sleep.sh
>> qsub_time 01/08/2020 10:34:51.870
>> start_time 01/08/2020 10:34:51.989
>> end_time 01/08/2020 10:39:52.056
>>
> -------------------------------------------------------------------------------------------------------
>>
>> My arc.conf file contains:
>>
>> [lrms]
>> lrms = sge
>> sge_root = /cm/shared/apps/sge/current
>> sge_bin_path = /cm/shared/apps/sge/current/bin/lx-amd64
>>
>>
> http://www.nordugrid.org/arc/arc6/admins/details/lrms.html?highlight=sge#sge<https://eur01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.nordugrid.org%2Farc%2Farc6%2Fadmins%2Fdetails%2Flrms.html%3Fhighlight%3Dsge%23sge&data=02%7C01%7C%7Ca00e3b61ef8c40a4195408d79e7a74cc%7C569df091b01340e386eebd9cb9e25814%7C0%7C0%7C637152121095916116&sdata=JegrSLtHcDPBD1xI47yK83tcoQiYO79e9VGsnCGZSy4%3D&reserved=0>
>>
>>
>> Thanks
>> Patrick
>>
>> -------------------------
>>
>> From: Testbed Support for GridPP member institutes
>> [[log in to unmask]] on behalf of Daniela Bauer
>> [[log in to unmask]]
>> Sent: 21 January 2020 12:13
>> To: [log in to unmask]
>> Subject: Re: ARC CE6 / UGE not working with Panda Queue
>>
>> Hi Patrick,
>>
>> Can you su to the atlas pilot user account and do a qsub manually ?
>> (Apologies if you already tested that.)
>>
>> Daniela
>>
>> On Tue, 21 Jan 2020 at 12:02, Alessandra Forti
>> <[log in to unmask]> wrote:
>>
>> Hi Patrick,
>>
>> for ATLAS it tells you the job submission to the batch system
>> failed. So the problem seems to be between the ARC-CE and the batch
>> system. It might be due to something set wrong upstream but we need
>> to understand why the job doesn't go from ARC to the BS.
>>
>> thanks
>>
>> cheers
>> alessandra
>>
>> On 21/01/2020 11:44, Patrick Smith wrote:
>> 2020-01-21 05:37:46 Finished - job id:
>> ZwKNDmcP1Dwn6eHrap0pOjmp5Gqd1mABFKDmmcHKDmABFKDmM8Vu5n, unix user:
>> 72002:72000, name: "arc_pilot", owner: "/DC=ch/DC=cern/OU=Organic
>> Units/OU=Users/CN=atlpilo1/CN=614260/CN=Robot: ATLAS Pilot1", lrms:
>> sge, queue: gridpp.q, failure: "Job submission to LRMS failed."
>>
>> --
>> Inference: a conclusion reached on the basis of evidence and
>> reasoning
>> Respect is a rational process. \\//
>> For Ur-Fascism, disagreement is treason. (U. Eco)
>>
>> -------------------------
>>
>> To unsubscribe from the TB-SUPPORT list, click the following link:
>> https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=TB-SUPPORT&A=1<https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.jiscmail.ac.uk%2Fcgi-bin%2Fwebadmin%3FSUBED1%3DTB-SUPPORT%26A%3D1&data=02%7C01%7C%7Ca00e3b61ef8c40a4195408d79e7a74cc%7C569df091b01340e386eebd9cb9e25814%7C0%7C0%7C637152121095926109&sdata=f6kTffgo6iTaBGT5hJzpMb3%2Btz4ZDMB2S2GZDlWfcGM%3D&reserved=0>
>
> --
>
> Sent from the pit of despair
>
> -----------------------------------------------------------
> [log in to unmask]
> HEP Group/Physics Dep
> Imperial College
> London, SW7 2BW
> Tel: +44-(0)20-75947810
> http://www.hep.ph.ic.ac.uk/~dbauer/<https://eur01.safelinks.protection.outlook.com/?url=http:%2F%2Fwww.hep.ph.ic.ac.uk%2F~dbauer%2F&data=02%7C01%7C%7Ca00e3b61ef8c40a4195408d79e7a74cc%7C569df091b01340e386eebd9cb9e25814%7C0%7C0%7C637152121095936104&sdata=3J8YmPioWV2fvaJqmnrC4gMffNkUwNbO9aWAgMLBkq8%3D&reserved=0>
>
> -------------------------
>
> To unsubscribe from the TB-SUPPORT list, click the following link:
> https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=TB-SUPPORT&A=1<https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.jiscmail.ac.uk%2Fcgi-bin%2Fwebadmin%3FSUBED1%3DTB-SUPPORT%26A%3D1&data=02%7C01%7C%7Ca00e3b61ef8c40a4195408d79e7a74cc%7C569df091b01340e386eebd9cb9e25814%7C0%7C0%7C637152121095936104&sdata=EJqq8YF7rqP956rRDc9v2TIIA%2FxO8H2IrgIpKxBCLZs%3D&reserved=0>
>
> -------------------------
>
> To unsubscribe from the TB-SUPPORT list, click the following link:
> https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=TB-SUPPORT&A=1<https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.jiscmail.ac.uk%2Fcgi-bin%2Fwebadmin%3FSUBED1%3DTB-SUPPORT%26A%3D1&data=02%7C01%7C%7Ca00e3b61ef8c40a4195408d79e7a74cc%7C569df091b01340e386eebd9cb9e25814%7C0%7C0%7C637152121095946105&sdata=7qD2uN8EJSjhVY%2BokqYJisi%2BQZJPkOyzw%2Bffmt9aboQ%3D&reserved=0>
>
> --
>
> Sent from the pit of despair
>
> -----------------------------------------------------------
> [log in to unmask]
> HEP Group/Physics Dep
> Imperial College
> London, SW7 2BW
> Tel: +44-(0)20-75947810
> http://www.hep.ph.ic.ac.uk/~dbauer/<https://eur01.safelinks.protection.outlook.com/?url=http:%2F%2Fwww.hep.ph.ic.ac.uk%2F~dbauer%2F&data=02%7C01%7C%7Ca00e3b61ef8c40a4195408d79e7a74cc%7C569df091b01340e386eebd9cb9e25814%7C0%7C0%7C637152121095956095&sdata=HCOzDv0Vnt0jiVUftwKxn7PfRdpwkwgr92zeanxzeXk%3D&reserved=0>
>
> -------------------------
>
> To unsubscribe from the TB-SUPPORT list, click the following link:
> https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=TB-SUPPORT&A=1<https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.jiscmail.ac.uk%2Fcgi-bin%2Fwebadmin%3FSUBED1%3DTB-SUPPORT%26A%3D1&data=02%7C01%7C%7Ca00e3b61ef8c40a4195408d79e7a74cc%7C569df091b01340e386eebd9cb9e25814%7C0%7C0%7C637152121095956095&sdata=j1ihp4i84s2mQVIyR%2Fp2jsh8RfuqAeUUv%2ByEAkK5uJE%3D&reserved=0>
>
> Links:
> ------
> [1] http://grid-arc-01.hpc.susx.ac.uk<https://eur01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgrid-arc-01.hpc.susx.ac.uk&data=02%7C01%7C%7Ca00e3b61ef8c40a4195408d79e7a74cc%7C569df091b01340e386eebd9cb9e25814%7C0%7C0%7C637152121095966097&sdata=IOmjLRUT0GkZHxeR4TtmMaj3%2FIm2O6i99nsgvEhRei4%3D&reserved=0>
########################################################################
To unsubscribe from the TB-SUPPORT list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=TB-SUPPORT&A=1<https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.jiscmail.ac.uk%2Fcgi-bin%2Fwebadmin%3FSUBED1%3DTB-SUPPORT%26A%3D1&data=02%7C01%7C%7Ca00e3b61ef8c40a4195408d79e7a74cc%7C569df091b01340e386eebd9cb9e25814%7C0%7C0%7C637152121095966097&sdata=MzGJKaB2AAURekSvFO6joG%2F3oWpQShaIg85lUWvow5U%3D&reserved=0>
________________________________
To unsubscribe from the TB-SUPPORT list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=TB-SUPPORT&A=1<https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.jiscmail.ac.uk%2Fcgi-bin%2Fwebadmin%3FSUBED1%3DTB-SUPPORT%26A%3D1&data=02%7C01%7C%7Ca00e3b61ef8c40a4195408d79e7a74cc%7C569df091b01340e386eebd9cb9e25814%7C0%7C0%7C637152121095976086&sdata=A1ObQVbPwFk9LsVkMy0txOpQmQohIxQXIQJMbF%2Bw9RI%3D&reserved=0>
########################################################################
To unsubscribe from the TB-SUPPORT list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=TB-SUPPORT&A=1
|