On 02/07/13 15:04, Torsten Harenberg wrote:
> Dear all,
>
> I'm fighting again against "killed by CE admin" (error=3) problems.
>
https://wiki.italiangrid.it/twiki/bin/view/CREAM/KnownIssues#CREAM_jobs_are_cancelled_with_st
I suspect the problem is that you've probably applied that workaround,
then updated CREAM - which has overwritten your modifications.
AIUI the problem is that gridengine relies on environment variables to
be set for correct functioning of qsub,qacct etc. Unfortunately, on
service start these are not set.
https://ggus.eu/ws/ticket_info.php?ticket=88284 refers to this. There's
a similar issue with the BDII which Maarten gave a fix for in
https://ggus.eu/ws/ticket_info.php?ticket=94510
Chris
> Here is an example job:
>
> [root@cream-ce cream]#
> [root@cream-ce cream]# grep CREAM884667450 *.log
> glite-ce-cream.log:02 Jul 2013 15:32:16,673
> org.glite.ce.cream.jobmanagement.db.table.JobTable - Job inserted. JobId
> = CREAM884667450
> glite-ce-cream.log:02 Jul 2013 15:32:17,141
> org.glite.ce.creamapi.jobmanagement.cmdexecutor.AbstractJobExecutor -
> JOB CREAM884667450 STATUS CHANGED: -- => REGISTERED
> [localUser=atlasprd005] [delegationId=1372767726.231562]
> glite-ce-cream.log:02 Jul 2013 15:34:09,754
> org.glite.ce.cream.cmdmanagement.CommandManager - new command
> [NAME="JOB_START"; PRIORITY_LEVEL=2; IS_ASYNCHRONOUS=true;
> STATUS=ACCEPTED; CATEGORY="JOB_MANAGEMENT";
> EXECUTOR_NAME="BLAHExecutor";
> USER_ID="CN_Robot__ATLAS_Pilot2_CN_596434_CN_atlpilo2_OU_Users_OU_Organic_Units_DC_cern_DC_ch_atlas_Role_production_Capability_NULL";
> CREATION_TIME="Tue Jul 02 15:34:09 CEST 2013"; JOB_ID_LIST={
> CREAM078958741; CREAM884667450; CREAM033583472; CREAM131437820;
> CREAM614985609; CREAM950043997; CREAM351981570; CREAM796070681;
> CREAM539064247 }; PRIORITY_LEVEL="1"; EXECUTION_MODE="S";
> IS_ADMIN="false"; REMOTE_REQUEST_ADDRESS="128.142.194.89";
> USER_DN="CN=Robot: ATLAS
> Pilot2,CN=596434,CN=atlpilo2,OU=Users,OU=Organic Units,DC=cern,DC=ch";
> USER_FQAN={ /atlas/Role=production/Capability=NULL;
> /atlas/Role=NULL/Capability=NULL; /atlas/lcg1/Role=NULL/Capability=NULL;
> /atlas/usatlas/Role=NULL/Capability=NULL }]
> glite-ce-cream.log:02 Jul 2013 15:34:09,825
> org.glite.ce.cream.cmdmanagement.CommandManager - new command [ID=42227;
> NAME="JOB_START"; PRIORITY_LEVEL=1; IS_ASYNCHRONOUS=true; STATUS=QUEUED;
> CATEGORY="JOB_MANAGEMENT"; EXECUTOR_NAME="BLAHExecutor";
> USER_ID="CN_Robot__ATLAS_Pilot2_CN_596434_CN_atlpilo2_OU_Users_OU_Organic_Units_DC_cern_DC_ch_atlas_Role_production_Capability_NULL";
> CREATION_TIME="Tue Jul 02 15:34:09 CEST 2013";
> START_PROCESSING_TIME="Tue Jul 02 15:34:09 CEST 2013";
> JOB_ID="CREAM884667450"; PRIORITY_LEVEL="1"; EXECUTION_MODE="S";
> IS_ADMIN="false"; REMOTE_REQUEST_ADDRESS="128.142.194.89";
> USER_DN="CN=Robot: ATLAS
> Pilot2,CN=596434,CN=atlpilo2,OU=Users,OU=Organic Units,DC=cern,DC=ch";
> USER_FQAN={ /atlas/Role=production/Capability=NULL;
> /atlas/Role=NULL/Capability=NULL; /atlas/lcg1/Role=NULL/Capability=NULL;
> /atlas/usatlas/Role=NULL/Capability=NULL }]
> glite-ce-cream.log:02 Jul 2013 15:34:10,858
> org.glite.ce.creamapi.jobmanagement.cmdexecutor.AbstractJobExecutor -
> JOB CREAM884667450 STATUS CHANGED: REGISTERED => PENDING
> [localUser=atlasprd005] [delegationId=1372767726.231562]
> glite-ce-cream.log:02 Jul 2013 15:34:17,120
> org.glite.ce.creamapi.jobmanagement.cmdexecutor.AbstractJobExecutor -
> JOB CREAM884667450 STATUS CHANGED: PENDING => IDLE
> [localUser=atlasprd005] [delegationId=1372767726.231562]
> glite-ce-cream.log:02 Jul 2013 15:34:17,120
> org.glite.ce.creamapi.jobmanagement.cmdexecutor.AbstractJobExecutor -
> ID=42228; NAME="JOB_START"; PRIORITY_LEVEL=1; IS_ASYNCHRONOUS=true;
> STATUS=EXECUTING; CATEGORY="JOB_MANAGEMENT";
> USER_ID="CN_Robot__ATLAS_Pilot2_CN_596434_CN_atlpilo2_OU_Users_OU_Organic_Units_DC_cern_DC_ch_atlas_Role_production_Capability_NULL";
> CREATION_TIME="Tue Jul 02 15:34:09 CEST 2013";
> START_PROCESSING_TIME="Tue Jul 02 15:34:10 CEST 2013";
> JOB_ID="CREAM884667450"; PRIORITY_LEVEL="1"; EXECUTION_MODE="S";
> IS_ADMIN="false"; REMOTE_REQUEST_ADDRESS="128.142.194.89";
> USER_DN="CN=Robot: ATLAS
> Pilot2,CN=596434,CN=atlpilo2,OU=Users,OU=Organic Units,DC=cern,DC=ch";
> USER_FQAN={ /atlas/Role=production/Capability=NULL;
> /atlas/Role=NULL/Capability=NULL; /atlas/lcg1/Role=NULL/Capability=NULL;
> /atlas/usatlas/Role=NULL/Capability=NULL }
> lrmsAbsJobId=sge/20130702153412/5448799;
> glite-ce-cream.log:02 Jul 2013 15:37:52,021
> org.glite.ce.creamapi.jobmanagement.cmdexecutor.AbstractJobExecutor -
> JOB CREAM884667450 STATUS CHANGED: IDLE => CANCELLED
> [description=Cancelled by CE admin] [failureReason=reason=3]
> [localUser=atlasprd005] [delegationId=1372767726.231562]
> glite-ce-cream.log:02 Jul 2013 15:38:34,803
> org.glite.ce.cream.cmdmanagement.CommandManager - new command
> [NAME="JOB_PURGE"; PRIORITY_LEVEL=2; IS_ASYNCHRONOUS=true;
> STATUS=ACCEPTED; CATEGORY="JOB_MANAGEMENT";
> EXECUTOR_NAME="BLAHExecutor";
> USER_ID="CN_Robot__ATLAS_Pilot2_CN_596434_CN_atlpilo2_OU_Users_OU_Organic_Units_DC_cern_DC_ch_atlas_Role_production_Capability_NULL";
> CREATION_TIME="Tue Jul 02 15:38:34 CEST 2013"; JOB_ID_LIST={
> CREAM519151914; CREAM537032450; CREAM539064247; CREAM539474902;
> CREAM545266116; CREAM614985609; CREAM770300836; CREAM796070681;
> CREAM804035630; CREAM830533409; CREAM884667450; CREAM906882920;
> CREAM950043997 }; PRIORITY_LEVEL="0"; EXECUTION_MODE="S";
> IS_ADMIN="false"; REMOTE_REQUEST_ADDRESS="128.142.194.89";
> USER_DN="CN=Robot: ATLAS
> Pilot2,CN=596434,CN=atlpilo2,OU=Users,OU=Organic Units,DC=cern,DC=ch";
> USER_FQAN={ /atlas/Role=production/Capability=NULL;
> /atlas/Role=NULL/Capability=NULL; /atlas/lcg1/Role=NULL/Capability=NULL;
> /atlas/usatlas/Role=NULL/Capability=NULL }]
> glite-ce-cream.log:02 Jul 2013 15:38:36,413
> org.glite.ce.cream.cmdmanagement.CommandManager - new command [ID=42641;
> NAME="JOB_PURGE"; PRIORITY_LEVEL=0; IS_ASYNCHRONOUS=true; STATUS=QUEUED;
> CATEGORY="JOB_MANAGEMENT"; EXECUTOR_NAME="BLAHExecutor";
> USER_ID="CN_Robot__ATLAS_Pilot2_CN_596434_CN_atlpilo2_OU_Users_OU_Organic_Units_DC_cern_DC_ch_atlas_Role_production_Capability_NULL";
> CREATION_TIME="Tue Jul 02 15:38:34 CEST 2013";
> START_PROCESSING_TIME="Tue Jul 02 15:38:35 CEST 2013";
> JOB_ID="CREAM884667450"; PRIORITY_LEVEL="0"; EXECUTION_MODE="S";
> IS_ADMIN="false"; REMOTE_REQUEST_ADDRESS="128.142.194.89";
> USER_DN="CN=Robot: ATLAS
> Pilot2,CN=596434,CN=atlpilo2,OU=Users,OU=Organic Units,DC=cern,DC=ch";
> USER_FQAN={ /atlas/Role=production/Capability=NULL;
> /atlas/Role=NULL/Capability=NULL; /atlas/lcg1/Role=NULL/Capability=NULL;
> /atlas/usatlas/Role=NULL/Capability=NULL }]
> glite-ce-cream.log:02 Jul 2013 15:38:39,186
> org.glite.ce.cream.jobmanagement.db.table.JobTable - Job deleted. JobId
> = CREAM884667450
> glite-ce-cream.log:02 Jul 2013 15:38:39,188
> org.glite.ce.creamapi.jobmanagement.cmdexecutor.AbstractJobExecutor -
> purge: purged job CREAM884667450
>
> [root@cream-ce cream]# grep 5448799 glite-ce-bupdater.log
> 2013-07-02 15:36:08 +-+line
> 520,command_string:/sge-root/bin/lx24-amd64//qacct -j '5448799'
> 2013-07-02 15:36:09 +-+query_err:5448750 5448748 5448733 5448735 5448749
> 5448728 5448723 5448725 5448741 5448745 5448746 5448739 5448738 5448721
> 5448751 5448742 5448722 5448734 5448740 5448732 5448726 5448724 5448744
> 5448731 5448743 5448730 5448729 5448737 5448747 5448736 5448727 5448752
> 5448757 5448754 5448753 5448762 5448761 5448756 5448758 5448759 5448760
> 5448755 5448766 5448764 5448765 5448763 5448767 5448776 5448773 5448775
> 5448770 5448774 5448777 5448781 5448772 5448779 5448769 5448771 5448780
> 5448778 5448768 5448785 5448782 5448788 5448791 5448784 5448789 5448786
> 5448787 5448790 5448783 5448792 5448799 5448794 5448801 5448797 5448800
> 5448802 5448793 5448798 5448795 5448796 5448803
> 2013-07-02 15:37:32 +-+line 587 error,
> command_string:/sge-root/bin/lx24-amd64//qacct -j '5448799'
>
> I think I could boil it down to this SGE behaviour:
>
> [root@cream-ce cream]# /sge-root/bin/lx24-amd64//qstat -j 5448799 | head -30
> ==============================================================
> job_number: 5448799
> exec_file: job_scripts/5448799
> submission_time: Tue Jul 2 15:34:12 2013
> owner: atlasprd005
> uid: 18955
> group: atlasprd
> gid: 1407
> sge_o_home: /home/atlasprd005
> sge_o_log_name: atlasprd005
> sge_o_path:
> /sge-root/bin/lx24-amd64:/sbin:/bin:/usr/sbin:/usr/bin
> sge_o_shell: /sbin/nologin
> sge_o_workdir: /var/tmp
> sge_o_host: cream-ce
> account: sge
> mail_list: [log in to unmask]
> notify: FALSE
> job_name: cream_884667450
> jobshare: 0
> hard_queue_list: atlasprd.q
> shell_list: NONE:/bin/bash
> env_list:
> [log in to unmask]:/var/cream_sandbox/atlasprd/CN_Robot__ATLAS_Pilot2_CN_596434_CN_atlpilo2_OU_Users_OU_Organic_Units_DC_cern_DC_ch_atlas_Role_production_Capability_NULL_atlasprd005/88/CREAM884667450/CREAM884667450_jobWrapper.sh@@@[log in to unmask]:/var/cream_sandbox/atlasprd/CN_Robot__ATLAS_Pilot2_CN_596434_CN_atlpilo2_OU_Users_OU_Organic_Units_DC_cern_DC_ch_atlas_Role_production_Capability_NULL_atlasprd005/proxy/1372767726_231562_14052430460278,[log in to unmask]:/var/cream_sandbox/atlasprd/CN_Robot__ATLAS_Pilot2_CN_596434_CN_atlpilo2_OU_Users_OU_Organic_Units_DC_cern_DC_ch_atlas_Role_production_Capability_NULL_atlasprd005/88/CREAM884667450/StandardOutput@@@[log in to unmask]:/var/cream_sandbox/atlasprd/CN_Robot__ATLAS_Pilot2_CN_596434_CN_atlpilo2_OU_
Users_OU_Or
> ganic_Units_DC_cern_DC_ch_atlas_Role_production_Capability_NULL_atlasprd005/88/CREAM884667450/StandardError
> script_file: /tmp/cream_884667450
> version: 1
> project: atlasprd
> scheduling info: queue instance
> "[log in to unmask]" dropped because it is
> temporarily not available
> queue instance
> "[log in to unmask]" dropped because it is
> temporarily not available
> queue instance
> "[log in to unmask]" dropped because it is
> temporarily not available
> queue instance
> "[log in to unmask]" dropped because it is
> temporarily not available
> queue instance
> "[log in to unmask]" dropped because it is
> temporarily not available
>
>
> [root@cream-ce cream]# /sge-root/bin/lx24-amd64//qacct -j 5448799
> error: job id 5448799 not found
>
>
>
> I have
>
> reporting_params accounting=true reporting=false \
> flush_time=00:00:00 joblog=false \
> sharelog=00:00:00
> accounting_flush_time=00:00:00
>
> in my global cluster configuration already.
>
> Any SGE expert out there who could give me a hint what else is needed?
>
> Thanks a lot,
>
> Torsten
>
>
|