Hi Pablo et al,
I figured it out ... it was rather exotic so I decided to share it with everyone.
I changed the virtual memory limit here as a temporary favor to ATLAS (their heavy ion jobs take more than 4 GB of vmem to run), I had changed their vmem allocation from "4096mb" (literal value in torque for resources_max.vmem) to "5192". When jobs started to fail and I could not determine the error cause, I reset the value to "4096b".
I guess bash can't run in 4kb.
Funny thing is, it didn't give me a "killed because of exceeded vm" message on the WN, it was something about PBSE_UNKJOBID and STAGE0 errors.
J "should've tried it on my HP-67" T
On Feb 17, 2011, at 16:29 , Pablo Fernandez wrote:
> Hi Jeff,
>
> PBS error codes were always a bit obscure to me, the best I could find was this
> page: http://www.eresearchsa.edu.au/pbs_exitcodes
>
> If the code is above 128, it's 128/256 + Signal that killed the job. In your
> case it means Signal 9, SIGKILL.
>
> In our case, when the job gets killed by the scheduler (running above walltime
> or memory) or when the user manually kills the job we get a Signal 15
> (Exit_status=271).
> In that case, we try to do a tracejob and see if the job was killed because it
> was requested by someone (a D line above the E status). If it's root, it's the
> scheduler, and if it's a user, it's a normal user job cancel request.
>
> BR/Pablo
>
>
> On Thursday 17 February 2011 16:04:19 you wrote:
>> Hi *,
>>
>> some jobs are failing with code 265. This is the exitcode torque prints in
>> the server accounting log, and what CREAM prints as 'failurereason'. any
>> idea what this might be? as far as I can tell, this should be the exit
>> code of the wrapper script, these jobs are being sent by condor g from the
>> atlas pilot factory, no idea what their wrapper script looks like.
>>
>> JT
|