Hi,
I've noticed that usually black hole nodes now fail jobs with exit
status -1 so I knocked this script up a few weeks ago:
#!/bin/bash
while read count node
do
state=$(pbsnodes -a $node | awk '/state =/{print $3}')
if !(echo $state | grep -q offline);
then
snode=$(echo $node | cut -d . -f 1)
echo -e "$snode\tBlackHole Check\t2\tBlackholed $count jobs" | \
send_nsca -H heplnx200 -p 5667 -c /etc/nagios/send_nsca.cfg >
/dev/null
pbsnodes -o $node
fi
done < <(grep Exit_status=-1 \
/var/spool/pbs/server_priv/accounting/$(date +%Y%m%d) \
/var/spool/pbs/server_priv/accounting/$(date +%Y%m%d
--date=now-1day)|\
tr ';' ' ' | \
awk '/ E /{print $14}' | \
sed -e 's|exec_host=\(heplnc....pp.rl.ac.uk\)/.|\1|' | \
uniq -c )
It's very basic, completely site specific and hasn't detected any black
holes since I started running it but then it hasn't taken the farm
completely offline either.
Feel free to take an modify.
On the subject of torque configuration I've got a problem with random
"Unspecified gridjob errors". I cannot tie it down to a single user,
group or node however I just found lots of lines like:
10/12/2009 00:01:50;0001;PBS_Server;Svr;PBS_Server;socket_to_handle,
internal socket table full
In server_logs that do appear to correlate with the approx times of the
problems. Which suggests we've got too many nodes for our configuration.
Googling around I found a message from Steve Traylen (All Hail the
Traylenator!) that suggests increasing job_stat_rate and setting
poll_jobs for large clusters.
Does anyone have any experience with setting these?
Yours,
Chris.
> -----Original Message-----
> From: Testbed Support for GridPP member institutes [mailto:TB-
> [log in to unmask]] On Behalf Of Peter Gronbech
> Sent: 12 October 2009 17:09
> To: [log in to unmask]
> Subject: Re: black hole node detection
>
> Alessandra wrote a script which is here:
>
>
> Her email to me:
>
> I put the script in the repository. It is really stupid! it greps for
> the
> nodes names
> and counts how many times a node appear in a day. If above 100 it
> prints
> a
> warning. The number can be changed we decided to be conservative but
> sometimes pilots confuse things.
>
> https://www.sysadmin.hep.ac.uk/svn/fabric-management/lcg_ce/black-
> holes-
> finder.sh
>
> Alessandra Forti
> NorthGrid Technical Coordinator
> University of Manchester
>
> Cheers Pete
>
>
> --
> ----------------------------------------------------------------------
> Peter Gronbech Senior Systems Manager and Tel No. : 01865 273389
> SouthGrid Technical Co-ordinator Fax No. : 01865 273418
>
> Department of Particle Physics,
> University of Oxford,
> Keble Road, Oxford OX1 3RH, UK E-mail : [log in to unmask]
> ----------------------------------------------------------------------
>
> -----Original Message-----
> From: Testbed Support for GridPP member institutes
> [mailto:[log in to unmask]] On Behalf Of Simon George
> Sent: 12 October 2009 16:35
> To: [log in to unmask]
> Subject: black hole node detection
>
> I vaguely recall hearing that someone had automated the detection of
> "black hole" nodes, i.e. worker nodes that have a problem so that jobs
> that start on them immediately fail and end. The node therefore sucks
> in
>
> all quued jobs pretty quickly. I haven't been able to find it with
> google or the tbsupport archive. Anyone out there know what I am
> looking
>
> for?
>
> Thanks,
> Simon
--
Scanned by iCritical.
|