Indeed, the PBS jobid does not map directly to the
file name, even though it looks like it should have.
My bad. We'll see what happens. Let me know if
I should put them back. They appear to be reappearing
on their own.
JT
On Thu, 2004-09-02 at 22:40, Jeff Templon wrote:
> Wow, now THAT's interesting ... doing as requested.
> I was moving all these files, there are jobs with id's
> up to about 100 000 in the system. The stale files
> all have PBS id's less than:
>
> 32765
>
> here are the last ten stale files before that:
>
> job.tbn18.nikhef.nl.32743.1080473637
> job.tbn18.nikhef.nl.32749.1094069076
> job.tbn18.nikhef.nl.32754.1094082337
> job.tbn18.nikhef.nl.32757.1079433123
> job.tbn18.nikhef.nl.32757.1093527534
> job.tbn18.nikhef.nl.32762.1087941216
> job.tbn18.nikhef.nl.32763.1092579538
> job.tbn18.nikhef.nl.32764.1092680456
> job.tbn18.nikhef.nl.32765.1087668881
> job.tbn18.nikhef.nl.32765.1092709219
>
> hmmm, 32765 ... wonder what happened to all the stale files
> above 32765? And why does that number look so familiar??
>
> And what is even more interesting, after having removed all
> these files ... the directory is EMPTY! Except for one stale
> file that sort of magically reappeared! Maybe the number after
> job.tbn18.nikhef.nl is not really the job ID?
>
> Hmmm.
>
> JT
>
> On Thu, 2004-09-02 at 22:10, Maarten Litmaath, CERN wrote:
> > On Thu, 2 Sep 2004, Jeff Templon wrote:
> >
> > > So,
> > >
> > > I have just spent a couple enjoyable hours trying to figure
> > > out what is going on with this silly qstat business.
> > > Firstly, I am on the verge of banning the following user
> > >
> > > /C=UK/O=eScience/OU=QueenMaryLondon/L=Physics/CN=dave kant
> > >
> > > since he seems to be responsible for something like 25% of
> > > the load on our CE, looping over and over many jobs.
> > >
> > > Then I saw something really strange: most of the jobs being
> > > provided to qstat -f did NOT EVEN EXIST on the system.
> > >
> > > Furthermore, the output of qstat -f was being piped to /dev/null
> > > so whatever this silly program is doing, it's not learning
> > > from the mistake ... imagine someone who called you once
> > > every fifteen minutes and asked "can I speak to Rod, please".
> > > You answer "Rod no longer lives here". Fifteen minutes later,
> > > ...
> > >
> > > So then I tried to inspect the program: you guessed it,
> > > Larry Wall Code, write once read never. The program has
> > > names like:
> > >
> > > perl /tmp/grid_manager_monitor_agent.atlas004.28318.1000 --delete-sel
> > >
> > > After even more inspection, I see that not only dteam
> > > is doing this silly asking for jobs that no longer exist;
> > > most of ALL of the qstats are doing this. From what of
> > > the code (in this case a good name) I can understand, it
> > > seems to be looking for state files, and I think it means
> > > the files in
> > >
> > > /opt/globus/tmp/gram_job_state
> > >
> > > of which there are over 10,000 on tbn18. I get the feeling
> > > that this code, if being run as dteam001, is looking
> > > at all the files in this directory, finding out which
> > > are owned by dteam001, extracting the pbs jobid, and doing
> > > a great big loop over all the jobids so gathered.
> > >
> > > Let's see, I currently have 672 active jobs (R or Q state)
> > > and 10,000 of these state files, so I expect about 7% of
> > > the qstat calls to refer to an actual real job on the
> > > system:
> > >
> > > tbn18:~> for n in $(seq 30)
> > > do
> > > qstat $(ps ax | egrep 'sh -c .*qstat.*[0-9]+.tbn18.nikhef.nl' | gawk
> > > '{print $9}') >& stat.q.$n
> > > sleep 2
> > > done
> > > tbn18:~> egrep '^[0-9]+' stat.q.* | wc
> > > 10 60 908
> > > tbn18:~> grep Unknown stat.q.* | wc
> > > 117 585 6295
> > >
> > > 10 out of 127 is 7.9%.
> > >
> > > So the question is, what do we do? Where do we submit the
> > > bug? Can I just do a rm -f on the directory with all these
> > > stale state files on it? It has the potential to drop
> > > the load quite a bit, getting rid of 90% of the qstat
> > > calls ...
> >
> > Surely something did not recover from some error condition somewhere;
> > we would like to investigate it, so please move all the stale (!) files
> > into some subdirectory for later inspection.
> > Check if the load goes down accordingly.
|