Hi Rod
Thanks for the tip. I have bumped up the MaxStartups to 30, so we will
see if this helps or not.
From the sshd man, it looks like the default LoginGraceTime is 120
seconds, so it is possible something is tying up the sshd port
connections, but that seems like a red herring to me.
Leslie
On Mon, 13 Dec 2004, Rod Walker wrote:
> Leslie,
> sshd_config has
> MaxStartups
> which defaults to 10. From sshd man:-
> --------
> Specifies the maximum number of concurrent unauthenticated con-
> nections to the sshd daemon. Additional connections will be
> dropped until authentication succeeds or the LoginGraceTime
> expires for a connection. The default is 10.
> ---------
> This looks like your limit of 10. Try increasing this number in
> /etc/ssh/sshd_config
> and restarting sshd. The fact this hasn`t been reported as a problem
> before may suggest some underlying problem which is making the
> authentication step very slow.
>
>
> Cheers,
> Rod.
>
>
>
>
> On Mon, 13 Dec 2004, Leslie Groer wrote:
>
> > Hi Maarten
> >
> > Thanks for all the replies. Yes, I did check that exact node and yes,
> > there are many nodes that are failing with the same pathology - I tried a
> > few others and ssh back to the CE works on all. It seems intermittent and
> > almost like a network problem, but there are no other indications of
> > network issues in the cluster.
> >
> > Has anyone seen an sshd saturation on their CE's? If I use xcat tools to
> > look at many nodes at once (with e.g. ssh wn001 "ssh bigmac-lcg-ce
> > /bin/ls") I do get occassional connection failures but I have to have more
> > than 10 ssh sessions going in parallel back to the CE to provoke this.
> > We have only about 100 jobs at a time running, so I doubt that so many
> > jobs are all finishing at the same time!
> >
> > I am also seeing a slow accumulation of .lcgjm jobdescription files,
> > especially for dteam, and globus-job-manager jobs so the load on the CE is
> > steadily increasing (memory is used up (2GB real+3GB swap), but the load
> > is still ~1.5).
> >
> > The CE is a dual-2.4 GHz Xeon so I can't believe it is not powerful enough
> > to support 98 dual-nodes.
> >
> > Leslie
> >
> >
> > On Mon, 13 Dec 2004, Maarten Litmaath, CERN wrote:
> >
> > > On Mon, 13 Dec 2004, Leslie Groer wrote:
> > >
> > > > I am seeing many job failures on our site of the form:
> > > >
> > > > --------------------------
> > > > PBS Job Id: 69221.bigmac-lcg-ce.physics.utoronto.ca
> > > > Job Name: STDIN
> > > > Post job file processing error; job 69221.bigmac-lcg-ce.physics.utoronto.ca on
> > > > host wn089/0
> > > >
> > > > Unable to copy file 69221.bigma.OU to
> > > > bigmac-lcg-ce.physics.utoronto.ca:/home/lhcb001/.lcgjm/globus-cache-export.2kq3Oe/batch.out
> > > > >>> error from copy
> > > > bigmac-lcg-ce.physics.utoronto.ca: Connection refused
> > > > atch.out: No such file or directory
> > > ^^^^^^^^
> > >
> > > Curious that the leading 'b' is missing; this can happen when 2 separate
> > > streams are connected to the logfile, the one overwriting the other;
> > > that would be a PBS bug, but I suppose it has nothing to do with the problem.
> > >
> > > > >>> end error output
> > > > Output retained on that host in: /var/spool/pbs/undelivered/69221.bigma.OU
> > > >
> > > > Unable to copy file 69221.bigma.ER to
> > > > bigmac-lcg-ce.physics.utoronto.ca:/home/lhcb001/.lcgjm/globus-cache-export.2kq3Oe/batch.err
> > > > >>> error from copy
> > > > bigmac-lcg-ce.physics.utoronto.ca: Connection refused
> > > > atch.err: No such file or directory
> > > ^^^^^^^^
> > >
> > > It happened here as well.
> > >
> > > > >>> end error output
> > > > Output retained on that host in: /var/spool/pbs/undelivered/69221.bigma.ER
> > > > --------------------------
> > > >
> > > > If I log into the node as the virtual user (lhcb001 in the case above), I
> > >
> > > Are you sure you logged into the exact WN (node208) that produced the error?
> > > A single misconfigured WN might be responsible for all those errors;
> > > you may have to check each and every one of them.
> > >
> > > > can ssh back into the CE with no problem so the Wiki suggestion
> > > > http://goc.grid.sinica.edu.tw/gocwiki/TroubleShootingHistory#head-e7b53f54ef7b9a2d31a356803946aa572bf0566f
> > > > (that the WN entry in ssh_known_hosts is wrong) does not help. What else
> > > > could be causing this failure?
> > > >
> > > > The undelivered files left in /var/spool/pbs contain
> > > >
> > > > ::::::::::::::
> > > > 69221.bigma.ER
> > > > ::::::::::::::
> > > > submit-helper script running on host node208 gave error: could not export
> > > > the local gass cache import.txt.tar file back to the gatekeeper
> > > > ::::::::::::::
> > >
> >
> > ,-~~-.___. __________________________________________
> > / | ' \ [log in to unmask] Department of Physics
> > ( ) 0 Tel: (416) 978-2959 University of Toronto
> > \_/-, ,----' Fax: (416) 978-8221 60 St. George St.
> > ==== // Toronto, ON M5S 1A7
> > / \-'~; /~~~(O) Canada
> > / __/~| / | Office: McLennan Physics Lab Rm 911
> > =( _____| (_________| http://home.fnal.gov/~groer
> > Leslie S. Groer
> >
>
> --
> Rod Walker +1 6042913051
>
,-~~-.___. __________________________________________
/ | ' \ [log in to unmask] Department of Physics
( ) 0 Tel: (416) 978-2959 University of Toronto
\_/-, ,----' Fax: (416) 978-8221 60 St. George St.
==== // Toronto, ON M5S 1A7
/ \-'~; /~~~(O) Canada
/ __/~| / | Office: McLennan Physics Lab Rm 911
=( _____| (_________| http://home.fnal.gov/~groer
Leslie S. Groer
|