On Sat, Aug 12, 2006 at 07:03:43AM +0100 or thereabouts, Ian Stokes-Rees wrote:
> Hi,
>
> David McBride wrote:
> >On Fri, 2006-08-11 at 16:05 +0100, Ian Stokes-Rees wrote:
> >>I'm curious as to how 4000 jobs on a CE can kill it. Surely the CE for
> >>a large cluster would be expected to handle 10,000 or more jobs.
> >
> >4000 (perl?) processes continuously polling the state of the local batch
> >system -- eg by invoking `qstat` every N seconds -- could easily raise
> >the load average of a single machine to debilitating levels.
> >
> >And the site BDII facility, typically installed on the CE head node, is
> >rather sensitive to the local system load..
>
> I understood David Groep (sp?) from NIKEF wrote a caching replacement
> for this last year some time.
>
> In any case, I feel a bit like my original question was unanswered:
> isn't it *necessary* for a CE to be able to support 4000 simultaneous
> jobs (whether running or queued)? How do the large computing centres
> such as CERN, Bologna, FZK, and RAL handle this?
The 4000 process is caused bu fast submission basically. Each job
queued spawns a new perl process. These then die after 340 seconds
I think it is to leave just one per RB per user.
So it is known a property that when farm going from idle to full quickly
the CE suffers.
Steve
>
> Cheers,
>
> Ian
>
> --
> Ian Stokes-Rees [log in to unmask]
> Particle Physics, Oxford http://grid.physics.ox.ac.uk/~stokes
--
Steve Traylen
work email: [log in to unmask]
personal email: [log in to unmask]
jabber: xmpp:[log in to unmask]
|