JISCMail - TB-SUPPORT Archives

RHUL had problem with SW area server on 16/19 and job were suspended
temporary.


On Thu, Jun 21, 2012 at 1:34 PM, Alessandra Forti
<[log in to unmask]>wrote:

>  Hi,
>
> I'm looking at the lost heartbeat error in atlas which is cumulatively the
> most dominant error in atlas production. Lost heartbeat happens when the
> jobs loose connection with the panda server. The job contacts the server
> every 30 mins during the job and every 2 minutes for 10 times at the end of
> the job.  If there is no contact for 6 hours the job is said to have "lost
> heartbeat".
>
> Atlas claims that it is mostly a problem with the sites. It is indeed true
> that it is partly due to sites failure however the reasons atlas give for
> this are not correct and to me smell of superficial debugging as they blame
> it all on the batch system from the ADCoS wiki these are the main resons:
>
> *1) Most common reason : the local batch system has killed the job
> because it used more than the accepted resources (CpuTime, WallTime,
> memory). The ATLAS requirements are published in the VOid Card<http://operations-portal.egi.eu/vo/view/voname/atlas>.
> By comparing similar jobs on different sites, try to identify which is the
> problematic variable. In this case, the number of failing jobs should be
> spread over time
>
> *this is not correct. PBS sends a SIGTERM (signal 15) before killing the
> job and the pilot catches that, SGE as far as I know sends also an alert
> signal and a grace period for the job is configurable. I expect more
> sophisticated batch systems like LSF can do the same.
>
> *2) **Site or batch system is broken : The failing jobs should be spread
> over a period of few minutes*.
>
>  ***3) **CE has lost track of the job
>
> *2) and 3) can be incorporated because they are vague site failures. Also
> if the batch system daemon (pbs_mom) dies the jobs do not die with it so it
> is difficult to blame it on the batch system but I might be wrong.
>
> As I see it the main site failures that can generate this error are:
>
> a) WN crashing
> b) network failure
> c) power cut
>
> Feel free to add.
>
> This is the UK production last week "lost heartbeat" is dominant. Other
> weeks months and years have more or less the same dominance.
>
>
> http://dashb-atlas-job.cern.ch/dashboard/request.py/dailysummary#button=successfailures&sites
> []=UK&sitesSort=8&start=2012-06-14&end=2012-06-21&timerange=daily&granularity=Daily&generic=1&sortby=0&series=All&activities[]=production
>
> in paticular you can look at the sites bar graph
>
>
> http://dashb-atlas-job.cern.ch/dashboard/request.py/terminatedjobsstatus_individual?sites=UK&activities=production&sitesSort=8&start=2012-06-14&end=2012-06-21&timeRange=daily&sortBy=0&granularity=Daily&generic=1&series=All&type=pfe
>
> In decreasing order of number of errors the sites the suffered most with
> the time stacked bar graph with hourly bin.
>
> RHUL
>
> http://dashb-atlas-job.cern.ch/dashboard/request.py/terminatedjobsstatus_individual?sites=UKI-LT2-RHUL&activities=production&sitesSort=2&start=2012-06-14&end=2012-06-21&timeRange=daily&sortBy=0&granularity=Hourly&generic=1&series=All&type=abcb
>
> RALPP
>
>
> http://dashb-atlas-job.cern.ch/dashboard/request.py/dailysummary#button=successfailures&sites
> []=UKI-SOUTHGRID-RALPP&sitesSort=2&start=2012-06-14&end=2012-06-21&timerange=daily&granularity=Hourly&generic=1&sortby=0&series=All&activities[]=production
>
> SHEF       power cut
>
> http://dashb-atlas-job.cern.ch/dashboard/request.py/dailysummary#button=successfailures&sites
> []=UKI-NORTHGRID-SHEF-HEP&sitesSort=2&start=2012-06-14&end=2012-06-21&timerange=daily&granularity=Hourly&generic=1&sortby=0&series=All&activities[]=production
>
> ECDF
>
> http://dashb-atlas-job.cern.ch/dashboard/request.py/dailysummary#button=successfailures&sites
> []=UKI-SCOTGRID-ECDF&sitesSort=2&start=2012-06-14&end=2012-06-21&timerange=daily&granularity=Hourly&generic=1&sortby=0&series=All&activities[]=production
>
> LANCS
>
> http://dashb-atlas-job.cern.ch/dashboard/request.py/dailysummary#button=successfailures&sites
> []=UKI-NORTHGRID-LANCS-HEP&sitesSort=2&start=2012-06-14&end=2012-06-21&timerange=daily&granularity=Hourly&generic=1&sortby=0&series=All&activities[]=production
>
> RAL
>
> http://dashb-atlas-job.cern.ch/dashboard/request.py/dailysummary#button=successfailures&sites
> []=RAL-LCG2&sitesSort=2&start=2012-06-14&end=2012-06-21&timerange=daily&granularity=Hourly&generic=1&sortby=0&series=All&activities[]=production
>
> MAN:      WN crashing not sure if all of them can be explained with this
> though.
>
> http://dashb-atlas-job.cern.ch/dashboard/request.py/dailysummary#button=successfailures&sites
> []=UKI-NORTHGRID-MAN-HEP&sitesSort=2&start=2012-06-14&end=2012-06-21&timerange=daily&granularity=Hourly&generic=1&sortby=0&series=All&activities[]=production
>
> Others
>
> Can the listed sites in particular explain their spikes? Sheffield I know
> they had a power cut that can explain their spike and Manchester had few
> problems with nodes crashing which fit the single small spikes although
> will pay more attention from now on to what jobs are on crashed WNs. RHUL
> has two big spikes in two different days that look like either a power cut
> or a network failures, RALPP has a a more spread pattern not compatible
> with network glitches nor with power cuts and so on.
>
> Any help is appreciated.
>
> cheers
> alessandra
>
> --
> Facts aren't facts if they come from the wrong people. (Paul Krugman)
>
>
>