RHUL had problem with SW area server on 16/19 and job were suspended temporary. On Thu, Jun 21, 2012 at 1:34 PM, Alessandra Forti <[log in to unmask]>wrote: > Hi, > > I'm looking at the lost heartbeat error in atlas which is cumulatively the > most dominant error in atlas production. Lost heartbeat happens when the > jobs loose connection with the panda server. The job contacts the server > every 30 mins during the job and every 2 minutes for 10 times at the end of > the job. If there is no contact for 6 hours the job is said to have "lost > heartbeat". > > Atlas claims that it is mostly a problem with the sites. It is indeed true > that it is partly due to sites failure however the reasons atlas give for > this are not correct and to me smell of superficial debugging as they blame > it all on the batch system from the ADCoS wiki these are the main resons: > > *1) Most common reason : the local batch system has killed the job > because it used more than the accepted resources (CpuTime, WallTime, > memory). The ATLAS requirements are published in the VOid Card<http://operations-portal.egi.eu/vo/view/voname/atlas>. > By comparing similar jobs on different sites, try to identify which is the > problematic variable. In this case, the number of failing jobs should be > spread over time > > *this is not correct. PBS sends a SIGTERM (signal 15) before killing the > job and the pilot catches that, SGE as far as I know sends also an alert > signal and a grace period for the job is configurable. I expect more > sophisticated batch systems like LSF can do the same. > > *2) **Site or batch system is broken : The failing jobs should be spread > over a period of few minutes*. > > ***3) **CE has lost track of the job > > *2) and 3) can be incorporated because they are vague site failures. Also > if the batch system daemon (pbs_mom) dies the jobs do not die with it so it > is difficult to blame it on the batch system but I might be wrong. > > As I see it the main site failures that can generate this error are: > > a) WN crashing > b) network failure > c) power cut > > Feel free to add. > > This is the UK production last week "lost heartbeat" is dominant. Other > weeks months and years have more or less the same dominance. > > > http://dashb-atlas-job.cern.ch/dashboard/request.py/dailysummary#button=successfailures&sites > []=UK&sitesSort=8&start=2012-06-14&end=2012-06-21&timerange=daily&granularity=Daily&generic=1&sortby=0&series=All&activities[]=production > > in paticular you can look at the sites bar graph > > > http://dashb-atlas-job.cern.ch/dashboard/request.py/terminatedjobsstatus_individual?sites=UK&activities=production&sitesSort=8&start=2012-06-14&end=2012-06-21&timeRange=daily&sortBy=0&granularity=Daily&generic=1&series=All&type=pfe > > In decreasing order of number of errors the sites the suffered most with > the time stacked bar graph with hourly bin. > > RHUL > > http://dashb-atlas-job.cern.ch/dashboard/request.py/terminatedjobsstatus_individual?sites=UKI-LT2-RHUL&activities=production&sitesSort=2&start=2012-06-14&end=2012-06-21&timeRange=daily&sortBy=0&granularity=Hourly&generic=1&series=All&type=abcb > > RALPP > > > http://dashb-atlas-job.cern.ch/dashboard/request.py/dailysummary#button=successfailures&sites > []=UKI-SOUTHGRID-RALPP&sitesSort=2&start=2012-06-14&end=2012-06-21&timerange=daily&granularity=Hourly&generic=1&sortby=0&series=All&activities[]=production > > SHEF power cut > > http://dashb-atlas-job.cern.ch/dashboard/request.py/dailysummary#button=successfailures&sites > []=UKI-NORTHGRID-SHEF-HEP&sitesSort=2&start=2012-06-14&end=2012-06-21&timerange=daily&granularity=Hourly&generic=1&sortby=0&series=All&activities[]=production > > ECDF > > http://dashb-atlas-job.cern.ch/dashboard/request.py/dailysummary#button=successfailures&sites > []=UKI-SCOTGRID-ECDF&sitesSort=2&start=2012-06-14&end=2012-06-21&timerange=daily&granularity=Hourly&generic=1&sortby=0&series=All&activities[]=production > > LANCS > > http://dashb-atlas-job.cern.ch/dashboard/request.py/dailysummary#button=successfailures&sites > []=UKI-NORTHGRID-LANCS-HEP&sitesSort=2&start=2012-06-14&end=2012-06-21&timerange=daily&granularity=Hourly&generic=1&sortby=0&series=All&activities[]=production > > RAL > > http://dashb-atlas-job.cern.ch/dashboard/request.py/dailysummary#button=successfailures&sites > []=RAL-LCG2&sitesSort=2&start=2012-06-14&end=2012-06-21&timerange=daily&granularity=Hourly&generic=1&sortby=0&series=All&activities[]=production > > MAN: WN crashing not sure if all of them can be explained with this > though. > > http://dashb-atlas-job.cern.ch/dashboard/request.py/dailysummary#button=successfailures&sites > []=UKI-NORTHGRID-MAN-HEP&sitesSort=2&start=2012-06-14&end=2012-06-21&timerange=daily&granularity=Hourly&generic=1&sortby=0&series=All&activities[]=production > > Others > > Can the listed sites in particular explain their spikes? Sheffield I know > they had a power cut that can explain their spike and Manchester had few > problems with nodes crashing which fit the single small spikes although > will pay more attention from now on to what jobs are on crashed WNs. RHUL > has two big spikes in two different days that look like either a power cut > or a network failures, RALPP has a a more spread pattern not compatible > with network glitches nor with power cuts and so on. > > Any help is appreciated. > > cheers > alessandra > > -- > Facts aren't facts if they come from the wrong people. (Paul Krugman) > > >