Hi Shkelzen,
> >> Here I must correct my first thought. We have made some more tests and
> >> it seems indeed now that these jobs are staying long in ready status
> >> (for almost 4 hours) before they get submitted ONLY when they have to
> >> run outside our site.
> >
> > Maybe a network HW or configuration problem close to or at your site,
> > affecting WMS traffic?
>
> We are thinking about a possible issue with a firewall (handling
> connection requests to slowly). So could we "relax" somehow the some
> timeout variables on the WMS+LB side? If so what are the correct
> variable to tweak? We can not change anything on the firewall itself
> yet.
The relevant parameters would be in /var/local/condor/condor_config.local,
but they already have quite relaxed values (for various reasons),
so I do not think you can gain much there. Documentation is in section
3.3.20 of the Condor manual:
http://www.cs.wisc.edu/condor/manual/v7.0/condor-V7_0_5-Manual.pdf
> > If the state is Ready, the job has been assigned to a CE;
> > the state becomes Scheduled when it has been delivered to that CE.
> >
> > The /var/local/condor/log/GridmanagerLog.glite* logs might provide clues
> > about the observed delays.
>
> It seems not easy to me to interpret such logs. I this we need your
> help... again ;-)
Many jobs appear to have failed with Globus errors 10 and 22:
http://pages.cs.wisc.edu/~adesmet/status.html
The first one may be due to the network, the CE or the user:
http://goc.grid.sinica.edu.tw/gocwiki/10_data_transfer_to_the_server_failed
The second ("job manager failed to create an internal script argument file")
suggests a problem on the CE.
> [...]
> > Indeed, the condor_schedd just died once with a segfault.
>
> So nothing special here?
It looks "normal".
|