Asif Osman wrote:
> Dear All,
>
> We are also facing the "SAM tests fail randomly - Unspecified gridmanager error", but have a little bit different observation.
>
> Once SAM test jobs come to our site, it is allocated to the reserved node properly. Its progress can be watched by observing the gram_job_mgr_*.log files in the home dir of "sgmops01". In the begining, the contents of the log file shows that the job is running properly and finishes w/o problem. But before job is finished, another log file is opened and it keeps on logging more data. It contains messages like:
>
> 1/7 13:26:06 JMI: poll_fast: ******** Failed to find https://ce.pakgrid.org.pk/20235/1199694163/
That other file is for the grid_monitor process running on the CE itself:
it is started by the RB or WMS to monitor the real jobs.
The error message you quoted is normal.
> Therfore, we get more than one status for the same job.
> 1) both successfull
> 2) one successfull other failure
> 3) both failure
>
> One can observe it on SAM site;
> https://lcg-sam.cern.ch:8443/sam/sam.py?funct=ShowHistory&sensors=CE&vo=ops&nodename=CE.pakgrid.org.pk
>
> At the end it leaves one log file in the home dir, which was created later in the sequence.
>
> From log file contents, it seems that job with job id such as https://ce.pakgrid.org.pk/20235/1199694163/ is still being watched, although once it is finshed.
>
> Help is required to fix this problem.
>
> Note: Problem started after reconfiguring node with:
> /opt/glite/yaim/bin/yaim -c -s /opt/glite/yaim/examples/siteinfo/site-info.def -n lcg-CE -n TORQUE_server
I successfully ran jobs on your CE both as "ops008" and "sgmops05",
so it appears there is some problem with the "sgmops01" account.
I will look further into the matter.
|