Dear All,
We are also facing the "SAM tests fail randomly - Unspecified gridmanager error", but have a little bit different observation.
Once SAM test jobs come to our site, it is allocated to the reserved node properly. Its progress can be watched by observing the gram_job_mgr_*.log files in the home dir of "sgmops01". In the begining, the contents of the log file shows that the job is running properly and finishes w/o problem. But before job is finished, another log file is opened and it keeps on logging more data. It contains messages like:
1/7 13:26:06 JMI: poll_fast: ******** Failed to find https://ce.pakgrid.org.pk/20235/1199694163/
Therfore, we get more than one status for the same job.
1) both successfull
2) one successfull other failure
3) both failure
One can observe it on SAM site;
https://lcg-sam.cern.ch:8443/sam/sam.py?funct=ShowHistory&sensors=CE&vo=ops&nodename=CE.pakgrid.org.pk
At the end it leaves one log file in the home dir, which was created later in the sequence.
From log file contents, it seems that job with job id such as https://ce.pakgrid.org.pk/20235/1199694163/ is still being watched, although once it is finshed.
Help is required to fix this problem.
Note: Problem started after reconfiguring node with:
/opt/glite/yaim/bin/yaim -c -s /opt/glite/yaim/examples/siteinfo/site-info.def -n lcg-CE -n TORQUE_server
Cheers,
Asif Osman
|