Hi Mario,
I've had problems too with my SL4 LCG CE: the bdii keeps dying every hour
after the gatekeeper receives a kill signal (15) and restarts itself.
That will be subject of another message once I've collectes straces,...
Independently of this, you may wish to look at the read permissions on
several files and directories in /opt/globus. I did a fresh install of a
CE yesterday. You can find my tests and changes below.
Yves
The first problem was with:
[root@epgce2 ~]# ll -d /opt/globus/etc/grid-services
drwx------ 2 root root 4096 May 2 16:37 /opt/globus/etc/grid-services
which had to be world readable. But, there was more:
$ globus-job-run epgce2.ph.bham.ac.uk /bin/hostname
GRAM Job submission failed because data transfer to the server failed
(error code 10)
$ globus-job-run epgce2.ph.bham.ac.uk /bin/hostname
GRAM Job submission failed because the job manager is misconfigured, a
scheduler script is missing (error code 105)
were caused by the wrong permissions on:
[root@epgce2 ~]# ll /opt/globus/lib/perl/Globus/GRAM/JobManager
total 44
-rw------- 1 root root 14671 May 2 19:07 fork.pm
-rw-r--r-- 1 root root 20100 May 2 19:07 lcgpbs.p
[root@epgce2 ~]# ll -d /opt/globus/lib/perl/Globus/GRAM/JobManager
drwxr----- 2 root root 4096 May 2 19:07
/opt/globus/lib/perl/Globus/GRAM/JobManager
which again had to be world readable.
Then, it looked better:
$ globus-job-run epgce2.ph.bham.ac.uk /bin/hostname
epgce2.ph.bham.ac.uk
but the following test failed :(
$ globus-job-run
epgce2.ph.bham.ac.uk:2119/jobmanager-lcgpbs /bin/hostname
GRAM Job submission failed because data transfer to the server failed
(error code 10)
The read permission on lcgpbs.rvf was again too restrictive:
[root@epgce2 ~]# ls -l /opt/globus/share/globus_gram_job_manager
total 28
-rw-r--r-- 1 root root 12938 Dec 8 2006 globus-gram-job-manager.rvf
-rw------- 1 root root 989 May 2 19:07 lcgpbs.rvf
Finally, it worked.
$ globus-job-run
epgce2.ph.bham.ac.uk:2119/jobmanager-lcgpbs /bin/hostname
s25.esc.bham.ac.uk
On Sat, 3 May 2008, Maarten Litmaath wrote:
> Hi Mario,
>
>>>> I've got a question about recent optimizations in lcg-ce (addition of
>>>> globus-gass-cache-marshal and globus-job-manager-marshal) - it has
>>>> been
>>>> said that configuration files can be found in globus/etc location,
>>>> but I
>>>> didn't manage to find any documentation about what that configuration
>>>> actually mean or does. Any hints? I suspect that our current cluster
>>>> problem might be related to the configuration of this new piece of
>>>> software.
>>>
>>> What problems? The SAM tests appear to be working fine.
>>
>> Latest example:
>>
>> [link to SAM job submission error page]
>
> The error was this:
>
> Globus error 94: the jobmanager does not accept any new requests
> (shutting down)
>
> Did you check its Wiki page:
>
> http://goc.grid.sinica.edu.tw/gocwiki/Globus_error_94%3A_the_jobmanager_does_not_accept_any_new_requests_%28shutting_down%29
>
> In particular note that the batch system can be in bad shape for _some_
> users, e.g. if the user ran out of disk quota.
>
>> However for 3 weeks we have been struggling with that particular
>> error. It comes and goes as it pleases and sometimes takes the cluster
>> offline for almost days. The CE has been reconfigured now and it seems
>> to help to some extent, but not 100%. Until today evening we were also
>> in the FCR because of it. I guess the late success of SAM has removed
>> us from FCR for now.
>
|