Hi Dave,
When the job are finished you should check that the outputs are copied
back to the CE (or that there is nothing in the pbs undelivered dir on
the workers). Otherwise, have a look at:
http://grid-deployment.web.cern.ch/grid-deployment/eis/docs/Maradona
I also saw the two following errors:
- Got a job held event, reason: Globus error 158: the job manager could
not lock the state lock file
- Got a job held event, reason: Unspecified gridmanager error
Job got an error while in the CondorG queue.
Unfortunately, I am unable pinpoint what is going wrong.
Yves
On Fri, 8 Jun 2007, David Robson wrote:
> I have just logged on to our CE as dteam001 and submitted a number of
> jobs via qsub. We now have
> a job running on every processor, so I assume the worker nodes are OK.
> Or is there more I
> can test?
>
> Dave
>
> Yves Coppens wrote:
> > Hi Dave,
> >
> > I've looked at the sam tests for your site. I can see a pattern of
> > successes and failures with the infamous "Cannot read JobWrapper output,
> > both from Condor and from Maradona" error. It could then be that there is
> > a problem with one your workers rather than a problem associated with the
> > user mappings. Have you checked for this?
> >
> > Thanks,
> >
> > Yves
> >
> > On Fri, 8 Jun 2007, David Robson wrote:
> >
> >
> >> Hi Yves,
> >>
> >> Thanks for that. Strangely, out site started passing the SAMCE tetsts
> >> overnight. I modified
> >> /opt/edg/etc/edg-mkgridmap.conf as you suggested, and now we are back to
> >> failed tests!
> >>
> >> IS there anything else we need to do?
> >>
> >>
> >> Yves Coppens wrote:
> >>
> >>> Hi Dave,
> >>>
> >>>
> >>>
> >>>> Our version of yaim is glite-yaim-3.0.0-34
> >>>>
> >>>>
> >>> Earlier version than I thought, but I think the problem is still the one
> >>> below, except that the file is /opt/edg/etc/lcmaps/gridmapfile and not
> >>> /opt/edg/etc/edg-mkgridmap.conf, sorry for this. I had a look at
> >>> gridmapfile on your CE and it has entries such .atlasprg and .atlasprd
> >>> which should be replaced by atlasprg and atlasprd, respectively, and for
> >>> all VOs. You can run something like:
> >>>
> >>> perl -p -i.bak -e 's/\.(\w+)sgm/$1sgm/g' gridmapfile
> >>>
> >>> and then for prd again, but I haven't tested the regex.
> >>>
> >>>
> >>>
> >>>> Is this the problem? Should we upgrade? Just yaim or all glite ?
> >>>>
> >>>>
> >>> I would not upgrade now, but try to get your site working first.
> >>>
> >>> Yves
> >>>
> >>>
> >>>
> >>>> Dave
> >>>>
> >>>> Yves Coppens wrote:
> >>>>
> >>>>
> >>>>> Hi David,
> >>>>>
> >>>>> If your version of yaim is glite-yaim-3.0.1-* ,then the problem may be in:
> >>>>>
> >>>>> /opt/edg/etc/edg-mkgridmap.conf. If you do not have no sgm and prd pool
> >>>>> accounts but single accounts, it should like:
> >>>>>
> >>>>> "/VO=ops/GROUP=/ops/ROLE=lcgadmin/Capability=NULL" opssgm
> >>>>> "/VO=ops/GROUP=/ops/ROLE=lcgadmin" opssgm
> >>>>> "/VO=ops/GROUP=/ops/ROLE=production/Capability=NULL" opsprd
> >>>>> "/VO=ops/GROUP=/ops/ROLE=production" opsprd
> >>>>> "/VO=ops/GROUP=/ops/Role=NULL/Capability=NULL" .ops
> >>>>> "/VO=ops/GROUP=/ops" .ops
> >>>>>
> >>>>> for the ops VO. If you do have pool accounts then opssgm should be
> >>>>> replaced by .opssgm (which is what yaim does). The same holds for the prd
> >>>>> account and other sgm and prd VO accounts.
> >>>>>
> >>>>> Yves
> >>>>>
> >>>>>
> >>>>> On Thu, 7 Jun 2007, David Robson wrote:
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>> Yesterday, we were frequently failing the SAM CE tests, although non-OPS
> >>>>>> VO jobs seemed to be OK.
> >>>>>>
> >>>>>> After following the discussions in the news groups, I reached the
> >>>>>> conclusion that the problem was
> >>>>>> due to the lack of sgm and prd accounts. Therefore I added lines of the
> >>>>>> form
> >>>>>>
> >>>>>>
> >>>>>> 50601:opssgm001:1932:ops:ops:sgm:
> >>>>>> 51301:opsprd001:1932:ops:ops:prd:
> >>>>>>
> >>>>>> to users.conf for each VO, and then ran configure_node on the CE and all
> >>>>>> WNs. Now we are failing ALL our SAM CE tests.
> >>>>>> I reversed the change by deleting the new accounts from user.conf and
> >>>>>> running configure_node on the CE and WNs again,
> >>>>>> but we are still failing ALL the tests. I don't see anything wrong in
> >>>>>> the globus-gatekeeper logs, I can su to ops001 and prove
> >>>>>> that ssh between the WN nodes is OK, and I can submit jobs internally
> >>>>>> with qsub.
> >>>>>>
> >>>>>> Any ideas anyone on how to debug this?
> >>>>>>
> >>>>>> Thanks in advance
> >>>>>>
> >>>>>> Dave
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>
> >>>
> >
> >
>
|