Print

Print


On Sat, 5 Feb 2005, Anar Manafov wrote:

> Dear colleagues,
>
> I have got a problem and can’t find a solution for it. Even I am not sure
> about the seed of the problem. I have completely spent a week already.
>
> One of our (GSI.de) CE started to behavior strange. After we implemented a
> new pool accounts schema (standard is not suitable for our infrastructure)
> and opened a full farm for CE, I can’t get edg-job-submit working correctly
> (before everything perfectly worked on the test environment).
>
> Some site info:
> GSI
> Problematic CE is “lcg06.gsi.de”
> Batch system is LSF (CE is NOT LSF muster).
> User home is shared between CE and WN.
>
> Generally, when the job is sent to lcg06 I have “Job RetryCount (3) hit”,
> but it is general error and could be because of many reasons.
>
> I tried to track down the problem.
>
> Log files, sys messages, gram_job_mgr_XXX.log seem to be very useless in
> such situation. I can tell it from previous experience of tracking job
> submission problems. If those logs are useful, I would appreciate to get any
> hint how I can track the job submission with help of those logs, especially
> in critical situations, when the job failed on some of the stages.
>
> When I used “edg-job-get-logging-info” with extended level of verbose “-v 2”
> I found this message as a reason why the Jobs fail:
> “Cannot read JobWrapper output, both from Condor and from Maradona”.
>
> I read some papers about this error and still couldn’t find anything, which
> could be suitable for my situation.
>
> I have  tracked down the job by script of JM debugging (I couldn’t find a
> GOOD logging!!!) and I suspect that there is something wrong with job output
> transfer. But what? I don’t know!
>
> I checked:
> 1 – That the job successfully delivered to LSF.
> 2 - Executed.
> 3 – JM is tracking the job status (edg-job-status shows the statuses
> correctly).
> 4 – LSF reports that the job successfully done.
> 5 – I have checked “gridftp”, seems to be ok.
>
> Also, I have looked to “submit-helper.pl”, also couldn’t see the problem.
> (BTW, maybe you know how to get an error log from this script? :) )
>
> Anyway, finally we have “Cannot read JobWrapper output, both from Condor and
> from Maradona”.
>
> Dear Friends, maybe someone could help me with a hint (or method), which I
> could use to track down the problem?
>
> Also, I would be VERY grateful if you point me to some useful log file(s),
> which I could use to track down such problems. Because I have a suspicion,
> that this software has a very bad logging. I think the logging is a very
> important part of software. Log should tell you when program got ‘A’ , but
> ‘B’ is expected or something. But here, I have to manually DEBUG the scripts
> to add additional log info, only to be able to see what is happening… I am
> sure that users are not supposed to see and debug the source of software!
> Please, any good log would be GREAT!
>
> Please, I would be very grateful for any HINT.

As you found out, many elements in the job submission chain are not easy
to debug, due to lack of logfiles or clear error messages.
These things should be improved, and they have been, but it takes
significant effort to make things easy for the admin and/or user.
Furthermore, the current job submission chain is fundamentally broken,
so we do not really want to spend a lot of effort on it, when we expect
to get something better in the next few months (based on Condor instead
of Globus).
Anyway, have you looked at the Wiki entries on job submission problems,
in particular this one:

http://goc.grid.sinica.edu.tw/gocwiki/Cannot_read_JobWrapper_output%2e%2e%2e

The error "Cannot read JobWrapper output, both from Condor and from Maradona"
means that the job exit status failed to be delivered to the RB, when two
independent methods should have been tried:

1. The job wrapper script writes the user job exit status to stdout,
   which is supposed to be sent back to the RB by Globus.

2. The user job exit status is written into an extra file that is copied
   to the RB with globus-url-copy.

When *both* methods fail, it usually means that the job did *not* run to
completion!

That means it either did not start at all (batch system submission problem,
WN disk full, time not synchronized between CE and WN, home directory not
writable, ...) or it got killed before it finished, e.g. because it ran
into the wall-clock time limit.

Of course it is possible that the job did finish, but then it must mean
that:

1. the WN could not do a globus-url-copy to the RB, *and*

2. Globus could not send back the job wrapper stdout, e.g. because it
   was not copied back from the WN to the CE, or because globus-url-copy
   does not work from the CE to the RB.

This combined set of problems still can have a single cause: there can
be a firewall limiting outgoing connections (to ports 20000-25000),
some CRLs can be out of date both on CE and WN, some CA files could be
absent altogether, the time (zone) on CE and WN can be wrong, ...

Note that the other Wiki entries may also provide clues.  For example:

http://goc.grid.sinica.edu.tw/gocwiki/submit-helper_script_%2e%2e%2e_gave_error%3a_cache_export_dir_%2e%2e%2e