Hi Jason,
>>At CERN in /opt/globus/lib/perl/Globus/GRAM/Helper.pm we changed:
>>
>>my $FINAL_DONE_LOAD_RANGE = 10;
>>to:
>>my $FINAL_DONE_LOAD_RANGE = 1000;
>>
>>Otherwise jobs will not be cleaned up when the load is high, which could
>>lead to an upward spiral. For example, the RB/WMS or the user may start
>>canceling jobs, which cannot be cleaned up immediately, and new jobs may
>>be sent instead, which also have to wait... Meanwhile the list of jobs
>>gets longer, so it takes more and more time for the jobmanager to loop
>>through them.
>>
>>Maybe this happened for that user?
>>
>>We intend to put this change into the next release of that code. You may
>>want to apply it already.
>
>
>
> Thanks a lot, i also noticed that cms001 lcgjm contains more than 6.2k
> globus-cache-export dir that wasnt the same number we found at the batch
> scheduler.
>
> have extend the final done load from default to 1000 to help cleaning up
> the job cache. thanks for the trick, not load reduce to less 10 but this
> is what the parameter apply before. let's hope this able to cleanup the
> cache of the job and reducing the load generated by job manager.
Note:
- it may take up to 1 hour before the change becomes effective, when the
current grid_monitor for cms001 is restarted;
- the load probably will become higher for (quite) a while, because the
cleanup of jobs adds to the load.
|