Hi Gonçalo,
> I'm still supporting an lcg-CE dedicated for Auger VO
How do they submit jobs to that CE? Not via the WMS, I presume?
In that case, are they using their own Condor-G installation?
If not, their workflow quite likely has a scalability problem.
> The machine presents a very bad performance, and as far as I've seen, it is
> because there are a lot of processes like:
>
> ??globus-job-mana,30974 -conf /opt/globus/etc/globus-job-manager.conf -type
> lcgsge -rdn jobmanager-lcgsge -machine-type unknown -publish-jobs
> ? ??globus-job-mana,11832 -m lcgsge -f /tmp/gram_cache_cleanupHlsNAB -c
> cache_cleanup
>
> # ps xuawww | grep "globus-job-manager -conf
> /opt/globus/etc/globus-job-manager.conf -type lcgsge -rdn jobmanager-lcgsge
> -machine-type unknown -publish-jobs" | wc -l
> 1371
QED.
> Those processes are either reading and/or writting from/to /tmp, and this is
> the cause of a huge I/O wait because there is a huge number of files there:
>
> # ls /tmp/ | wc -l
> 219334
>
> # time ls /tmp
> [...]
> real 1m16.023s
> user 0m4.510s
> sys 0m2.650s
>
>
> Most of the files there are proxy files like:
>
> -rw------- 1 augerprd029 augerprd 9666 May 8 13:33 x509up_p29181.fileX0mXTi.1
> -rw------- 1 augerprd029 augerprd 9666 May 8 13:34 x509up_p31464.fileOhpaer.1
> -rw------- 1 augerprd029 augerprd 9670 May 8 13:35 x509up_p1119.filerILwwE.1
> -rw------- 1 augerprd029 augerprd 9670 May 8 13:36 x509up_p3812.filekRSuZw.1
> -rw------- 1 augerprd029 augerprd 9666 May 8 13:37 x509up_p5592.fileLNp4Ca.1
> -rw------- 1 augerprd029 augerprd 9670 May 8 13:38 x509up_p8286.filenyK0BN.1
> -rw------- 1 augerprd029 augerprd 9666 May 8 13:39 x509up_p10240.filelvkesl.1
> -rw------- 1 augerprd029 augerprd 9666 May 8 13:42 x509up_p14560.fileDFg26K.1
> -rw------- 1 augerprd029 augerprd 9666 May 8 13:43 x509up_p15658.filegSSzAz.1
> -rw------- 1 augerprd029 augerprd 9670 May 8 13:44 x509up_p19384.file8honnB.1
> -rw------- 1 augerprd029 augerprd 9670 May 8 13:45 x509up_p20809.file6uANvq.1
> -rw------- 1 augerprd029 augerprd 9666 May 8 13:46 x509up_p23031.fileFAfCd9.1
> -rw------- 1 augerprd029 augerprd 9670 May 8 13:47 x509up_p24797.filesZ0FdJ.1
> -rw------- 1 augerprd029 augerprd 9670 May 8 13:48 x509up_p26241.file7kkaNq.1
> -rw------- 1 augerprd029 augerprd 9670 May 8 13:49 x509up_p27685.fileBWeeQf.1
> -rw------- 1 augerprd029 augerprd 9666 May 8 13:52 x509up_p31281.filetMYU9p.1
>
> # grep x509 lala | wc -l
> 191695
>
> The problem is that these are not old files. The oldest ones are from May 8th:
>
> # openssl x509 -text -noout -in /tmp/x509up_p29181.fileX0mXTi.1
> [...]
> Not Before: May 8 12:28:40 2012 GMT
> Not After : May 8 22:00:23 2012 GMT
>
>
> I do not understand why the middleware did not delete them yet or if this is a
> problem in the Auger submission chain.
They probably need to use Condor-G one way or another; I do not know of
another way to avoid scalability problems with the LCG-CE (i.e. GRAM-2).
|