Print

Print


Hi Vega,

> 
> I can confirm that such logs are still growing since this mornig 
> (timeframe=11:25-16:45; bigger logfile=2,2Mb)...I hope they will stop 
> (under check ;-)  )!

I don't think it will stop growing since the logs are generated by the 
globus manager process responsible for those condor processes.

> At the moment the prcesses running as eo002 (the local user who run the 
> job) are:
> 
> (ps -efl --forest)
> 0 S eo002    14086     1  0  76   0    -  1476 schedu 11:25 ?       
>  00:00:04 *globus-job-manager* -conf /opt/glite/etc/globus-job-man
> ager.conf -type fork -rdn jobmanager-fork -machine-type unknown 
> -publish-jobs
> 0 S eo002    14199     1  0  75   0    -  1552 schedu 11:25 ?       
>  00:00:02 /opt/condor-c/sbin/*condor_master* -f -r 680
> 0 S eo002    14231 14199  0  75   0    -  1902 schedu 11:25 ?       
>  00:00:02  \_ *condor_schedd *-f -n 5670f86976d594674aa5ef1c9bc2b3
> [log in to unmask]
> 0 S eo002    13403     1  0  75   0    -  1398 schedu 16:31 ?       
>  00:00:00 *globus-job-manager* -conf /opt/glite/etc/globus-job-man
> ager.conf -type fork -rdn jobmanager-fork -machine-type unknown 
> -publish-jobs
> 0 S eo002    13422     1  0  75   0    -  1073 schedu 16:31 ?       
>  00:00:00 *perl */home/eo002/.globus/.gass_cache/local/md5/00/0bed
> d51244c38d2825c37fec701443/md5/8d/618299719439f70dfc2258222583a4/data 
> --dest-url=https://grid-eo-rb01.esrin.esa.int:20001/tmp/condor
> _g_scratch.0x8506178.1770/grid-monitor-job-status.grid-e
> 0 S eo002    13424 13422  0  75   0    -  1929 schedu 16:31 ?       
>  00:00:00  \_ *perl */tmp/grid_manager_monitor_agent.eo002.13422.1
> 000 --delete-self --maxtime=3600s
> 
> Summaryzing:         2 condor
>                 2 jobmanager
>                 2 perl
> ==========================
>                 6 processes for doing nothing...I wonder what happened 
> on clusters heavily accessed by different people

Definitely, there will be problems. There are some parameter settings in 
the condor configuration on WMS not to submit jobs to the CEs that 
already reached the maximum number of globus job manager processes.

> I still don't know if it's the normal procedure, a bug or I simply 
> missed something, but in the first case (normal procedure): why wasting 
> resources when they can be shut down and restarted when/if needed? Which 
> benefits are provided with this approach?
> 

This is good question. You can submit a bug and choose the category 
WMS(glite).

> 
> P.S. about times: just for being sure, a simple Hello_World.jdl takes 
> 4-6 minutes for being submitted to a WMS, running on the CE and 
> providing output in a clear situation (no load on machines and/or 
> network)...is it normal? Is there any way for speed up the job?

In my case, it is slow, but not as slow as yours. However, it is also 
slow with LCG RB and CE.

Cheers,

Di

> Cheers
> Vega
> 
> 
> 
> *Di Qing <[log in to unmask]>*
> Sent by: LHC Computer Grid - Rollout <[log in to unmask]>
> 
> 18/08/2006 16:14
> Please respond to
> LHC Computer Grid - Rollout <[log in to unmask]>
> 
> 
> 	
> To
> 	[log in to unmask]
> cc
> 	
> Subject
> 	Re: [LCG-ROLLOUT] - gLiteCE - questions
> 
> 
> 	
> 
> 
> 
> 
> 
> Hi Vega,
> 
>  > ok, it's clear why condor processes start, but :
>  >  >> After the users' jobs finished, these process will not exit.
>  >
>  > ...can you explain better this point please? Why processes should be
>  > kept running even after job completion? At he moment the only fact I
> 
> The CE and the resources behind it are somehow like a kind of condor
> resources, your next jobs will be submitted to the same condor
> processes. I am not sure if they will keep running for ever if you don't
> touch them, need JRA1 to confirm.
> 
>  > notice is that I've log files growing without control in user's
>  > home...checking those log files I found that is polling an unexisting
>  > pbs job; well...it's the same entry as for the old LCG CE...in that case
>  > I'm sure it refers to pbs jobs while I'm not in this: it polls a pbs job
>  > which has not ever existed and simply continues...
> 
> This log files should come from the globus-job-manager which launch
> these condor processes.
> 
>  > Does it means that X different users will leave 2 jobmanager processes
>  > (with their condor "children") PER job PER user?
> 
> Currently there are 2 condor processes left per user.
> 
> Cheers,
> 
> Di
> 
> 
>  > Thanks and cheers
>  >
>  > Vega Forneris
>  >
>  > +-----------------------------------------------+
>  > ESA-ESRIN
>  > Unix Systems Administrator
>  > Via Galileo Galilei
>  > 00044 Frascati (Rm) - Italy
>  > Phone +39 06 94180581
>  > Mailto: [log in to unmask]
>  > +-----------------------------------------------+
>  > Vitrociset S.p.A.
>  > Unix System Administrator
>  > Via Tiburtina 1020
>  > 00100 Roma - Italy
>  > Phone +39 06 8820 4297    
>  > Mailto: [log in to unmask]
>  > +-----------------------------------------------+
>  >
>  > "I do not feel obliged to believe that the same God who has endowed us
>  > with sense, reason, and intellect has intended us to forgot their use."
>  > (Galileo Galilei)
>  >
>  >
>  >
>  > *Di Qing <[log in to unmask]>*
>  > Sent by: LHC Computer Grid - Rollout <[log in to unmask]>
>  >
>  > 18/08/2006 14:35
>  > Please respond to
>  > LHC Computer Grid - Rollout <[log in to unmask]>
>  >
>  >
>  >                  
>  > To
>  >                  [log in to unmask]
>  > cc
>  >                  
>  > Subject
>  >                  Re: [LCG-ROLLOUT] - gLiteCE - questions
>  >
>  >
>  >                  
>  >
>  >
>  >
>  >
>  >
>  > Vega Forneris wrote:
>  >  >
>  >  > Hi *,
>  >  >
>  >  > I've just set up a little GRID for testing purpose (NOTE: this implies
>  >  > that all machines involved are note used by anyone else=low cpu 
> load and
>  >  > IP traffic), but I've still a couple of questions about gLiteCE +
>  > gLiteWMS:
>  >  >
>  >  > TEST:
>  >  > [vforneris@grid0008 GLITE]$ cat hello_no_target.jdl
>  >  > Executable = "/bin/echo";
>  >  > Arguments = "Hello World";
>  >  > StdOutput = "message.txt";
>  >  > StdError = "stderror";
>  >  > OutputSandbox = {"message.txt","stderror"};
>  >  >
>  >  > "glite-job-list-match" and " glite-job-submit" work perfectly and the
>  >  > job is succesfully submitted, but this simple job takes 4-6 
> minutes for
>  >  > retrieving the output...is it normal (first question!!!) ? It's a very
>  >  > long time considering the job complexity (=NULL!), the distances (the
>  >  > systems are really close each other) and the server's work-load...
>  >  >
>  >  > Here the output of the command " glite-job-status -v 3 "
>  >  > - stateEnterTimes =  
>  >  >       Submitted        : Fri Aug 18 11:24:27 2006 CEST
>  >  >       Waiting          : Fri Aug 18 11:24:15 2006 CEST
>  >  >       Ready            : Fri Aug 18 11:24:16 2006 CEST
>  >  >       Scheduled        :                ---
>  >
>  > There is not a Scheduled event for gLite CE to update since the
>  > submission mechanism changed ax explained below.
>  >
>  >  >       Running          : Fri Aug 18 11:27:08 2006 CEST
>  >  >       Done             : Fri Aug 18 11:30:00 2006 CEST
>  >  >       Cleared          :                ---
>  >  >       Aborted          :                ---
>  >  >       Cancelled        :                ---
>  >  >       Unknown          :                ---
>  >  >
>  >  >
>  >  > while the pbs logs show that the cluster process this job in less 
> than 7
>  >  > seconds since it reached the pbs_server and the authenticating process
>  >  > takes one second!
>  >  >
>  >  > ...the real problem is that even after retriving the output, there're
>  >  > still jobs running on the CE owned by the local user (in this case:
>  > eo002)
>  >  >
>  >  > eo002    14086     1  0 11:25 ?        00:00:00 globus-job-manager 
> -conf
>  >  > /opt/glite/etc/globus-job-manager.conf -type fork -rdn jobmanager-fork
>  >  > -machine-type unknown -publish-jobs
>  >  > eo002    14199     1  0 11:25 ?        00:00:00
>  >  > /opt/condor-c/sbin/condor_master -f -r 680
>  >  > eo002    14231 14199  0 11:25 ?        00:00:00 condor_schedd -f -n
>  >  > [log in to unmask]
>  >  > eo002    14274     1  0 11:25 ?        00:00:00 globus-job-manager 
> -conf
>  >  > /opt/glite/etc/globus-job-manager.conf -type fork -rdn jobmanager-fork
>  >  > -machine-type unknown -publish-jobs
>  >  > eo002    14306     1  0 11:25 ?        00:00:00 perl
>  >  >
>  > 
> /home/eo002/.globus/.gass_cache/local/md5/b4/be02fce8b5474e16cb3f16794d52b6/md5/8d/618299719439f70dfc2258222583a4/data 
> 
>  >
>  >  >
>  > 
> --dest-url=https://grid-eo-rb01.esrin.esa.int:20001/tmp/condor_g_scratch.0x8506178.1770/grid-monitor-job-status.grid-eo-engine03.esrin.esa.int:2119. 
> 
>  >
>  >  >
>  >  > eo002    14308 14306  0 11:25 ?        00:00:00 perl
>  >  > /tmp/grid_manager_monitor_agent.eo002.14306.1000 --delete-self
>  >  > --maxtime=3600s
>  >
>  > gLite CE is much different from LCG CE. And there are not job managers
>  > for batch system, thus when a new user's job coming through WMS, two
>  > condor processes will be launched through job manager fork as you saw
>  > above. Then users' jobs actually are submitted to condor on CE from the
>  > condor on WMS. After the users' jobs finished, these process will not
>  > exit. JRA1 is planning to move the condor processes from user based to
>  > VO based.
>  >
>  > Di
>  >
>  >  >
>  >  > and 2 gram_job log files grow continuosly in user's home....
>  >  >
>  >  > Crosschecking the "daemon_unique_name"
>  >  > (=5670f86976d594674aa5ef1c9bc2b3b2) in WMS's tmp folder, I  found that
>  >  > in condorc-advertiser.813.2385.out log is continuosly appended the
>  >  > following text:
>  >  >
>  >  > Fri Aug 18 11:47:18 2006 Advertising
>  >  > "[log in to unmask]"
>  >  > Fri Aug 18 11:47:48 2006 #################
>  >  > Fri Aug 18 11:47:48 2006 Rewriting
>  >  > "[log in to unmask]"
>  >  > Fri Aug 18 11:47:48 2006     Setting requirements true
>  >  > MyType = "Machine"
>  >  > TargetType = "Job"
>  >  > Activity = "Idle"
>  >  > Arch = "CondorC"
>  >  > CONDORC_WANTJOB = TRUE
>  >  > CondorCAd = 1
>  >  > CondorPlatform = "$CondorPlatform: I386-LINUX_RH9 $"
>  >  > CondorVersion = "$CondorVersion: 6.7.10 Aug  3 2005 $"
>  >  > DaemonStartTime = 1155893119
>  >  > Machine = "grid-eo-engine03.esrin.esa.int"
>  >  > MaxJobsRunning = 200
>  >  > MonitorSelfAge = 1200
>  >  > MonitorSelfCPUUsage = 0.012500
>  >  > MonitorSelfImageSize = 7612.000000
>  >  > MonitorSelfResidentSetSize = 4620
>  >  > MonitorSelfTime = 1155894319
>  >  > MyAddress = "<193.204.231.32:22864>"
>  >  > Name = 
> "[log in to unmask]"
>  >  > NumUsers = 0
>  >  > OpSys = "CondorC"
>  >  > Requirements = TRUE
>  >  > START = TRUE
>  >  > ServerTime = 1155894460
>  >  > SiteName = "grid-eo-engine03.esrin.esa.int"
>  >  > StartdIpAddr = "<193.204.231.32:22864>"
>  >  > State = "Unclaimed"
>  >  > TotalFlockedJobs = 0
>  >  > TotalHeldJobs = 0
>  >  > TotalIdleJobs = 0
>  >  > TotalJobAds = 0
>  >  > TotalRemovedJobs = 0
>  >  > TotalRunningJobs = 0
>  >  > UpdateSequenceNumber = 1155894468
>  >  > VirtualMemory = 0
>  >  > WantAdRevaluate = True
>  >  > WantResAd = TRUE
>  >  > daemon_unique_name = "5670f86976d594674aa5ef1c9bc2b3b2"
>  >  >
>  >  > I suppose that after the output retrieve, jobs should end, don't they?
>  >  >
>  >  > Thanks for the support and cheers
>  >  >
>  >  > Vega Forneris
>  >  >
>  >  > +-----------------------------------------------+
>  >  > ESA-ESRIN
>  >  > Unix Systems Administrator
>  >  > Via Galileo Galilei
>  >  > 00044 Frascati (Rm) - Italy
>  >  > Phone +39 06 94180581
>  >  > Mailto: [log in to unmask]
>  >  > +-----------------------------------------------+
>  >  > Vitrociset S.p.A.
>  >  > Unix System Administrator
>  >  > Via Tiburtina 1020
>  >  > 00100 Roma - Italy
>  >  > Phone +39 06 8820 4297    
>  >  > Mailto: [log in to unmask]
>  >  > +-----------------------------------------------+
>  >  >
>  >  > "I do not feel obliged to believe that the same God who has endowed us
>  >  > with sense, reason, and intellect has intended us to forgot their 
> use."
>  >  > (Galileo Galilei)
>  >
>