I don't know if there are anything ready for this, and if it is able to
handle VOMS proxy(but should be, otherwise it's a bug). When users
switch VOs, different condor processes should be used.
Di
Jeff Templon wrote:
> Hi,
>
> I wonder what the rationale is for moving to a VO-based watchdog? Will
> it be able to handle the various VOMS groups / roles? What will happen
> to users who switch VOs?
>
> JT
>
> Vega Forneris wrote:
>>
>> Hi Di,
>>
>> I can confirm that such logs are still growing since this mornig
>> (timeframe=11:25-16:45; bigger logfile=2,2Mb)...I hope they will stop
>> (under check ;-) )!
>>
>> At the moment the prcesses running as eo002 (the local user who run
>> the job) are:
>>
>> (ps -efl --forest)
>> 0 S eo002 14086 1 0 76 0 - 1476 schedu 11:25 ?
>> 00:00:04 *globus-job-manager* -conf /opt/glite/etc/globus-job-man
>> ager.conf -type fork -rdn jobmanager-fork -machine-type unknown
>> -publish-jobs
>> 0 S eo002 14199 1 0 75 0 - 1552 schedu 11:25 ?
>> 00:00:02 /opt/condor-c/sbin/*condor_master* -f -r 680
>> 0 S eo002 14231 14199 0 75 0 - 1902 schedu 11:25 ?
>> 00:00:02 \_ *condor_schedd *-f -n 5670f86976d594674aa5ef1c9bc2b3
>> [log in to unmask]
>> 0 S eo002 13403 1 0 75 0 - 1398 schedu 16:31 ?
>> 00:00:00 *globus-job-manager* -conf /opt/glite/etc/globus-job-man
>> ager.conf -type fork -rdn jobmanager-fork -machine-type unknown
>> -publish-jobs
>> 0 S eo002 13422 1 0 75 0 - 1073 schedu 16:31 ?
>> 00:00:00 *perl */home/eo002/.globus/.gass_cache/local/md5/00/0bed
>> d51244c38d2825c37fec701443/md5/8d/618299719439f70dfc2258222583a4/data
>> --dest-url=https://grid-eo-rb01.esrin.esa.int:20001/tmp/condor
>> _g_scratch.0x8506178.1770/grid-monitor-job-status.grid-e
>> 0 S eo002 13424 13422 0 75 0 - 1929 schedu 16:31 ?
>> 00:00:00 \_ *perl */tmp/grid_manager_monitor_agent.eo002.13422.1
>> 000 --delete-self --maxtime=3600s
>>
>> Summaryzing: 2 condor
>> 2 jobmanager
>> 2 perl
>> ==========================
>> 6 processes for doing nothing...I wonder what happened
>> on clusters heavily accessed by different people
>>
>> I still don't know if it's the normal procedure, a bug or I simply
>> missed something, but in the first case (normal procedure): why
>> wasting resources when they can be shut down and restarted when/if
>> needed? Which benefits are provided with this approach?
>>
>>
>> P.S. about times: just for being sure, a simple Hello_World.jdl takes
>> 4-6 minutes for being submitted to a WMS, running on the CE and
>> providing output in a clear situation (no load on machines and/or
>> network)...is it normal? Is there any way for speed up the job?
>>
>> Cheers
>> Vega
>>
>>
>>
>> *Di Qing <[log in to unmask]>*
>> Sent by: LHC Computer Grid - Rollout <[log in to unmask]>
>>
>> 18/08/2006 16:14
>> Please respond to
>> LHC Computer Grid - Rollout <[log in to unmask]>
>>
>>
>>
>> To
>> [log in to unmask]
>> cc
>>
>> Subject
>> Re: [LCG-ROLLOUT] - gLiteCE - questions
>>
>>
>>
>>
>>
>>
>>
>>
>> Hi Vega,
>>
>> > ok, it's clear why condor processes start, but :
>> > >> After the users' jobs finished, these process will not exit.
>> >
>> > ...can you explain better this point please? Why processes should be
>> > kept running even after job completion? At he moment the only fact I
>>
>> The CE and the resources behind it are somehow like a kind of condor
>> resources, your next jobs will be submitted to the same condor
>> processes. I am not sure if they will keep running for ever if you don't
>> touch them, need JRA1 to confirm.
>>
>> > notice is that I've log files growing without control in user's
>> > home...checking those log files I found that is polling an unexisting
>> > pbs job; well...it's the same entry as for the old LCG CE...in that
>> case
>> > I'm sure it refers to pbs jobs while I'm not in this: it polls a
>> pbs job
>> > which has not ever existed and simply continues...
>>
>> This log files should come from the globus-job-manager which launch
>> these condor processes.
>>
>> > Does it means that X different users will leave 2 jobmanager processes
>> > (with their condor "children") PER job PER user?
>>
>> Currently there are 2 condor processes left per user.
>>
>> Cheers,
>>
>> Di
>>
>>
>> > Thanks and cheers
>> >
>> > Vega Forneris
>> >
>> > +-----------------------------------------------+
>> > ESA-ESRIN
>> > Unix Systems Administrator
>> > Via Galileo Galilei
>> > 00044 Frascati (Rm) - Italy
>> > Phone +39 06 94180581
>> > Mailto: [log in to unmask]
>> > +-----------------------------------------------+
>> > Vitrociset S.p.A.
>> > Unix System Administrator
>> > Via Tiburtina 1020
>> > 00100 Roma - Italy
>> > Phone +39 06 8820 4297 > Mailto: [log in to unmask]
>> > +-----------------------------------------------+
>> >
>> > "I do not feel obliged to believe that the same God who has endowed us
>> > with sense, reason, and intellect has intended us to forgot their
>> use."
>> > (Galileo Galilei)
>> >
>> >
>> >
>> > *Di Qing <[log in to unmask]>*
>> > Sent by: LHC Computer Grid - Rollout
>> <[log in to unmask]>
>> >
>> > 18/08/2006 14:35
>> > Please respond to
>> > LHC Computer Grid - Rollout <[log in to unmask]>
>> >
>> >
>> > > To
>> > [log in to unmask]
>> > cc
>> > > Subject
>> > Re: [LCG-ROLLOUT] - gLiteCE - questions
>> >
>> >
>> > >
>> >
>> >
>> >
>> >
>> > Vega Forneris wrote:
>> > >
>> > > Hi *,
>> > >
>> > > I've just set up a little GRID for testing purpose (NOTE: this
>> implies
>> > > that all machines involved are note used by anyone else=low cpu
>> load and
>> > > IP traffic), but I've still a couple of questions about gLiteCE +
>> > gLiteWMS:
>> > >
>> > > TEST:
>> > > [vforneris@grid0008 GLITE]$ cat hello_no_target.jdl
>> > > Executable = "/bin/echo";
>> > > Arguments = "Hello World";
>> > > StdOutput = "message.txt";
>> > > StdError = "stderror";
>> > > OutputSandbox = {"message.txt","stderror"};
>> > >
>> > > "glite-job-list-match" and " glite-job-submit" work perfectly
>> and the
>> > > job is succesfully submitted, but this simple job takes 4-6
>> minutes for
>> > > retrieving the output...is it normal (first question!!!) ? It's
>> a very
>> > > long time considering the job complexity (=NULL!), the distances
>> (the
>> > > systems are really close each other) and the server's work-load...
>> > >
>> > > Here the output of the command " glite-job-status -v 3 "
>> > > - stateEnterTimes = > > Submitted : Fri Aug 18
>> 11:24:27 2006 CEST
>> > > Waiting : Fri Aug 18 11:24:15 2006 CEST
>> > > Ready : Fri Aug 18 11:24:16 2006 CEST
>> > > Scheduled : ---
>> >
>> > There is not a Scheduled event for gLite CE to update since the
>> > submission mechanism changed ax explained below.
>> >
>> > > Running : Fri Aug 18 11:27:08 2006 CEST
>> > > Done : Fri Aug 18 11:30:00 2006 CEST
>> > > Cleared : ---
>> > > Aborted : ---
>> > > Cancelled : ---
>> > > Unknown : ---
>> > >
>> > >
>> > > while the pbs logs show that the cluster process this job in
>> less than 7
>> > > seconds since it reached the pbs_server and the authenticating
>> process
>> > > takes one second!
>> > >
>> > > ...the real problem is that even after retriving the output,
>> there're
>> > > still jobs running on the CE owned by the local user (in this case:
>> > eo002)
>> > >
>> > > eo002 14086 1 0 11:25 ? 00:00:00
>> globus-job-manager -conf
>> > > /opt/glite/etc/globus-job-manager.conf -type fork -rdn
>> jobmanager-fork
>> > > -machine-type unknown -publish-jobs
>> > > eo002 14199 1 0 11:25 ? 00:00:00
>> > > /opt/condor-c/sbin/condor_master -f -r 680
>> > > eo002 14231 14199 0 11:25 ? 00:00:00 condor_schedd -f -n
>> > > [log in to unmask]
>> > > eo002 14274 1 0 11:25 ? 00:00:00
>> globus-job-manager -conf
>> > > /opt/glite/etc/globus-job-manager.conf -type fork -rdn
>> jobmanager-fork
>> > > -machine-type unknown -publish-jobs
>> > > eo002 14306 1 0 11:25 ? 00:00:00 perl
>> > >
>> >
>> /home/eo002/.globus/.gass_cache/local/md5/b4/be02fce8b5474e16cb3f16794d52b6/md5/8d/618299719439f70dfc2258222583a4/data
>>
>> >
>> > >
>> >
>> --dest-url=https://grid-eo-rb01.esrin.esa.int:20001/tmp/condor_g_scratch.0x8506178.1770/grid-monitor-job-status.grid-eo-engine03.esrin.esa.int:2119.
>>
>> >
>> > >
>> > > eo002 14308 14306 0 11:25 ? 00:00:00 perl
>> > > /tmp/grid_manager_monitor_agent.eo002.14306.1000 --delete-self
>> > > --maxtime=3600s
>> >
>> > gLite CE is much different from LCG CE. And there are not job managers
>> > for batch system, thus when a new user's job coming through WMS, two
>> > condor processes will be launched through job manager fork as you saw
>> > above. Then users' jobs actually are submitted to condor on CE from
>> the
>> > condor on WMS. After the users' jobs finished, these process will not
>> > exit. JRA1 is planning to move the condor processes from user based to
>> > VO based.
>> >
>> > Di
>> >
>> > >
>> > > and 2 gram_job log files grow continuosly in user's home....
>> > >
>> > > Crosschecking the "daemon_unique_name"
>> > > (=5670f86976d594674aa5ef1c9bc2b3b2) in WMS's tmp folder, I
>> found that
>> > > in condorc-advertiser.813.2385.out log is continuosly appended the
>> > > following text:
>> > >
>> > > Fri Aug 18 11:47:18 2006 Advertising
>> > > "[log in to unmask]"
>> > > Fri Aug 18 11:47:48 2006 #################
>> > > Fri Aug 18 11:47:48 2006 Rewriting
>> > > "[log in to unmask]"
>> > > Fri Aug 18 11:47:48 2006 Setting requirements true
>> > > MyType = "Machine"
>> > > TargetType = "Job"
>> > > Activity = "Idle"
>> > > Arch = "CondorC"
>> > > CONDORC_WANTJOB = TRUE
>> > > CondorCAd = 1
>> > > CondorPlatform = "$CondorPlatform: I386-LINUX_RH9 $"
>> > > CondorVersion = "$CondorVersion: 6.7.10 Aug 3 2005 $"
>> > > DaemonStartTime = 1155893119
>> > > Machine = "grid-eo-engine03.esrin.esa.int"
>> > > MaxJobsRunning = 200
>> > > MonitorSelfAge = 1200
>> > > MonitorSelfCPUUsage = 0.012500
>> > > MonitorSelfImageSize = 7612.000000
>> > > MonitorSelfResidentSetSize = 4620
>> > > MonitorSelfTime = 1155894319
>> > > MyAddress = "<193.204.231.32:22864>"
>> > > Name =
>> "[log in to unmask]"
>> > > NumUsers = 0
>> > > OpSys = "CondorC"
>> > > Requirements = TRUE
>> > > START = TRUE
>> > > ServerTime = 1155894460
>> > > SiteName = "grid-eo-engine03.esrin.esa.int"
>> > > StartdIpAddr = "<193.204.231.32:22864>"
>> > > State = "Unclaimed"
>> > > TotalFlockedJobs = 0
>> > > TotalHeldJobs = 0
>> > > TotalIdleJobs = 0
>> > > TotalJobAds = 0
>> > > TotalRemovedJobs = 0
>> > > TotalRunningJobs = 0
>> > > UpdateSequenceNumber = 1155894468
>> > > VirtualMemory = 0
>> > > WantAdRevaluate = True
>> > > WantResAd = TRUE
>> > > daemon_unique_name = "5670f86976d594674aa5ef1c9bc2b3b2"
>> > >
>> > > I suppose that after the output retrieve, jobs should end, don't
>> they?
>> > >
>> > > Thanks for the support and cheers
>> > >
>> > > Vega Forneris
>> > >
>> > > +-----------------------------------------------+
>> > > ESA-ESRIN
>> > > Unix Systems Administrator
>> > > Via Galileo Galilei
>> > > 00044 Frascati (Rm) - Italy
>> > > Phone +39 06 94180581
>> > > Mailto: [log in to unmask]
>> > > +-----------------------------------------------+
>> > > Vitrociset S.p.A.
>> > > Unix System Administrator
>> > > Via Tiburtina 1020
>> > > 00100 Roma - Italy
>> > > Phone +39 06 8820 4297 > > Mailto: [log in to unmask]
>> > > +-----------------------------------------------+
>> > >
>> > > "I do not feel obliged to believe that the same God who has
>> endowed us
>> > > with sense, reason, and intellect has intended us to forgot
>> their use."
>> > > (Galileo Galilei)
>> >
>>
|