Hi,
On Aug 18, 2006, at 2:53 PM, Antun Balaz wrote:
>>> 2) This one prevents using glite-WMS for long jobs with voms-proxies
> that
>>> needs to be renewed
>>> https://savannah.cern.ch/bugs/?func=detailitem&item_id=19045
>>
>> The proxy can be renewed, only voms extension part is not renewed in
>> old WMS(well, it is supposed to be fixed in new WMS, but we still
>> need to verify). This will not block your jobs since gridftp on WMS
>> still uses DN mapping. There will be problems only when your jobs
>> need voms extension to do authentication on WNs. So I still suspect
>> the failure of your jobs was caused by other reasons.
>
> I am open to any suggestion you may have; however, the fact is that
> when I
> use long-lived voms-proxies, there are no problems...
>
The problem is that one should not use long lived proxies. The
security model is based on the concept that
proxies are rather short lived and the process of proxy renewal is
used to provide credentials for long jobs.
The idea is that to limit the damage that can be done with a stolen
proxy.
What is not clear from the mail exchange is wether there is a problem
with voms-proxy renewal on the gLite-CE or not.
For the gLite-lcg-CE there is and it is known.
markus
> And for missing Scheduled time bug - I detected it when submitting
> jobs
> through our glite-WMS to our lcg-CE:
>
> https://savannah.cern.ch/bugs/?func=detailitem&item_id=19074
>
> Regards, Antun
>
>
>
>
>>
>> Di
>>
>>> 3) This one is also detected by you (missing Scheduled time), but
>>> it is
> not
>>> critical
>>> https://savannah.cern.ch/bugs/?func=detailitem&item_id=19074
>>>
>>> Best regards, Antun
>>>
>>>
>>> -----
>>> Antun Balaz
>>> Research Assistant
>>> E-mail: [log in to unmask]
>>> Web: http://scl.phy.bg.ac.yu/
>>>
>>> Phone: +381 11 3160260, Ext. 152
>>> Fax: +381 11 3162190
>>>
>>> Scientific Computing Laboratory
>>> Institute of Physics, Belgrade, Serbia
>>> -----
>>>
>>> ---------- Original Message -----------
>>> From: Vega Forneris <[log in to unmask]>
>>> To: [log in to unmask]
>>> Sent: Fri, 18 Aug 2006 12:09:44 +0200
>>> Subject: [LCG-ROLLOUT] - gLiteCE - questions
>>>
>>>> Hi *,
>>>>
>>>> I've just set up a little GRID for testing purpose (NOTE: this
>>>> implies that all machines involved are note used by anyone else=low
>>>> cpu load and IP traffic), but I've still a couple of questions
>>>> about
>>>> gLiteCE + gLiteWMS:
>>>>
>>>> TEST:
>>>> [vforneris@grid0008 GLITE]$ cat hello_no_target.jdl
>>>> Executable = "/bin/echo";
>>>> Arguments = "Hello World";
>>>> StdOutput = "message.txt";
>>>> StdError = "stderror";
>>>> OutputSandbox = {"message.txt","stderror"};
>>>>
>>>> "glite-job-list-match" and " glite-job-submit" work perfectly and
>>>> the job is succesfully submitted, but this simple job takes 4-6
>>>> minutes for retrieving the output...is it normal (first
>>>> question!!!)
>>>> ? It's a very long time considering the job complexity (=NULL!),
>>>> the
>>>> distances (the systems are really close each other) and the
>>>> server's
>>>> work-load...
>>>>
>>>> Here the output of the command " glite-job-status -v 3 "
>>>> - stateEnterTimes =
>>>> Submitted : Fri Aug 18 11:24:27 2006 CEST
>>>> Waiting : Fri Aug 18 11:24:15 2006 CEST
>>>> Ready : Fri Aug 18 11:24:16 2006 CEST
>>>> Scheduled : ---
>>>> Running : Fri Aug 18 11:27:08 2006 CEST
>>>> Done : Fri Aug 18 11:30:00 2006 CEST
>>>> Cleared : ---
>>>> Aborted : ---
>>>> Cancelled : ---
>>>> Unknown : ---
>>>>
>>>> while the pbs logs show that the cluster process this job in less
>>>> than 7 seconds since it reached the pbs_server and the
>>>> authenticating process takes one second!
>>>>
>>>> .......the real problem is that even after retriving the output,
>>>> there're still jobs running on the CE owned by the local user (in
>>>> this case: eo002)
>>>>
>>>> eo002 14086 1 0 11:25 ? 00:00:00 globus-job-manager
>>>> -conf /opt/glite/etc/globus-job-manager.conf -type fork -rdn
>>>> jobmanager-fork -machine-type unknown -publish-jobs
>>>> eo002 14199 1 0 11:25 ? 00:00:00 /opt/condor-
>>>> c/sbin/condor_master -f -r 680
>>>> eo002 14231 14199 0 11:25 ? 00:00:00 condor_schedd -f -n
>>>> [log in to unmask]
>>>> eo002 14274 1 0 11:25 ? 00:00:00 globus-job-manager
>>>> -conf /opt/glite/etc/globus-job-manager.conf -type fork -rdn
>>>> jobmanager-fork -machine-type unknown -publish-jobs
>>>> eo002 14306 1 0 11:25 ? 00:00:00 perl
>>
>>> /home/eo002/.globus/.gass_cache/local/md5/b4/
>>> be02fce8b5474e16cb3f16794d52b
> 6
>>> /md5/8d/618299719439f70dfc2258222583a4/data --dest-url=https://
>>> grid-eo-
>>> rb01.esrin.esa.int:20001/tmp/condor_g_scratch.0x8506178.1770/grid-
> monitor-
>>> job-status.grid-eo-engine03.esrin.esa.int:2119.
>>>> eo002 14308 14306 0 11:25 ? 00:00:00 perl
>>>> /tmp/grid_manager_monitor_agent.eo002.14306.1000 --delete-self --
>>> maxtime=3600s
>>>> and 2 gram_job log files grow continuosly in user's home....
>>>>
>>>> Crosschecking the "daemon_unique_name"
>>>> (=5670f86976d594674aa5ef1c9bc2b3b2) in WMS's tmp folder, I found
>>>> that in condorc-advertiser.813.2385.out log is continuosly appended
>>>> the following text:
>>>>
>>>> Fri Aug 18 11:47:18 2006 Advertising
>>>> "[log in to unmask]"
>>>> Fri Aug 18 11:47:48 2006 #################
>>>> Fri Aug 18 11:47:48 2006 Rewriting
>>>> "[log in to unmask]"
>>>> Fri Aug 18 11:47:48 2006 Setting requirements true
>>>> MyType = "Machine"
>>>> TargetType = "Job"
>>>> Activity = "Idle"
>>>> Arch = "CondorC"
>>>> CONDORC_WANTJOB = TRUE
>>>> CondorCAd = 1
>>>> CondorPlatform = "$CondorPlatform: I386-LINUX_RH9 $"
>>>> CondorVersion = "$CondorVersion: 6.7.10 Aug 3 2005 $"
>>>> DaemonStartTime = 1155893119
>>>> Machine = "grid-eo-engine03.esrin.esa.int"
>>>> MaxJobsRunning = 200
>>>> MonitorSelfAge = 1200
>>>> MonitorSelfCPUUsage = 0.012500
>>>> MonitorSelfImageSize = 7612.000000
>>>> MonitorSelfResidentSetSize = 4620
>>>> MonitorSelfTime = 1155894319
>>>> MyAddress = "<193.204.231.32:22864>"
>>>> Name = "5670f86976d594674aa5ef1c9bc2b3b2@grid-eo-
>>>> engine03.esrin.esa.int"
>>>> NumUsers = 0
>>>> OpSys = "CondorC"
>>>> Requirements = TRUE
>>>> START = TRUE
>>>> ServerTime = 1155894460
>>>> SiteName = "grid-eo-engine03.esrin.esa.int"
>>>> StartdIpAddr = "<193.204.231.32:22864>"
>>>> State = "Unclaimed"
>>>> TotalFlockedJobs = 0
>>>> TotalHeldJobs = 0
>>>> TotalIdleJobs = 0
>>>> TotalJobAds = 0
>>>> TotalRemovedJobs = 0
>>>> TotalRunningJobs = 0
>>>> UpdateSequenceNumber = 1155894468
>>>> VirtualMemory = 0
>>>> WantAdRevaluate = True
>>>> WantResAd = TRUE
>>>> daemon_unique_name = "5670f86976d594674aa5ef1c9bc2b3b2"
>>>>
>>>> I suppose that after the output retrieve, jobs should end, don't
>>>> they?
>>>>
>>>> Thanks for the support and cheers
>>>>
>>>> Vega Forneris
>>>>
>>>> +-----------------------------------------------+
>>>> ESA-ESRIN
>>>> Unix Systems Administrator
>>>> Via Galileo Galilei
>>>> 00044 Frascati (Rm) - Italy
>>>> Phone +39 06 94180581
>>>> Mailto: [log in to unmask]
>>>> +-----------------------------------------------+
>>>> Vitrociset S.p.A.
>>>> Unix System Administrator
>>>> Via Tiburtina 1020
>>>> 00100 Roma - Italy
>>>> Phone +39 06 8820 4297
>>>> Mailto: [log in to unmask]
>>>> +-----------------------------------------------+
>>>>
>>>> "I do not feel obliged to believe that the same God who has endowed
>>>> us with sense, reason, and intellect has intended us to forgot
>>>> their
>>>> use."
>>>> (Galileo Galilei)
>>> ------- End of Original Message -------
> ------- End of Original Message -------
|