Hi Sam,
Yes I agree it's very difficult to predict file hotness. The results
show that most of the cache hit on a file happens within 12 hours
after it's cached (I guess that's when the fail-and-retrys happen
frequently). So for XCache I would suggest a decision lib to filter
files that won't be hit at all, e. g. user outputs and logs, and a
more aggressive cached file purging policy to remove cold files.
Thanks,
Teng
Quoting Sam Skipsey <[log in to unmask]> on Thu, 5 Jul 2018
16:35:41 +0100:
> hi Teng:
>
> So, when I looked at the file accesses at Glasgow (sans cache), a lot of
> our "very hot" AODs were failed-and-retried-jobs / resubmitted jobs. (We
> did also have some hot AODs from a single job set - which is why Wahid's
> hot-file-replicator was invented - but it was pragmatically very hard to
> work out in advance if a file was going to be hot or not, and they were
> surrounded by accesses for lots of files hit once. The same applied to the
> old pcache stuff, which tried to do WN-local caching of file accesses for
> ATLAS in pathena. There's quite a long history of trying to do
> opportunistic caching or optimisation of hot files in ATLAS at Tier 2s! )
>
> I think this is a key thing to understand to see if caching data files is
> useful.
>
> Sam
>
> On Thu, Jul 5, 2018 at 4:22 PM Teng Li <[log in to unmask]> wrote:
>
>> Hi Sam and Chris,
>>
>> I'm calculating cache hit on files, as whole files are copied to the
>> WN before jobs begin to run.
>>
>> And yes, the cache hit rate is high probably because the WNs read data
>> from a single SE rather that a federation which has larger capacity,
>> and the jobs are dispatched to the WNs because the SE attached has
>> already got the needed files.
>>
>> By config I mean the disk size used by the cache. As XCache always
>> remove the coldest files, either too small or too large disk size will
>> drop the performance. The test results indeed show that about 90% of
>> the input AOD files are hit exactly one time. But the rest 10% (some
>> are very hot) contributes to roughly 2/3 of all the cache hit, and the
>> rest 1/3 is contributed by libraries files.
>>
>> I'not sure about why some of the AOD files are touched multiple times,
>> which could either because of, a) jobs are resubmitted b) jobs are
>> failed and retried, c) some AOD files are indeed used many times.
>>
>> Cheers,
>> Teng
>>
>>
>> Quoting Sam Skipsey <[log in to unmask]> on Thu, 5 Jul 2018
>> 15:39:53 +0100:
>>
>> > Config where? In the jobs or the cache? Much ATLAS analysis I've seen
>> tends
>> > to parallelise over data files, which means that you get 1 hit on each
>> file
>> > in the run, on average (which you also see in your cache too, AFAICT).
>> How
>> > do you know if you're going to get a hot file in advance?
>> >
>> > Sam
>> >
>> > On Thu, Jul 5, 2018 at 3:23 PM Teng Li <[log in to unmask]> wrote:
>> >
>> >> Hi Sam,
>> >>
>> >> Thanks that's much appreciated.
>> >>
>> >> I'm trying to avoid that we present opposite opinions in back to back
>> >> talks :) as I'm going to say that XCache is an efficient network
>> >> amplificator for ATLAS (cache hit rate on data files could reach over
>> >> 40% with optimised config).
>> >>
>> >> Cheers,
>> >> Teng
>> >>
>> >>
>> >> Quoting Sam Skipsey <[log in to unmask]> on Thu, 5 Jul 2018
>> >> 15:11:44 +0100:
>> >>
>> >> > Hi Teng,
>> >> >
>> >> > Sure, but I don't want to give your talk before you give it ;)
>> >> >
>> >> > The reason I'm mentioning that your cache is a local one, in network
>> >> terms,
>> >> > is to contrast it with Chris' results (which are explicitly a cache of
>> >> (CMS
>> >> > AAA) data from remote sites). Hence, the thing your results show are
>> hit
>> >> > rates, not Chris' transmission rates etc [because you're not concerned
>> >> with
>> >> > latency comparisons, yet, as these would be v different remote versus
>> >> > local, I assume].
>> >> >
>> >> > In any case, I agree with you that the main point is the cache hit
>> rate,
>> >> > for my slides, which substantially agrees with Chris' experience at
>> RALPP
>> >> > for remote CMS data.
>> >> >
>> >> > I've added a sentence to the slide to note that this is intended as a
>> >> > general test of the feasibility of a cache for SEs in general, if that
>> >> > helps?
>> >> >
>> >> > Sam
>> >> >
>> >> > On Thu, Jul 5, 2018 at 2:43 PM Teng Li <[log in to unmask]> wrote:
>> >> >
>> >> >> Hi Sam,
>> >> >>
>> >> >> Yes the current implementation is local. But I would address in my
>> >> >> talk that it's a test simulating a transparent cache between WNs and
>> a
>> >> >> remote SE. As the performance evaluation focuses on the cached
>> >> >> content, so wether the backend SE is remote or not is not vital in
>> the
>> >> >> study. And XCache should be more useful to cache remote data for
>> >> >> future diskless sites.
>> >> >>
>> >> >> Best Regards,
>> >> >> Teng
>> >> >>
>> >> >>
>> >> >> Quoting Sam Skipsey <[log in to unmask]> on Thu, 5 Jul 2018
>> >> >> 14:20:00 +0100:
>> >> >>
>> >> >> > Hi Teng,
>> >> >> >
>> >> >> > Yes, it's opportunistic because you're just caching "stuff which is
>> >> being
>> >> >> > transferred right now, in case it is useful later".
>> >> >> > In the case of the testing you're presenting, the cache is also
>> local
>> >> -
>> >> >> > because it is between your local SE and your local compute. (I know
>> >> the
>> >> >> > plan is to make it more remote, but IIRC, your actual
>> implementation
>> >> >> > currently doesn't proxy outside ECDF?)
>> >> >> >
>> >> >> >
>> >> >> > Thanks for the better slide.
>> >> >> > Sam
>> >> >> >
>> >> >> > On Thu, Jul 5, 2018 at 2:16 PM Teng Li <[log in to unmask]> wrote:
>> >> >> >
>> >> >> >> Hi Sam,
>> >> >> >>
>> >> >> >> Thanks. Just one comment on slide 11.
>> >> >> >>
>> >> >> >> Maybe I've understood the term "local opportunistic cache" wrong
>> but
>> >> I
>> >> >> >> think XRootD proxy cache is mostly useful between local WNs and
>> >> remote
>> >> >> >> SEs, which is what we are simulating and planning to address in my
>> >> >> >> following talk. So I would like to say "XrootD proxy cache between
>> >> >> >> remote SE and worker nodes".
>> >> >> >>
>> >> >> >> Also the plot is little blurry. I've attached a clearer one.
>> >> >> >>
>> >> >> >> Cheers,
>> >> >> >> Teng
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >> Quoting Sam Skipsey <
>> [log in to unmask]>
>> >> on
>> >> >> >> Thu, 5 Jul 2018 13:33:23 +0100:
>> >> >> >>
>> >> >> >> > As promised, here's a draft of the CHEP Tier 2 caching talk.
>> >> >> >> >
>> >> >> >> > All comments gratefully received (the frontspiece needs me to
>> add
>> >> >> credits
>> >> >> >> > anyway, so if you comment you get a credit as well)...
>> >> >> >> >
>> >> >> >> > Sam
>> >> >> >> >
>> >> >> >> >
>> >> >>
>> ########################################################################
>> >> >> >> >
>> >> >> >> > To unsubscribe from the GRIDPP-STORAGE list, click the following
>> >> link:
>> >> >> >> >
>> >> https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=GRIDPP-STORAGE&A=1
>> >> >> >> >
>> >> >> >>
>> >> >> >>
>> >> >> >> --
>> >> >> >> The University of Edinburgh is a charitable body, registered in
>> >> >> >> Scotland, with registration number SC005336.
>> >> >> >>
>> >> >> >>
>> >> >> >
>> >> >>
>> >> >>
>> >> >> --
>> >> >> The University of Edinburgh is a charitable body, registered in
>> >> >> Scotland, with registration number SC005336.
>> >> >>
>> >> >>
>> >> >>
>> >> >
>> >>
>> >>
>> >> --
>> >> The University of Edinburgh is a charitable body, registered in
>> >> Scotland, with registration number SC005336.
>> >>
>> >>
>> >>
>> >
>>
>>
>> --
>> The University of Edinburgh is a charitable body, registered in
>> Scotland, with registration number SC005336.
>>
>>
>>
>
--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.
########################################################################
To unsubscribe from the GRIDPP-STORAGE list, click the following link:
https://www.jiscmail.ac.uk/cgi-bin/webadmin?SUBED1=GRIDPP-STORAGE&A=1
|