Hi Steve, Burke, S (Stephen) пишет: > > As Owen said, this is not a good solution because you won't be able to read > the files, the normal replica management tools need to find the SE in the > information system. How's that? I need no infosys to query RLS; and RLS records have pretty explicit SFNs, don't they? globus-url-copy needs no infosys either. Anyway, I am not really insisting on removing SEs from the infosystem; but do you know of any LCG tool or method that makes use of the published free space? And you actually mentioned yourself that what is published is the overall space, not a per-VO quota. I'm suggesting a least damaging solution (in my opinion), and I'm willing to discuss alternatives. > Also, intrinsically a full SE is not a fatal error any > more than a full disk on any system, it's just that users need some way of > dealing with the condition. A full disk on a system is not a fatal error. I have plenty of them full sitting around. Just checked, NorduGrid has 17 out of 43 disk SEs completelly full. You just use the system read-only, which is perfectly fine for a Storage Element. A full *system* partition is fatal, but I am sure nobody has storage area and system area on the same partition. > The free space is published in the information > system so it should be possible to recognise the situation and deal with it > in whatever way you like - maybe atlas would actually rather leave the SEs > full and write new files somewhere else. This is effectively the situation. If a job fails to write to an SE - whatever is the reason, - it will eventually store the file wherever it can be stored. It doesn't use the free space reported in the infosystem, just a "kamikaze" method :-) BTW, the reported free space is useless for yet another reason: imagine there's 10 GB reported free, and 10 jobs read this information simultaneously (and they do, even more than 10), and duly start uploading a 2GB file each. Guess what will happen. Right, all will fail. Meanwhile, the SE GRIS will time out because the system will get overloaded with 10 multithreaded transfers, and 10 more jobs will still see the 2 GB free because this is what will be cached in the BDII. And so on. Ain't that cool. > The only reason it can be a problem > is on systems where all VOs share the space and there are no quotas, so one > VO can block the others. So, we can block LHCb and they can block us. We're even ;-) > A separate point is the question of reliability. Tier-1s will typically > commit to a high level of reliability so you can resonably expect that files > there are safe. Many sites, even ones with large amounts of space, may not > have much reliability or backup, so if disks crash data may be lost. I'm not > sure how that can be represented, how do you quantify the likelihood of > losing data? Nobody's perfect. A certain person here suggested to have data loss insurance :-) Smaller is the site, less compensation is to be paid. Profits from the insurance company should finance purchase of more storage hardware. How's that? ;-) Seriously, I would suggest to change the entire LCG SE model - and the information system schema. As a SE, only a reasonably reliable facility, committed for long-term storage, should qualify. SEs should not necessarily be linked to sites, they should be standalone services available via GridFTP, SRM, whatever, and register to GIISes independently on the rest of the site. Thus we will be able to have a set of sites for running data, and a [different] set of SEs for storing the results. The disk space local to the site and necessary for its proper functioning should be renamed and treated as "cache" and must not be used for long-term data storage. Of course, the "real" SE disk space may well be cross-mounted on the WNs, it is up to sysadmin's choice. Oxana