The construction of a web archive begins with the definition of just what the archive’s purpose should be. While some web archives have very specific inclusion criteria and focus on very narrow topics for which there is a known and limited universe of content to preserve, other web archiving initiatives set out to simply archive what they can through a vast web of sources and donors and without any overarching collection strategy or user community to help guide them, in marked difference to the approach typically taken in the library and archival communities to physical collections.

Some of the largest collections of archived web content are thus multi-petabyte datasets compiled over years or even decades through criteria, seed lists, crawler designs and explicit and inadvertent design decisions that have long ago been lost to time or which are considered proprietary and cannot be shared.

http://bit.ly/2hWIoHK
http://bit.ly/2hWIoHK+

Peterk
Dallas, Tx
[log in to unmask]
Save our in-boxes! http://emailcharter.org
“If only there were a massive entity that I were forced to fund to tell me how I should live my life, since I’m so obviously incapable of deciding for myself.” M. Hashimoto