Last week I explored what precisely makes up the 20 year archive of the web held in the Internet Archive’s Wayback Machine. Several of those findings have spawned considerable discussion over the past week within the library and web archival communities about what it means to archive the web, how much documentation and metadata is enough, the tradeoffs in completeness vs reach, and how to better engage with the myriad constituencies served by web archives.
Why is it so important to understand what’s in our web archives? Perhaps the most important reason is that as an infinite and ever-changing landscape, it is simply impossible to archive the “entire internet” and perfectly preserve every change to every page in existence. Web archives are by their very nature an imperfect record of the web and constructing them is an exercise in countless tradeoffs of how to preserve an infinite stream with finite resources.