Dear List Members
On 6th August I placed post asking for views on issues to consider for a proposed large scale scanning project. With apologies for delay, here is synopsis of views and ideas expressed.
Post received around 20 replies. Six giving detailed advice or information, while others said were interested in the issue for future projects in their organisation. Some gave experience of large scale scanning as a stand-alone project, broadly the requirement I was asked to investigate, while more contacts described scanning as a component of a wider business change such as installing EDRMS and/or introducing digital mailroom. Issues grouped by broad topic below
ESTIMATING AND METRICS
Views were mixed on whether there are reliable industry-wide metrics. Some suggested figures, but others argued was not feasible to simply forecast volume based on the filetype; different content can mean one PDF takes N kb whereas another takes 2xN kb or more.
The only source who suggested a standard measure said for an A4 typed document monochrome scanned at 300 dpi into PDF file, we should allow 40kb per side of text. This was based on format of image-over-text: the scanned image on top with OCR-searchable text behind it. However, another argued a single page 200dpi TIFF can vary 10kb to > 100kb dependent on print grain and point size and if colour is needed, variations in compression or colour tone can vary from <10kb to > 1Mb per image.
Some contacts suggested we should not be too fussed about file size, when cost of disk storage is so low. No one mentioned cloud storage costs.
In terms of scanning throughput, it was suggested a project should handle 100,000 pages per month at maximum. While the sheet-feeder scanning element could run at “eye-watering speeds” the need for human input in preparation and indexing would limit the amount of paper that can be scanned each month.
At least one correspondent strongly rejected the concept of benchmark metrics, arguing we would have to gather our own a sample to estimate from. This would need to feature ‘average’ documents, assess the number of keystrokes to index those, and include the number of indexing fields.
Range of variables needed considering in estimates: how much work to remove staples, clips, dust covers? How much space to store these while the contents are being scanned? How frequently will files be needed for core business during the project? If taken off-site (or even off-shore), what are suitable batch sizes for available transport. Is the indexing data “objective” (in standard place or standard form) or “subjective” (needs expertise to find and interpret). How accurate must the indexing actually be?
Complexity is another important variable. Scanning a single record series is easier if these all use the same “metadata set”. This can be ‘auto populated’ by some (unspecified) software products.
PILOTING
Most common theme; nearly everyone called for pilot. Estimating must have sample trial. Recommended contacting choice of bureaux and including sample trial as part of bidding process. Practical trial of people and technologies will show qualitative factors like scan readability from range of originals as well as quantitative hard costs.
PROJECT MANAGEMENT AND BUSINESS ENGAGEMENT
Need for clear objectives and senior management sponsorship, plus a project exec or manager actually driving the project forward. There was strong recommendation to appoint a dedicated project manager and produce and agree clear specification for the work.
Some contacts said they had resourced in-house. It appears if firms had spare accommodation then leasing machines and deploying staff in-house could be considered. Most sources recommended outsourcing as far as possible with in-house resources quality assuring a sample of the scans.
Need to spread effort over business process, people and technology was highlighted, otherwise project risked be viewed as “ICT initiative” and users are more likely to try and avoid making the change.
COMMERCIAL
Where EDRMS services are being introduced, many vendors will recommend a partner or preferred-supplier scanning bureaux, although this may limit competition against an open marketplace.
A number of contacts mentioned firms used in the past. While I dont feel it appropriate to put these on open list, I did note everyone seemed to have had a good experience, expressing positive views. No one related bad experiences or suggested a firm to avoid.
In our own case, we have been directed to use a specific scanning firm who already have a framework contract. This has offered benefits in avoiding the overhead of external tendering but meant we were not able to compare wider market vendors’ technical capabilities and capacity.
THE SCANNING PROCESS
Thanks to everyone who described the four-step process:
1 Cleansing – removing staples etc, placing sheets in neat order
2 Scanning and Indexing
3 Quality Assurance – checking scans legible and accurate, making sure double-sided scans made of double sided originals etc
4 Managing the paper originals – either returning to storage adequately labelled or destroying securely.
The scanning part of the process appears the simplest; almost completely automated, whereas significant human input is needed at all other steps.
One contact suggested reviewing and indexing manually from file header sheets was simpler than trying to capture metadata from content by OCR, particularly for administrative services (HR, student records). OCR was used in industries where search and discovery paramount (such as mining).
Quality control can be on sample basis. Early in project sample a lot but as confidence in quality (on both parties) grows could reduce to 5pc.
Preparing files and re-packaging after scanning (if paper being kept) are normally the largest aspect of project and contacts suggested many projects under-estimate this element.
LOCATION AND LOGISTICS
A number of contacts said they had successfully run the operation in-house, or at least on their premises. This offered better control and for highly sensitive material could prove safer, but needed very accurate planning estimates. For a large project, there was a suggestion to set up parallel teams to generate some healthy competition and motivation. On smallest projects, the view was segregate tasks: for example: two colleagues on preparation, one running the scanner, two more indexing and repackaging.
Other contacts argued against in-house saying bulk scanning is a specialist operation, particularly on volume approaching 0.25 million files. Even where the volume and duration of work could not be accurately forecast, then contacts were recommending shipping the material to an external bureau instead. Bureau should help with specification analysis, as their core business.
Be sure of logistics in getting materials to bureau. Could have ultrafast scanning team but they cannot excel if transport company creates bottleneck in getting goods to them. Preference for shipping files to single central bureau, rather than multiple small bureaux near the files.
Make sure you have means to access files needed for core business during the scanning project, before the scanning starts. Likewise agree process for handling new paperwork that comes in after start of scanning.
DIGITAL STORAGE
TIFF format was only suggested where need for high level compliance.
As storage will be needed for the scanned images, most contacts reported they had implemented an EDRMS product (no one mentioned SharePoint) at the time of the scanning project. Including those where EDRMS had not been the original driver for the scanning work.
Kofax users mentioned a ‘connector’ facility that would route scanned documents directly into an EDRMS fileplan.
For less complex operations, suggestion to store on network drives, subject to warning the network managers about the sudden increase in volume!
Many thanks to everyone who contributed. The information received has been valuable in helping plan this project. The project is still at an early stage, and could be down-scoped as early estimates and liaison with government-frameworked supplier suggested the original volume could take 12 months to complete. This may mean only files up to 8 years old being scanned as these are the most accessed files, with older files remaining in paper form unless or until there is increased audit attention on that earlier time period. Resource demands of manual indexing also led to decision to avoid complex indexing (either manual or OCR) with files being scanned at whole-box level (putting multiple files on a PDF) and then indexed and PDFs split later.
Kind regards
Colin Tyc
To view the list archives go to: https://www.jiscmail.ac.uk/cgi-bin/webadmin?A0=RECORDS-MANAGEMENT-UK
To unsubscribe from this list, send an email to [log in to unmask] with the words UNSUBSCRIBE RECORDS-MANAGEMENT-UK
For any technical queries re JISC please email [log in to unmask]
For any content based queries, please email [log in to unmask]
|