Hi guys,
I wonder if anyone had any advice on this one:
I am looking for advice or best practice on how to move text
from Word Doc files to CMS (TYPO3) fields.
Background: On the IMPACT project (http://www.impact-project.eu/) we have
a website driven by a CMS (TYPO3, run by IMPACT partners in Gottingen) from
which we want to deliver a range of materials originally produced in MS Word,
by authors across Europe and collated by partners at the British Library.
We do not intend to deliver a very large number of documents
– somewhere between 50 and 100. Most of these are only 5-10 pages
with a few at up to 20-30, however they are structured and include various
styles, tables images etc and we want to retain as much of these as
possible. But the point is that it is not an enormous pile by any means
and although it is large enough to make an elegant automated workflow useful,
it is also not too much to rely on a more manual and pragmatic approach.
The Workflow: Originally, the plan was to establish a
workflow based on DocBook-XML, which would provide a way of holding the
documents in a standard way that was fully open and structured and could then
deliver in a range of ways including: HTML (via the CMS), PDF and
hopefully EPUB, or even back to RTF/Word.
Using Docbook has been harder to establish than we had
originally hoped, which has been mainly due to:
·
a lack of tools to convert Word Doc to Docbook
(we are using upCast RT – which works well but is quite complex in
use)
·
a lack of expertise in XML to undertake the QA, and
manual additions to the DocBook files
·
difficulties finding anyone who can easily
create the XSLT for the transformations from DocBook to PDF and EPUB (We
have used a standard transformation for the conversion to XHTML which seems to
work fine although it could still do with some tweeks
In as far as we have been able to test it.....it works well,
especially for making the clean-html.
However our inability to really ‘tie it down’
has lead us to consider some other alternatives:
Clean-HTML: We are now considering using a
workflow that consists of transferring the Word docs to clean-html and using
that as a standard master-file. However we are not quite sure what is the
best way of doing this, so far have looked at:
·
Using the ‘clean-html’ functionality
of Dreamweaver (this is how I have done this in the past and then
cleaning up in DW, before pulling out the code)
·
Word Cleaner 4.7 http://converttohtml.com
·
DocToHtml v2.0 http://www.opilsoft.com/
·
YAWC (yet another word converter) http://www.yawconline.com/
·
Originally, our partners in Gottingen
recommended another program: Docvert at http://holloway.co.nz/docvert/index.html
but this requires a working web-server, which we don’t have easy access
to.
·
Brian Kelly has suggested using the TinyMCE
editor in Wordpress (under Chrome), which seems to handle a fair amount of the
vagaries of the Word Docs and provide pretty clean html
Word-CMS: A simple pragmatic approach - just
copy’n’pasting directly from word to the CMS for the HTML and
to a branded template to make PDFs. This is easy enough and
pragmatic.....but also slow and open to mistakes and does not support much
functionality.
And the $64,000 question is: So after a
long pre-amble, my question is: What is the best way to reliably get from
Word Docs to the CMS?
It would be ‘good’ if the method was able to move
the word styles to the appropriate ‘heading-styles’ within the html
so they could use the standard CCS used by the CMS.
It would be even better if the method allowed for
‘chunking’ of the html or creation of a linked T.O.C.
Both of which we can do with DocBook.
A million projects across the world, must be doing this
every day....so I can’t believe there isn’t an easy way to do
this. In the days prior to the use of CMS-driven site, I would of just
dropped it all into Dreamweaver and coded it up and that would be that, but
this needs a more subtle approach.
What is the best practice?
Best Wishes and a happy w/end
eib
*********************************
Ed I Bremner
Research Officer
IMPACT - Improving Access to Text
UKOLN - University of Bath
e-mail: [log in to unmask]
skype: ed.bremner
twitter: impactocr
*********************************
******************************
Ed I Bremner
Consultant and Trainer in Digital Media
BremWeb Imaging
07973 335509
******************************