Print

Print


Hi guys,

 

I wonder if anyone had any advice on this one:

 

I am looking for advice or best practice on how to move text from Word Doc files to CMS (TYPO3) fields.

 

 

Background:  On the IMPACT project  (http://www.impact-project.eu/) we have a website driven by a CMS (TYPO3, run by IMPACT partners in Gottingen) from which we want to deliver a range of materials originally produced in MS Word, by authors across Europe and collated by partners at the British Library.

 

We do not intend to deliver a very large number of documents – somewhere between 50 and 100.  Most of these are only 5-10 pages with a few at up to 20-30, however they are structured and include various styles, tables images etc and we want to retain as much of these as possible.  But the point is that it is not an enormous pile by any means and although it is large enough to make an elegant automated workflow useful, it is also not too much to rely on a more manual and pragmatic approach.

 

The Workflow: Originally, the plan was to establish a workflow based on DocBook-XML, which would provide a way of holding the documents in a standard way that was fully open and structured and could then deliver in a range of ways including:  HTML (via the CMS),  PDF and hopefully EPUB, or  even back to RTF/Word.

 

Using Docbook has been harder to establish than we had originally hoped, which has been mainly due to:

 

·         a lack of tools to convert Word Doc to Docbook (we are using upCast RT – which works well  but is quite complex in use)

·         a lack of expertise in XML to undertake the QA, and manual additions to the DocBook files

·         difficulties finding anyone who can easily create the XSLT for the transformations from DocBook to PDF and EPUB  (We have used a standard transformation for the conversion to XHTML which seems to work fine although it could still do with some tweeks

 

In as far as we have been able to test it.....it works well, especially for making the clean-html.

 

However our inability to really ‘tie it down’  has lead us to consider some other alternatives:

 

Clean-HTML:  We are now considering using a workflow that consists of transferring the Word docs to clean-html and using that as a standard master-file.  However we are not quite sure what is the best way of doing this, so far have looked at:

·         Using the ‘clean-html’ functionality of Dreamweaver  (this is how I have done this in the past and then cleaning up in DW, before pulling out the code)

·         Word Cleaner  4.7  http://converttohtml.com

·         DocToHtml v2.0  http://www.opilsoft.com/

·         YAWC (yet another word converter)   http://www.yawconline.com/

·         Originally, our partners in Gottingen recommended another program:  Docvert at http://holloway.co.nz/docvert/index.html  but this requires a working web-server, which we don’t have easy access to.

·         Brian Kelly has suggested using the TinyMCE editor in Wordpress (under Chrome), which seems to handle a fair amount of the vagaries of the Word Docs and provide pretty clean html

 

Word-CMS:  A simple pragmatic approach - just  copy’n’pasting directly from word to the CMS for the HTML and to a branded template to make PDFs.   This is easy enough and pragmatic.....but also slow and open to mistakes and does not support much functionality.

 

And the $64,000  question is:  So after a long pre-amble, my question is:  What is the best way to reliably get from Word Docs to the CMS?

 

It would be ‘good’ if the method was able to move the word styles to the appropriate ‘heading-styles’ within the html so they could use the standard CCS used by the CMS.

It would be even better if the method allowed for ‘chunking’ of the html or creation of a linked T.O.C.   Both of which we can do with DocBook.

 

A million projects across the world, must be doing this every day....so I can’t believe there isn’t an easy way to do this.  In the days prior to the use of CMS-driven site, I would of just dropped it all into Dreamweaver and coded it up and that would be that, but this needs a more subtle approach.

 

What is the best practice?

 

Best Wishes and a happy w/end

 

eib

 

 

*********************************

Ed I Bremner

Research Officer

IMPACT - Improving Access to Text

UKOLN - University of Bath

e-mail:  [log in to unmask]

skype:   ed.bremner

twitter:  impactocr

*********************************

 

 

 

******************************

Ed I Bremner

Consultant and Trainer in Digital Media

BremWeb Imaging

www.bremweb.co.uk

[log in to unmask]

07973 335509

******************************