The XML Trenches posts are case studies of real-world publisher XML events. The idea is to illustrate the fact that digital content strategies need the right tools and approach, and are a guide to help publishers make the right decisions.
We just received a DocBook XML file from a publisher who asked, "Can this be used to make a Mobipocket for Amazon?". The DocBook XML had already been used by the producer to make an ePub. We cracked it open to have a look.
It was a valid and well-formed DocBook file with a custom namespace, but there were significant tagging and structural errors which stop it being usable except by spending time = money. I have posted on this before in Sustainable XML Strategies.
This is the issue with every DocBook produced book XML I have ever seen. They cannot be used to produce more outputs without significant analysis, time and more work.
I am pushing the XHTML vs. XML wheelbarrow hard and fast here as I have just been involved in a lengthy discussion on the value of XML vs. XHTML on the "eBooks/ePub Technologies" group on LinkedIn. If you are a publisher about to put your feet into some form of digital content strategy, you need to know what you are getting.
This is a slightly technical discussion, but the purpose is to show some typical problems that arise with XML if it is not carried out correctly.
The Analysis1. The input XML files are split. These need to be merged before importing on the basis of an <xi:include> in the frontmatter section. This in itself is not an issue (it's valid XML), but why a small novel would be treated this way is an interesting exercise in making an easy job hard, and archiving more expensive.
2. Frontmatter/Backmatter sections are treated as chapter without any "role". These when imported obviously appear as Body Chapters. This is absolutely, utterly, completely, stupidly wrong DocBook tagging. If you are only making an ePub you can get away with it. If the publisher doesn't have the technical resource to check the XML then any junk will do.
3. The Dedication section was tagged inside Part. In the DTD dedication is a frontmatter section. If there was a dedication in a part it would have had to be tagged as a <chapter role="dedication"> (don't know why anyone would do this), otherwise in frontmatter it can just be <dedication>. This implies more such careless inanities in the corpus.
4. The input XML files are formatted for pretty printing. The paragraphs have huge numbers of hardcoded empty spaces in them which have to be explicitly stripped. This is very bad for an archive production file and need more processing.
5. XML files uses a namespace "xxx". Any custom elements need to be handled separately. This means there is a custom DocBook DTD floating around somewhere for something and will probably be in some other book, but we can't tell the impact from the supplied sample.
6. The input package contained images in eps format and the referenced image in XML is eps. While this is not wrong, it does imply a post-xml manual method for image processing for the eBook formats.
7. Numerous other formatting errors that were probably manually fixed during ePub creation.
While the number of defects is not massive, each one is a standard processing show stopper and requires human intervention and processor adjustment. Ka-Ching! The cash register rings. XML has to be usable and must reduce future costs and provide new business revenue strategies.
Whoever created this was not particularly concerned about the XML value as they were just delivering an ePub. They probably have some manual or semi-automated method to get from the DocBook garbage to XHTML for ePub and once there the original XML structure doesn't matter anymore because it is just a stack of XHTML files with NCX links.
This DocBook tagging is from the XML slum, where the roof may keep the falling rain out, but the drainage doesn't work.
We XSL processed the files and imported them into IGP:FLIP where it was transformed into FoundationXHTML. We were able to instantly create the Mobipocket output using Formats on Demand. We then manually cleaned up the rest of the nonsense stuff for the final output.
We are charging around $25 per book (the profit on 5-10 Kindle sales) for the clean-up because there is a lot of manual intervention required on each book. OK that's not much, but adds up if you have hundreds of books and this has to be done every time a new format hits the market. If the DocBook been done properly in the first instance it could have been $5 per book (1-2 Kindle sales). If it had been produced using IGP:FLIP, the additional formats would have cost nothing.
We will probably have to add a tissue allowance as we weep with anguish while processing these poor sad books.
SummaryDocBook can be used as the XML source. But it has to be created correctly and there is no way to tell from format generated outputs that the tagging is correct and valuable.
Well formed and valid XML is absolutely no warranty on the value of the XML. The tagging patterns have to be correct.
So called XML first strategies are opaque, usually incomplete, and always cost money. DocBook or TEI will only ever cost you more and more money as formats come and go and change.
Getting content tagged into XML for a format or presentation environment does not mean you have a digital content strategy.
You must make your digitization services contractor (if you are using one) show how the XML can be used NOW or in the FUTURE, for multiple format generation.