The XML Trenches posts are case studies of real-world publisher XML events. The idea is to illustrate the fact that digital content strategies need the right tools and approach, and are a guide to help publishers make the right decisions.
We just received a DocBook XML file from a publisher who asked, "Can this be used to make a Mobipocket for Amazon?". The DocBook XML had already been used by the producer to make an ePub. We cracked it open to have a look.
It was a valid and well-formed DocBook file with a custom namespace, but there were significant
tagging and structural errors which stop it being usable except by
spending time = money. I have posted on this before in Sustainable XML Strategies.
This is the issue with every DocBook produced book XML I have ever seen. They cannot be used to produce more outputs without significant analysis, time and more work.
I am pushing the XHTML vs. XML wheelbarrow hard and fast here as I have just been involved in a lengthy discussion on the value of XML vs. XHTML on the "eBooks/ePub Technologies" group on LinkedIn. If you are a publisher about to put your feet into some form of digital content strategy, you need to know what you are getting.
This is a slightly technical discussion, but the purpose is to show some typical problems that arise with XML if it is not carried out correctly.
The Analysis
1. The input XML files are split. These need to be merged before importing on the basis of an <xi:include> in the frontmatter section. This in itself is not an issue (it's valid XML), but why a small novel would be treated this way is an interesting exercise in making an easy job hard, and archiving more expensive.2. Frontmatter/Backmatter sections are treated as chapter without any "role". These when imported obviously appear as Body Chapters. This is absolutely, utterly, completely, stupidly wrong DocBook tagging. If you are only making an ePub you can get away with it. If the publisher doesn't have the technical resource to check the XML then any junk will do.
3. The Dedication section was tagged inside Part. In the DTD dedication is a frontmatter section. If there was a dedication in a part it would have had to be tagged as a <chapter role="dedication"> (don't know why anyone would do this), otherwise in frontmatter it can just be <dedication>. This implies more such careless inanities in the corpus.
4. The input XML files are formatted for pretty printing. The paragraphs have huge numbers of hardcoded empty spaces in them which have to be explicitly stripped. This is very bad for an archive production file and need more processing.
5. XML files uses a namespace "xxx". Any custom elements need to be handled separately. This means there is a custom DocBook DTD floating around somewhere for something and will probably be in some other book, but we can't tell the impact from the supplied sample.
6. The input package contained images in eps format and the referenced image in XML is eps. While this is not wrong, it does imply a post-xml manual method for image processing for the eBook formats.
7. Numerous other formatting errors that were probably manually fixed during ePub creation.
While the number of defects is not massive, each one is a standard processing show stopper and requires human intervention and processor adjustment. Ka-Ching! The cash register rings. XML has to be usable and must reduce future costs and provide new business revenue strategies.
Whoever created this was not particularly concerned about the XML value as they were just delivering an ePub. They probably have some manual or semi-automated method to get from the DocBook garbage to XHTML for ePub and once there the original XML structure doesn't matter anymore because it is just a stack of XHTML files with NCX links.
This DocBook tagging is from the XML slum, where the roof may keep the falling rain out, but the drainage doesn't work.
We XSL processed the files and imported them into IGP:FLIP where it was transformed into FoundationXHTML. We were able to instantly create the Mobipocket output using Formats on Demand. We then manually cleaned up the rest of the nonsense stuff for the final output.
We are charging around $25 per book (the profit on 5-10 Kindle sales) for the clean-up because there is a lot of manual intervention required on each book. OK that's not much, but adds up if you have hundreds of books and this has to be done every time a new format hits the market. If the DocBook been done properly in the first instance it could have been $5 per book (1-2 Kindle sales). If it had been produced using IGP:FLIP, the additional formats would have cost nothing.
We will probably have to add a tissue allowance as we weep with anguish while processing these poor sad books.
Summary
DocBook can be used as the XML source. But it has to be created correctly and there is no way to tell from format generated outputs that the tagging is correct and valuable.Well formed and valid XML is absolutely no warranty on the value of the XML. The tagging patterns have to be correct.
So called XML first strategies are opaque, usually incomplete, and always cost money. DocBook or TEI will only ever cost you more and more money as formats come and go and change.
Getting content tagged into XML for a format or presentation environment does not mean you have a digital content strategy.
You must make your digitization services contractor (if you are using one) show how the XML can be used NOW or in the FUTURE, for multiple format generation.
That's a good point which I didn't mention. The publisher in question had been given that as the go to market strategy. The result was way below their presentation expectations and requirements.
Kindlegen may work on the simplest of novels, but as soon as there is any significant styling it collapses. The publisher's we work with are trying to create the best possible e-books that capture at least the heart of the print design to create a better end-user experience.
Kindlegen's main problem is it cannot reasonably interpret every stylesheet thrown at it, so it will ignore anything that isn't a direct CSS selector applied to an element, and takes over a lot of other presentation styles. We handle this with significant processing in our Formats on Demand application, for example something like ".galley .extract { * }" would become ".galley-extract { * }, and the stylesheet manipulated accordingly. We can do that because 1) We understand the problem statement and 2) we control the XHTML/HTML and CSS so it will delivery the goods. We use Kindlegen for the final packaging, but by providing it a package, not an ePub.
From a production environment that is highly controlled, and understands Mobipocket's CSS weakpoints, the Kindlegen can be used to make outstanding ePubs. That hardly happens. For example good results are more a factor of luck when the ePub is generated by a styles noise machine like InDesign.
Other issues are Mobi does not handle style applied to IDs, and a lot of other styles that are standard for e-pub. Inside it is basically a sub-set of HTML 3.2 and CSS 1. Well presented table styling in Mobi is particularly tedious.
Kindlegen sort of works for the simplest novels tagged as very basic html with no significant styling or positioning, and or when any e-book is better than no e-book. We work with dozens of publishers who have tried ePub2Kindlegen outputs, and only a very few work with it.
The problem statement of this post was not just the ability to get to a Mobi/Kindle, but that the XML value sold, was not present. The XML did not create a digital content strategy and new business direction. e-formats are a pretty small revenue stream for publishers at present so each currency unit spent should deliver value.
Next, what happens with tomorrow's inevitable new formats and delivery/fulfilment formats? The XML must be ready and not a new cost center.
Posted by: Richard Pipe | Sunday, July 25, 2010 at 06:21 PM