Faster and more effective indexing of Docbook XML books using XTUL

The problem with XML Docbook indexing

So, we’ve all thought the same thing: here you are scrolling through a Docbook XML book and looking for juicy tidbits to add to the index. Since there are no page breaks, you have to embed the index entries at each location in the source text. You’ve already figured out how to view the book in a more readable format, instead of the XML source code markup, in oXygen or XMLmind.

Wait, didn’t I already have a similar term used in another chapter? Scroll … scroll … scroll …

Now, where was that? Oh, forget it. I’ll just have to edit the final output created when the index is generated. Ugh. Why can’t I just use my regular indexing software that I’m used to for this format?

Well, you can, and you should. You know that it will provide a better index, less by-hand editing later, not to mention a whole bunch of peace of mind.

Enter XTUL.

XTUL Software

XTUL Software

What does the application do?

Simply put, it allows you to index Docbook XML and HTMLBook the way you want to, using indexing software, like SKY Index Professional.  The application prepares your XML files with viewable markers that you can use as “page numbers”.  Of course, they have nothing to do with where pages will break, but allow you to specify where an entry should go, even if it is for a range of text.

Once you complete your index, editing as you desire, you proceed to a couple of simple steps using XTUL once again to embed all of the entries at once.  It only takes a couple of seconds.  All index entry tags are generated in the XML source code markup as valid XML.  There is no chance for you to make a mistake in adding index terms and no code to write manually!

Does it automatically index XML Docbook?

No, it simply enables you to add and edit all of your index entries at once and outside of any XML Editor like oXygen or XMLmind.

Has it been tested in production?

BIM has been using XTUL in-house for over two years on all XML Docbook and HTMLBook indexes.  We even have a customized version that can enable us to index AsciiDoc in the same manner.

indexterms inserted automatically

indexterms inserted automatically


BIM has found XTUL to be invaluable for creating all Docbook XML and HTMLBook indexes that are created.  Even when chapters have to be done little by little or in batches, XTUL helps to work with the indexing of new chapters by allowing one to edit together with previous chapters, all for a much cleaner output.

Any professional indexer will find this tool necessary if they intend to work on a Docbook index for more than about 15 minutes. :-D

XTUL was created by Solid Knot and is available for download online.

Why Should Humans Index Books?

Computers are so much faster than people at processing information. They can list instances of terms that appear in a book in a flash, and have the capability to perform full-text searches on electronic documents. So why on earth would anyone pay a human to index a book?

Language is very complex, so indexes prepared by computer or text searches are seriously lacking when compared to those written by a professional human indexer. The rest of this article points out some of the differences you’ll see between automatic indexes or searches and human-prepared ones.

Automatic indexes can’t distinguish between homographs (words that are spelled the same way but have different meanings). An example given by the Society of Indexers is searching on the Internet for the pop stars Madonna and Prince. You’d find millions of unwanted references to religious art and royalty.

Also, text searches will not handle synonyms (words that have the same meaning) properly. For instance, a recent index we worked on used the words “bidding” and “pricing” interchangeably. An automatic search would have listings for each with only the pages where that specific word was used. We humans were able to see that these words were used in discussions about determining pricing for clients, so we listed them all under “pricing” but also made an entry for “bidding” with a “See pricing” reference so that someone looking up “bidding” could find all of the relevant information.

Computers also, unlike humans, can’t pick up concepts contained in the writing where your specific search term isn’t used. So you might look up “dog grooming” but miss a discussion called “Clipping Your Pet’s Nails.” They aren’t able to read the content of graphics, either, so you’d miss pictures that demonstrate brushing a dog’s hair but don’t have relevant captions.

Finally, the main problem with a search box or computer-generated index: they can’t tell whether a hit for a certain term gives you relevant information or just a passing mention. So they really give you concordances, lists of pages where a word appears in the text. There’s a big difference between a concordance and an index. Most of the terms we search for are not rare; they’re used very frequently. So they’d appear lots and lots of times in a concordance. To illustrate this, you might be interested in learning more about using Adobe Photoshop to edit images. So let’s say you look up “Photoshop” in a book on photography. You might find 100 places where that word was used, but many would lead you to sentences where the author says something like, “I like using Adobe Photoshop.” That’s all well and good, but is it really useful? Another example is in this article – in the third paragraph, I mention everyone’s favorite 80’s pop star, Madonna. But that paragraph isn’t really relevant to someone who wants to know more information about her, say, her first Top 40 single or her favorite brand of corset.

The point of an index is to help readers find the word they’re looking for and be taken to a useful discussion of it. So while a computer or search box can find that specific word really well, a human indexer can think from the reader’s point of view – what concept will they be searching for and what words would they use to find it? Writers put so much thought into carefully wording their books because they want to communicate the knowledge and experience they’ve worked so hard to obtain. The indexes for those books should be just as thoughtfully prepared so readers can get to that important information.

-by Danielle Easler, Indexer at BIM

photo credit: mrsdkrebs

Authors, get tips on writing your own index

How We Created an Index for Over 10,000 Pages of Content

At BIM we are used to indexing lots of pages every month. In fact, we usual index over 100,000 pages of content each year. However, when we received a phone call asking us to create an index for a single project that was over 10,000 pages we knew that we were in for quite a ride!

You might wonder how an indexing team goes about writing an index for so much information. The following is a brief summary of the basic process that we used.

Working off an Index Shell

Usually when we are asked to write an index, we read through the information and decide which concepts to include in the index and how to word the index entries. We also decide whether some index entries should contain subentries and if so what they should be.

However, for this project we were given an index “shell” which contained most of the terms and concepts that we were to find within the text. Although it might seem that this would make it easier to index the information, it actually made the project more challenging. Instead of reading the paragraphs of information and forming phrases that communicate the idea expressed by the information, we had to work in reverse; reading the existing index entries and figuring out which information they applied to.

Training the Team

Because of how different this project was from conventional indexing, our indexing team had to be trained on how to index in this unusual way. To do this, we gathered some of the indexers together at Mojo Coworking in downtown Asheville, utilizing their conference room and giant monitor to instruct the indexers. Besides providing guidance on the indexing, we did some of the work together so that we could pose and answer questions as they came up. These initial meetings proved invaluable as they helped the team to think as one.

Ongoing Guidance

Despite our initial meetings, our team of seven indexers needed constant guidance throughout the project, being that we kept running across new terms and phrases that we hadn’t encountered in previous pages of content. Also, many times indexers would suggest additions to the terms in the index as well as indexing certain concepts under existing index entries. On top of that, the client would add new terms in the middle of the project!

To provide ongoing guidance and to ensure consistency in term usage, one of our staff, Christa, was assigned the job of Project Coordinator. Christa wrote up a total of eight pages worth of indexing guidelines, which she amended quite frequently.

Managing the Files

It goes without saying that managing over 10,000 pages of text is no easy task. What made file management even more challenging was the fact that the files given to us were of high resolution and there were only one to two pages per file! That meant somewhere around 8000 heavy downloads. After downloading them to a central location, we then had to make the files available to our indexers, most of whom worked out of their homes.

We didn’t want our indexers to download so many individual files nor have to work on one or two pages and then open more files so frequently in order to continue work. To avoid all of that, we decided to compile files into batches of 25 to 100 pages each.

We then opened up an account with Box account owners can install a folder, called My Box Files, on their desktops and choose to sync the folder with a corresponding folder on the Box site.

Our File Manager, Ruti, was in charge of downloading files and compiling them into batches. She created Box folders labeled with each of our indexers’ names and installed them onto her desktop. After creating the batches, she dragged and dropped them into the indexers’ Box folders, all the while tracking which files were assigned to each indexer.

Once the indexers synched their Box folders, they didn’t have to actually go to the site to download the files. They simply appeared in the indexers’ folders on their individual computers. Sometimes they would be busy working on some files and by the time they were finished there were more files in their folders. Other times, if Ruti put the files in their folders at night, they would wake up in the morning and find their files on their computer, waiting for them. This process greatly streamlined the workflow.

Merging the Indexes

Once all of the indexers were finished with their pages, the individual indexes had to be merged. Using a special importation tool, we converted our Word index documents into a format that could be imported into Sky indexing software. The indexing software took care of sorting the various reference locators in numerical order.

After checking for errors and fixing them, the index was ready for delivery. The entire project was completed in just under four months, while we simultaneously handled indexing for our regular clients. Our hard-working indexers and well-planned strategy made it all possible.

Photo credit: Horia Varlan

Selling Non-Fiction eBooks: How to Compete

Now that it is easier than ever to create your own non-fiction ebook and sell it online, there is also more competition than ever. Even if you write a book about a niche topic like training Labradors, as opposed to a book about training dogs in general (idea found here), you still might find a good amount of competition out there. Do a search on Google and you’ll see what I mean.

So how can you compete?

One way is to tap markets that are not being pursued. Using the Labrador training book as an example, you could join forums like Just Labradors. I haven’t investigated their particular posting rules since this is just an example, but most forums and blogs allow links in participants’ signature lines. So after leaving a thoughtful post regarding how to train labradors, you would sign it with something like this:

John (or Jane) Doe, Author of “Training Labradors the Easy Way” (with a link to your Amazon page or another site where you are selling your book).

Of course the key is that your comment must be one that adds value to the site you are visiting. If you just try to advertise your book, that’s not going to work and it might even get you banned from using the site.

This method does not generate oodles of traffic, granted, but at least it is targeted traffic. Those who click on your link are practically guaranteed to be labrador-owners who are interested in training their dogs.

Print it!

Another way is to turn your ebook into a print book. Yes, I know that sounds like something you don’t want to get your hands into, but there are POD (print on demand) printers out there who only charge for each book that you print. Take a small box of those printed books and go to veterinary offices and ask if you can leave them a complimentary copy for their waiting room. Include sheets of paper with your website on it where readers can order their own copy for home.

Next, try attending a meetup like the Long Island Golden Retriever and Labrador Meetup, or a simliar group nearby you. Don’t just go to sell your book, though. That’s tacky! Instead, bring your own Lab (I expect you have one otherwise it would be strange that you wrote a book about how to train them), associate with other Labrador owners, get to know the people and the dogs. When should you bring up your book? You’ll know. Some dog-owners with rather desobedient dogs will express their frustation, at which point you could say “You know…” :-)

Now all of that might seem like a lot of work, and it is. But it also gives you more writing material for your blog. Try a post like “My Experience at the Long Island Golden Retriever and Labrador Meetup”. (Please tell me you do have a blog. If not, start one!) And it helps you to learn about other concerns and needs labrador owners have. It also expands your network of Lab-owners which could lead to book sales further down the line as you continue networking.

It’s not just about the book

Your first tendency might be to concentrate just on how many dollars of profit you make off each book. But there are other ways to create income off of your book besides the sales of the book itself. For example, if you are so into Labs that you wrote a book about how to train them, I bet you could give some classes on the topic, too. Or maybe you have the space and yard layout to board labradors while people are away. Why would a Lab-owner chose a generic dog-sitter when you have expertise handling the specific type of dog that they own? The idea is to find numerous alternative ways to create income off of your non-fiction book.

Obviously, these aren’t things that the big publishing houses are going to do. And they’re not things that most authors will do, either. So get out there! Meet people on the Web and in person who share your interests! And have fun selling your book and related services!

Photo credit: David Sifry

ePub vs. MOBI vs. PDF: Which format should you use for your eBook?

If you’re an author, I’m sure you’ve considered self-publishing your book as an eBook. You’ve probably read some pretty inspiring success stories about authors who have sold a substantial number of books on the Web. And even those who haven’t sold very many at least were able to get them online and out in front of the public. That might not have happened if they were still trying to get their books accepted by acquisitions editors at traditional publishing houses.

But deciding to self-publish an eBook seems to lead to a lot of questions, many of which we hope to answer in future posts on this blog. One of the first things you’ll have to decide is which format you want to use to publish your eBook. The most-common formats are ePub, MOBI (used on Kindle) and PDF.

Pros and Cons of PDF

The format you’re probably most familiar with is PDF (Portable Document Format) since that’s the file format used with Adobe Reader, which is installed on most computers. So it might seem that it would be your best bet, being that almost anyone with a computer can open a PDF file. But think again…

Try opening a PDF eBook on a phone and you’ll see the problem. PDF files contain t static text, so if there are 400 words on a page when you create it, there will be 400 words on a small iPhone screen. Obviously the text will be too small to read, so you’ll have to enlarge the page. And then you’ll have to move it around with your finger in order to read everything on the page. If you own an iPhone, try downloading our Tips for Authors Creating an Index which is available as a PDF. Or borrow a friend’s iPhone and try it out. You’ll quickly see what I mean.

Benefits of ePub

On the other hand, text in ePub books is not static. Instead, it flows. ePub eBooks display as much text as will fit on the screen, depending on the text size the user has chosen. So all the user has to do is read and flip pages. Very nice. MOBI does the same for eBooks on the Kindle.


As far as the differences between ePub and MOBI, they are very different, technically-speaking, but not so different for the reader. ePub tends to format books in a way that looks closer to what the author initially sets up than MOBI does, especially with spacing. And MOBI files of same books tend to be quite a bit fatter, sometimes double in size. However, if you want to sell your book on Amazon, then it needs to be in MOBI format.

Deciding on a format

Still unsure about which format you should use? If you want to reach readers who will read your eBook on their computer, then your best bet is PDF. If you want to reach those who will read on their iPhone or Android phone then you’re better off going with ePub. And if you want to sell to Kindle users, then MOBI is the format to use.

But wouldn’t you rather reach them all? So why not make your eBook available in each of the three most poplar formats? Considering how cheap it is to produce eBooks, it makes sense to publish your eBook in a way that makes it accessible to anyone who wishes to read it.

How do you go about publishing in these formats? We’ll answer that in an upcoming post…

Learn more about an embedded indexing course

Should all nouns in a book appear in a book’s index?

There are some publishers or authors that would immediately say “Yes!” to this question. They feel that any person, place, thing or idea that is mentioned in a book belongs in the index.

If you are an author, editor or publisher, here are a few examples of paragraphs from a few books that will help you see that including each and every noun in a book’s index is a mistake that will compromise the usability of the book’s index.

Example #1, from Steven Pressfield’s The War of Art, page 74:

“The writer is an infantryman. He knows that progress is measured in yards of dirt extracted from the enemy one day, one hour, one minute at a time and paid for in blood. The artist wears combat boots. He looks in the mirror and sees G.I. Joe.”

Here are the nouns in that paragraph: writer, infantryman, progress, yards, dirt, enemy, day, hour, minute, time, blood, artist, combat boots, mirror, G.I. Joe.

Do you see the problem with including every noun and noun phrase in the book’s index? Not only would it bloat the index (in fact you would actually have a concordance, which is not quite the same as an index), but if you were reading the book and looked up the term “yards” in the index and were led to this paragraph, would it contain what you expected? In fact in a book about being an artist, you would probably not look up that term at all. So why put it in the index? How about “dirt”, “day”, “hour”, “enemy”? Get the point? Unless the index directs the reader to truly relevant information about a subject, it will just disappoint the reader. A few entries like that and the reader will lose trust in the index and not bother using it at all. (By the way, The War of Art doesn’t contain an index, so we wrote one and posted it online here.)

But how about technical books? Wouldn’t every noun in such books be concrete, important items that need indexing? Well, take a look at the following example.

Example #2, from William Horton’s *Designing Web-Based Training*, page 311:

“…notice that the first and forth questions can be answered by looking at the picture. The second compares the mineral to the one other mineral whose hardness most people are familiar with. Most people could answer the third question by recalling that June is the month when many people get married and diamonds are used for engagement rings.”

Since the book is about how to design online training, the words “mineral”, “June”, “month”, “diamonds” and “engagement rings” are not pertinent to the theme of the book. They are just here as examples.

So next time you get ready to index a book, or to evaluate an index prepared by a professional indexer, remember that indexing every noun is not the way to create a quality index. Instead, evaluate whether each noun and noun phrase relates to the theme of the book and whether there is pertinent information about the topic on the page that you are referencing.

To embed or not embed an index in Word, InDesign or Quark Xpress

You want to get the index to the book that you are working on done super-fast, as in a few days. You also want to be able to reuse the index if you later publish another edition of the book. And since you plan on producing both a print and an eBook version of the book, you want the index to work in both.

If you are trying to achieve any or all of these objectives, than by now you have probably already heard about embedding indexing. Here’s the skinny on what it is and a few words on the advantages and disadvantages that you’ll want to consider.

What is embedded indexing?

Embedded indexing is the process of inserting hidden tags, which contain index entries, into the text of a book. This can be done in most publishing software such as InDesign, Quark Xpress, FrameMaker and even in Microsoft Word. The tags can be viewed by clicking on an option within the application, but obviously would not be viewable in the final output files, such as PDFs.

Advantages to embedded indexing

There are several advantages of embedding index entries into the text.

One benefit is it allows indexing to start before the book is even finished being written. You can send the chapters to the indexer one at a time and she can embed the index entries and send the chapters back to you. Even if you edit the chapter, deleting some sentences and moving some paragraphs around, the index tags get deleted with the text that you deleted and move with the text that you moved. Also, since the index entries are embedded in the text, it doesn’t matter that you haven’t formatted the book yet, or inserted all of the images, etc. When  the book has been copyedited, proofread and is ready to go, you generate the index and the page numbers after each index entry will reference whichever page where the corresponding hidden tag is located. The indexer will need to do an edit of the index, but that should only take a few days in comparison to the few weeks she would need to start indexing toward the end of the production cycle.

The other benefit is the point we mentioned at the beginning of this- you can reuse the index if you wish to also produce the book as an eBook or in some other format. Since ePUB books don’t have page numbers per se, the index would also not have page numbers, but it would link directly to the text, taking the reader to the exact paragraph referenced by the index entry.

Disadvantages of embedding indexing

Despite the benefits, there is a downside to embedding indexes in publishing software. One of them is that it is harder for the indexer to create a good-quality index. Since she is receiving the chapters one (or a few) at a time (and turning them in that way too), she’s sort of indexing with tunnel vision. She can view the index for each chapter that she is working on by generating it for that specific file, but she cannot see the index entries to chapters that she has already indexed to see how each entry should be adjusted to blend with the others. If she were working off of final PDFs, she could compare an index entry that she is writing to one that she created for a previous chapter, evaluate whether they relate to the same thing and then edit the old or the new entry so that they don’t conflict.

As a simple example, perhaps chapter 1 discusses “cooks” while chapter 7 uses the term “chefs”. Are they the same? Perhaps they are used to refer to the same profession in this book or maybe there is a distinction. If they are the same, the reader should not find two separate entries (one for “cooks” and the other for “chefs”) with completely different indented subentries under each of them. Instead, the most commonly used term (we’ll say its “chefs”) should have all of the corresponding subentries and the other term (“cooks”) should have a cross-reference reading “See chefs.” If they are different, than both entries should have their own, distinct subentries. However, each entry should have a “See also” reference pointing to the other entry, since they are so closely related.

You can see how hard it would be to determine how to treat related entries in an embedded index. Obviously, there are much more complicated relationships and many synonymous or seemingly-synonymous terms throughout the text of most books, especially in highly technical works.

Editing the index, which needs to be done when the book is almost finished, is also much more difficult when the index has been embedded. Instead being able to index as she goes along, comparing one entry with another, the indexer must clean up what by now has surely gotten rather messy. The best quality indexes are usually produced when the indexer can make many small adjustments during the indexing process, not by doing major edits all at once.

That being said, there are methods of saving index entries from previous chapters and then comparing them with new index entries upon creating them. Such an indexing system can help the indexer to avoid scattering information within the index and thus create a neater, more-organized index that needs less editing. But not all indexers know how to do this. (More on that in another post).

Making the decision

So what did you decide? To embed or not to embed? If you are still undecided or have more questions, click on the button below and I’ll gladly answer your questions (at no charge) about embedded indexing.