Indexing and Keyword Tagging of Websites and Intranets

Let me start you off on the right foot, so that you don’t waste time reading this if you don’t have to:

If you do not have a lot of information on your intranet or website, or if the information that you have is not really that import (for example, not many people use it and if they spend a lot of time looking for information, it doesn’t really matter), then we suggest that you use a free search engine such as Google Custom SearchJRank or PicoSearch.

On the other hand, if you manage intranet content for a medical institution, financial services company, law firm or other company with information that is of great importance, then we suggest that you take a few moments and read the following articles, written by Kevin Broccoli (CEO at BIM):

  1.  Indexes: An Old Tool for a New Medium
  2.  Intranet Design Magazine also published the following article, reproduced below:

As company intranets grow in content, it becomes increasingly difficult to find the exact information that an intranet user may be looking for. Companies have traditionally used search engines to locate information on their intranets. However, many are finding that search engines (even the newer, so-called “intelligent” ones) are just not enough.

For example, perhaps you are looking up information on a particular subject. You type the word into the search engine interface and click “GO!” Within seconds you have a list of retrieved documents. But there are 87 of them! And there is little indication as to which document might be the one that you need. You have a choice of clicking on some of the entries with the hopes that the needed information will be within the first few documents, or spending literally hours combing through each one of them.

Why is it so difficult to find what you need?

Intelligent – Not!

One reason is the manner in which search engines operate. Generally search engines look for every occurrence of the word which was typed into the search interface. Upon finding them, it lists each and every document containing that word. However, the topic may only be mentioned within some of the documents, with no information of real value.

Also, you may be searching for more specific information regarding the topic, but are not sure how to narrow your search. Or perhaps the documents use certain words or phrases within the text, and although you are typing in synomynous terms, they are not the exact terms needed. Or it may be that a word is simply misspelled.

A search engine, like other computer automata, can’t allow for such errors.

For example, if you work in an insurance company, you may be looking for information regarding “theft.” Some of the documents use this precise word, so the search engine grabs those pages. But it does not retrieve any of the pages using the term “robbery” or “thievery.” You may not even understand why the search engine retrieved certain documents. In many instances, only the title of the document is listed, which doesn’t tell you much.

One way of improving the relevance of search results is to look for keywords that can be inserted as “metadata” within the pages of the intranet. This is one of the promises of eXtended Markup Language (XML), which (among other things) lets authors tag pages precisely so users can more easily find them.

But metadata is no panacea. For one thing, the user may still be unsure how to narrow a search, resulting in an overabundance of irrelevant hits. Moreover, word processing tools like Microsoft Word have long given authors the ability to add metadata to documents. Yet how many times have you filled in those Summary Info fields? Any information retrieval scheme that relies on people to categorize their ideas will at best be limited, and at worst may interfere with the creation of intellectual capital.

Indexes and Outlines

What, then, is the solution to the above-mentioned problems? Simply put, an index with main headings and subcategories. Users are instantly aided in narrowing down their search by choosing from such available subcategories. The interface of an index is also quite familiar to everyone, having been employed in the back of reference and trade books practically since the beginning of print.

In addition to the above-mentioned articles, Brian O’Leary (in Book:A Futurist’s Manifesto) had this to say:

“Digital abundance is pushing publishers to create much more than title-level metadata. To manage abundance, publishers and their agents can (and do) use blunt instruments, like verticals, or somewhat more elegant tools, like search engines.But when it comes to discovery, access, and utility, nothing substitutes for authorial and editorial judgment, as evidenced in the structural and contextual tags applied to our content.

Context can’t be just a preference or an afterthought any more. Early and deep tagging is a search reality. In structural terms, our content fits search conventions, or it will not be referenced. And in contextual terms, our content needs to be deeply and consistently tagged, or it will face an increasingly tough time being found.”

Recently, BIM has created a system which combines a search engine interface with a human-created index. Stated simply, the search engine is set to search the index instead of every page of an intranet.

The following outlines the four basic steps that must be performed in order to have accurate information retrieval:

(1) The search engine must be set to search only keywords that are assigned to each Web page. This means that someone has to read through each page of the site, and then decide which words most accurately reflect the contents of that page. The text must be carefully examined for concepts that are implied, even if the word itself is not used within the text.

(2) A custom-designed thesaurus must be developed, based on the terminology used on the web site as well as within the specific industry or profession. The thesaurus lists not only terms that mean the same thing as the chosen term, but both broader and narrower terms for the word. An example of this (for a medical web site) would be the phrase “root canal.” A broader term for this phrase would be “dentistry.”

(3) The search engine is set to recognize synonymous terms. If a user types in a certain word, all pages are retrieved that have words meaning the same thing as the chosen term.

(4) Then the individual assigning the keywords uses the site thesaurus as a guide in assigning terms to the various pages. She or he assigns broader and narrower terms to the principal keywords as called for.

A fifth step can be used to create a super-precise information retrieval system. This step is the creation of sub-categories to go along with each keyword. Let’s illustrate this with our medical web site example:

Using the first four steps, we created a search system which allows visitors to find all of the information dealing with the human heart. However, if the web site is quite large, the search engine still might retrieve up to 20 or more pages that have to do with the heart. Although there is usually a title and paragraph summary showing what each page is about, they can often be very wordy and not easy to scan through. By creating sub-categories for the word “heart” (such as “heart disease,” “heart surgery,” “parts of the heart,” etc.) and displaying these sub-topics as search results, web site visitors can quickly and easily identify the topics that fit their needs. There is no need for Boolean operators or any special skills on the part of the user.

By employing the above methods users always find all of the information that they are looking for, without having to wade through hundreds (or even thousands) of irrelevant documents.

Who should assign the keywords?

There is always the temptation to use content writers to assign the keywords. The thought is that they are already writing the page, so why not have them stick some terms in the keyword tags of the html page so that it can be used with a search engine later on? This inevitably results in inconsistent keyword tagging, lack of use of all words that might be used in the search, and (once again) user frustration.

BIM has experience tagging documents with keywords, having done so for medical and financial institutions.