Information Retrieval

When people think about information retrieval they most likely will think about the data itself that is stored in the database and the interface which acts as the intermediary to that data. Often it is forgotten that an equally important component of any information retrieval system is the user with information needs. Therefore, not only is the spectrum of database design and search engines broad and complicated, but the spectrum of human behavior, understanding and information need is even more so.

As Cleveland & Cleveland states:

"Being a people susceptible to gadgetry, we [have] lived for a while with an innocent dream, believing that all we [need is] a readable microform device and larger, faster computers. Now we have very large and unbelievably fast computing machines, but we still have problems in information handling. In many instances we found that computers simply allowed us to make the same old mistakes at an incredible speed. Why? Because there are deep intellectual complexities involved that we are only beginning to understand and appreciate - problems concerning users and their individual needs, the realization that the relevance of a particular document is the judgement of a single individual, not a universal constant. It is no longer acceptable to build systems geared to a mythical 'average' user who has never existed and never will. There are problems concerning indexing, classification, and searching techniques. How do scientists look for information? How do non-scientists search? Slowly, we are realizing that an information system is concerned with more than just documents and their contents - very much involved is human behavior."

It is these elements that make information retrieval a phenomenally difficult puzzle. Because there are so many factors that effect whether or not a user receives the information they need in an effective manner, all three of these components need to be considered.

The previous section on relational databases gives a brief introduction to database structure, and the sections on Oracle, Voyager, and Multi MIMSY provide links to information about the specific tools we will be using. The physical dimension of how to coordinate operations between Voyager and Multi MIMSY is still largely unknown because we do not have these pieces of software in hand at this time and are unable to access their particular configurations of Oracle.

It has generally been assumed that we will construct our own interface to provide access to integrated information. This interface is proposed to be grounded in web technology and functionality, and is also fundamentally an unknown until the underlying structure of the finished system is better defined.

We can, however, look at how we might consider best meeting the needs of our users by looking at various search strategies and philosophies.

Recall and Precision

Recall and precision are the two main methods traditionally used to test information retrieval effectiveness. Recall refers to how successful the system is in retrieving all the documents that possibly relate to a search query. Precision, on the other hand, refers to how well the system separates out the irrelevant documents from the truly relevant ones. Generally these measurements are in inverse proportion to each other, which means that when you use a search engine that gives a high level of recall, you will at the same time reduce precision and when you increase precision, you decrease recall.

Exact Searching

A library OPAC (Online Public Access Catalog) is one place that you will see a number of options offered for constructing a search. The basic catagories are usually author, title, subject and keyword. The first three search strategies deal with very specific kinds of information, and the only fields that are actually searched are the fields that contain that specific kind of information. These searches take advantage of the structure that is part of a MARC record and that is integral to library cataloging. The system will index these fields and match a search query against that index. This is why we have included index classes in our data dictionary. This indicates not only which fields we feel should be indexed independently, but also what kind of index they should be.

In the case of an author search, only those fields within a MARC record that are defined to contain author information will be searched. Because catalogers generally utilize authority files to which the author field is related, and because the form of the name is "controlled", there can be a high level of precision in this kind of search. One very large name authority file is the Library of Congress name authority file which contains 3.5 million authority records.

The same can be said of a subject search, only the source of the controlled vocabulary in this case is a subject headings list, often the Library of Congress Subject Headings (LCSH). A subject heading is an example of precoordination, where the subject determination is made by the cataloger but requires that the user enter a controlled subject heading. Again, this type of search can offer a high level of precision.

Keyword searching

A keyword search is different from the first three search strategies mentioned because it does not depend on controlled vocabulary, it will search across fields that contain different kinds of information, and it generally offers a high level of recall. Basic keyword searches look for a word, number or combination of words and numbers anywhere in a record, not just in the author, title, subject or call number fields. A keyword search on "vessel" will bring up records that use the word vessel anywhere, but will not differentiate between a vessel that is watercraft and a vessel that is a cup, for instance, whereas a subject search under vessel (watercraft) would bring up records that have already been determined to be relevent to vessels that are watercraft. Keyword searching is an example of postcoordination.

Keyword searches allow you to link terms from different parts of a record, such as an author's name with a word from a book title or a subject heading.

Examples of when keyword searches are useful:

Boolean Operators

Keyword search terms may be combined with operators, enabling you to broaden or narrow your search. Operators enable you to link terms in a wide variety of ways. The more specific the operator used in combining your terms, the fewer matches you will get (less recall/greater precision). Boolean operators can only be used in keyword searching.

The system searches the terms in a search strategy from left to right and processes the most specific operators first. The default operator is usually AND. If you do a keyword search with more than one search term and do not include an operator, the system usually assumes the operator AND. Other frequently used Boolean operators are OR, and NOT.

Relevancy Ranking

One of the techniques that is often used by search engines to help deal with high recall answer sets is to rank them by relevancy, showing which documents were determined to be the most relevant and working down to those with minimal relevance. Different search engines will determine relevancy in different ways, as well as displaying the ranking differently, but all attempt to help the user be selective. The type of criteria that the system uses to determine relevancy are how often the term shows up, in the case of a boolean search, how close the terms are in proximity to one another.

Natural Language Processing

Relevancy ranking is primarily based on mathematical algorithms, and while it can be somewhat useful in dealing with high recall, there is generally only rudimentary ability to actually deal with levels of human language (for a chart of the various levels of language see the Synchronic Model of Language). Natural Language Processing is the attempt to build systems that will allow queries to be constructed based on natural patterns of language. In a lecture on information retrieval, Professor Barbara Kwasnik describes NLP as such:

Most documents are written in natural language -- that is the author does not as a rule use a controlled vocabulary or a constrained sentence structure (syntax) or formal expressions. This does not mean that there is no regularity at all. While we express ourselves in a very wide variety of ways, we nevertheless use some rules. This is particularly true if we belong to what is known as a particular discourse community. For instance, journalists use similar forms and vocabulary, as do lawyers and doctors. And even ordinary folks follow certain patterns. Otherwise we wouldn't be able to understand each other if the variation of our linguistic expressions were infinitely and absolutely different. The aim of NLP is to allow, ideally, searchers to express their queries in natural language without having to translate their search into a formal statement, and second, for the search engine to search documents and retrieve them based on clues offered by the regularity or patterns of linguistic expression. Thus, NLP is not really a retrieval technique, but rather a whole different approach to information retrieval. There are many techniques and strategies that NLP uses to achieve this still unattainable goal of natural language searching and document representation.

A research study* done to compare results of exact searches with keyword searches found that when the same terms are used in a subject search and then in a keyword search, that the keyword search will indeed give more results (high recall). When users were surveyed as to their satisfaction, however, the level of satisfaction was lower than with exact searches (high precision). This indicates that given an easy choice, people will choose to use both keyword and controlled vocabulary searching as they are different tools for different situations.

What this should indicate to us is that we will want to offer high recall searches by taking advantage of the web interfaces that are offered by both Multi MIMSY and Voyager, as well as an interface that we may create ourselves. This keyword type of searching will offer high recall with the ability to rank by relevance.

By virtue of the preparation work that we have done, we are in an excellent position to define a new level of precision access to our information as well. Using the parameters that are being defined in our data dictionary process, and by beginning to define maritime vocabulary (AAT), and by using our areas of expertise to add context to how we catagorize our information (Subject headings coordination), we can consider the possibility of making concrete links between the data that is contained in Multi MIMSY and the data that is contained in Voyager. Although the exact methodology for doing this has not yet been determined (perhaps we will do direct mapping from one field to another, perhaps we will develop SQL "reports" or queries that will connect the data), we are in conversation with both organizations to make this happen.

*see Tillotson, J. (1995). Is Keyword Searching the Answer? College & Research Libraries, 56 (3).

last updated on 01/14/99