This is a text written by Jon Berg <jon.berg|a|turtlemeat.com> spring 2005 in the Computer Science course Medical Informatics at Tromsø University, Norway.
Current and future possibilities of Medical Informatics
11. Information retrieval in medical informatics systems
Information retrieval is identification and efficient use of recorded media. Previously the main purpose of information retrieval systems was to search biomedical literature. In the last years the need for Information retrieval has increased with the incorporation of new media into the information storage systems. The new media are gene and protein structures, all the types of multimedia that has become common for example digitized sound, video and pictures.
Theory behind Information Retrieval
The Information Retrieval process can be decomposed into four main tasks;
In indexing the information is to be represented in the smallest possible way to facilitate for rapid and efficient retrieval. In many cases an index is a list of likely words a person is going to be searching for. In the simplest form in the case of the inverted index it is a list of words and a pointer to where the information can be found. There is also the possibility humans doing indexing for example the way it is done in MEDLINE. In MEDLINE the article to be indexed is put in one of 15 trees based on the subject headings, and this is assigned by a person doing the indexing. Full-text indexing is a widely used method of automatically indexing the content. Full-text indexing involves first extracting all the words in the text, then removing stop words which is common words like “I”, “was”, “go”, “do” and so on. Then words are stemmed to remove common endings such as: “s”, “es”, “ed” and “ing”. Words are then weighted according to their importance and likeliness of giving a good discrimination of the document it appears in. A simple weighting formula is TF*IDF, where TF is the log(frequency of term in the given document) and IDF is log(number of documents/number of documents with the given word).
Query formulation is the process of transforming something a person is searching for into a quality query. A query often will contain Boolean operators such as AND OR NOT. The query may also contain wildcards such for one or several letters. Often there will be a user interface for the user to fill out and the query will be constructed from what the user enters. There is also the possibility of having users enter natural queries, such as natural sentences without any particular syntax.
Retrieval is the process that happens after a user has entered what he is searching for and a query has been constructed. The retrieval process involves matching, ranking and display. Matching is sorting out entries that match the query. Ranking is sorting the matched results in a particular order that can be chronologically, relevance and alphabetically.
Display is the process of outputting the search result to the user.
Current state of information retrieval
Information retrieval has benefited a lot from the introduction of the Internet specially the web. It provides means for distributing many types of information both text and multimedia. The web provides a great source for information, but a lot of the information is aimed at the health consumer, not the health professional. Finding credible information on the web for the health professional is also difficult because it is difficult to measure the quality of the information. One way that has been adopted to try to give health professionals quality information is that some health organizations that already have a great deal of credibility in the health community have created their own websites. Other initiatives have been aggregation services that will allow users to search information collected from many credible sources.
Future challenges in information retrieval
The web provides a promising way to distribute information. It could be a problem to use information on the web because it can more easily be changed. This is a problem compared to static content published in journals because they will change as evidence change. This raises the question if the versions should be archived. The web also makes it possible for publishers to get their work out without being published in journals. This would make it easy to get your findings out, but it will lack the integrity and quality control that articles have to go through to get published in traditional paper journals. As more and more multimedia information is used in the health care, information retrieval will in the future also have to deal with recalling of multimedia information. Data such as movie clips, sounds and pictures are not possible to search for by entering text. One way of solving this has been to incorporate meta-data and use the meta-data for matching of queries. This meta-data has to be derived from the context that the data is used in for example the text entries surrounding it where it appears in a health record.
Web (webmastering) Mix Computer related text
Mix Computer related text