Unified Medical Language System

From Wikipedia, the free encyclopedia

Jump to: navigation, search

The Unified Medical Language System (UMLS) is a compendium of many controlled vocabularies in the biomedical sciences. It provides a mapping structure among these vocabularies and thus allows one to translate among the various terminology systems; it may also be viewed as a comprehensive thesaurus and ontology of biomedical concepts. UMLS further provides facilities for natural language processing. It is intended to be used mainly by developers of systems in medical informatics.

UMLS consists of the following components:

  • Metathesaurus, the core database of the UMLS, a collection of concepts and terms from the various controlled vocabularies, and their relationships;
  • Semantic Network, a set of categories and relationships that are being used to classify and relate the entries in the Metathesaurus;
  • SPECIALIST Lexicon, a database of lexicographic information for use in natural language processing;
  • a number of supporting software tools.

The UMLS was designed and is maintained by the US National Library of Medicine, is updated quarterly and may be used for free. The project was initiated in 1986 by Donald A. B. Lindberg, M.D., then Director of the Library of Medicine.

Contents

[edit] Purpose and applications

The number of biomedical resources available to researchers is enormous. Often this is a problem due to the large volume of documents retrieved when the medical literature is searched. The purpose of the UMLS is to enhance access to this literature by facilitating the development of computer systems that understand biomedical language. This is achieved by overcoming two significant barriers: "the variety of ways the same concepts are expressed in different machine-readable sources & by different people" and "the distribution of useful information among many disparate databases & systems".

UMLS can be used to design information retrieval or patient record systems, to facilitate the communication between different systems, or to develop systems that parse the biomedical literature. For many of these applications, the UMLS will have to be used in a customized form, for instance by excluding certain source vocabularies that are not relevant to the application. The Library of Medicine itself uses it for its PubMed and ClinicalTrials.gov systems.

Users of the system have to sign a "UMLS agreement" and file brief annual reports on their use. Academic users can employ the UMLS free of charge for research. Commercial or production use requires copyright licenses for some of the incorporated source vocabularies.

[edit] Metathesaurus

The Metathesaurus forms the base of the UMLS and comprises over 1 million biomedical concepts and 5 million concept names, all of which stem from the over 100 incorporated controlled vocabularies and classification systems. Some examples of the incorporated controlled vocabularies are ICD-9-CM, ICD-10, MeSH, SNOMED CT, LOINC, WHO Adverse Drug Reaction Terminology, UK Clinical Terms, RxNORM, Gene Ontology, and OMIM (see full list).

The Metathesaurus is organized by concept, and each concept has specific attributes defining its meaning and is linked to the corresponding concept names in the various source vocabularies. Numerous relationships between the concepts are represented, for instance hierarchical ones such as "isa" for subclasses and "is part of" for subunits, and associative ones such as "is caused by" or "in the literature often occurs close to" (the latter being derived from Medline).

The scope of the Metathesaurus is determined by the scope of the source vocabularies. If different vocabularies use different names for the same concept, or if they use the same name for different concepts, then this will be faithfully represented in the Metathesaurus. All hierarchical information from the source vocabularies is retained in the Metathesaurus. Metathesaurus concepts can also link to resources outside of the database, for instance gene sequence databases.

The Metathesaurus itself is produced by the automated processing of machine-readable versions of the source vocabularies, followed by human intervention of editing and review. It is distributed as an SQL relational database and can also be accessed via a Java object-oriented API.

[edit] Semantic network

Each concept in the Metathesaurus is assigned to at least one "Semantic type" (a category), and certain "Semantic relationships" may obtain between members of the various Semantic types. The Semantic network is a catalog of these Semantic types and Semantic relationships. This is a rather broad classification; there are 135 semantic types and 54 relationships.

The major semantic types are organisms, anatomical structures, biologic function, chemicals, events, physical objects, and concepts or ideas. The links among semantic types provide the structure for the network and show important relationships between the groupings and concepts. The primary link between semantic types is the "isa" link, establishing a hierarchy of types and allowing to locate the most specific semantic type to assign to a given Metathesaurus concept. The network also has 5 major categories of non-hierarchical (or "associational") relationships. These are "physically related to", "spatially related to", "temporally related to", "functionally related to" and "conceptually related to".

The information about a Semantic type includes an identifier, definition, examples, hierarchical information about the encompassing Semantic type(s), and its associational relationships. Associational relationships within the Semantic Network are very weak. They capture at most some-some relationships, i.e. they capture the fact that some instance of the first type may be connected by the salient relationship to some instance of the second type. Phrased differently, they capture the fact that a corresponding relational assertion is meaningful (though it need not be true in all cases).

[edit] SPECIALIST Lexicon

The SPECIALIST Lexicon contains information about common English vocabulary, biomedical terms, terms found in MEDLINE and in the UMLS Metathesaurus. Each entry contains syntactic (how words are put together to create meaning), morphological (form and structure) and orthographic (spelling) information. A set of Java programs use the lexicon to work through the variations in biomedical texts by relating words by their parts of speech, which can be helpful in web searches or searches through an electronic medical record.

Entries may be one-word or multiple-word terms. Records contain four parts: base form (i.e. "run" for "running"); parts of speech (of which Specialist recognizes eleven); a unique identifier; and any available spelling variants. For example, a query for "anesthetic" would return the following:

{base=anaesthetic
spelling_variant=anesthetic
entry=E0008769
cat=noun
variants=reg}
{base=anaesthetic
spelling_variant=anesthetic
entry=E0008770
cat=adj
variants=inv
position=attrib(3)}

(Browne et al., 2000)

The SPECIALIST lexicon is available in two formats. The "unit record" format can be seen above, and comprises slots and fillers. A slot is the element (i.e. "base=" or "spelling variant=") and the fillers are the values attributable to that slot for that entry. The "relational table" format is not yet normalized and contains a great deal of duplication of data.

[edit] Supporting software tools

MetamorphoSys is a program that can be used to customize the Metathesaurus for specific applications, for instance by excluding certain source vocabularies.

lvg is a program that uses the SPECIALIST lexicon to generate lexical variants of a given term and to support the parsing of natural language text.

MetaMap is an online tool that, when given an arbitrary piece of text, finds and returns the relevant Metathesaurus concepts. MetaMap Transfer (MMTx) provides the same functionality as a Java program.

Knowledge Source Server is an online application that allows one to browse the Metathesaurus.

[edit] References

[edit] See also

[edit] External links

Personal tools