Minimum description length

From Wikipedia, the free encyclopedia

The minimum description length principle is a formalization of Occam's Razor in which the best hypothesis for a given set of data is the one that leads to the largest compression of the data. MDL was introduced by Jorma Rissanen in 1978; it is an important concept in information theory and learning theory.

Any set of data can be represented by a string of symbols from a finite (say, binary) alphabet. "The fundamental idea behind the MDL Principle is that any regularity in a given set of data can be used to compress the data, i.e. to describe it using fewer symbols than needed to describe the data literally." (Grünwald, 1998. See the link below.) Since we want to select the hypothesis that captures the most regularity in the data, we look for the hypothesis with which the best compression can be achieved.

In order to do this, we must first fix a code to compress the data. The most general way to do this is to choose a (Turing-complete) computer language. We then write a program in that language, that outputs the data. This program thus represents the data. The length of the shortest program that outputs the data is called the Kolmogorov complexity of the data. This is the central idea of Ray Solomonoff's idealized theory of inductive inference.

However, this mathematical theory does not provide a practical way of doing inference. The most important reasons for this are:

Kolmogorov complexity is uncomputable: there exists no computer program that, when input an arbitrary sequence of data, outputs the shortest program that produces the data. Even if we should accidentally find the shortest program that outputs the data, it is in general not possible to know for certain that it is the shortest.
The Kolmogorov complexity depends on what computer language is used to describe programs. This is an arbitrary choice, but it does influence the complexity up to some constant additive term. For that reason, constant terms tend to be disregarded in Kolmogorov complexity theory. But in practice, where often only a small amount of data is available, such constants may have a very large influence on the inference results: good results cannot be guaranteed when one is working with limited data.

MDL is an attempt to remedy these, by:

Restricting the set of allowed codes in such a way that it becomes possible (computable) to find the shortest codelength of the data, relative to the allowed codes, and
Choosing a code that is reasonably efficient whatever the data at hand. This point is somewhat elusive and much research is still going on in this area.

Rather than "programs", in MDL theory one usually speaks of candidate hypotheses, models or codes. The set of allowed codes is then called the model class. (To confuse matters, some authors refer to the model class as the model.) The code is then selected for which the sum of the description of the code and the description of the data with the help of the code is minimal.

One of the important properties of MDL methods is that they provide a natural safeguard against overfitting, because it implements a tradeoff between the complexity of the hypothesis (model class) and the complexity of the data given the hypothesis. To see why this is true, consider the following example. Suppose you flip a coin 1,000 times and you observe the numbers of heads and tails. We consider two model classes: the first consists of a code that represents each outcome with a 0 for heads or a 1 for tails. This code represents the hypothesis that the coin is fair. The code length according to this code is always exactly 1,000 bits. The second model class consists of all codes that are efficient for a coin with some specific bias, representing the hypothesis that the coin is not fair. Say that we observe 510 heads and 490 tails. Then the code length according to the best code in the second model class is shorter than 1,000 bits. For this reason a naive statistical method might put forward this second hypothesis as a better explanation for the data. However, in an MDL approach we would have to construct a single code based on the hypothesis, we cannot just use the best one. A simple way to do it would be to use a two-part code, in which we first specify which element of the model class has the best performance, and then we specify the data using that code. We will need quite a lot of bits to specify which code to use; thus the total codelength based on the second model class would be larger than 1,000 bits. Thus if you follow an MDL approach the conclusion has to be that there is not enough evidence in support of the hypothesis that the coin is biased, even though the best element of the second model class provides better fit to the data.

Central to MDL theory is the one-to-one correspondence between code length functions and probability distributions. (The lemma involved is the Kraft-McMillan inequality.) For any probability distribution $P$ , it is possible to construct a code $C$ such that the length (in bits) of $C (x)$ is equal to $- log 2 P (x)$ ; this code minimizes the expected code length. Vice versa, given a code $C$ , one can construct a probability distribution $P$ such that the same holds. (Rounding issues are ignored here.) In other words, searching for an efficient code reduces to searching for a good probability distribution, and vice versa.

[edit] Related concepts

MDL is very strongly connected to probability theory and statistics through the correspondence between codes and probability distributions mentioned above. This has led some researchers to view MDL as being equivalent to Bayesian inference. Code length of the model and code length of model and data together in MDL correspond to prior probability and marginal likelihood respectively in the Bayesian framework. This point of view is expressed for example in David MacKay's Information Theory, Inference, and Learning Algorithms (see link below). However, while Bayesian machinery is often useful in constructing efficient MDL codes, the MDL framework also accommodates other codes that are not Bayesian. An example is the Shtarkov 'normalized maximum likelihood code', which plays a central role in current MDL theory, but has no equivalent in Bayesian inference. Furthermore, Rissanen stresses that we should make no assumptions about the true data generating process: in practice, a model class is typically a simplification of reality and thus does not contain any code or probability distribution that is true in any objective sense. According to the MDL philosophy, we should thus dismiss Bayesian methods if they are based on 'unsafe' priors that would lead to poor results for some possible data generating processes. The priors that are acceptable from an MDL point of view also tend to be favored in so-called objective Bayesian analysis; however, there the motivation is usually different.

MDL was not the first information-theoretic approach to learning; as early as 1968 Wallace and Boulton pioneered a related concept called Minimum Message Length (MML). The difference between MDL and MML is a source of ongoing confusion among academics and encyclopaedia writers alike. Superficially, the methods appear mostly equivalent, but there are some significant differences, especially in interpretation:

MML is a fully subjective Bayesian approach: it starts from the idea that one represents one's beliefs about the data generating process in the form of a prior distribution. MDL ostensibly avoids any assumptions about the data generating process (but see above for the difficulty of choosing a "reasonable" code).
Both methods make use of two part codes: the first part always represents the information that one is trying to learn such as the index of a model class (model selection), or parameter values (parameter estimation). The second part is an encoding of the data given the information in the first part. The difference is that in the MDL literature, it is advocated that parameters that we do not want to learn should be moved to the second part of the code, where they can be represented together with the data by using a so-called one-part code. This is often more efficient than a two-part code. In the original description of MML, all parameters are encoded in the first part so all parameters are learned.