Skip to content

Week 2

Entropy

This is similar to the concept of entropy we know from physics. It measures the "uncertainty" or "randomness" in a signal. Like with physical entropy, the higher the value, the more disordered or uncertain the signal is. If we have a completely predictable signal (like always rolling a 6 on a die), then the entropy is 0. Maximum entropy occurs when all outcomes are equally likely (like a fair die).

Entropy measures disorder in a system. More possible states mean higher entropy, and a uniform distribution of states is maximum entropy. On the other hand, highly ordered states are low entropy. In information theory, the formula for entropy is:

H(X)=H(pX)=xXpX(x)log(1pX(x))

Let's use a few examples to explain this:

  • Let's say we have perfect certainty (like perfectly ordered crystals). Imagine a "signal" that's always the same value, like a broken thermometer that always shows 20ºC. Here, the probability of 20ºC = 1, and the probability of all other values is 0. Hence, the entropy is 0 (there is no uncertainty).
H(X)=1log(1)=0
  • Let's consider the binary case (like a two-state system). We can think of a coin flip. For a fair coin, p(heads)=p(tails)=12. Hence, entropy is:

    H(X)=12log(2)+12log(2)=12+12=1 bit

    This is the maximum entropy for a two-state system. For a biased coin, the entropy would be lower as there is more certainty of the outcome.

  • Let's consider a system with multiple states (like energy levels). We can say that each of those states has a different probability, like (A=0.5,B=0.25,C=0.125,D=0.125). This produces a lower entropy than if all the states were equally likely. The entropy for this system would be:

    H(X)=12log(2)+14log(4)+18log(8)+18log(8)=12+12+38+38=74 bits

    If all the states were equally likely (0.25 each), the entropy would have been 2 bits, so as there is more "order" or predictability, the entropy is lower.

Some key properties:

  • Minimum entropy occurs when the outcome is certain (probability = 1), like a system frozen in its ground state.

    H(X)=0
  • Maximum entropy occurs when all outcomes are equally likely, like a system at infinite temperature.

    H(X)=log(number of possible states)=log(|X|)
  • The entropy uses logarithms because it makes it additive (like in physics). The entropy of independent events can be summed, H(X,Y)=H(X)+H(Y). Since the logarithms are in base-2, the entropy is measured in "bits".

For example, this concept can be used in weather forecasting. Imagine if the probabilities for weather during a summer day were (Sunny=0.7,Cloudy=0.2,Rainy=0.1), which has less entropy than winter where the probabilities might be more equal. Therefore, by knowing that it's a summer day, we can reduce our uncertainty about the weather.

The big insight is that entropy measures the average uncertainty of our system. Just like in statistical mechanics where higher entropy means more possible microstates, in information theory higher entropy means that we need more "bits" on average to describe what's happening.

Joint entropy

This measures the total uncertainty when we have two related things, say X and Y, for example, temperature and pressure readings in a system. It considers how they behave together, not just individually.

The joint entropy H(X,Y) is like measuring the total uncertainty of a system with two variables – similar to how we think about the total energy of a system with two interacting parts.

H(X,Y)=H(pX,Y)=(x,y)X×YpX,Y(x,y)log(1pX,Y(x,y))

For example, if we take a simple weather system, with X being the temperature (hot/cold), and Y being the sky conditions (cloudy/clear), we can create a joint probability table:

ClearCloudy
Hot0.40.1= 0.5 (total hot)
Cold0.10.4= 0.5 (total cold)
Total0.50.5= 1.0

Some key properties:

  • Chain rule (like in physics where we can break down total energy):

    H(X,Y)=H(X)+H(Y|X)

    This basically just means:

    Total entropy=Uncertainty of X+Remaining uncertainty of Y after knowing X

    just like Total Energy=Kinetic Energy+Potential Energy.

  • Upper and lower bounds:

    H(X,Y)H(X)+H(Y)

    with equality only if X and Y are independent. This is similar to how the total energy of interacting particles can be less than the sum of the individual energies.

There are a few physical analogies we can use to explain different types of systems with joint entropy:

  • Independent systems: This is like two non-interacting particles. The total entropy is just the sum of the individual entropies:

    H(X,Y)=H(X)+H(Y)

    For example, the temperatures in New York and Tokyo can be considered independent systems.

  • Completely dependent systems: This is like rigidly connected particles.

    H(X,Y)=H(X)=H(Y)

    For example, the temperatures in ºC and in ºF is a completely dependent system, since knowing one tells you the other.

  • Partially dependent systems: This is like weakly interacting particles.

    H(X)+H(Y|X) with H(Y|X)<H(Y)

    For example, the temperature and pressure in a room can be a partially dependent system, since knowing the temperature can give you some idea about the pressure, but not the true value.

Let's look at a two-particle system, for example. We can say that X is the spin of particle 1 (up/down), and Y is the spin of particle 2 (up/down).

If these particles are completely independent, then:

p(up,up)=0.25p(up,down)=0.25p(down,up)=0.25p(down,down)=0.25

and the joint entropy is:

H(X,Y)=H(X)+H(Y)=[12log(2)+12log(2)]+[12log(2)+12log(2)]=1+1=2 bits

which is the maximum possible. If the particles are entangled:

p(up,up)=0p(up,down)=0.5p(down,up)=0.5p(down,down)=0

then the joint entropy is:

H(X,Y)=pX,Y(x,y)log(1pX,Y(x,y))=p(up,up)+p(up,down)+p(down,up)+p(down,down)=0+0.5+0.5+0=1 bit

The same calculation can also be used for independent system from earlier:

H(X,Y)=pX,Y(x,y)log(1pX,Y(x,y))=p(up,up)+p(up,down)+p(down,up)+p(down,down)=14log(4)+14log(4)+14log(4)+14log(4)=12+12+12+12=2 bits

To visualise how joint entropy works, we can imagine the x-axis being the first variable, the y-axis being the second variable, and the z-axis being their joint probability. The volume under this surface then becomes the joint entropy.

The key insight is that joint entropy tells us how much total uncertainty exists in a system with two variables, accounting for any relationships between them. The joint entropy can also be extended to any number of variables over possibly different alphabets:

H(X1,...,Xn)(x1,...,xn)X1×...×XnpX1,...,XN(x1,...,xn)log(1pX1,...,XN(x1,...,xn))

Conditional entropy

This asks, "If I know Y, how much uncertainty remains about X?" For example, if we know today's weather (Y), how uncertain are we about tomorrow's weather (X)?

It can be like measuring position after measuring momentum in quantum mechanics, or like knowing the remaining uncertainty in energy after measuring temperature.

H(X|Y)=yYpY(y)H(X|Y=y)

where:

H(X|Y=y)=xXpX|Y=y(x)log(1pX|Y=y(x))

For example, we can consider a temperature and weather system, where X is the temperature (hot/cold) and Y is the weather (sunny/cloudy). Here, given that it's sunny, we can say that:

p(hot|sunny)=0.8p(cold|sunny)=0.2

or if it's cloudy:

p(hot|cloudy)=0.3p(cold|cloudy)=0.7

where:

p(sunny)=0.6p(cloudy)=0.4

We can therefore calculate the entropy H(X|Y):

H(X|Y)=yYpY(y)H(X|Y=y)=p(sunny)H(X|sunny)+p(cloudy)H(X|cloudy)=p(sunny)[xXpX|sunny(x)log(1pX|sunny(x))]+p(cloudy)[xXpX|cloudy(x)log(1pX|cloudy(x))]=p(sunny)[p(hot|sunny)log(1p(hot|sunny))+p(cold|sunny)log(1p(cold|sunny))]+p(cloudy)[p(hot|cloudy)log(1p(hot|cloudy))+p(cold|cloudy)log(1p(cold|cloudy))]=0.4[0.8log(10.8)+0.2log(10.2)]+0.6[0.3log(10.3)+0.7log(10.7)]=0.82 bits

Some key properties:

  • Conditioning reduces entropy (like making a measurement reduces uncertainty):

    H(X|Y)H(X)

    Knowing Y never increases the uncertainty about X. This is like how measuring position gives you some information about the momentum (though limited by uncertainty principle).

  • Perfect knowledge: H(X|Y)=0 if X is completely determined by Y. This is like how knowing velocity completely determines the momentum for a given mass, or how temperature in ºC completely determines the temperature in ºF.

  • Independence: H(X|Y)=H(X) if X and Y are independent. For example, measuring the spin in the x direction tells us nothing about the spin in the z direction, or how the temperature in New York tells us nothing about the temperature in Tokyo.

Let's take an example of predicting the weather. If we are trying to predict what the temperature will be like in the evening (hot/cold) given that we know the temperature in the morning was cold:

p(evening=cold|morning = cold)=0.7p(evening=hot|morning = cold)=0.3

or if the morning was hot:

p(evening=cold|morning = hot)=0.2p(evening=hot|morning = hot)=0.8

The conditional entropy here represents how uncertain we are about the evening temperature after knowing the morning temperature. This is lower than total entropy of evening temperature alone, but not zero (since the weather isn't perfectly predictable).

We can think of conditional entropy as starting with total uncertainty H(X) and subtracting information gained from knowing Y, and what's left is H(X|Y).

This connects to Mutual information I(X;Y):

H(X)Total uncertainty=I(X;Y)Shared information+H(X|Y)Remaining uncertainty

Kullback-Leibler (KL) Divergence

KL divergence D(qp) is a measure of how different two probability distributions p and q are:

D(qXpX)=xXqX(x)log(qX(x)pX(x))

For example, we can take two temperature distributions: p=(cold=0.3,mild=0.4,hot=0.3) which is the actual distribution, and q=(cold=0.2,mild=0.5,hot=0.3) which is the measured/predicted distribution. From this we can find their KL divergence:

D(pq)=tp(t)log(p(t)q(t))=0.3log(0.30.2)+0.4log(0.40.5)+0.3log(0.30.3)=0.047

However, the KL divergence is asymmetrical, D(pq)D(qp). In the example above: D(qp)=0.044, which is different from D(pq)=0.047. This is similar to how entropy can be irreversible, and is the reason that the KL divergence is not a true distance metric.

Another property of the KL divergence is non-negativity, or Gibbs' inequality:

D(pXqX)0

with equality if pX=qX.

Basically, we can think of KL divergence like the information lost when using q to approximate p, or the work needed to transform one distribution into another.

Mutual information

This measures how much knowing about one thing tells you about another. It's like correlation in physics, but more general. For example, we can ask, "how much does knowing the air pressure tell us about the likelihood of rain?"

Mutual information is like measuring how much two variables "know about each other".

I(X;Y)=(x,y)X×YpX,Y(x,y)log(pX,Y(x,y)pX(x)pY(y))

This can also be written in a few different ways:

I(X;Y)=D(pX,YpXpY)=H(X)H(X|Y)Information about X gained by knowing Y=H(Y)H(Y|X)Information about Y gained by knowing X=H(X)+H(Y)H(X,Y)Total info minus joint info

We can use the example of quantum entanglement to explain this. For example, if we have perfect entanglement, if X is up then Y must be down:

p(up,down)=0.5p(down,up)=0.5p(up)=0.5p(down)=0.5I(X;Y)=pX,Y(x,y)log(pX,Y(x,y)pX(x)pY(y))=0.5log(0.50.5×0.5)+0.5log(0.50.5×0.5)=0.5+0.5=1 bit

This has high mutual information. On the other hand, if there is no entanglement:

p(up,up)=0.25p(up,down)=0.25p(down,up)=0.25p(down,down)=0.25p(up)=0.5p(down)=0.5I(X;Y)=pX,Y(x,y)log(pX,Y(x,y)pX(x)pY(y))=0.25log(0.250.5×0.5)+0.25log(0.250.5×0.5)+0.25log(0.250.5×0.5)+0.25log(0.250.5×0.5)=0 bits

So there is zero mutual information.

In classical physics, we can consider the ideal gas equation PV=nRT. Here, assuming fixed volume, the mutual information between the pressure P and the temperature T is high, knowing one will tell you a lot about the other.

Mutual information has a property of non-negativity:

I(X;Y)0

with equality if X and Y are independent. This is similar to how correlation functions in physics are bounded. Independent systems are like uncoupled systems in physics.

Mutual information is also symmetric:

I(X;Y)=I(Y;X)

This is like how correlation functions in physics are symmetric.

We can think of this like a Venn diagram, where the circles are entropies, and the overlap between them is the mutual information.

For example, if we're trying to predict the weather for the next day X, and we have a barometer Y, we can use mutual information to make some predictions, since there is high mutual information between these:

  • Low pressure → likely rain
  • High pressure → likely clear

However, this isn't perfect, since the weather is unpredictable:

I(X;Y)<H(Y)

There is a data processing inequality for mutual information: If we have XYZ forming a Markov chain, then:

I(X;Z)min[I(X;Y),I(Y;Z)]

This shows that information/energy can only degrade through a system.

Markov chain

We can think of a Markov chain as the path of a particle where its next position depends only on where it is now, not how it got there.

For example, with a weather Markov chain, we can say that tomorrow's weather only depends on today's weather, not yesterday's. If it's sunny today:

p(sunny tomorrow|sunny today)=0.8p(rainy tomorrow|sunny today)=0.2

but if it's rainy today:

p(sunny tomorrow|rainy today)=0.4p(rainy tomorrow|rainy today)=0.6

A physical analogy could be a random walk, where the next step only depends on the current position of the particle, not how it got to that position. Another example could be quantum measurements, where each measurement only depends on the previous state, not on any further history.

XInitial stateYMeasurement 1ZMeasurement 2

For the data processing inequality (DPI) in Mutual information, we can think of it like, information can never be gained, only lost. For instance, let's say we're measuring temperature. We have a Markov chain:

XActual temperatureYThermometer readingZDigital display

Each step can lose information: the thermometer has limited precision and may be slightly inaccurate, and the digital display rounds numbers. Hence, we can't know more about X from Z than from Y:

I(X;Z)I(X;Y)

Markov chains are memory-less:

p(Xfuture|Xpresent,Xpast)=p(Xfuture|Xpresent)

Only the current state of the system matters.

Markov chains can also be reversible:

XYZ

Reversible Markov chains are symmetrical:

pX,Y,Z(x,y,z)=pX(x)pY|X=x(y)pZ|Y=y(z)=pZ(z)pY|Z=z(y)pX|Y=y(x)

This is like reversible processes in thermodynamics, or reversible chemical reactions. Reversibility means that the system has no preferred direction.

Conditional mutual information

This measures how much X and Y "know about each other" when we already know Z.

I(X;Y|Z)=H(X|Z)H(X|Y,Z)

We can think of it like how much information X and Y share after already knowing Z. This is similar to partial correlation in physics.

For example, let's say we have three particles: X, Y and Z, where X and Y are potentially entangled particles, and Z is a particle that we've measured. Here, I(X;Y|Z) tells us how much X and Y are entangled, even after knowing Z's state.

In classical physics, this can be temperature T, pressure P and volume V. Here, I(T;P|V) tells us how much temperature and pressure are related, given that we know the volume (constant).

Conditional mutual information is non-negative:

I(X;Y|Z)0

This is just like other information measures, we can't have negative information. This is just like how we can't have negative energy in physics.

We can also chain mutual information:

I(X;Y,Z)Total information=I(X;Z)Direct+I(X;Y|Z)Conditional

and more generally:

I(X1n;Y1m)=I(X1;Y1m)+i=2nI(Xi;Y1m|X1i1)

This is like decomposing forces in physics.

For example, if we have today's temperature Z, and we want to find out how much tomorrow's temperature X tells us about whether it will rain tomorrow Y, then we can use the conditional mutual information I(X;Y|Z).

Information identities

Basic identity relation

I(X;Y)=H(X)H(X|Y)=H(Y)H(Y|X)

This is basically total uncertaintyremaining uncertainty.

Joint entropy relation

H(X,Y)=H(X)+H(Y|X)=H(Y)+H(X|Y)

We can think of this like energy in a two-particle system, with total energy=energy of particle 1+energy of particle 2 given particle 1, or if we take position and momentum, then H(position,momentum)=H(position)+H(momentum|position).

For example, we can take a weather system with temperature T and pressure P. Then:

H(T,P)=H(T)+H(P|T)

which means that the total uncertainty in temperature and pressure is the uncertainty in temperature, and the uncertainty in pressure given that temperature.

Chain rule for entropy

H(X1n)=H(X1)+i=2nH(Xi|X1i1)

We can think of this as being in a multi-particle system, and we're adding particles one by one: H(system)=H(particle1)+H(particle2|particle1)+H(particle3|particle2,particle1)+...

Similar to Joint entropy relation, the total entropy of the system is the initial entropy, plus the entropy of all subsequent states given the previous states.

Chain rule for mutual information

I(X1n;Y1m)=I(X1;Y1m)+i=2nI(Xi;Y1m|X1i1)

We can think of this like:

Information gained= Initial info+New info from 1st measurement+New info from 2nd measurement+ ...

This is also similar to the Chain rule for entropy, where the total information of the system is the initial information we start with, plus the information we get from every subsequent state given the previous states.

Information inequalities/conditions for tightness

Non-negativity of entropy

H(X)0

with equality if pX(i) equals 0 or 1 for all iX. This means that pX(x) equals 1 for one value, and 0 for everything else. Since there is no uncertainty in the distribution, entropy is therefore zero. For everything else, entropy can only be positive.

Log-cardinality bound on entropy

H(X)log(|X|)

with equality if pX(i)=1|X| for all iX. This means that H(X)=log(|X|) if all probabilities are equal for all outcomes. This is maximum entropy, since the distribution doesn't favor any particular outcome. If there were any bias in the distribution, entropy would be lower than this.

Gibbs' inequality/non-negativity of KL divergence

D(pXqX)0

with equality if pX=qX for any pair of distributions pX and qX on the same alphabet X. Basically, since the KL divergence measures the difference between distributions, this difference can only be non-negative. The divergence is zero when the distributions are the same, since there is no difference between them, and positive otherwise.

Non-negativity of conditional entropies

H(X|Y)0

with equality if X is a deterministic function of Y. Basically, if X=f(Y) then the entropy of X given Y is zero, since X can be determined with full certainty if Y is known. For instance, if we take X to be momentum, and Y to be velocity, since momentum can be deterministically calculated from velocity (assuming known constant mass), the uncertainty in momentum X is 0. Otherwise, Non-negativity of entropy applies.

Non-negativity of mutual information

I(X;Y)0

with equality if X and Y are independent. Basically, if X and Y are completely unrelated, then the shared information between them is zero, since you can't know anything about X from Y, and vice versa. However, if they are related, then knowing something about Y will tell you something about X, and vice versa, so it is positive.

Additionally, I(X;X)=H(X), and so Non-negativity of entropy also applies.

Entropy bound on mutual information

I(X;Y)min[H(X),H(Y)]

with equality if either X is a deterministic function of Y, or Y is a deterministic function of X. If either variable determines the other, then knowing any one variable gives you knowledge of the full system, and so in that case I(X;Y)=H(X)=H(Y). However, in any other case, you can't get the full information of the system from any one variable, so the mutual information is less than any one of the variables.

Conditioning reduces entropy (C.R.E.)

H(X|Y)H(X)

with equality if X is independent of Y. If X and Y are independent, then knowing Y gives no information about X, and so H(X|Y)=H(X). However, if X and Y are related, then knowing Y reduces the uncertainty about X, and so the uncertainty of X given Y is less than the uncertainty of X alone.

Convexity/concavity inequalities

WAT

I'm gonna be honest I have no clue what anything in this section means.

For a function y=f(x), the gradient dydx of a concave function decreases as x increases. An example of a concave function is log. On the other hand, the gradient of a convex function increases as x increases.

Jensen's inequality

For a concave function f():

i=1kλif(xi)f(i=1kλixi)

for k vectors: x1, x2, ..., xk in R, and {λi}i=1k such that λi0 for all i and λi=1.

Let's think of a rope hanging in between two points:

A •         • B
   \       /
    \     / <- Actual rope position
     \   /
      \ /
       • C <- Average position

Imagine that's a smooth curve, I can't draw a curved line in ASCII art.

The average height of points on the rope λif(xi) is less than the height of the rope at the average position f(λixi). Here, f(x) is a concave function that determines the height of the rope.

Convexity of KL divergence

D(pXqX) is convex in the pair of distributions (pX,qX). This means the opposite of Jensen's inequality for the KL divergence:

iλiD(piqi)D(iλipiiλiqi)

So basically, the average distance between two distributions is less than the distance between two average distributions.

Concavity/convexity of mutual information

If we take pX,Y(x,y)=pX(x)pY|X=x(y):

  • For fixed pY|X=x(), I(X;Y) is concave in pX(x). This means that mixing input distributions can't increase information. Like entropy in thermodynamics, if we mix two pure states (like one that always returns 1, and another that always returns 0), this will increase uncertainty (since now there's a chance of 1 and a chance of 0). The more uniform the mixed distribution is (closer to 50-50), the higher the entropy will be, and so the mutual information can't increase.
  • For fixed pX(x), I(X;Y) is convex in pY|X(y). This means that mixing channels is worse than using either one consistently. It's like noise in measurements, if we're mixing two different measurement devices (let's say one gives consistently perfect results, and the other is complete noise), the result will generally be worse than just using the better one. It can't get better by randomising between different devices.

Information inequalities for Markov chains

If we have a reversible Markov chain:

XYZ

Data processing inequality (D.P.I.)

I(X;Z)min[I(X;Y),I(Y;Z)]

This is described in more detail in Markov chain. Basically, it means that information can only decrease as it's passed around. You can't get more information about X from Z than you had at Y.

Fano's inequality

Pr(ZX)H(X|Y)1log(|X|)

Let's say we have a Markov chain over a communication channel:

XSent messageYReceived signalZDecoded message

Fano's inequality states that there is a minimum error rate in decoding, if a channel Y adds noise (increases H(X|Y)). It tells us the minimum error rate that's related to the noise in the channel, and the size of the alphabet.

Primary use of information inequalities

Let's say we have a source coding problem. We need to take some data X, compress it into Y, and then decompress it to Z. This forms a Markov chain:

XOriginal dataYCompressedZDecompressed data

The goal is to make Z as close to X as possible, while making Y as small as possible.

We can use Fano's inequality to show that it's impossible to compress data beyond a certain size without guaranteeing some errors. We can prove this by assuming we can compress better than the claimed limit, and then using Fano's inequality to show that this leads to contradiction.

Say we have binary data X, and we try to compress n bits to n1 bits, then decompress back. If we apply Fano's inequality:

Pr(ZX)H(X|Y)1log(|X|)

If X is uniformly distributed, H(X)=n bits, H(X|Y)1 bit since Y has fewer bits than X. Therefore, the probability that ZX is nonzero, and so some error must exist.

A physics analogy of this would be trying to prove perpetual motion is impossible. First, we assume that it works, then we show that it would violate energy conservation, and so we can prove that it is impossible.

The general strategy to prove something is impossible:

  1. Assume it's possible.
  2. Apply information inequalities:
    • Fano's inequality
    • Data processing inequality
    • Non-negativity of mutual information
  3. Reach contradiction.
  4. Therefore, the original assumption was wrong.