Week 2
Entropy
This is similar to the concept of entropy we know from physics. It measures the "uncertainty" or "randomness" in a signal. Like with physical entropy, the higher the value, the more disordered or uncertain the signal is. If we have a completely predictable signal (like always rolling a 6 on a die), then the entropy is 0. Maximum entropy occurs when all outcomes are equally likely (like a fair die).
Entropy measures disorder in a system. More possible states mean higher entropy, and a uniform distribution of states is maximum entropy. On the other hand, highly ordered states are low entropy. In information theory, the formula for entropy is:
Let's use a few examples to explain this:
- Let's say we have perfect certainty (like perfectly ordered crystals). Imagine a "signal" that's always the same value, like a broken thermometer that always shows 20ºC. Here, the probability of 20ºC = 1, and the probability of all other values is 0. Hence, the entropy is 0 (there is no uncertainty).
Let's consider the binary case (like a two-state system). We can think of a coin flip. For a fair coin,
. Hence, entropy is: This is the maximum entropy for a two-state system. For a biased coin, the entropy would be lower as there is more certainty of the outcome.
Let's consider a system with multiple states (like energy levels). We can say that each of those states has a different probability, like
. This produces a lower entropy than if all the states were equally likely. The entropy for this system would be: If all the states were equally likely (0.25 each), the entropy would have been 2 bits, so as there is more "order" or predictability, the entropy is lower.
Some key properties:
Minimum entropy occurs when the outcome is certain (probability = 1), like a system frozen in its ground state.
Maximum entropy occurs when all outcomes are equally likely, like a system at infinite temperature.
The entropy uses logarithms because it makes it additive (like in physics). The entropy of independent events can be summed,
. Since the logarithms are in base-2, the entropy is measured in "bits".
For example, this concept can be used in weather forecasting. Imagine if the probabilities for weather during a summer day were
The big insight is that entropy measures the average uncertainty of our system. Just like in statistical mechanics where higher entropy means more possible microstates, in information theory higher entropy means that we need more "bits" on average to describe what's happening.
Joint entropy
This measures the total uncertainty when we have two related things, say
The joint entropy
For example, if we take a simple weather system, with
| Clear | Cloudy | ||
|---|---|---|---|
| Hot | 0.4 | 0.1 | = 0.5 (total hot) |
| Cold | 0.1 | 0.4 | = 0.5 (total cold) |
| Total | 0.5 | 0.5 | = 1.0 |
Some key properties:
Chain rule (like in physics where we can break down total energy):
This basically just means:
just like
. Upper and lower bounds:
with equality only if
and are independent. This is similar to how the total energy of interacting particles can be less than the sum of the individual energies.
There are a few physical analogies we can use to explain different types of systems with joint entropy:
Independent systems: This is like two non-interacting particles. The total entropy is just the sum of the individual entropies:
For example, the temperatures in New York and Tokyo can be considered independent systems.
Completely dependent systems: This is like rigidly connected particles.
For example, the temperatures in ºC and in ºF is a completely dependent system, since knowing one tells you the other.
Partially dependent systems: This is like weakly interacting particles.
For example, the temperature and pressure in a room can be a partially dependent system, since knowing the temperature can give you some idea about the pressure, but not the true value.
Let's look at a two-particle system, for example. We can say that
If these particles are completely independent, then:
and the joint entropy is:
which is the maximum possible. If the particles are entangled:
then the joint entropy is:
The same calculation can also be used for independent system from earlier:
To visualise how joint entropy works, we can imagine the x-axis being the first variable, the y-axis being the second variable, and the z-axis being their joint probability. The volume under this surface then becomes the joint entropy.
The key insight is that joint entropy tells us how much total uncertainty exists in a system with two variables, accounting for any relationships between them. The joint entropy can also be extended to any number of variables over possibly different alphabets:
Conditional entropy
This asks, "If I know
It can be like measuring position after measuring momentum in quantum mechanics, or like knowing the remaining uncertainty in energy after measuring temperature.
where:
For example, we can consider a temperature and weather system, where
or if it's cloudy:
where:
We can therefore calculate the entropy
Some key properties:
Conditioning reduces entropy (like making a measurement reduces uncertainty):
Knowing
never increases the uncertainty about . This is like how measuring position gives you some information about the momentum (though limited by uncertainty principle). Perfect knowledge:
if is completely determined by . This is like how knowing velocity completely determines the momentum for a given mass, or how temperature in ºC completely determines the temperature in ºF. Independence:
if and are independent. For example, measuring the spin in the direction tells us nothing about the spin in the direction, or how the temperature in New York tells us nothing about the temperature in Tokyo.
Let's take an example of predicting the weather. If we are trying to predict what the temperature will be like in the evening (hot/cold) given that we know the temperature in the morning was cold:
or if the morning was hot:
The conditional entropy here represents how uncertain we are about the evening temperature after knowing the morning temperature. This is lower than total entropy of evening temperature alone, but not zero (since the weather isn't perfectly predictable).
We can think of conditional entropy as starting with total uncertainty
This connects to Mutual information
Kullback-Leibler (KL) Divergence
KL divergence
For example, we can take two temperature distributions:
However, the KL divergence is asymmetrical,
Another property of the KL divergence is non-negativity, or Gibbs' inequality:
with equality if
Basically, we can think of KL divergence like the information lost when using
Mutual information
This measures how much knowing about one thing tells you about another. It's like correlation in physics, but more general. For example, we can ask, "how much does knowing the air pressure tell us about the likelihood of rain?"
Mutual information is like measuring how much two variables "know about each other".
This can also be written in a few different ways:
We can use the example of quantum entanglement to explain this. For example, if we have perfect entanglement, if
This has high mutual information. On the other hand, if there is no entanglement:
So there is zero mutual information.
In classical physics, we can consider the ideal gas equation
Mutual information has a property of non-negativity:
with equality if
Mutual information is also symmetric:
This is like how correlation functions in physics are symmetric.
We can think of this like a Venn diagram, where the circles are entropies, and the overlap between them is the mutual information.
For example, if we're trying to predict the weather for the next day
- Low pressure → likely rain
- High pressure → likely clear
However, this isn't perfect, since the weather is unpredictable:
There is a data processing inequality for mutual information: If we have
This shows that information/energy can only degrade through a system.
Markov chain
We can think of a Markov chain as the path of a particle where its next position depends only on where it is now, not how it got there.
For example, with a weather Markov chain, we can say that tomorrow's weather only depends on today's weather, not yesterday's. If it's sunny today:
but if it's rainy today:
A physical analogy could be a random walk, where the next step only depends on the current position of the particle, not how it got to that position. Another example could be quantum measurements, where each measurement only depends on the previous state, not on any further history.
For the data processing inequality (DPI) in Mutual information, we can think of it like, information can never be gained, only lost. For instance, let's say we're measuring temperature. We have a Markov chain:
Each step can lose information: the thermometer has limited precision and may be slightly inaccurate, and the digital display rounds numbers. Hence, we can't know more about
Markov chains are memory-less:
Only the current state of the system matters.
Markov chains can also be reversible:
Reversible Markov chains are symmetrical:
This is like reversible processes in thermodynamics, or reversible chemical reactions. Reversibility means that the system has no preferred direction.
Conditional mutual information
This measures how much
We can think of it like how much information
For example, let's say we have three particles:
In classical physics, this can be temperature
Conditional mutual information is non-negative:
This is just like other information measures, we can't have negative information. This is just like how we can't have negative energy in physics.
We can also chain mutual information:
and more generally:
This is like decomposing forces in physics.
For example, if we have today's temperature
Information identities
Basic identity relation
This is basically
Joint entropy relation
We can think of this like energy in a two-particle system, with
For example, we can take a weather system with temperature
which means that the total uncertainty in temperature and pressure is the uncertainty in temperature, and the uncertainty in pressure given that temperature.
Chain rule for entropy
We can think of this as being in a multi-particle system, and we're adding particles one by one:
Similar to Joint entropy relation, the total entropy of the system is the initial entropy, plus the entropy of all subsequent states given the previous states.
Chain rule for mutual information
We can think of this like:
This is also similar to the Chain rule for entropy, where the total information of the system is the initial information we start with, plus the information we get from every subsequent state given the previous states.
Information inequalities/conditions for tightness
Non-negativity of entropy
with equality if
Log-cardinality bound on entropy
with equality if
Gibbs' inequality/non-negativity of KL divergence
with equality if
Non-negativity of conditional entropies
with equality if
Non-negativity of mutual information
with equality if
Additionally,
Entropy bound on mutual information
with equality if either
Conditioning reduces entropy (C.R.E.)
with equality if
Convexity/concavity inequalities
WAT
I'm gonna be honest I have no clue what anything in this section means.
For a function
Jensen's inequality
For a concave function
for
Let's think of a rope hanging in between two points:
A • • B
\ /
\ / <- Actual rope position
\ /
\ /
• C <- Average positionImagine that's a smooth curve, I can't draw a curved line in ASCII art.
The average height of points on the rope
Convexity of KL divergence
So basically, the average distance between two distributions is less than the distance between two average distributions.
Concavity/convexity of mutual information
If we take
- For fixed
, is concave in . This means that mixing input distributions can't increase information. Like entropy in thermodynamics, if we mix two pure states (like one that always returns 1, and another that always returns 0), this will increase uncertainty (since now there's a chance of 1 and a chance of 0). The more uniform the mixed distribution is (closer to 50-50), the higher the entropy will be, and so the mutual information can't increase. - For fixed
, is convex in . This means that mixing channels is worse than using either one consistently. It's like noise in measurements, if we're mixing two different measurement devices (let's say one gives consistently perfect results, and the other is complete noise), the result will generally be worse than just using the better one. It can't get better by randomising between different devices.
Information inequalities for Markov chains
If we have a reversible Markov chain:
Data processing inequality (D.P.I.)
This is described in more detail in Markov chain. Basically, it means that information can only decrease as it's passed around. You can't get more information about
Fano's inequality
Let's say we have a Markov chain over a communication channel:
Fano's inequality states that there is a minimum error rate in decoding, if a channel
Primary use of information inequalities
Let's say we have a source coding problem. We need to take some data
The goal is to make
We can use Fano's inequality to show that it's impossible to compress data beyond a certain size without guaranteeing some errors. We can prove this by assuming we can compress better than the claimed limit, and then using Fano's inequality to show that this leads to contradiction.
Say we have binary data
If
A physics analogy of this would be trying to prove perpetual motion is impossible. First, we assume that it works, then we show that it would violate energy conservation, and so we can prove that it is impossible.
The general strategy to prove something is impossible:
- Assume it's possible.
- Apply information inequalities:
- Fano's inequality
- Data processing inequality
- Non-negativity of mutual information
- Reach contradiction.
- Therefore, the original assumption was wrong.