Week 1
Big-O notation
There are different flavors of asymptotic complexity notation:
: grows no faster than as gets large. Formally: : grows strictly slower than . Formally: : grows no faster than . Formally: : grows strictly slower than . Formally: : and grow at the same rate. Formally: if , if , if .
This notation is crucial for analysing the complexity and scaling behavior of algorithms and information-theoretic quantities.
Stirling's approximation
Stirling's approximation for factorials:
This is a useful approximation when working with factorials, which often come up in combinatorics and information theory. The key takeaway is that for large
Binary random variables
Some specific results for binary random variables
, where is the sample mean. is the total number of binary random variables, or elements in the binary vector, being analysed. .
This shows how the sample mean concentrates around the expected value for i.i.d. binary trials.
Types and type classes
For vectors
- The type
of a vector is a length- probability vector, where the -th component is the fraction of occurrences of the symbol in . - The type-class
is the set of all vectors such that .
Some key results about the size of type-classes:
- The number of types of length-
vectors over an alphabet satisfies . - For a binary vector
with ones, the size of its type-class is , where is the binary entropy function.
Alphabet and vector space:
represents the alphabet or set of possible symbols and values. In the binary case, . represents a vector of length containing elements from the alphabet , i.e. . Types:
- The type
of a vector is a length- probability vector, where the -th component is the fraction of occurrences of the symbol in . - The type captures the empirical distribution or the histogram of the symbols in the vector
. Type classes:
- The type-class
is the set of all vectors such that . - The type-class contains all vectors that have the same empirical distribution or type.
Size of type-classes:
- The number of types of length-
vectors over the alphabet satisfies . - This provides an upper bound on the number of possible types for vectors of length
over the alphabet . - For a binary vector
with ones, the size of its type-class is approximately where is the binary entropy function. - This shows that the size of the type-class grows exponentially with the length of the vector
, and the growth rate is determined by the binary entropy of the type .
:
- In the context of the binary vector
, represents the fraction or proportion of ones in the vector. - Specifically, if the binary vector
has ones, then is the ratio of the number of 1s to the total length of the vector. - So
is the empirical probability or relative frequency of the symbol appearing in the binary vector .
:
- This represents the set of all possible types (empirical distribution) of length-
vectors over the alphabet . - In other words,
is the set of all length- probability vectors that can arise as the type of some length- vector over the alphabet . - The size of this set,
is bounded above by .
Example
Let's say we have a binary alphabet
- Calculating the type
: - The type
is a length probability vector that represents the empirical distribution of the symbols in . - In this case, there are 3 ones and 2 zeros in
. - The type
, since the fraction of 0's is and the fraction of 1's is .
- The type
- Calculating the type-class
: - The type-class
is the set of all vectors that have the same type as . - In other words,
contains all binary vectors of length 5 that have 3 ones and 2 zeros, in any order. - Some examples of vectors in
are: , , , etc.
- The type-class
- Size of the type-class
: - For a binary vector
with ones, the size of the type-class is approximately , where is the binary entropy function. - In this case,
, so . - The size of the type-class is then
. However, the true value is .
- For a binary vector
WAT
The approximation doesn't work.
Probability of type
Building on the type-class concept, the probability that a vector
where
(Binary) Pinsker's inequality
Pinsker's inequality is a mathematical relationship between two ways of measuring the "distance" between probability distributions. We can think of it like comparing two different ways to measure energy in physics.
For example, we can take a biased coin. The true probability of heads,
On the right side, we have
The inequality tells us that the KL divergence is always at least
We can think of this like measuring distances.
This is useful in physics when comparing theoretical predictions (
Chernoff bound (additive form)
This is similar to uncertainty principles in physics. If we imagine that we are measuring the average value of many identical experiments (like measuring particle positions),
If we imagine that we are doing
This shows us a few things:
- The more measurements we take (larger
), the less likely we are to be far from the expected value - The larger our tolerance
, the less likely we are to exceed it. - The bound drops off exponentially, which is very fast.
For example, let's say we're measuring electron spin. We do
which is very small.
Hence if the tolerance
This is similar to how measurement uncertainty in physics often scales as
The Chernoff bound is useful for estimating how many measurements we would need for a desired confidence level, understanding the reliability of experimental results, and designing experiments with appropriate sample sizes.
Entropy perturbation bound (for general alphabets)
This is about how much of the entropy (a measure of disorder, similar to thermodynamic entropy) changes when we slightly modify a probability distribution. If two distributions
Let's take two probability distributions
This is our
The entropy measures uncertainty/randomness in the distribution. For a distribution
This is similar to entropy in thermodynamics, and higher entropy means more uncertainty/randomness.
The left hand side
Physically, this is like comparing two gases. If their molecule velocities are similar (small
This is useful in various fields:
- In physics, this tells us how much entropy can change when we slightly perturb a system.
- In information theory, it helps us understand how robust entropy measurements are.
- In statistical mechanics, it helps us bound entropy differences between similar states
Continuing the example above,
This means that the entropies of
This gives us a few key insights:
- Small changes in probability (small
) lead to small changes in entropy. - The bound gets looser with larger alphabet size
. - The relation involves
, not just , making it tighter.
(Binary) High probability sets
We can think of this like finding the smallest region in phase space that contains almost all possible states of a system. In this case,
For example, we can consider a quantum system. There are many possible states (like all possible binary strings), but most of the probability is concentrated in a small subset. This is like finding the most important energy levels that contain most of the system's probability.
To construct
| string | probability |
|---|---|
| 111 | 0.3 |
| 110 | 0.25 |
| 101 | 0.2 |
| 100 | 0.15 |
| 011 | 0.025 |
| 010 | 0.025 |
| 001 | 0.025 |
| 000 | 0.025 |
The size bound of
where
This is important in various fields:
- In physics, this helps us identify which states matter the most.
- In coding theory, this helps us design efficient codes.
- In statistical mechanics, this is similar to identifying important microstates.
Let's use a practical example. Imagine measuring
This is used in statistical mechanics, and is similar to how a gas at fixed temperature occupies only a tiny fraction of all possible states, but that fraction contains most of the probability.
Strongly typical sets
This extends (Binary) High probability sets to more general alphabets, not just binary. A sequence is considered "typical" if its empirical distribution (what we actually observe) is close to the true probability distribution. This is similar to how, in statistical mechanics, most microstates of a system are "typical" and give rise to the expected macroscopic properties.
There are 3 basic components of this:
- Alphabet
(possible outcomes), for example for a 4-sided die . is the true probability distribution. For a fair 4-sided die, . is the type (empirical distribution) of sequence . For example, if , the type .
A sequence
For example, let's say
So this sequence is strongly typical.
Physically, this is like measuring particle positions.
Strongly typical sequences are "reasonable" measurement outcomes, like getting 501 heads in 1000 coin flips (typical), and unlike getting all heads (not typical).
Some properties:
- Most sequences are typical:
as . - The number of typical sequences is roughly
where is the entropy of . - Each typical sequence has probability about
.
is the same subset of most likely strings as was in (Binary) High probability sets, except now with an alphabet of arbitrary size.
This is related to Sanov's Theorem, which tells us that the probability of observing an "atypical" sequence decreases exponentially with
This is used in various fields:
- In physics, it describes the likely measurement outcomes.
- In coding, it helps design efficient compression.
- In statistics, it helps understand "normal" behavior.
- In information theory, it is fundamental to channel coding.
Sanov's Theorem
Sanov's Theorem is about the probability of observing empirical distribution that deviate from the true distribution.
Let's say we have a true probability distribution
Sanov's Theorem states that, for a set of distributions
where
Therefore:
Using a physical example, imagine we're measuring particle positions. Let's say the true distribution
Sanov says this probability decreases exponentially with:
- How many measurements you take (
) - How "weird" your observation is (measured by
).
Sanov's theorem is useful in various fields:
- In physics, it quantifies how unlikely "weird" measurements are. It is similar to how entropy increases in thermodynamics, and explains why we usually see "typical" behavior.
- In information theory, it shows that typical sets contain the most probability, justifies using typical sequences for coding, and gives error bounds for compression.
For a concrete numerical example, let's say we have a true distribution
So the probability bound is:
In statistical mechanics, this is similar to why gases spread out to fill containers, why temperature equalises between objects, or why we see macroscopic irreversibility.
Some key insights:
- Large deviations are exponentially unlikely.
- The rate of decay depends on how "strange" the deviation is.
- This explains why strongly typical sets have high probability.
- Gives quantitative bounds on atypical behavior.
Lexicographic encoding
This is about data compression. Imagine we have a string of symbols (like a message or a sequence of measurement outcomes). This encoding method allows us to compress the string efficiently by:
- First describing what type of sequence it is (like saying how many 0s and 1s)
- Then describing where exactly in the list of possible sequences of that type this particular sequence appears.
This method is efficient because it compresses the data to nearly the theoretical minimum length (
Basically, instead of encoding the whole sequence directly, we first describe the type (frequency distribution), and then describe which specific sequence of that type it is.
For example, say
- Describe the type:
- Count: 3 ones, 1 zero
- Type
- Describe the position: Among all the sequences with 3 ones and 1 zero (our options are 1110, 1101, 1011, 0111), our sequence is number 2 in this list.
This works because:
- Type information is relatively small, it takes about
bits, because . - Position information takes about
bits, where is the entropy of the type. This is optimal by Shannon's source coding theorem.
A more detailed example: Let
- Type description:
- Count: 5 ones, 3 zeros
- Type
- Position in type-class:
- List all sequences with 5 ones and 3 zeros in order:
11111000 11110100 11110010 ... 00011111 - Find the position of 11010110 in this list
- List all sequences with 5 ones and 3 zeros in order:
The computational complexity is
Physically, we can think of organising books. Instead of describing the exact location of each book, we first describe how many there are in each category, then describe the arrangement within that constraint.
This type of encoding approaches the theoretical minimum.