Skip to content

Week 1

Big-O notation

There are different flavors of asymptotic complexity notation:

  • f(n)O(g(n)): f(n) grows no faster than g(n) as n gets large. Formally:

    f(n)O(g(n)) if c>0,NZ such that n>N,|f(n)|<c|g(n)|
  • f(n)o(g(n)): f(n) grows strictly slower than g(n). Formally:

    f(n)o(g(n)) if c>0,NcZ such that n>Nc,|f(n)|<c|g(n)|
  • f(n)Ω(g(n)): g(n) grows no faster than f(n). Formally:

    f(n)Ω(g(n)) if g(n)O(f(n))
  • f(n)ω(g(n)): g(n) grows strictly slower than f(n). Formally:

    f(n)ω(g(n)) if g(n)o(f(n))
  • f(n)Θ(g(n)): f(n) and g(n) grow at the same rate. Formally:

    f(n)Θ(g(n)) if f(n)O(g(n)) and f(n)Ω(g(n))
  • f(n)g(n) if limnlog(f(n))log(g(n))=1, f(n)˙g(n) if limnlog(f(n))log(g(n))1, f(n)˙g(n) if limnlog(f(n))log(g(n))1.

This notation is crucial for analysing the complexity and scaling behavior of algorithms and information-theoretic quantities.

Stirling's approximation

Stirling's approximation for factorials:

n!2πn(ne)n(1+Θ(1n))

This is a useful approximation when working with factorials, which often come up in combinatorics and information theory. The key takeaway is that for large n, the factorial n! grows exponentially, and Stirling's formula provides a good approximation.

Binary random variables

Some specific results for binary random variables X that take values in [0,1]:

  • E(X¯)=pn, where X¯i=1nXi is the sample mean. n is the total number of binary random variables, or elements in the binary vector, being analysed.
  • Pr[X¯=E(X¯)]=(npn)ppn(1p)(1p)n=12πp(1p)n(1+Θ(1n)).

This shows how the sample mean concentrates around the expected value for i.i.d. binary trials.

Types and type classes

For vectors xXn:

  • The type T(x) of a vector x is a length-|X| probability vector, where the i-th component is the fraction of occurrences of the symbol i in x.
  • The type-class T(x) is the set of all vectors x such that T(x)=T(x).

Some key results about the size of type-classes:

  • The number of types of length-n vectors over an alphabet X satisfies |Pn(X)|=(n+|X|1|X|1)(n+1)|X|1.
  • For a binary vector x{0,1}n with qn ones, the size of its type-class is |T(x)|=(nqn)2nH(q), where H(q)qlog1q+(1q)log11q is the binary entropy function.

Alphabet and vector space:

  • X represents the alphabet or set of possible symbols and values. In the binary case, X={0,1}.
  • x represents a vector of length n containing elements from the alphabet X, i.e. xXn.

Types:

  • The type T(x) of a vector x is a length-X probability vector, where the i-th component is the fraction of occurrences of the symbol i in x.
  • The type captures the empirical distribution or the histogram of the symbols in the vector x.

Type classes:

  • The type-class T(x) is the set of all vectors x such that T(x)=T(x).
  • The type-class contains all vectors that have the same empirical distribution or type.

Size of type-classes:

  • The number of types of length-n vectors over the alphabet X satisfies |Pn(X)|(n+1)|X|1.
  • This provides an upper bound on the number of possible types for vectors of length n over the alphabet X.
  • For a binary vector x{0,1}n with qn ones, the size of its type-class is approximately |T(x)|2nH(q) where H(q) is the binary entropy function.
  • This shows that the size of the type-class grows exponentially with the length of the vector n, and the growth rate is determined by the binary entropy of the type q.

q:

  • In the context of the binary vector x{0,1}n, q represents the fraction or proportion of ones in the vector.
  • Specifically, if the binary vector x has qn ones, then q is the ratio of the number of 1s to the total length n of the vector.
  • So q is the empirical probability or relative frequency of the symbol 1 appearing in the binary vector x.

Pn(X):

  • This represents the set of all possible types (empirical distribution) of length-n vectors over the alphabet X.
  • In other words, Pn(X) is the set of all length-|X| probability vectors that can arise as the type of some length-n vector over the alphabet X.
  • The size of this set, |Pn(X)| is bounded above by (n+1)|X|1.

Example

Let's say we have a binary alphabet X={0,1}, and we have a binary vector x=[0,1,1,0,1] of length n=5.

  1. Calculating the type T(x):
    • The type T(x) is a length |X|=2 probability vector that represents the empirical distribution of the symbols in x.
    • In this case, there are 3 ones and 2 zeros in x.
    • The type T(x)=(25,35), since the fraction of 0's is 25 and the fraction of 1's is 35.
  2. Calculating the type-class T(x):
    • The type-class T(x) is the set of all vectors x that have the same type as x.
    • In other words, T(x) contains all binary vectors of length 5 that have 3 ones and 2 zeros, in any order.
    • Some examples of vectors in T(x) are: [0,1,1,0,1], [1,0,1,1,0], [1,1,0,0,1], etc.
  3. Size of the type-class |T(x)|:
    • For a binary vector x{0,1}n with qn ones, the size of the type-class T(x) is approximately 2nH(q), where H(q) is the binary entropy function.
    • In this case, q=35, so H(q)=H(35)=0.971.
    • The size of the type-class is then |T(x)|25×0.97129. However, the true value is (53)=10.

WAT

The approximation doesn't work.

Probability of type

Building on the type-class concept, the probability that a vector x has has a particular type qx is:

Pr[T(x)=qX]=(nqn)ppn(1p)(1p)n2nD(qp)

where D(qXp) is the Kullback-Leibler divergence between the type qX and the true underlying distribution p. This relates the probability of observing a particular type to the "distance" between the empirical distribution and the true distribution.

D(qp)qlog(qp)+(1q)log(1q1p)

(Binary) Pinsker's inequality

Pinsker's inequality is a mathematical relationship between two ways of measuring the "distance" between probability distributions. We can think of it like comparing two different ways to measure energy in physics. D(qp) is the KL divergence, and |pq| is just the absolute difference between the probabilities. The inequality states that the KL divergence is always greater than or equal to a constant times the square of the absolute difference.

D(qp)2ln(2)|pq|2

For example, we can take a biased coin. The true probability of heads, p=0.6, while the estimated probability of heads (from experiments), q=0.55.

On the right side, we have |pq|2=|0.60.55|2=|0.05|2=0.00252ln(2)×0.0025=0.0072. On the left hand side, the KL divergence D(qp)=0.0074. Hence, the KL divergence is larger than the right hand side, and so the inequality holds.

The inequality tells us that the KL divergence is always at least 2ln(2) times the squared absolute difference. This means that if the absolute difference is large, the KL divergence must be even larger. It provides a useful lower bound, since if we know D(qp), we can bound how different p and q can be.

We can think of this like measuring distances. |pq| is like measuring distance with a ruler, while D(qp) is like measuring the actual path length when you have to go over hills and valleys. Pinsker's inequality tells us that the complex path is always at least a certain times longer than the straight-line distance.

This is useful in physics when comparing theoretical predictions (p) with experimental results (q). It's also useful in information theory when analysing how much information is lost when using one distribution to approximate another.

Chernoff bound (additive form)

This is similar to uncertainty principles in physics. If we imagine that we are measuring the average value of many identical experiments (like measuring particle positions), X¯ is the measured average, and EX¯ is the true expected value. The Chernoff bound tells us how likely it is that our measurement will deviate from the true value by more than some amount (2ϵn). This bound shows us that this probability decreases exponentially with the sample size n.

Pr[|X¯EX¯|2ϵn]˙exp(2ln(2)nϵ2)

If we imagine that we are doing n identical measurements (like measuring particle positions). Each measurement will give us a binary outcome (like "particle detected" or "not detected"). X¯ is the measured average, and EX¯ is the expected (theoretical) average. For example, we can measure something in a lab n times, and we know the theoretical expected value (like mean energy) from calculations, and we want to know how likely it is that our lab measurement average will be more than 2ϵn away from the theoretical expected value.

Pr[|X¯EX¯|2ϵn] is the probability that our measurement deviates from the expected value by more than 2ϵn. |X¯EX¯]| is how far our measurement (X¯) is from the expected value (EX¯). ϵ is like the "tolerance" – how much error we are willing to accept.

exp(2ln(2)nϵ2) gives us the upper bound on that probability. It decreases exponentially with n (the number of measurements) and ϵ2 (our squared tolerance).

This shows us a few things:

  • The more measurements we take (larger n), the less likely we are to be far from the expected value
  • The larger our tolerance ϵ, the less likely we are to exceed it.
  • The bound drops off exponentially, which is very fast.

For example, let's say we're measuring electron spin. We do n=1000 measurements, and we want our result to be within ϵ=0.1 of the true value. The bound tells us that the probability of our result being further than 0.1 of the true value is less than:

exp(2ln(2)nϵ2)=exp(2ln(2)×1000×0.12)=exp(20ln(2))2.06×109

which is very small.

Hence if the tolerance ϵ decreases slower than 1n (ϵω(1n)), then the measurements will almost certainly be within the tolerance. Formally:

if ϵω(1n) then Pr[|X¯EX¯|2ϵn]1o(1)

This is similar to how measurement uncertainty in physics often scales as 1n.

The Chernoff bound is useful for estimating how many measurements we would need for a desired confidence level, understanding the reliability of experimental results, and designing experiments with appropriate sample sizes.

Entropy perturbation bound (for general alphabets)

This is about how much of the entropy (a measure of disorder, similar to thermodynamic entropy) changes when we slightly modify a probability distribution. If two distributions p and q are too close to each other (their 1 distance is small), then their entropies can't be too different. The bound is proportional to the size of the alphabet (the number of possible outcomes).

|H(pX)H(qX)|ϵlog(|X|ϵ)

Let's take two probability distributions pX and qX over the same set X, for example X={1,2,3,4}, pX=(0.2,0.3,0.4,0.1) and qX=(0.25,0.25,0.35,0.15). Then the 1-norm pXqX1 measures how different the distributions are by calculating the sum of the absolute differences between probabilities. In this example:

pXqX1=|0.20.25|+|0.30.25|+|0.40.35|+|0.10.15|=0.05+0.05+0.05+0.05=0.2

This is our ϵ in the bound.

The entropy measures uncertainty/randomness in the distribution. For a distribution p:

H(p)=ipilog(pi)

This is similar to entropy in thermodynamics, and higher entropy means more uncertainty/randomness.

The left hand side |H(pX)H(qX)| is how different the entropies are. The bound tells us that this difference can't be too large. Specifically, it's at most ϵlog(|X|ϵ) where |X| is the size of the alphabet (number of possible outcomes), and ϵ is how different the distributions are (the 1-norm).

Physically, this is like comparing two gases. If their molecule velocities are similar (small ϵ), then their thermodynamic entropies must also be similar. This bound tells us exactly how similar they must be.

This is useful in various fields:

  • In physics, this tells us how much entropy can change when we slightly perturb a system.
  • In information theory, it helps us understand how robust entropy measurements are.
  • In statistical mechanics, it helps us bound entropy differences between similar states

Continuing the example above, |X|=4 (there are 4 possible outcomes) and ϵ=2 (our 1 distance from earlier). The bound says that:

|H(pX)H(qX)|0.2log(40.2)0.2log(20) ˙ 0.86 bits

This means that the entropies of pX and qX cannot differ by more than 0.86 bits.

This gives us a few key insights:

  • Small changes in probability (small ϵ) lead to small changes in entropy.
  • The bound gets looser with larger alphabet size |X|.
  • The relation involves log(|X|ϵ), not just log(|X|), making it tighter.

(Binary) High probability sets

We can think of this like finding the smallest region in phase space that contains almost all possible states of a system. In this case, B is the smallest set of binary strings that contains almost all likely outcomes. The size of this set is approximately 2 raised to the power of (nH(p)+nϵlog(2ϵ)), where H(p) is the entropy.

B is a subset of all the possible binary strings of length n. For example, if n=3, then the full space {0,1}3 is:

{000,001,010,011,100,101,110,111}

B would be some subset of these strings that includes the most likely outcomes.

Pr(XB)=1o(1) means that the probability that our outcome X is in the set B is very high, very close to 1. o(1) means that "something that approaches 0" as n gets larger, so 1o(1) means "almost certainly". This is similar to physics, where we say a particle is "almost certainly" in its ground state.

For example, we can consider a quantum system. There are many possible states (like all possible binary strings), but most of the probability is concentrated in a small subset. This is like finding the most important energy levels that contain most of the system's probability.

To construct B "greedily", we start with an empty set, and keep adding the most probable strings one by one, until the total probability reaches 1o(1). For example, with n=3, say the probabilities were:

stringprobability
1110.3
1100.25
1010.2
1000.15
0110.025
0100.025
0010.025
0000.025

B might be {111,110,101,100} because these total 0.9 probability.

The size bound of B is:

|B| ˙ 2nH(p)+nϵlog(2/ϵ)

where |B| is the number of strings in our set, and H(p) is the entropy of the probability distribution. This bound tells us that B cannot be too large; it's approximately 2nH(p) times some correction factor.

This is important in various fields:

  • In physics, this helps us identify which states matter the most.
  • In coding theory, this helps us design efficient codes.
  • In statistical mechanics, this is similar to identifying important microstates.

Let's use a practical example. Imagine measuring n coin flips. Let n=1000 flips of a biased coin with p=0.7 for heads. Here, H(p)=(0.7log(0.7)+0.3log(0.3))=0.88. The most likely outcomes will have around 700 heads. So, B would contain strings with head counts close to 700, not all 21000 possible strings.

This is used in statistical mechanics, and is similar to how a gas at fixed temperature occupies only a tiny fraction of all possible states, but that fraction contains most of the probability.

Strongly typical sets

This extends (Binary) High probability sets to more general alphabets, not just binary. A sequence is considered "typical" if its empirical distribution (what we actually observe) is close to the true probability distribution. This is similar to how, in statistical mechanics, most microstates of a system are "typical" and give rise to the expected macroscopic properties.

There are 3 basic components of this:

  • Alphabet X (possible outcomes), for example for a 4-sided die X={1,2,3,4}.
  • pX is the true probability distribution. For a fair 4-sided die, pX=(0.25,0.25,0.25,0.25).
  • T(x) is the type (empirical distribution) of sequence x. For example, if x=1,1,2,3,1,4,1,2, the type T(x)=(58,28,18,18).

A sequence x is (ϵ,n)-strongly typical if its empirical distribution T(x) is close to pX. Here, "close" means that the 1 distance is ϵ, where the 1 distance is the sum of absolute differences.

For example, let's say X={1,2,3}, pX=(0.5,0.3,0.2), n=10 and ϵ=0.1. For a sequence x=1,1,1,1,1,2,2,2,3,3, the type T(x)=(0.5,0.3,0.2). We can check if our sequence is typical:

T(x)pX=|0.50.5|+|0.30.3|+|0.20.2|=0ϵ

So this sequence is strongly typical.

Physically, this is like measuring particle positions. pX is the quantum mechanical probability distribution, and T(x) is what we actually measure in n trials.

Strongly typical sequences are "reasonable" measurement outcomes, like getting 501 heads in 1000 coin flips (typical), and unlike getting all heads (not typical).

Some properties:

  • Most sequences are typical: Pr(xAϵ(n)(X))1 as n.
  • The number of typical sequences is roughly |Aϵ(n)|2nH(pX) where H(pX) is the entropy of pX.
  • Each typical sequence has probability about 2nH(pX).

Aϵ(n) is the same subset of most likely strings as B was in (Binary) High probability sets, except now with an alphabet of arbitrary size.

This is related to Sanov's Theorem, which tells us that the probability of observing an "atypical" sequence decreases exponentially with n, similar to how exponentially suppressed.

This is used in various fields:

  • In physics, it describes the likely measurement outcomes.
  • In coding, it helps design efficient compression.
  • In statistics, it helps understand "normal" behavior.
  • In information theory, it is fundamental to channel coding.

Sanov's Theorem

Sanov's Theorem is about the probability of observing empirical distribution that deviate from the true distribution.

Let's say we have a true probability distribution pX, and we take n independent samples. We observe an empirical distribution T(x), and now we want to know the probability of seeing a "weird" distribution.

Sanov's Theorem states that, for a set of distributions Q:

Pr(T(x)Q)exp(n×minqQ(D(qp)))

where D(qp) is the KL divergence we saw in (Binary) Pinsker's inequality, n is the number of samples, and the minimum is taken over all distributions q in set Q.

For Strongly typical sets:

Q={q:qpX1>ϵ}

Q is the set of "atypical" distributions. By Pinsker's inequality:

D(qp)2ln(2)|pq|2

Therefore:

Pr(xAϵ(n))exp(2ln(2)nϵ2)

Using a physical example, imagine we're measuring particle positions. Let's say the true distribution pX=(0.5,0.3,0.2). After n=100 measurements, what is the probability of seeing T(x)=(0.7,0.2,0.1)?

Sanov says this probability decreases exponentially with:

  • How many measurements you take (n)
  • How "weird" your observation is (measured by D(T(x)p)).

Sanov's theorem is useful in various fields:

  • In physics, it quantifies how unlikely "weird" measurements are. It is similar to how entropy increases in thermodynamics, and explains why we usually see "typical" behavior.
  • In information theory, it shows that typical sets contain the most probability, justifies using typical sequences for coding, and gives error bounds for compression.

For a concrete numerical example, let's say we have a true distribution p=(0.5,0.5), and an observed distribution q=(0.7,0.3), over the course of n=100 trials.

D(qp)=0.7log(0.70.5)+0.3log(0.30.5)0.119

So the probability bound is:

exp(100×0.119)2.7×104

In statistical mechanics, this is similar to why gases spread out to fill containers, why temperature equalises between objects, or why we see macroscopic irreversibility.

Some key insights:

  • Large deviations are exponentially unlikely.
  • The rate of decay depends on how "strange" the deviation is.
  • This explains why strongly typical sets have high probability.
  • Gives quantitative bounds on atypical behavior.

Lexicographic encoding

This is about data compression. Imagine we have a string of symbols (like a message or a sequence of measurement outcomes). This encoding method allows us to compress the string efficiently by:

  • First describing what type of sequence it is (like saying how many 0s and 1s)
  • Then describing where exactly in the list of possible sequences of that type this particular sequence appears.

This method is efficient because it compresses the data to nearly the theoretical minimum length (nH(X) where H(X) is the entropy) and can be implemented with reasonable computational complexity (O(n3log2(n)) operations).

Basically, instead of encoding the whole sequence directly, we first describe the type (frequency distribution), and then describe which specific sequence of that type it is.

For example, say n=4, with a binary alphabet {0,1}, and our original sequence is 1101. To encode the string:

  1. Describe the type:
    • Count: 3 ones, 1 zero
    • Type T(x)=(0.25,0.75)
  2. Describe the position: Among all the sequences with 3 ones and 1 zero (our options are 1110, 1101, 1011, 0111), our sequence is number 2 in this list.

This works because:

  1. Type information is relatively small, it takes about O(log(n)) bits, because |Pn(X)|(n+1)|X|1.
  2. Position information takes about nH(p) bits, where H(p) is the entropy of the type. This is optimal by Shannon's source coding theorem.

A more detailed example: Let n=8, sequence = 11010110.

  1. Type description:
    • Count: 5 ones, 3 zeros
    • Type T(x)=(38,58)
  2. Position in type-class:
    • List all sequences with 5 ones and 3 zeros in order:
      11111000
      11110100
      11110010
      ...
      00011111
    • Find the position of 11010110 in this list

The computational complexity is O(n3log2(n)) because the encoding needs to enumerate sequences in the type class, then find the position of the sequence, and handle large numbers (hence the log2 factor).

Physically, we can think of organising books. Instead of describing the exact location of each book, we first describe how many there are in each category, then describe the arrangement within that constraint.

This type of encoding approaches the theoretical minimum. nH(X) is Shannon's entropy (theoretical minimum), o(n) is some extra overhead, and this encoding is perfectly reversible (lossless compression).