React

Graph all the things

analyzing all the things you forgot to wonder about

6 Matrix Decompositions Everyone* Should Know

2023-03-27
interests: linear algebra

*everyone's conjugate transpose

matrix decompositions: the movie

The title is a joke, but was inspired by the 110-page tome What Every Programmer Should Know About Memory, which I think is not a joke?

I'm going to try explaining these 6 useful matrix decompositions:

Decompositions

hierarchy of decompositions

The first 4 (SVD, Eigen, Schur, PSD eigen) help with geometry and exponentiation. The last 2 (LU and Cholesky) help calculate determinants and solve for x in systems of equations like AX = B. Scroll to the bottom if you need a reminder about any terminology or matrix classes.

Singular Value Decomposition (SVD)

  • Formula: A = W\Sigma V^*
  • Applies to: all matrices!
  • Useful for: geometry

Hot take: if you only know one matrix decomposition, make it SVD (forget Eigen).

A 2x2 matrix has 4 degrees of freedom that describe a linear distortion on 2D vectors. The usual representation of

\begin{bmatrix}a&b\\c&d\end{bmatrix}
can be unhelpful when reasoning geometrically. More intuitively, a 2x2 matrix acts on a vector by doing these 3 things in order:
  • rotates by an angle (1 degree of freedom)
  • stretches along the x and y axes by different ratios (2 degrees of freedom)
  • rotates by another angle (and maybe flips) (1 degree of freedom)

The SVD shows that every matrix does these same 3 things. In other words, for any m\times n matrix A, there exist unitary matrices W (m\times m) and V^* (n\times n) and a nonnegative diagonal matrix \Sigma (m\times n) such that

A = W\Sigma V^*
So A first rotates via V^* in n dimensions, then stretches along the coordinate axes and changes dimension to m via \Sigma (either truncating or adding on 0's), then rotates and flips via W in m dimensions.

The amounts \Sigma stretches by along the axes are called the singular values. If you take a (hyper)circle and apply a matrix to it, you get back a (hyper)ellipse whose radii are the singular values. This has wide-sweeping applications in dimensionality reduction and projective geometry.

The SVD converting a circle with axes v_0 and v_1 into an ellipse with axes sigma_0 w_0 and sigma_1 w_1

At Hive, I once used SVD to help find interesting clusters of television shows, including this one:

  • Overhaulin'
  • Bitchin' Rides
  • Graveyard Carz

The data amounted to a giant matrix of (viewers x TV shows), over 1M x 10k in size, where each entry was a 1 if the specific viewer had watched the TV show and 0 otherwise. I wanted to apply k-means to cluster the TV shows, but the data was too big, so I first used SVD to reduce dimensionality via a technique called PCA. I used Apache Spark to get the r\approx30 biggest singular values and an m\times r slice W' of W. I then projected A onto the column vectors of W', thereby reducing the dimensionality of the shows from 10k x 1M to 10k x r. Applying k-means to the reduced data gave us lots of fun clusters to look at.

Eigendecomposition

  • Formula: A = P^{-1}DP
  • Applies to: diagonalizable matrices
  • Useful for: exponentiation

The eigendecomposition explains what happens after repeated (or fractional, or inverse, or exponential) application of a matrix. That makes it good at describing how linear systems evolve over time or space, which is extremely important; any system of ordinary linear differential equations can be solved by a matrix exponential. If

\mathbf{x}'(t) = A\mathbf{x}(t),\qquad \mathbf{x}(0) = \mathbf{x}_0
then we can solve \mathbf{x} at any point in time with
\mathbf{x}(t) = e^{At}\mathbf{x}_0
Engineers use the eigendecomposition to model how skyscrapers respond to earthquakes. Physicists use it to understand quantum systems. Biologists use it to model population genetics.

Intuitively, when the eigendecomposition is possible, there are certain eigenvectors that the matrix stretches by their respective eigenvalues without rotating. More concretely, there exists an invertible matrix P (containing the eigenvectors) and a diagonal matrix D (containing the eigenvalues) such that

A = P^{-1}DP

As I hinted, its most important property is easy exponentiation. Just apply the exponents to the entries of the diagonal matrix:

A^r = P^{-1}D^rP
Similarly,
e^A = P^{-1}e^DP

But beware: the eigendecomposition can be confusing.

  • The eigenvectors aren't generally orthogonal. So if you imagine stretching the space along each eigenvector by its eigenvalue sequentially, you'd be wrong. Somehow the matrix jointly stretches the eigenvectors simultaneously.
  • Part of the decomposition is P, which has almost no useful properties.
  • The eigenvectors and values do not have any important geometric interpretation.
The eigendecomposition stretching two eigenvectors p_0 and p_1 to lambda_0p_0 and lambda_1p_1. They are not at right angles, and they do not correspond to the minor and major axes of the ellipse.

Schur Decomposition

  • Formula: A = VUV^*
  • Applies to: square matrices
  • Useful for: exponentiation

What happens when you exponentiate a matrix that has no eigendecomposition? Polynomial growth (often mixed with more exponentials). For instance, \begin{bmatrix}1&1\\-1&3\end{bmatrix} is not diagonalizable. It turns out that exponentiating it gives

\begin{bmatrix}1&1\\-1&3\end{bmatrix}^r = 2^r\left(I + \frac12r\begin{bmatrix}-1&1\\-1&1\end{bmatrix}\right)
The Schur decomposition helps explain behavior like this.

According to the Schur decomposition, every square matrix is equivalent to an upper triangular matrix if we rotate (and flip) our axes carefully. Precisely, there exist a unitary matrix V and upper triangular matrix U such that

A = VUV^*

Similarly to the eigendecomposition, this has the property that

A^r = VU^rV^*
It is much harder to exponentiate an upper triangular matrix than a diagonal matrix, but it's still easier than exponentiating a square matrix directly. For instance, in the example above, the Schur decomposition gives V = \frac1{\sqrt{2}}\begin{bmatrix}1&-1\\1&1\end{bmatrix} and U = \begin{bmatrix}2&2\\0&2\end{bmatrix}. It is easy to show that U^r = 2^r\begin{bmatrix}1&r\\0&1\end{bmatrix}, and conjugating by V gives the result above. Even when exponentiating the upper triangular matrix by brute force is necessary, it requires roughly a quarter as much work as a square matrix.

The Schur decomposition also provides the (generalized!) eigenvalues! The main diagonal of U is identical the eigendecomposition's D term.

Positive-semidefinite (PSD) Eigendecomposition

  • Formula: A = VDV^*
  • Applies to: PSD matrices
  • Useful for: geometry, exponentiation

The holy trinity of decompositions is the eigendecomposition of a PSD matrix. It gives you the best of 3 worlds: an SVD where W and V are the same matrix, an eigendecomposition where P is unitary, and a Schur decomposition where U is diagonal. Where V is unitary and D is diagonal, you get

A = VDV^*

This does it all. The singular values equal the eigenvalues. The n\times n PSD matrix can be decomposed into n orthogonal invariant subspaces, so with the right choice of axes, you can reason about each dimension indepently from the others. And if A is real, then both V and D are as well.

It is so special that we should have a different word for it. Maybe "Cauchy decomposition" would be historically accurate, but to abide Stigler's Law, I propose "Keanu decomposition".

LU Decomposition (with partial pivoting)

  • Formula: \pi A = LU
  • Applies to: square matrices
  • Useful for: determinants, solving linear systems

These last 2 matrix decompositions are completely different from the first 4. Instead of applying rotations and stretches, they break square matrices into two triangular matrices. This enables quickly solving for X in systems of equations like AX=B, as well as calculating \det(A). In particular, LU with partial pivoting is what Lapack (and therefore numpy etc.) use for solving general systems of equations.

If you harken back to Algebra II, you may remember doing tedious Gaussian elimination to solve linear systems. Computing the LU decomposition requies the same 2n^3/3-flop Gaussian elimination, but spits out some handy byproducts: a lower triangular matrix L, upper triangular matrix U, and permutation matrix \pi such that

\pi A = LU

One of the easiest things you can do with this is compute A's determinant. The determinant of a triangular matrix is the product of its diagonal, and the determinant of a permutation matrix is (-1)^\text{#swaps}, so

\det(A) = (-1)^\text{#swaps}\prod_{i=1}^n L_{ii}U_{ii}
It only took \mathcal{O}(n^3) flops to calculate this. Compare that to \mathcal{O}(n!) flops for the usual algorithm they teach in school.

The other thing you can do is solve linear systems. That includes computing matrix inverses (solving AX=I)! Intuitively, it's relatively easy to solve for \mathbf{x} in a triangular system of equations like

2x_0 = 6,\qquad x_0 + 3x_1 = 9
Here one can trivially deduce the value of x_0, then substitute that into the equation for x_1 (i.e. "forward substitution"). This requres only \mathcal{O}(n^2) floating-point operations (flops). Once we have the LU decomposition, we can first solve for \mathbf{y} in L\mathbf{y} = \pi\mathbf{b}, then solve for \mathbf{x} in U\mathbf{x} = \mathbf{y}. Computing the LU decomposition and solving like this is exactly as hard as (and procedurally almost identical to) Gaussian elimination on an augmented matrix. Its advantage is that we can now solve multiple systems of equations AX_1 = B_1, AX_2=B_2, \ldots quickly, reusing the LU decomposition.

Details and caveats:

  • Partial pivoting refers to dynamically swapping rows to avoid division by 0 during Gaussian elimination. And with a bit more care, if we always divide by the largest magnitude number possible, we also improve numerical stability. That's why the permutation matrix \pi is necessary.
  • If A is not invertible, L and/or U won't be either, so one should check for 0's on their diagonals before trying to solve linear systems.
  • If A is n\times n and B is n\times k with k\gg n, we can do better than the standard LU procedure by actually evaluting A^{-1}. The total cost is still \mathcal{O(n^3 + n^2k)}, but computers are better and more parallel at matrix multiplications than forward/backward substitution.

Cholesky Decomposition

  • Formula: A = LL^*
  • Applies to: Positive-definite matrices
  • Useful for: determinants, solving linear systems

Positive-definite matrices make everything better, and the LU decomposition is no exception. It upgrades into the Cholesky decomposition, with these benfits:

  • 2x faster to compute the decomposition.
  • typically no need for a permutation matrix.
Method flops for r solves constituting k systems of n equations
Gaussian elimination 2rn^3/3 + 2n^2k
LU decomposition 2n^3/3 + 2n^2k
Cholesky decomposition n^3/3 + 2n^2k

The biggest use case for the Cholesky decomposition is solving least squares problems. A linear regression \text{argmin}_{\beta} ||X\beta - \mathbf{y}|| has the same solution for \beta as the system of equations (X^TX)\beta = X^T\mathbf{y}. Assuming there are at least n linearly independent vectors in X, X^TX is positive-definite.

Least squares fits right into the form of the Cholesky decomposition, which makes sense - it's exactly what Cholesky designed it for. He worked as a survey officer for the French in WWI, using it to triangulate distances quickly. Unfortunately he died of battle wounds just a few months before the war ended. But his work was published posthumously, and even bigger names like Alan Turing have studied his decomposition's properties.

At Lyft we used the Cholesky decomposition in a contextual assortment bandit service. In other words, we would take a few numbers of information (e.g. how urban the user's location is) and recommend a subset of possible options based on that (e.g. regular Lyft ride, bicycle, public transit). We learned how to go from inputs (contexts) to outputs (assortments) over time by observing some loss (e.g. conversion rate). The crux of this problem was semi-frequently evaluating terms that looked like

\mathbf{x}^T\left(\sum_v vv^T\right)^{-1}\mathbf{x}
Fortunately V=\sum_v vv^T is a positive-definite matrix! We were able to evaluate V^{-1}\mathbf{x} fairly quickly thanks to the Cholesky decomposition. And... user experiences were improved? Value was delivered to key stakeholders? Something like that.

Appendix

I used complex matrices for this post to avoid a lot of duplicate terminology for real matrices. Some types of matrices are special cases of others:

Matrices

hierarchy of matrices

Definitions:

  • square: n\times n.
  • invertible: P is invertible if there exists an inverse P^{-1} such that PP^{-1} = P^{-1}P = I. This is equivalent to having only nonzero eigenvalues. Or to having a nonzero determinant.
  • conjugate transpose (noun): the result of flipping the matrix across its diagonal and flipping the sign of all imaginary components. A^* denotes the conjugate transpose of A.
  • conjugating (verb): conjugating A by an invertible matrix P means taking P^{-1}AP. I find it really unfortunate that this is based on the same word as the previous term.
  • similar: A is similar to B if you can conjugate B by some matrix to get A; A=P^{-1}BP for an invertible matrix P.
  • diagonalizable: similar to a diagonal matrix.
  • diagonal: has nonzero entries only on the diagonal.
  • Hermitian: equal to its conjugate transpose. If the matrix is real, this means it's symmetric.
  • unitary: V is unitary if V^* = V^{-1}. This really means that V acts on vectors by only rotations and flips, preserving distances and angles. We often conjugate by a unitary matrix and write the result as VAV^*.
  • positive-semidefinite (PSD): A is positive-semidefinite if for any vector v, v^TAv \ge 0. Any Hermitian matrix with nonnegative real eigenvalues has this property.
  • upper (or lower) triangular: only has nonzero entries on or above (or below) the diagonal.
  • permutation: a matrix of 0's and 1's where each column and each row has exactly one 1. During matrix multiplication, these just reorder the rows of the other matrix.
  • flop: floating point operation. A single division, multiplication, addition, or subtraction.

Notable Mentions

After careful review, these matrix decompositions did not make the cut:

  • polar decomposition. It's equivalent to SVD and just breaks the matrix into fewer parts.
  • low-rank decomposition. I'm pretty sure people only go from decomposed form -> full matrix and never the other way around.
  • Jordan normal form. It's in between eigen and Schur, and I just can't put that much nuance into one blog post.