Spring 2018 18.06 Lecture Summaries
This document is a brief summary of what was covered in each 18.06 lecture, along with links and suggestions for further reading. It is not a good substitute for attending lecture, but may provide a useful study guide.
(You can also look at the analogous summaries from Fall 2017.)
I'll try to update it within a day of each lecture.
Machine Learning with Gaussian Elimination
PSET 1 .
Further reading: Textbook sections 2.1 and 2.2. Strang lecture 2 video.
In the first lecture, we perform Gaussian elimination the "high school" way, by putting A next to b and performing row operations. In the next lecture we will move into a more matrix computation view by bringing in Elimination matrices. At the end of the notebook above I did a little machine learning experiment where Gaussian elimination accurately finds the unknown parameters in a 3-vector w right away with one backslash, while traditional machine learning can take a long time. I'm not sure if traditional machine learning can be improved, but if any of you try running another software library and see better results, I'd be very interested.
As I'm sure you know, the world needs creative people who understand concepts and how to use them. This course does not emphasize hand computation like the old days.
The big message: the matmul you learned in high school is not the only, perhaps not even the best perspective on how to multiply matrices. There are many other ways. There are really two key facts that are important
- All the elementwise viewpoints can be reordered. Matmul has that kind of "symmetry."
- It may be better to view a matrix as a collection of columns or rows or blocks or just as a whole.
You should understand all the views. The column view. The row view. The outer product view. The block matrix view. You should unerstand why all 6 "ijk" choices are equally correct ways to multiply matrices.
Further reading: Textbook sections 2.3, 2.4, 2.6. Strang lecture 1 video, lecture 4 video.
On Friday at 5pm, there will be an optional tutorial session in 32-141 for the Julia computer software that we will be using in 18.06 this semester for homework and lecture demonstrations. Julia is a programming language, but no "real" programming will be required in 18.06: we will just be using it as a "fancy calculator" that happens to have lots of linear algebra and other capabilities. The tutorial session is optional; for the homework, we will mostly just give you code and tell you to run it, perhaps with minor tweaks, in order to perform computational experiments. Bring your laptops, and try logging into juliabox.com beforehand. More information:
Key Points to consider
- How the L matrix entries are just the multipliers from Gaussian elimination
- How in practice, one rarely "augments" the matrix with the right-hand side like in the old paper and pencil textbooks. Instead, you compute A=LU, substitute this into Ax=b=LUx, let c=Ux, solve Lc=b, then solve Ux=c. In particular, solving Lc=b is exactly the same as performing the Gaussian-elimination steps on c.
- Given A=LU, you can efficiently solve multiple right-hand sides, or equivalently the matrix equation AX=B.
- How row swaps lead to the factorization PA=LU: in practice, the computer always does row swaps, and always gives you a permutation matrix P (or its equivalent).
- Singular matrices = zero pivots (that can't be eliminated by row swaps) = no solutions or infinitely many solutions.
Further reading: Textbook sections 2.6, , 2.5. Strang lecture 4 video and lecture 5 video.
We did an example of Gauss-Jordan to compute inverses.
Then we explored the ordering of elimination matrices.
A Key point is that elimination matrices are for human understanding, in the real world the multipliers are computed and inserted into L.
In the real world, neither Julia, nor Python, nor MATLAB, nor R does Gaussian Elimination, but rather all these packages call LAPACK written in FORTRAN90.
-
Transposes,permutations,orthogonality Further reading: Textbook sections 2.5, 2.6 ("The cost of elimination"), and 11.1. Strang video lecture 3.
One way or another, we have now covered the material up to 2.6 and also 11.1. Also we covered the first four Gil Strang videos on OCW.
In today's lecture we finished the notebook on transposes, permutations, and orthogonality. We then looked at the common algebraic structure of vector spaces, playing the game of "Am I a vector space?"
If a set does not have a clear 0 vector, then it is easy to exclude it from being a vector space, but having a zero does not guarantee that a set is a vector space.
The most important fact is that linear combinations of vectors should exist in the space.
Reviewed vector spaces (a set V of anything you can add x+y and multiply by scalars αx) and introduced subspaces (a subset of V such that vector operations stay in the subspace). Examples of vector spaces include real n-component vectors (ℝⁿ, or ℂⁿ for complex numbers), real m×n matrices, functions f(x) (ℝ↦ℝ, real numbers to real numbers). Examples of subspaces includes planes or lines through the origin in ℝ³, or the origin itself. The goal of this is to break vector spaces into smaller pieces that we can still do linear algebra on (hence the need for a subspace, not just any arbitrary subset). Subspaces are especially important to help us understand the solutions (if any) of Ax=b for non-square matrices A.
Generalizations of vector spaces and subspaces. These aren't limited to the n-component vectors (ℝⁿ, or ℂⁿ for complex numbers) that we use in 18.06! Other examples include real m×n matrices, functions f(x) (ℝ↦ℝ, real numbers to real numbers). Examples of subspaces includes planes or lines through the origin in ℝ³, or the origin itself; continuous functions, or functions with f(0)=0. This generality is useful! It is quite common to have problems where the unknowns are matrices rather than vectors (e.g. a Sylvester equation or multilinear algebra) or problems where the unknowns are functions (e.g. partial differential equations like Maxwell's equations in electromagnetism and functional analysis), and the fact they they are vector spaces means that a lot of the tools and concepts of linear algebra can be applied. (This happens in many engineering and physics courses; in math, see e.g. 18.303.)
For an m×n matrix A, introduced two important subspaces.
-
First, the column space C(A): the set {Ax for all x ∈ ℝⁿ}. This is the set of right-hand sides b such that Ax=b is solvable, and is a subspace of ℝᵐ (not ℝⁿ unless m=n!). Equivalently, C(A) is all linear combinations of the columns of A, which we call the span of the columns.
- Did an 3×2 example in which C(A) was a plane in ℝ³, and noticed via elimination that the "dimensionality" ("2d") of this subspace matched the rank (2).
- Did another 3×2 example in which C(A) was a line in ℝ³, because the second column was a multiple of the first, and noticed via elimination that the "dimensionality" ("1d") of this subspace matched the rank (1).
-
Second, the null space (also called the "kernel") N(A): the set {x such that Ax=0} ⊆ ℝⁿ (i.e., a subspace of ℝⁿ). Given any solution x to Ax=b, then x+z is also a solution if z ∈ N(A) (i.e. Az=0 ⟹ A(x+z)=Ax+Az=Ax=b).
- In our 2×3 example, the null space was determined to be a line in ℝ³, and we saw that this gave a line of solutions.
These are very important subspaces because they tell us a lot about the matrix A and solutions to Ax=b. As a trivial example, if A is an n×n invertible matrix, C(A)=ℝⁿ and N(A)={0}.
Defined a basis for a vector space as a minimal set of vectors (we will later say that they have to be linearly independent) whose span (all linear combinations) produces everything in the space. The dimension of a vector space is the number of vectors in a basis. More on this later… for now, I mostly rely on your intuition that a plane is 2d, a line is 1d, a point is 0d, etcetera.
Further reading: Textbook, sections 3.1 and 3.2; Strang video lecture 6.
More on nullspace and column space. Reviewed the definitions, and the fact that Ax=b is solvable if and only if ("iff") b ∈ C(A). Given a particular solution xₚ satisfying Axₚ=b, xₚ+x is also a solution for any x ∈ N(A).
Showed that the nullspace is preserved by elimination (row) operations, but that the column space is not. So, to find N(A), we can instead do elimination and find N(U)=N(A) for the upper-triangular form U. Defined the rank as the number of (nonzero) pivots, the pivot columns, and the free columns. We now want to find all possible solutions to Ax=0.
For hand calculation, it is useful to go a step further: much like Gauss–Jordan, we take U and eliminate "upwards" in the pivot columns to turn them into I, i.e. to turn the pivot columns into an identity matrix. This is called the reduced row echelon form (abbreviated "rref") R. Again, N(R)=N(A).
We can solve Rx=0 by breaking up x=(p,f) into the pivot variables p and the free variables f, and we find that they satisfy p=-Ff. This means that we can choose any f we want and p is determined. If we choose the usual basis (1,0,0,…), (0,1,0,…) for f∈ℝⁿ⁻ʳ and solve for p, this gives us the special solutions, a basis for N(A). By the same reasoning we can see that the dimension of N(A) is n-r, #columns – #pivots. This is true for all matrices, and is known as the rank-nullity theorem.
Did an example of a 3×5 matrix A with rank 3 (3 pivots, which we will call full row rank), in which case N(A) is a 2-dimensional "plane" (in ℝ⁵), and we find two special solutions (a basis for the nullspace). Once you get used to this, you can just read off the special solutions from the rref form R.
Brought up another way to think about the nullspace. If N(A) contains a nonzero vector x≠0, then Ax=0 means that some linear combination of the columns of A, with at least one nonzero coefficient, is zero. We say that the columns of A are linearly dependent: at least one of the columns can be made from a linear combination of the other columns. If N(A)={0}, then we say that the columns are linearly independent.
Earlier, I defined a basis as a minimal set of vectors whose span gives an entire vector space, and the dimension of the space as the size of the basis. We now rephrase this as saying basis vectors must be be linearly independent. The dimension of a space is still the number of basis vectors.
Further reading: Textbook, sections 3.2-3.3, video lecture 6, lecture 7. Textbook sections 3.4, video lecture 9.
Bases, dimension, and independence. Earlier, I defined a basis as a minimal set of vectors whose span gives an entire vector space, and the dimension of the space as the size of the basis. Now, we want to think more carefully about the term "minimal". If we have too many vectors in our basis, the problem is that some of the vectors might be redundant (you can get them from the other basis vectors). We now rephrase this as saying that such vectors are linearly dependent: some linear combination (with nonzero multipliers) of them gives the zero vector, and we want every basis to be linearly independent. The dimension of a subspace is still the number of basis vectors.
What does it mean to be linearly independent? Given a set of n vectors {x₁, ⋯, xₙ}, if we matrix a matrix X whose columns are x₁⋯xₙ, then C(X) is precisely the span of x₁⋯xₙ. To check whether the x₁⋯xₙ form a basis for C(X), we need to check whether they are linearly independent. There are three equivalent ways to think about this:
-
We want to make sure that none of x₁⋯xₙ are "redundant": make sure that no xⱼ can be made from a linear combination of the other xᵢ's.
-
Equivalently, we don't want any linear combination of x₁⋯xₙ to give zero unless all the coefficients are zero.
-
Equivalently, we want N(X) = {0}.
In this way, we reduced the concept of independence to thinking about the null space. Since the nullspace is preserved by elimination, it follows that columns of A are dependent/independent if and only if the corresponding columns of R are dependent/independent. By looking at R, we can see by inspection that the pivot columns form a maximal set of independent vectors, and hence are a basis for C(R). Hence, the pivot columns of A (i.e. the columns of A corresponding to the columns of R or U where the pivots appear) are a basis for C(A).
It follows that the dimension of C(A) is exactly rank(A).
We can now find the complete solution (i.e., all solutions) to a non-square linear system Ax=b. Elimination turns this into Rx=d. We look for solutions in the form xₚ + xₙ: a particular solution xₚ plus any vector xₙ in N(A) (specified explicitly by giving a basis). The particular solution xₚ can be any solution to Ax=b, but the simplest one to find is usually to set the free variables to zero. That is, we write the solution xₚ=(p; 0) where p is the unknown values in the pivot rows, setting the other (free) rows to zero, then plug into Rxₚ=d to get p. Since this part of R is just I, we can easily solve it. Then we add in anything in N(A).
I gave a 3×6 example matrix A (from Strang, section 3.3) that was rank deficient: after elimination, we only had two pivots (rank r=2) in the first two rows, and a whole row of zeros. Furthermore, its pivot columns were the first and third columns, with the second and fourth columns being free — this is possible (albeit unusual)! Showed that we could still get the special solutions by solving for the pivot variables in terms of the free variables = (1,0,…) etcetera, and we still got dimension n-r (= 2) for the null space. When solving Ax=b by elimination into Ux=c, however, we only got a solution x if c was zero in the same rows as U. If the zero third row of U was matched by a zero third row of c, then we got a particular solution as before by setting the free variables to zero. If the zero third row of U was not matched by a zero third row of c, then there is no solution: b was not in the column space C(A). In general, problems that are not full row rank may not have solutions.
Went through four important cases for an m×n matrix A of rank r. (Note that we must have r ≤ m and n: you can't have more pivots than there are rows or columns.)
-
If r=n, then A has full column rank. We must have m ≥ n (it is a "tall" matrix), and N(A)={0} (there are no free columns). Hence, any solution to Ax=b (if it exists at all) must be unique.
-
If r=m, then A has full row rank. We must have n ≥ m (it is a "wide" matrix), and C(A)=ℝᵐ. Ax=b is always solvable (but the solution will not be unique unless m=n).
-
If r=m=n, then A is a square invertible matrix. Ax=b is always solvable and the solution x=A⁻¹b is unique.
-
If r < m and r < n, then A is rank deficient. Solutions to Ax=b may not exist and will not be unique if they do exist.
Cases (1)-(3) are called full rank: the rank is as big as possible given the shape of A. In practice, most matrices that one encounters are full rank (this is essentially always true for random matrices). If the matrix is rank deficient, it usually arises from some special structure of the problem (i.e. you usually want to look at where A came from to help you figure out why it is rank deficient, rather than computing the rank etcetera by mindless calculation). (A separate problem is that of matrices that are nearly rank deficient because the pivots are very small, but the right tools to analyze this case won't come up until near the end of the course.)
Further reading: Textbook sections 3.3–3.4, lecture 8 and lecture 9.
Note: On Wednesday February 28, we will have an in class review for Exam 1. Bring Questions. You may wish to look at previous Exam 1 problems, though note that the timing of Exam 1 can vary from semester to semester so topics can shift.
We looked at the general form of the RREF, considered the four fundamental subspaces, and used a graph as an example. We covered GS 3.5.
Review.
PSET 4 due March 7, 10:55am .
Exam 1
We defined orthogonal subpsaces and orthogonal complements.
Showed that the matrix subspaces have an orthogonality relationship: C(A)⟂N(Aᵀ), in the sense that every vector in C(A) is orthogonal to every vector in N(Aᵀ). In consequence, their intersection is {0} (0-dimensional), and C(A)+N(Aᵀ)=ℝᵐ. Similarly for C(Aᵀ) and N(A). This is the fundamental theorem of linear algebra.
Equivalently, showed that N(Aᵀ) is the orthogonal complement of C(A), defining the orthogonal complement S⟂ of a subspace S as {x such that xᵀy=0 for all y ∈ S}. Showed explicitly that if y is orthogonal to every vector in C(A), then y is necessarily in N(Aᵀ) (and vice versa).
This often gives an nice way to check if a vector is in C(A): b is in C(A) if and only if b is orthogonal to N(Aᵀ) (or to a basis thereof). Gave an example where C(A) is a plane in ℝ³, N(Aᵀ) is the line through 0 orthogonal to that plane, and the equation for b ∈ C(A) was yᵀb=0 for a y ∈ N(Aᵀ)
Considered a number of examples of matrices A, and considered the picture on the front cover of the book.
Concluded with the observation that n independent vectors in ℝᵐ span the space hence form a basis. Similaraly n spanning vector in ℝᵐ are independent, and thus are also a basis.
Further reading: Textbook 4.1; video lecture 14.
Orthogonal projection onto C(A) and other subspaces, and the projection matrix P.
- Projection matrix P = aaᵀ/aᵀa onto 1d subspaces with a basis vector a.
- Projection matrix P = A(AᵀA)⁻¹Aᵀ onto n-dimensional subspaces C(A), where A is m×n with full column rank (rank n).
- Equivalence between orthogonal projection and least-squares: minimizing ‖b-Ax‖ is equivalent to minimizing ‖b-y‖ over y∈C(A), and the solution is y=Ax̂=Pb, where AᵀAx̂=Aᵀb as before.
- Key properties P²=P, P=Pᵀ, C(P)=C(A), N(P)=N(Aᵀ).
- Projection I-P onto the orthogonal complement of C(A), i.e onto N(Aᵀ).
- Projection P = B(BᵀB)⁻¹Bᵀ onto an arbitrary subspace, where B is a matrix whose columns are the basis vectors. For example, if A is not full column rank, we can make a new matrix B out of the pivot columns and use B(BᵀB)⁻¹Bᵀ to project onto C(A)=C(B).
We mentioned that the SVD can be used when the subspace is not known, but is to be determined by the data, and deep neural nets (not 18.06) are used when we are fitting to a nonlinear space.
Further reading: Textbook 4.2; video lecture 15.
PSET 5 due March 14, 10:55am .
Solutions to PSET 4
We went through least squares with geometry, algebra, and calculus.
We mentioned that one can use A\b in Julia to find the least squares solution to Ax̂=b whose explicit solution is the same as (AᵀA)x̂=(Aᵀb) as long as A has independent columns.
We show how to view two dimensional line fitting which minimizes vertical distances squared to a line is equivalent to an orthgonal projection in ℝ^n where the A is an nx2 matrix.
We recommended memorizing a few basic gradients, not so much for 18.06 but other courses you may take. We took gradients of functions with vector inputs and scalar outputs. It is worth noting that the size of a gradient is always the size of the variable the gradient is taking the derivatives with respect to. (x in these examples)
∇ₓ(constant) = the zero vector in ℝ^n
∇ₓ( sum(x) ) = the ones vector in ℝ^n
∇ₓ( w ⋅ x ) = ∇ₓ( wᵀx ) = w
∇ₓ( xᵀMx ) = (M+Mᵀ)x (which is 2Mx if M is symmetric)
Finding extrema such as minimums is solved by setting a gradient to 0.
We then showed how if you find the gradient of (Ax-b)ᵀ(Ax-b) = xᵀAᵀAx - 2bᵀAx + bᵀb, and set it to 0 we recover (AᵀA)x̂=(Aᵀb).
In machine learning one often computes gradients with respect to matrices not vectors, and so the result is a matrix not a vector. The idea is the same, one computes the derivative of a scalar output with respect to every component of a matrix.
An important example in machine learning worth memorizing: ∇W(xᵀWx) = xxᵀ for a square W. More generally ∇W(yᵀWx) = yxᵀ. (Check that if W is m×n, so is yxᵀ).
There are lots of other least squares examples worth looking at in the least squares lecture notebook
Further reading: Textbook 4.3; video lecture 16.
We showed that projections and least squares simplify with orthonormal bases, discussed angles in higher dimensions (as defined by acos(x⋅y/(‖x‖‖y‖)).
We talked about orthogonal matrices being defined as matrices with orthonormal columns. Orthogonal matrices preserve length and dot products, hence angles.
We considered rotations, permutations, and reflections (I-2uuᵀ) where u is a unit vector.
We began a Gram Schmidt computation in a Julia notebook. Many students noticed that this seemed too much like a hand computation on a computer, and asked if this could be done in a better way on a computer. We will see that this is possible in Lecture 17.
Further reading: Textbook 4.4; video lecture 17
We discussed two forms of infinite dimensional vector space. The discrete v=(v₁,v₂,v₃,…) or continuous functions defined on an interval such as [0,2π], or [-1,1]. Dot products are infinite sums for the discrete case. We take integrals in the continuous case: (f,g) = ∫f(x)g(x)dx, where the integration is over the interval. In Hilbert space, we take only those vectors of finite length. This is a subspace. One can talk about angles between two functions or the length of a function, and this makes perfect sense.
We then talked about Fourier series as an example of P = UUᵀ. The purpose was not so very much to teach Fourier series, but to see Fourier series as the same mathematics as the finite matrix projection or expansion. We also briefly looked at the Julia notebooks for sine series and orthogonal polynomials:
sine series notebook
orthogonal polynomial notebook
Further reading: Textbook 10.5; video lecture 24, starting at 35:15
PSET 6 due Thursday !! March 22, 10:55am.
Solutions to PSET 5
We discussed how if A has independent columns, one can form A = QR, where Q and A have the same size and R is upper triangular. R encodes the fact that the span of the first k columns of Q matches that of the first k columns of A, for k=1,2,....
We compared and constrasted A,Q, and the U from the svd. They all have the same size. They all have the same column space. A contains columns in arbitrary directions. Q and U both have orthonormal columns but are not generally the same matrix. Q has the spanning property mentioned above. U has a best approximation property, which we will see later in the course.
We then began to talk about the properties of determinants. We mentioned that determinants are more useful for theory than for practical computations. Determinants are rarely used in practical computer programs.
A determinant is a scalar function of an n×n matrix. One can write down an explicit formula or the three basic properties:
- Det(I)=1
- If you interchange rows, the determinant flips sign
- The determinant is a linear function of each row separately.
For starters it is not clear that there is a function with the above three properties, or whether it is unique, and that is what we will investigate.
Other properties of determinants may be seen in this notebook:
Further reading: Textbook 5.1; video lecture 18
We continued to prove the various properties of the determinant found in
this notebook:
determinants notebook
In particular, we methodically showed step by step that the three axioms imply that the the determinant of a matrix must be (-1)^(# row exchanges) x (product of the pivots.) That's it, nothing else, it must be that.
Now that we know the three properties imply one unique function, then we can use this to show that D(A) defined as |AB|/|B| for any non-singular B, has the three properties so for sure D(A)=|A|, and thus |AB|=|A||B|. (If A is singular or B is singular, it is easy to see that AB is singular, so these cases can be checked easily.)
Further reading: Textbook 5.2; video lecture 19
We proved |Aᵀ|=|A| and discussed how |permutation matrix| = ±1. This allowed us to see that if you took the fifteen puzzle and you interchanged two pieces you would render th epuzzle insolvable. Here's the idea: 1) if one checkerboards (colors alternately black and red) the 4x4 array, it's easy to see that if the "hole" journeys to the same starting position, it would take an even number of steps. The state of the puzzle can be encoded as a 16x16 permutation matrix, and every step is a row exchange, hence switches the sign of the determinant. Thus if you interchange two pieces artifically you can never solve the puzzle. The same argument would basically be true if you interchanged two of the edge centered pieces of a Rubik's cube. The state is a 12x12 permutation matrix, and a 3d checkerboarding shows you need an even number of steps to return to where you start.
We showed that the pivots of a matrix are given by a ratio of determinants , and also the solution to Ax=b, is a ratio of determinants by Cramer's rule.
We discussed that if L(x) is a linear function of x with scalar values, then L(x) must have the form C ⋅ x for some vector C. We then wrote down C for 2x2 matrix determinants and 3x3 matrix determinants.
We wrote down the big formula for determinants in terms of permutations.
PSET 7 due Wednesday April 4, 10:55am.
Further reading: Textbook 5.3; video lecture 20
We derived an explicit formula for the elements of A⁻¹, discussed the cofactor rules, the cross product, and volumes. In particular the cofactor formula is the very Property 3. We explained that perhaps one can view it as a formula for determinants, but one should also understand that the very linearity in Propety 3 is given explicitly in that formula.
We saw that the cross product was really the cofactors and explained there could be a cross product in higher dimensions but it would take n-1 arguments and produce n cofactors.
Familiar formulas like if you double every length in a 3d object, the volume is multiplied by 2^3=8 generalize to higher dimensions and are captured by the determinant formula.
Further reading: Textbook 5.3.
Began Eigenvalues Ax=λx
A=XΛX⁻¹
Eigenvalues are ways to turn hard matrix problems into a system of easy scalar problems.
Considered eigenvalues of Projection Matrices, Rotation Matrices, Reflection Matrices.
Pure mathematicians make a big deal about the eigenvalues as the roots of the characteristic polynomial det(A–λI). This is one good way to think about eigenvalue problems (e.g. it tells you immediately to expect ≤ m eigenvalues, possibly complex, from an m×m matrix). But it is not really a good way to compute them except for tiny (e.g. 2×2) matrices.
In fact, it's actually the reverse: one of the best ways to compute roots of polynomials is to convert the polynomial into a matrix and find the eigenvalues.
The matrix factorization A=XΛX⁻¹ is also sometimes a better way to think about eigenvalues.
The actual way eigenvalues are computed is a topic for another class (e.g. 18.335); in 18.06, we will focus mainly on their properties and usage. The computer will handle finding them.
Further reading: Strang, section 6.1; video lecture 21.
Review
Exam 2
Similar Matrices
Flow Chart as to when matrices may be similar.
Further reading: Strang, sections 6.1 and 6.2; video lecture 21 and lecture 22
The matrices tA + (1-t)I all have the same eigenvectors for any scalar t. This Julia code is meant to be a fun illustration but it falls short. If one of you would like to fix this, it would be fantastic.
The idea of similar matrices as buckets and for diagonalizible matrices, a diagonal is a nice representative.
Analysis of Fibonacci numbers.
Analysis of Aᵏu₀
Markov Matrices (Positive Markov Matrices)
Markov matrices and eigenvalues.
Further reading: Strang, section 10.3 and video lecture 24.
We can now solve systems of ODEs dx/dt = Ax in terms of eigenvectors and eigenvalues. Each eigenvector is multiplied by exp(λt) (where exp(x)=eˣ), so that solutions decay if the eigenvalues have negative real parts (and approach a nonzero steady state if one eigenvalue is zero). We can also express this in terms of a new Matrix operation eᴬᵗ, the matrix exponential.
Further reading: Strang, section 6.3 and video lecture 23.
Symmetric matrices and Positive definite matrices.
We proved that symmetric matrices have real eigenvalues and eigenvectors for distinct eigenvalues are orthogonal. Further symmetric matrices are always diagonalizable.
Positive definite matrices have a number of equivalent characterizations some of which are not so obviously equivalent.
Worth noting that for symmetric matrices, the singular values are the absolute values of the eigenvalues, while for positive definite matrices, the singular values are the same as eigenvalues.
Further reading: Strang, sections 6.4–6.5, and video lecture 25 .
As an application of real-symmetric and positive-definite matrices, I returned to the system of masses and springs from lecture 23, but this time I considered N masses m and N+1 springs. I showed that Newton's laws take the form:
- mẍ = -kDᵀDx ⟹ ẍ = -Ax, where D is an incidence matrix and A=(k/m)DᵀD.
A is obviously real-symmetric, so its eigenvalues λ are real. With a little more work, we saw that it must be positive-definite. In particular, take any eigensolution Av=λv for v≠0. Taking a dot product vᵀAv=λvᵀv, we found vᵀAx=(k/m)vᵀDᵀDv=(Dv)ᵀDx=λvᵀv, so λ=(k/m)‖Dv‖²/‖v‖² ≥ 0 immediately follows. Since Dᵀ is upper triangular, by inspection it had rank Ν (N pivots), and hence D has full column rank so Dv≠0 and λ>0.
The fact that A is negative definite allowed us to derive that any such system of masses and springs has orthogonal oscillating solutions called the normal modes. In particular, given the eigenvectors vⱼ (orthogonal since A is real-symmetric!), satisfying Avⱼ=λⱼvⱼ with λⱼ>0, we expanded the solution x(t)=∑ⱼcⱼvⱼ in the basis of these eigenvectors. For each eigenvector component, the matrix A acts just like a number λ, allowing us to easily solve the equation c̈ⱼ=-λⱼcⱼ to get sines and cosines, and hence to get the general solution:
- x(t) = ∑ⱼ [αⱼ cos(ωⱼt) + βⱼ sin(ωⱼt)] vⱼ
where ωⱼ=√-λⱼ, and αⱼ and βⱼ are determined from the initial conditions x(0) and ẋ(0).
The key point is that the structure of the problem told us that λⱼ>0 and hence that the frequencies ωⱼ are real numbers. (If they were complex, we would have exponentially growing or decaying solutions, which would make no physical sense for a system of lossless springs and masses.) The moral of this story is that real-symmetric and definite matrices don't just fall down out of the sky, they arise from how the matrix was constructed, and that these matrix properties are often the key to understanding the physical properties of real problems.
Further reading: Strang, section 10.2. See also these notes on the springs-and-masses problem from 18.303 (you can ignore the last two pages, which go beyond 18.06, and ignore the Δx factor which is used in 18.303 to connect the discrete problem to a continuous problem).
In this lecture, I want to introduce you to a new type of matrix: circulant matrices. Like real-symmetric matrices, they have orthonormal eigenvectors, but unlike real-symmetric matrices we know exactly what their eigenvectors are! Moreover, their eigenvectors are closely related to the famous Fourier transform and Fourier series. Even more importantly, it turns out that circulant matrices and the eigenvectors lend themselves to incredibly efficient algorithms called FFTs, that play a central role in much of computational science and engineering.
Further reading: The textbook, sections 8.3 and 9.3, has some basic information on these topics. The Wikipedia articles on Circulant matrix, discrete Fourier transform, and fast Fourier transform aren't too bad. Some elementary lecture notes on FFTs from 18.095 talk more about the algorithms.
** Quiz 3 will cover material up to today's lecture. **
We went through a "real" proof of the Perron Frobenius theorem which states that a matrix will all positive entries has an eigenpair where every entry in Ax = tx is positive, and further all other eigenvalues have absolute value less than t.
The writeup of the proof along with an optimization demonstration that mimics the key part of the proof may be found in a Perron-Frobenius notebook:
We remembered the five characterizations of positive definite matrices and pointed out again they were all equivalent.
If S is positive definite we used the eigendecomposition to understand all the points x such that x'Sx=1, and also the image of the unit sphere under the mappsing x goes to Sx. Both are ellipsoids with axes in the direction of the eigenvectors, the semi-lengths of the axes being 1/sqrt(eigenvalues) in the first case, and the eigenvalues in the second case.
We then asked what is the image of the unit sphere for a general A? It can not be the eigenvalues as they may not be real. It is not even the absolute value of the eigenvalues. A quick interpretation of the SVD shows that we have an ellipsoid with columns in the direction of the columns of V and semi-lengths the singular values.
See GS p. 354-356,392,477
Test Topics: Mostly from Chapter 6 with a little bit from Chapter 7 and 10.3.
We explained the division by (n-1) in the sample covariance formula by studing the projection matrix P = eye(n) .- 1/n
. We saw that applying this to randn(n) and using the orthogonal invariance of randn(n) shows that the sum of squares of the projection is exactly n-1.
This may be found in the notebook:
(Sample variance division by n-1)[http://nbviewer.jupyter.org/github/stevengj/1806/blob/spring18/lectures/Sample%20Variance%20division%20by%20n-1.ipynb]
One can also see GS chapter 12.1 and perhaps a little 12.3.
We also talked about the approximation power of the SVD including image compression.
Review
Quiz 3.
When doing numerical calculations, we keep running into little roundoff errors that arise from the computer only keeping a finite number of digits in its arithmetic. Mostly, we've been ignoring these or waving our hands, but a huge branch of mathematics is devoted to the origin and propagation of errors in calculations.
In this lecture, we take a first step in that subject by asking a simple question: if we solve Ax=b and have a little error Δb in b (either due to roundoff, or measurement errors, or something else), how big is the resulting error in Δx? This deceptively simple question leads to lots of interesting topics, e.g. induced matrix norms and matrix condition numbers.
Further reading: Strang, section 9.2. For a much more thorough discussion of these topics, see classes like 18.335 or (to a lesser extent) 18.330, as well as more advanced textbooks like Numerical Linear Algebra by Trefethen and Bau.
In this and the next lecture, we discussed the more abstract approach of linear algebra that one finds in more advanced classes. The power of this approach is that it, in one fell swoop, applies to vector spaces of concrete vectors, function spaces, vectors of vectors, vectors of matrices, and so much more.
Another benefit of the abstract approach is that it explains why matrix multiply is defined as it is. Perhaps you wondered why this complicated matrix multiply is what it is. Now you know.
We defined linear transformations and pointed out that given bases there is always a matrix for the transformation. Nonetheless, the linear transformation can be simpler than the matrix.
We gave a few examples: 1) the concrete matrix times vector 2) stencils on a grid 3) rotation in x,y of polynomials of the form ax^2 + bxy + cy^2.
Further reading: Strang, all of Chapter 8. Video Lecture 30
More linear transformations. See the notebook
Of particular note is the derivative in two different bases and how one matrix must be similar to the other.
We pointed out that symbolically we could write T[v1,...,vn] for [Tv1,...,Tvn] where v1,...,vn is a basis for V. In this notation T[v1,...,vn]c = [w1,....,wm]Ac describes every linear transformation where c are the coordinates for a vector in V, and d=Ac are the coordinates in W.
On a numerical computer one would just work with c and d=Ac, and somewhere in the documentation one would understand the correspondence with the vector space and bases in mind.
The identity transformation when W=V and m=n is the transformation Tv=v for all v in V.
We consider two situations
1. rotation of ℜ³
2. polynomial shifting
where
one might have T=identity A≠I and T≠identity A=I.
- Let V=W=ℜ³, Q=[q1 q2 q3] be an orthogonal rotation matrix. The columns can be thought of as axes in a rotated coordinate system. We can call q1,q2,q3 the q-basis and the usual x,y,z axes the standard-basis.
1a. If the input and output bases are each the standard basis, and if T is rotation, then the matrix is Q. Literally d=Qc is a rotation of a vector with coordinates (c[1],c[2],c[3]) to a new rotated position (d[1],d[2],d[3])
1b. If the input basis is the standard basis and the output basis is the q-basis, then the same rotation in 1a is represented by the identity matrix, I.
1c. If the input basis is the q-basis and the output basis is the standard basis, then the identity transformation T is represented by the matrix Q. In this situation, no vector moves, but d=Qc represents taking a vector in the q frame of reference and writing it in the standard frame of reference.
- Let V=W=cubic polynomials (dim=4). We call the polynomials 1,(x+1),(x+1)^2,(x+1)^3 the translated basis or t-basis. The standard-basis will be 1,x,x^2,x^3.
Let P be the 4x4 upper triangular Pascal matrix .
2a. Input=Output=standard-basis. T(f(x))=f(x+1). Matrix = P. Here polynomials shift. The binomial theorem that you should have seen in high school proves the correctness of this matrix. (Note you should now be able to invert this matrix by expanding (x-1)^n, rather than the methods you learned at the start of the semester.)
2b. Input=standard basis, Output=t-basis, same shift transformation in 2a is represented as the identity matrix, I. (analogy with 1b above.) The polynomial is shifting, but it's coordinates are the same numbers interpreted in a new basis.
2c. Input=t-basis, Output=standard basis, then the identity transformation also has matrix P. (Note the analogy with 1c above.) Here we are using the pascal matrix rather than symbolic manipulation to rewrite c0 + c1(x+1) + c2(x+1)^2 + x3(x+1)^3 as d0 + d1 x + d2 x^2 + d3 x^3. Again note the polynomial has not shifted, it is just being rewritten in the monomial basis.
Summary
Input Basis | Output Basis | T | Matrix | |
---|---|---|---|---|
1a | standard | standard | Rotation | Q |
1b | standard | q | Rotation | I |
1c | q | standard | Identity | Q |
2a | standard | standard | shift | P |
2b | standard | t-basis | shift | I |
2c | t-basis | standard | Identity | P |
If you can appreciate how mathematicians see the analogy between the two examples as more than an analogy, but a common theoretical structure then you can appreciate why the linear transformation concept is more powerful than the matrix times vector concept.