Interdisciplinary Numerical Methods: "Hub" 18.S190/16.S090

This new MIT course (Spring 2025) introduces numerical methods and numerical analysis to a broad audience (assuming 18.03, 18.06, or equivalents, and some programming experience). It is divided into two 6-unit halves:

18.S190/16.S090 (first half-term “hub”): basic numerical methods, including curve fitting, root finding, numerical differentiation and integration, numerical differential equations, and floating-point arithmetic. Emphasizes the complementary concerns of accuracy and computational cost. Prof. Steven G. Johnson and Prof. Qiqi Wang.
Second half-term: three options for 6-unit “spokes”
- 18.S191/16.S091 — numerical methods for partial differential equations: finite-difference and finite-volume methods, boundary conditions, accuracy, and stability. Prof. Qiqi Wang.
- 18.S097/16.S097 — large-scale linear algebra: sparse matrices, iterative methods, randomized methods. Prof. Steven G. Johnson.
- 18.S192/16.S098 — parallel numerical computing: multi-threading and distributed-memory, and trading off computation for parallelism — may be taken simultaneously with other spokes! Prof. Alan Edelman.

Taking both the hub and any spoke will count as an 18.3xx class for math majors, similar to 18.330. Taking both the hub and the PDE spoke will substitute for 16.90. Weekly homework, no exams, but spokes will include a final project.

This repository is for the "hub" course (currently assigned the temporary numbers 18.S190/16.S090).

18.S190/16.S090 Syllabus, Spring 2025

Instructors: Prof. Steven G. Johnson and Prof. Qiqi Wang.

Lectures: MWF10 in 2-142 (Feb 3 – Mar 31), slides and notes posted below. Lecture videos posted in Panopto Video on Canvas.

Homework and grading: 6 weekly psets, posted Fridays and due Friday midnight; psets are accepted up to 24 hours late with a 20% penalty; for any other accommodations, speak with S3 and have them contact the instructors. No exams.

Homework assignments will require some programming — you can use either Julia or Python (your choice; instruction and examples will use a mix of languages).
Submit your homework electronically via Gradescope on Canvas as a PDF containing code and results (e.g. from a Jupyter notebook) and a scan of any handwritten solutions.
Collaboration policy: Talk to anyone you want to and read anything you want to, with two caveats: First, make a solid effort to solve a problem on your own before discussing it with classmates or googling. Second, no matter whom you talk to or what you read, write up the solution on your own, without having their answer in front of you (this includes ChatGPT and similar). (You can use psetpartners.mit.edu to find problem-set partners.)

Teaching Assistants: Mo Chen and Shania Mitra (shania at mit.edu)

Office Hours: Wednesday 4pm in 2-345 (Prof. Johnson) and Thursday 5pm via Zoom (Prof. Wang).

Resources: Piazza discussion forum, math learning center, TSR^2 study/resource room, pset partners.

Textbook: No required textbook, but suggestions for further reading will be posted after each lecture. The book Fundamentals of Numerical Computation (FNC) by Driscoll and Braun is freely available online, has examples in Julia, Python, and Matlab, and is a valuable resource. Fundamentals of Engineering Numerical Analysis (FENA) by Moin is another useful resource (readable online with MIT certificates).

This document is a brief summary of what was covered in each lecture, along with links and suggestions for further reading. It is not a good substitute for attending lecture, but may provide a useful study guide.

Lecture 1 (Feb 3)

Overview and syllabus: slides and this web page
Finite-difference approximations: Julia notebook and demo

Brief overview of the huge field of numerical methods, and outline of the small portion that this course will cover. Key new concerns in numerical analysis, are (i) performance (traditionally, arithmetic counts, but now memory access often dominates) and (ii) accuracy (both floating-point roundoff errors and also convergence of intrinsic approximations in the algorithms). In contrast, the more pure, abstract mathematics of continuity is called "analysis", and is mainly concerned with (ii) but not (i): they are happy to prove that limits converge, but don't care too much how quickly they converge. Whereas traditional discrete computer science is concerned with mainly with (i) but not (ii): they care about performance and resource usage, but traditional algorithms like sorting are either right or wrong, never approximate.

As a starting example, considered the the convergence of finite-difference approximations to derivatives df/dx of given functions f(x), which appear in many areas of numerical analysis (such as solving differential equations) and are also closely tied to polynomial approximation and interpolation. By examining the errors in the finite-difference approximation, we immediately see two competing sources of error: truncation error from the non-infinitesimal Δx, and roundoff error from the finite precision of the arithmetic. Understanding these two errors will be the gateway to many other subjects in numerical methods.

Further reading: FNC book: Finite differences, FENA book: chapter 2. There is a lot of information online on finite difference approximations, these 18.303 notes, or Section 5.7 of Numerical Recipes. The Julia FiniteDifferences.jl package provides lots of algorithms to compute finite-difference approximations; a particularly robust and powerful way to obtain high accuracy is to employ Richardson extrapolation to smaller and smaller δx. If you make δx too small, the finite precision (#digits) of floating-point arithmetic leads to catastrophic cancellation errors.

Lecture 2 (Feb 5)

Floating-point introduction

One of the most basic sources of computational error is that computer arithmetic is generally inexact, leading to roundoff errors. The reason for this is simple: computers can only work with numbers having a finite number of digits, so they cannot even store arbitrary real numbers. Only a finite subset of the real numbers can be represented using a particular number of "bits", and the question becomes which subset to store, how arithmetic on this subset is defined, and how to analyze the errors compared to theoretical exact arithmetic on real numbers.

In floating-point arithmetic, we store both an integer coefficient and an exponent in some base: essentially, scientific notation. This allows large dynamic range and fixed relative accuracy: if fl(x) is the closest floating-point number to any real x, then |fl(x)-x| < ε|x| where ε is the machine precision. This makes error analysis much easier and makes algorithms mostly insensitive to overall scaling or units, but has the disadvantage that it requires specialized floating-point hardware to be fast. Nowadays, all general-purpose computers, and even many little computers like your cell phones, have floating-point units.

Went through some simple definitions and examples in Julia (see notebook above), illustrating the basic ideas and a few interesting tidbits. In particular, we looked at error accumulation during long calculations (e.g. summation), as well as examples of catastrophic cancellation and how it can sometimes be avoided by rearranging a calculation.

Further reading: FNC book: Floating-poing numbers. Trefethen & Bau's Numerical Linear Algebra, lecture 13. What Every Computer Scientist Should Know About Floating Point Arithmetic (David Goldberg, ACM 1991). William Kahan, How Java's floating-point hurts everyone everywhere (2004): contains a nice discussion of floating-point myths and misconceptions. A brief but useful summary can be found in this Julia-focused floating-point overview by Prof. John Gibson. Because many programmers never learn how floating-point arithmetic actually works, there are many common myths about its behavior. (An infamous example is 0.1 + 0.2 giving 0.30000000000000004, which people are puzzled by so frequently it has led to a web site https://0.30000000000000004.com/!)

Lecture 3 (Feb 7)

Interpolation OneNote Notebook Code Complete Notes
pset 1: due Feb 14

Discussed the important problem of interpolating a function $f(x)$ from a set of discrete data points, which shows up in a vast number of real problems and connects to many other areas of numerical methods (e.g. differentiation and integration). To begin with, considered the simplest algorithm of piecewise linear interpolation in one dimension, with points $x$ at spacing $\Delta x$, and showed that (for a twice-differentiable function) the error (the difference between the interpolant and the true function) converges as $O(\Delta x^2)$ (second-order convergence).

Further reading: FNC chapter 5 and FENA chapter 1. Piecewise linear interpolation is implemented in Python by numpy.interp, and several other interpolation schemes by scipy.interpolate. Interpolation packages in Julia include Interpolations.jl, Dierckx.jl (splines), BasicInterpolators.jl, and FastChebInterp.jl (high-order polynomials).

Optional Julia Tutorial (Feb 7 @ 4pm in 2-190)

A basic overview of the Julia programming environment for numerical computations. This tutorial will cover what Julia is and the basics of interaction, scalar/vector/matrix arithmetic, and plotting — just as a "fancy calculator" for now (without the "real programming" features).

Tutorial materials (and links to other resources)

If possible, try to install Julia on your laptop beforehand using the instructions at the above link. Failing that, you can run Julia in the cloud (see instructions above).

This won't be recorded, but you can find a video of a similar tutorial by Prof. Johnson last year (MIT only), as well as many other tutorial videos at julialang.org/learning.

Lecture 4 (Feb 10)

Notes: OneNote Notebook

One approach to generalizing piecewise-linear interpolation is to interpolate $n$ points with a polynomial $p(x)$ of degree of degree $n-1$. This is an important technique for many applications, and the general topic of approximating functions by polynomials has vast importance in numerical analysis, but requires care if $n$ becomes much larger than 4 (cubic) or so, even for smooth functions (no noise).

A general conceptual approach is to set up a system of linear equations for the polynomial coefficients $c_i$, satisfying $p(x_k) = y_k$ for each data point $(x_k, y_k)$. This can be expressed in matrix form $Ac = y$, where the $n \times n$ matrix $A$ is known as a Vandermonde matrix if you are using the usual monomial basis $p(x) = c_0 + c_1 x + c_2 x^2 + \cdots$. One quickly runs into two difficulties, however:

The matrix $A$ is nearly singular for large $n$ in the monomial basis, so floating-point roundoff errors are exacerbated. (We will say that it has a large condition number or is "ill-conditioned", and will define this more precisely next time.) Solution: It turns out that monomials are just a poorly behaved basis for high-degree polynomials, and it is much better to use orthogonal polynomials, most commonly Chebyshev polynomials, as a basis — with these, people regularly go to degrees of thousands or even millions.
Polynomial interpolation from equally spaced points (in any basis!) can diverge exponentially from the underlying "true" smooth function in between the points (even in exact arithmetic, with no roundoff errors!). This is called a Runge phenomenon. Solution 1: Use carefully chosen non-equispaced points. A good choice that leads to exponentially good polynomial approximations (for smooth functions) is the Chebyshev nodes, which are clustered near the endpoints. Solution 2: use a lower degree ($<< n$) polynomial and perform an approximate fit to the given points, rather than requiring the polynomial to go through them exactly. (More on this soon.)

If you address both of these problems, high-degree polynomial approximation can be a fantastic tool for describing smooth functions. If you have noisy data, however, you should typically use much lower-degree polynomials to avoid overfitting (trying to match the noise rather than the underlying "true" function).

Further reading: FNC book, chapter 9. Beware that the FENA book starts with the "Lagrange formula" for the interpolating polynomial, but this formula is very badly behaved ("numerically unstable") for high degrees and should not be used; it is superseded by the "barycentric" Lagrange formula (see FNC book; reviewed in much more detail by Berrut and Trefethen, 2004). The subject of polynomial interpolation is the entry point into approximation theory; if you are interested, the book by Trefethen and accompanying video lectures are a great place to get more depth. The numpy.polynomials module contains a variety of functions for polynomial interpolation, including with Chebyshev polynomials. A package for multi-dimensional Chebyshev polynomial interpolation in Julia is FastChebInterp. A pioneering package that shows off the power of Chebyshev polynomial interpolation is chebfun in Matlab, along with related packages in other languages. (This approach is taken to supercomputing extremes by the Dedalus package to solve partial differential equations using exponentially converging polynomial approximations.) It turns out that if you are interpolating using Chebyshev polynomials, and you choose your points $x_k$ to be Chebyshev points (either extrema or roots of the degree $n$ Chebyshev polynomial), then you don't need to form any Vandermonde-like matrix explicitly — you can get the coefficients much more quickly, in $O(n \log n)$ operations, using fast Fourier transforms (FFTs); given these, a polynomial in the Chebyshev basis can then be evaluated at any $x$ with $O(n)$ operations using the Clenshaw algorithm. If you don't need the explicit coefficients, the barycentric Lagrange formula (for an arbitrary set of points) has $O(n^2)$ setup cost to obtain a set of weights from the $x_k$ — still cheaper than the $O(n^3)$ cost of solving the Vandermonde system — after which the polynomial can be evaluated cheaply ($O(n)$) at any point $x$ similar to the other schemes.

Lecture 5 (Feb 12)

handwritten notes

The goal of this lecture is to precisely define the notion of a condition number, which quantifies the sensitivity of a function f(x) to small perturbations in the input. A function that has a "large" condition number is called "ill-conditioned", and any algorithm to compute it may suffer inaccurate results from any sort of error (whether it be roundoff errors, input noise, or other approximations) — it doesn't mean you can't use that function, but it usually means you need to be careful (about both implementation and about interpretation of the results).

For a given function $f(x)$ (with inputs and outputs in any normed vector space), the two important quantities are:

absolute condition number $\kappa_a(f, x) = \lim_{\epsilon \to 0^+} \sup_{\Vert \delta x \Vert = \epsilon} \frac{\Vert \overbrace{f(x + \delta x) - f(x)}^{\delta f} \Vert}{\Vert \delta x \Vert}$. It is the worst case $\Vert \delta f \Vert / \Vert \delta x \Vert$ for any arbitrarily small input perturbation $\delta x$. Unfortunately, if the inputs and outputs of $f$ have different scales ("units"), then $\kappa_a$ may be hard to interpret (it is "dimensionful").
relative condition number $\kappa_r(f, x) = \lim_{\epsilon \to 0^+} \sup_{\Vert \delta x \Vert = \epsilon} \frac{\Vert \delta f \Vert / \Vert f(x) \Vert}{\Vert \delta x \Vert / \Vert x \Vert} = \kappa_a(f, x) \frac{\Vert x \Vert}{\Vert f(x) \Vert}$. This is a dimensionless quantity: it is the maximum ratio of the relative change in f to the relative change in x. Most of the time, $\kappa_r$ is the quantity of interest.

All of these quantities involve the central concept of a norm ‖⋯‖, which is a way of measuring the "length" of a vector. The most familar norm, and usually the default or implicit choice for column vectors $x \in \mathbb{R}^n$, is the Euclidean or "L₂" norm $\sqrt{x^T x}$, but other popular norms include the L₁ norm (the "taxicab" norm) and the maximum norm ($L_\infty$). It turns out that all norms differ by at most an n-dependent constant factor, so the choice of norms is mostly one of convenience if you don't care about constant factors, e.g. if you are comparing $O(\varepsilon)$ to $O(\varepsilon^2)$, but it can make a big difference in optimization/fitting (e.g. different forms of regression or regularization).

For example, looked at the condition number of summation $f(x) = \sum_i x_i$. In the $L_1$ norm, the absolute condition number is $\kappa_a = 1$, and the relative condition number is $\kappa_r = \sum_i |x_i| / |\sum_j x_j |$, which gives $\kappa_r = 1$ when all the summands $x_i$ have the same sign, but blows up $\to \infty$ as $\sum_j x_j \to 0$ (for $x \ne 0$) — this is an "ill-conditioned" sum where you can expect a large relative error due to catastrophic cancellation, similar to the examples in lectures 1 and 2.

If the function is differentiable, then the condition number simplifies even further:

If $x$ and $f(x)$ are scalars, then $\kappa_a = |f'(a)|$ is simply the magnitude of the derivative.
If $x \in \mathbb{R}^n$ and $f(x) \in \mathbb{R}^m$ are vectors, then $\kappa_a = \Vert J \Vert$ is the "operator" or "induced" norm of the Jacobian matrix J ($J_{i,j} = \partial f_i /\partial x_j$), where the induced norm $\Vert A \Vert = \sup_{x \ne 0} \frac{\Vert A x \Vert}{\Vert x \Vert}$ measures "how big" a matrix $A$ is by how much it can stretch a vector. (We won't worry too much yet about how to compute this induced norm, but if you know the SVD it is the largest singular value of $A$.)

This leads us to our next important concept, the (relative) condition number $\kappa(A)$ of a matrix, and in particular we considered square invertible matrices $A$. This is defined by first considering the function $f(x) = Ax$. Its relative condition number, from the above definitions, is $\kappa_r(f, x) = \Vert A \Vert \frac{\Vert x \Vert}{\Vert Ax \Vert}$, which depends upon $x$. However, if we take the upper bound over all x, i.e. the worst x, then we get $\kappa_r(f, x) \le \Vert A \Vert \cdot \Vert A^{-1} \Vert = \kappa(A)$, the product of the norms of $A$ and its inverse. Note that $\kappa(A) = \kappa(A^{-1})$: this also tells us the worst case sensitivity for solving Ax=b (multiplying by the inverse). (If you know the SVD, then $\kappa$ is the ratio of the largest and smallest singular values of $A$.) The condition number is a dimensionless (scale-invariant) measure of how close to singular the matrix $A$ is. For example, if two columns of $A$ are nearly parallel, then $\kappa(A)$ will be very large: we say that the matrix is "ill-conditioned" and you should be very careful of inaccuracies when working with it.

For example, we can now explain why the monomial basis was so bad: it is easy to see that the Vandermonde matrix becomes nearly singular for large $n$, since when you take large powers the columns become nearly parallel. Conversely, it turns out that a "rotation matrix", or more generally any orthogonal matrix, has condition number equal to 1 (in the $L_2$ norm), and correspondingly using orthogonal polynomials is a much better-behaved basis.

Further reading: FNC book: problems and conditioning and conditioning linear systems. 18.06 lecture notes on conditioning of linear systems. Advanced books like Numerical Linear Algebra by Trefethen and Bau (lecture 12) treat this subject in much more depth. See also Trefethen, lecture 3, for more in-depth coverage of norms. A fancy vocabulary word for a vector space with a norm (plus some technical criteria) is a Banach space.

Lecture 6 (Feb 14 💕)

Introduced the topic of least-square fitting of data to curves. As long as the fitting function is linear in the unknown coefficients c (e.g. a polynomial, or some other linear combination of basis functions), showed that minimizing the sum of the squares of the errors corresponds to minimizing the norm of the residual, i.e. the "loss function" L(c) = ‖Ac - y‖², where $A$ is a "tall" matrix whose rows correspond to the data points and whose columns correspond to the basis functions. (This is an overdetermined linear system because there are more equations than unknowns: we have too few parameters to make the curve go through all the data points in general, unlike interpolation.)

It is a straightforward calculus exercise to show that ∇L = 2Aᵀ(Ac - y), which means that optimal coefficients c can be found by setting the gradient to zero, and ∇L=0 implies the "normal equations" AᵀAc = Aᵀy. In principle, these can be solved directly, the normal equations square the condition number of A (κ(AᵀA)=κ(A)²) so they are normally solved in a different way, typically by QR factorization of A (or sometimes using the SVD of A); there are typically library functions that do this for you e.g. c = numpy.linalg.lstsq(A, y) in Python or c = A \ y in Julia and Matlab.

More generally, minimizing L(c) is an example of an optimization problem. One simple way to attack such problems is by gradient descent: simply go downhill, in steps Δc = -s∇L where s is the "learning rate" or "step size" parameter (that has to be chosen carefully: too small and it will converge very slowly, but too large and it won't converge at all). It turns out that an ill-conditioned A also leads to gradient descent that converges slowly (because the local downhill direction -∇L doesn't point to the minimum, necessitating small steps). Nowadays, naive gradient descent (especially with a fixed learning rate) is a rather primitive technique that has been mostly superseded by better "accelerated" methods, but computing the gradient ∇L and using it to identify the downhill direction is still a key conceptual starting point for many algorithms. Of course, for least-square fitting where L(c) is linear in c, we can directly solve for c as shown above, but iterative optimization methods are crucial to solve more general problems where the unknowns enter in a nonlinear way (and/or there are constraints).

In both cases, compared the effect of the monomial basis vs. the Chebyshev-polynomial basis. Because the former leads to an ill-conditioned $A$ at high degrees, the accuracy of the fit quickly saturates due to roundoff errors, especially if you solve AᵀAc = Aᵀy (which squares the condition number); it also leads to slow convergence for gradient descent. The Chebyshev basis, in contrast, leads to a well-conditioned $A$ (even for equally spaced points, as long as the number of points is much larger than the degree), so it has neither of these problems.

Further reading: FNC book chapter 3. Strang's Introduction to Linear Algebra section 4.3 and 18.06 videow lecture 16. There are many, many books and other materials on linear least-squares fitting, from many different perspectives (e.g. numerical linear algebra, statistics, machine learning…) that you can find online. The FastChebInterp.jl package in Julia does least-square fitting ("regression") with Chebyshev polynomials for you in an optimized way, including multi-dimensional fitting.

Lecture 7 (TUESDAY, Feb 18)

First, discussed the computational cost of interpolation and fitting.

Solving the least-squares problem $\min_x \Vert Ax - b\Vert$ for an $m \times n$ matrix $A$ (m points and n basis functions) has $O(mn^2)$ cost, whether you use the normal equations $A^T A x = A^T b$ (which squares the condition number) or a more accurate method like QR factorization.
Interpolation can be thought of as the special case $m=n$: solving an $n \times n$ linear system $Ax =b$ is $O(n^3)$ (usually done by LU factorization, which is the matrix-factor form of Gaussian elimination). However, for specific cases there are more efficient algorithms: for polynomial interpolation from arbitrary points, the barycentric Lagrange formula (mentioned above) has $O(n^2)$ cost, and using the Chebyshev-polynomial basis from Chebyshev points has $O(n \log n)$ cost (via FFTs).

The computational scaling becomes even more important when you go to higher dimensions: multivariate interpolation and fitting. We discussed a few cases cases, using two dimensions (2d) for simplicity:

If you have a general basis $b_k(x,y)$ of 2d basis functions, you can form a Vandermonde-like matrix $A$ as above (whose columns are hte basis functions and whose rows are the grid points), and solve it as before. The matrices get much larger in higher dimensions, though!
If you can choose a Cartesian "grid" of points, also known as a "tensor product grid" (a grid in x by a grid in y), then it is convenient to use separable basis functions $p_i(x) p_j(y)$, e.g. products of polynomials. While you can still form a Vandermonde-like system as above, it turns out that there are much more efficient algorithms in this case. (The ideal case is a tensor product of Chebyshev points along each dimension, in which case you can interpolate products of Chebyshev polynomials in $O(n \log n)$ cost.)
If you have a tensor product grid, you can treat it as a collection of rectangles, and do bilinear interpolation on each rectangle — this is the obvious generalization of piecewise-linear interpolation in 1d. It is fast and second-order accurate (and there are higher-order versions like bicubic interpolation too).
If you have an irregular set of points, there is still an analogue of piecewise-linear interpolation. One first connects the set of points into triangles in 2d (or tetrahedra in 3d, or more generally "simplices"); this is a tricky problem of computational geometry, but a standard and robust solution is called Delaunay triangulation. Once this is done, you can interpolate within each triangle (or simplex) using an affine function (linear + constant $a^T x + \beta$).

Further reading: FNC book section 2.5 on efficiency of solving Ax=b; much more detail on both this and the least-squares case can be found in e.g. Trefethen & Bau's Numerical Linear Algebra and many other sources. Links to the barycentric formula can be found above, along with fast algorithms for Chebyshev interpolation. Tensor-product-grid interpolation and fitting products are also closely related to Kronecker products of matrices, and there are often more efficient algorithms than simply forming the giant "Vandermonde" matrix and solving it.

Lecture 8 (Feb 19)

Radial basis function (RBF) interpolation: slides
Quadrature: Notes on error analysis of the trapezoidal rule and Clenshaw-Curtis quadrature in terms of Fourier cosine series, and a quick review of cosine series.

To begin with, briefly touched an another popular method for multidimensional interpolation: radial basis functions. This can be understood in the same computational framework as polynomial interpolation: we form a "Vandermonde" matrix equation Ac=b, where the rows are data points and the columns are basis functions, to solve for the coefficients c. In this case, the basis functions are $\Phi(\Vert x - x_k\Vert)$: a function $\Phi$ centered at every data point $x_k$, depending only on the radius from that point. (Often decaying with radius, but not always.) Usually there is a "hyperparameter" of a lengthscale in $\Phi$ (e.g. a localization lengthscale) that one has to set (e.g. heuristically, or by checking the fit against separate "validation data"). Often, one augments the RBFs with additional basis functions, such as low-degree polynomials, leading to an underdetermined system that one solves by imposing additional criteria (e.g. orthogonality between RBF coefficients and the polynomial terms, or perhaps minimizing the magnitude of the coefficients in some norm).

Launched into a new topic: Numerical integration ("quadrature"). In general, a "quadrature rule" is a scheme for estimating a definite integral ∫f(x)dx by a sum ∑ₖf(xₖ)wₖ of f(x) evaluated at N (or N+1) quadrature points xₖ with quadrature weights wₖ. The goal is to choose these quadrature points and weights so that the estimate converges to the true integral as quickly as possible with N. (The typical assumption is that evaluating f(x) is the dominant computational cost, so we want to minimize function evaluations.) Quadrature is closely related to interpolation: most quadrature schemes (especially in 1d) proceed by picking points, interpolating the function between the points somehow (usually by polynomials), and then integrating the interpolant (which is easy with polynomials).

Began by analyzing the two simplest schemes with equally spaced points: piecewise-linear interpolation leads to a composite trapezoidal rule with $O(1/N^2)$ error (for continuous, piecewise-differentiable functions), while piecewise-constant interpolation leads to a rectangle rule (Riemann sum, or Euler method) with $O(1/N)$ error.

These error estimates are upper bounds, but some functions do much better! For periodic functions, the trapezoidal and rectangle rules are equivalent, and for smooth ("analytic") functions it converges exponentially fast with $N$. Saw this with an example function, but our goal is to see precisely why this is, and then use this insight to rearrange things so that we can always get exponential convergence for smooth functions (even if they are not periodic). The key to achieving this is to analyze convergence in terms of a Fourier cosine series.

Further reading (RBFs): A very readable overview of radial basis functions can be found in this blog post by Shih-Chin. Much deeper coverage can be found in the book Meshfree Approximation Methods With Matlab (Fasshauer, 2007) or in this review by Buhmann (2000).

Further reading (quadrature):: FNC section 5.6. Lloyd N. Trefethen, "Is Gauss quadrature better than Clenshaw-Curtis?," SIAM Review 50 (1), 67-87 (2008). Trefethen's Six Myths of Polynomial Interpolation and Quadrature (2011) is a shorter and more colloquial description some of these ideas (and more!). A related polynomial interpolation method (in some sense a generalization of quadrature by Chebyshev polynomials/points) is Gaussian quadrature (and its many variants), whose accuracy is analyzed in the Trefethen papers above, and the state of the art for computing which is Hale and Townsend (2012); a suboptimal but beautifully simple algorithm was described by Golub & Welsch (1969).

Lecture 9 (Feb 21)

pset 2 solutions
pset 3: due Friday, Feb 28.

Continued analysis from Lecture 8 (see notes). We related the convergence rate of trapezoidal rule to the convergence rate of the Fourier cosine series, and showed (using integration by parts) that the convergence rate of the cosine series is determined by the behavior of the odd-order derivatives at the boundaries (assuming that the function is smooth in the interior). This reproduces the $O(1/N^2)$ convergence rate in the general case from before, and $O(1/N^4)$ convergence if the first derivatives match at the boundary, etcetera … with "superalgebraic" convergence faster than any power law if all the odd derivatives match (e.g. vanish) at the boundary. (The convergence is exponential for "analytic" functions, i.e. functions with a convergent Taylor series.)

Moreover, showed how we can arrange for this fast convergence to occur all the time: for $\int_{-1}^{+1} f(x) dx$, we change variables to $f(cos \theta)$, at which point (a) the cosine series of $f(cos \theta)$ converges superalgebraically and (b) we can compute the cosine-series coefficients $a_k = \frac{2}{\pi} \int_0^\pi f(\cos\theta) \cos(k\theta) d\theta$ by a trapezoidal rule with equally spaced θ and it will converge superalgebraically. Putting these together, we found that you get a quadrature rule where the points $x_n = \cos(n\pi/N)$ are Chebyshev points (the same ones we've encountered several times already), and the weights can be computed by a discrete cosine transform (DCT) (which can be done in $O(N \log N)$ work using an FFT algorithm). This algorithm (with some additional technical details) is known as Clenshaw–Curtis quadrature, and is one of the two fastest converging known quadrature algorithms for generic smooth integrands (the other being Gauss–Lobatto quadrature, which turns out to have almost the same convergence rate).

Finally, if we change variables back to $x$, we see that the cosine series for $f(\cos \theta)$ is precisely an expansion in Chebyshev polynomials $T_k(x) = \cos(k \cos^{-1} x)$, which are now revealed to be a Fourier cosine series in disguise. This is why Chebyshev polynomials are so nicely behaved (they inherit all the nice properties of the cosine basis, but work for non-periodic functions thanks to the change of variables) and is why Chebyshev points $\cos(n\pi/N)$ were so important (they correspond to equally spaced angles for the cosine series). In fact, Clenshaw–Curtis quadrature is exactly equivalent to evaluating the integrand at Chebyshev points, interpolating it with a polynomial, and integrating the polynomial interpolant.

Further reading:: See further reading for lecture 8, and for lecture 4 on Chebyshev polynomials.

Lecture 10 (Feb 22)

quadrature overview slides: now that we've analyzed trapezoidal rule and Clenshaw–Curtis, let's zoom out to survey the bigger picture

Overview of the big picture of quadrature algorithms: Clenshaw–Curtis is not the end!

Lecture 11 (Feb 26)

handwritten notes

Discussed Richardson extrapolation, which is a powerful general technique to compute limits $\lim_{h \to 0^+} y(h)$ by evaluating $y(h_i)$ at a decreasing sequence of $h_i > 0$ values (usually a geometric progression) and extrapolating polynomial interpolations. The distinguishing features of Richardson's method are:

It computes many extrapolations, formed by degree-$q$ polynomial interpolations of all consecutive subsequences of $q+1$ $h_i, y(h_i)$ points for all possible degrees $q$.
All of these extrapolations are computed efficiently at the same time (in $O(n^2)$ cost for $n$ points) by a linear recurrence relation where each degree-$q$ extrapolation is computed from two degree-$(q-1)$ extrapolations (forming a table called a "Neville tableau" or "Neville–Aitken tableau": it is a version of Neville's algorithm for polynomial interpolation).
Each degree $q > 0$ extrapolant comes with an error estimate, given by its difference from the degree-$(q-1)$ estimate ending at the same $h_i$ point. This allows Richardson's method to be robust and adaptive: by using the extrapolant with the smallest error estimate for the final $y(0^+)$ result, it automatically selects a good subsequence of $h_i$ values (neither so large that polynomial fitting doesn't work, nor so small that e.g. roundoff/cancellation errors dominate).

Famous applications of Richardson extrapolation include Romberg integration (extrapolating low-order quadrature formulas), the Bulirsch–Stoer ODE method (extrapolating ODE solutions to zero stepsize), and an algorithm by Ridders (1982) for extrapolating finite-difference derivatives (also reviewed in Numerical Recipes sec. 5.7).

Further reading: Unfortunately, almost all descriptions of Richardson extrapolation combine it with a particular application, e.g. many textbooks only describe it in the context of ODEs, or integration, or differentiation. These course notes by Flaherty are in the ODE context, but are written in a general enough way that you can see the applicability to other problems, and they discuss adaptive error estimation briefly at the end. The Richardson.jl package in Julia implements the algorithm in a very general setting, and its documentation includes a number of examples; it's used by the Romberg.jl package for Romberg integration and by the FiniteDifferences.jl package for extrapolating finite-difference approximations. In Python, I'm not currently aware of any general-purpose implementation (though there are implementations in the context of e.g. derivatives or integration).

Lecture 12 (Feb 28)

pset 3 solutions
pset 4: due Friday, March 7

Root finding is the problem of solving $f(x) = 0$. For linear functions, this is easy. For scalar polynomial functions, it's easy for degree 2 (using the quadratic formula); for degree > 4 it is impossible to have a closed-form expression, but there are nowadays "easy" methods involving eigenvalues of companion matrices. However, for more general nonlinear functions there is not such a straightforward approach to get the roots. You are left with iterative methods: a procedure to obtain a sequence $x_k$ of values, starting from some "guess" $x_0$, that approaches the root (hopefully quickly).

The most famous such algorithm is Newton's method, which you probably learned in first-year calculus. The key idea is to use the derivative $f'$ to approximately linearize the function around $x_k$: $f(x_k + \delta) \approx f(x_k) + f'(x_k) \delta$. Setting this linearization to zero gives us a Newton step $x_{k+1} = x_k - f'(x_k)^{-1} f(x_k)$.

In calculus, it was enough to set up this algorithm, try it out, and see that it works pretty well (if you have a reasonable starting point). In numerical analysis, we want to be more precise: how fast does it converge, once $x_k$ is close to the root? If the exact root is $r$, and the error on the k-th step is $\epsilon_k = x_k - r$, we showed that $\epsilon_{k+1} = O(\epsilon_k^2)$. This is called "quadratic convergence". This is entirely different from "second-order" convergence! We have used "second-order" convergence to mean that the error after $N$ steps is $O(1/N^2)$, but it turns out that quadratic convergence (nearly doubling the number of correct digits on each step) has an error after $N$ steps that is a double exponential $O(a^{-b^{N}})$ for some constants $a,b$.

Gave a numerical demo in which we applied Newton's method to find square roots $r = \sqrt{a}$, corresponding to a root of $f(x) = x^2 - a$. The Newton iteration simplifies in this case to $x_{k+1} = \frac{1}{2} \left(x_k + \frac{a}{x_k}\right)$, which turns out to have been known in ancient Bablylon. See also these notes.

Another important extension of Newton's method is to systems of $n$ nonlinear equations ${f}({x})={0}$ where $f,x,0$ are $n$-component vectors. It turns out that the same linearization and the same Newton step $x_{k+1} = x_k - f'(x_k)^{-1} f(x_k)$ works, with the same quadratic convergence. The only difference is that now $f'(x_k)$ is an $n \times n$ Jacobian matrix.

Newton's method works well if you start with a guess that is "reasonably close" to the root. For starting points that are far from the root, however, it can be wildly unpredictable in fascinating ways, leading to Newton fractals.

Further reading: FNC book, chapter 4, especially section 4.3 on Newton's method. A key class of methods that we didn't cover involve what to do when you don't have access to the derivative (or in the case of large nonlinear systems the whole Jacobian matrix might be too expensive to compute or even store). In this case, there are many algorithms:

Fixed-point iteration attempts to solve $g(x)=x$ ($g(x) = f(x) + x$) by $x_{k+1} = g(x_k)$. This may not converge at all unless $|g'(r)| < 1$, but it is the starting point for the much more robust and fast-converging Anderson-accelerated fixed-point iteration, which itself is closely related to "quasi-Newton" methods.
In 1d, secant methods are Newton-like methods that perform linearization (or higher-order) by interpolating two (or more) points. Generalization to multiple dimensions yields quasi–Newton methods such as Broyden's method and Anderson acceleration.
Newton–Krylov methods only require you to compute Jacobian–vector products(jvp's) $f'(x_k)\delta$ for given $x_k, delta$, which corresponds to a single directional derivative — this is much cheaper than computing the entire Jacobian in high dimensions, at the price of using an iterative linear solver to find each Newton step. Another important class of methods are numerical continuation methods, which solve a family of root problems $f(x,\lambda) = 0$ for the roots $x(\lambda)$ at each value of the parameter $\lambda$. Given a root at one $\lambda$, the simplest algorithm is to use this as the starting point for Newton's method to find the root at a nearby $\lambda$, but there are much more sophisticated methods such as pseudo-arclength continuation. A modern implementation of numerical continuation can be found in, for example, the BifurcationKit.jl package in Julia. There are many books and reviews on this topic, e.g. Herbert B. Keller, Lectures on Numerical Methods in Bifurcation Problems (Springer, 1988).

Lecture 13 (Mar 3)

lecture notes/code: (One Note)[https://mitprod-my.sharepoint.com/personal/qiqi_mit_edu/_layouts/15/Doc.aspx?sourcedoc={24321bd2-8c69-451a-b0cc-c2b42aed5743}&action=view&wd=target%28B.%20Initial%20Value%20Problems%2F20250303.one%7C6663d749-3747-45de-bb34-f8393732d800%2F%29&wdorigin=717]

Numerical methods for ordinary differential equations (ODEs). Introduced the concept of an initial value problem $\frac{du}{dt} = f(u,t)$ given $u(0)$, and some numerical methods: the forward Euler (explicit) method, the backward Euler (implicit) method, and the midpoint method. Showed that forward Euler has $O(\Delta t)$ error and the midpoint method has $O(\Delta t^2)$ error.

Looked at an example problem $\frac{du}{dt} = -u$ with $u(0)=1$, which has the analytical solution $u(t) = e^{-t}$. Numerically, forward Euler and the midpoint method both exhibit the expected convergence rates. But even though the midpoint method seems better in terms of convergence rate, its actual error is much larger than forward Euler until $\Delta t$ gets very small. We will see next time that it is suffering from an "instability" where the errors grow exponentially with time $t$, making the "constant coefficient" of the $O(\Delta t^2)$ exponentially large.

Moreover, we used the "stencil" algorithm from pset 1 to derive a 3rd-order finite difference approximation, and found that the behaviour was even worse than the midpoint rule. Even for a fixed time $t$, the solutions in this case diverged exponentially rather than converging as $\Delta t$ decreases. Next lecture, we will see that this is a failure of "zero stability".

Further reading: FNC book, chapter 6, sections 6.1–6.2. FENA book, chapter 4 and section 4.1. You can also find hundreds of other web pages and videos on these topics. 3Blue1Brown has an entertaining introduction to the idea of a differential equation. And here is a nice video about the history of numerical ODE solvers talks about the pioneering contributions of Katherine Johnson and her portrayal in the 2016 film Hidden Figures.

Lecture 14 (Mar 5)

notes and code: (One Note)[https://mitprod-my.sharepoint.com/personal/qiqi_mit_edu/_layouts/15/Doc.aspx?sourcedoc={24321bd2-8c69-451a-b0cc-c2b42aed5743}&action=view&wd=target%28B.%20Initial%20Value%20Problems%2F20250305.one%7Cd1ccb4e9-d0d8-414c-98cd-f1392b3d5277%2F%29&wdorigin=717] (Code)[https://colab.research.google.com/drive/1fmyyPSp5gImrcj42wxxn3kBqTDavq5qJ?usp=sharing]

Stability of numerical ODE methods. The term "stability" has special meanings in this context, different from other areas of numerical analysis where "stability" is mostly about the sensitivity of an algorithm to roundoff errors. For ODE methods, "stability" refers to sensitivity to truncation error $\Delta t$ (the finite timesteps used to approximate $du/dt$). In particular, there are three central concerns for any ODE method:

Consistency: Do the local truncation errors of a single time step $u_{n+1} = \cdots$ go to zero as $O(\Delta t^{1+p})$ for $p > 0$? That is, are we correctly approximating $\frac{du}{dt} - f(u,t)$ to some positive order $O(\Delta t^p)$?
Zero stability: With a right-hand sided $f=0$, it is zero-unstable if a nonzero initial condition can diverge ($\Vert u_k \Vert \to \infty$) as the step $k \to \infty$ (with $\Delta t$ cancelling from the analysis for $f=0$, so this can be viewed as a divergence as $\Delta t \to 0$ for a fixed $t$).
Linear stability: Later, we will include the right-hand side in the stability analysis by linearizing it when $u$ is close to a root of $f$, especially for autonomous ODEs $f(u,t) = f(u)$. Through this analysis, we will show that some methods diverge for a fixed $\Delta t$ as you increase the total time $t$, perhaps unless $\Delta t$ is sufficiently small (conditional stability).

An important result is the Dahlquist equivalence theorem (closely related to the Lax (or Lax–Richtmyer) equivalence theorem):

If a method is both consistent and zero-stable, then it is ("globally") convergent: the approximate solution $\tilde{u}(t)$ approaches the exact solution $u(t)$ at any time $t$, as $\Delta t \to 0$, with a convergence rate $\Vert\tilde{u}(t) - u(t)\Vert = O(\Delta t^p)$ matching the local truncation error.

That is, there are only two ways an ODE method could go badly wrong: either you have a bug in your finite-difference approximation (it's inconsistent with the equation you are discretizing, i.e. not approximating the right equation), or the solution diverges (it's unstable) as $\Delta t \to 0$. This is good news, in a way! You can't have a consistent scheme that silently converges to the wrong answer — if it fails, it will "fail loudly" in an obvious way (diverging).

All three of the schemes (Euler, midpoint, and third-order) from last lecture were consistent (with $p = 1,2,3$ respectively). But the third-order scheme was zero-unstable. (The midpoint rule was zero-stable: its solutions converged as $\Delta t \to 0$ for a fixed time $t$, but it exhibited a different kind of instability that we will discuss next time: its solutions could diverge for fixed $\Delta t$ as the time $t$ increased.) Besides numerical experiments, how does one analyze zero-stability?

The trick to analyze zero stability is to set $f=0$ and notice that the ODE scheme $u_{n+1} = \cdots$ is in the form of a linear recurrence relation: $u_{n+1}$ is a linear combination of $u_k$ for $k < n$, with constant ($k$-independent) coefficients. By combining several $u_k$'s into a vector ${x}_k$, we can write this in matrix form as ${x}_{k+1} = A{x}_k$ for some matrix $A$. Iterating this relation, we immediately see that ${x}_{k+1} = A^k {x}_1$, and so the question becomes: do matrix powers $A^k$ diverge or not? This can be analyzed by looking at eigenvalues λ of $A$, satisfying $A{x}=\lambda {x}$ for some eigenvectors ${x} \ne 0$. For an eigenvector ${x}$, $A^k = \lambda^k {x}$, so it diverges if $|\lambda | > 1$. If any of the eigenvalues has $|\lambda | > 1$, then $A^k {x}_1$ will diverge for some initial condition ${x}_1$, and the system is unstable. So, to check zero-stability, we just need to write the scheme in terms of a matrix $A$, compute the eigenvalues of $A$ (computers are good at this), and make sure $|\lambda | \le 1$ for all eigenvalues. (Technically, this assumes that the matrix is diagonalizable; in the very rare case of a non-diagonalizable or "defective" matrix, more care is required.)

Showed by this analysis that Euler is zero-stable (its $A$ is a $1 \times 1$ matrix with eigenvalue $1$), the midpoint rule is also zero stable (its $A$ is a $2 \times 2$ exchange matrix with eigenvalues $\pm 1$), but the third-order scheme was unstable (it had an eigenvalue of $\approx -2.6$)

Furher reading: FNC book, section 6.8: zero stability (which used an alternative framing equivalent to eigenvalues, but without explicitly forming the matrix). FENA book, chapter 4.2 (which used the eigenvalue formulation). For the general problem of analyzing matrix powers and linear recurrences via eigenvalues, you may want to review some material from 18.06: Strang Intro. to Linear Algebra section 6.2, and 18.06 OCW lecture 22 (diagonalization and matrix powers).

Lecture 15 (Mar 7)

notes and code: One Note Code
pset 4 solutions
pset 5: due Friday March 14

Linear stability analysis of ODEs and discretization schemes.

The exact ODE $\frac{du}{dt} = \lambda u$ has exponentially growing solutions for $\Re \lambda > 0$, and non-growing ("stable") solutions for $\Re \lambda \le 0$. This analysis extends to linear autonomous ODEs $\frac{du}{dt} = A u$ where $A$ is a (diagonalizable) matrix, since we can just check each eigenvalue $\lambda$ of $A$. (Later, we will also extend this analysis to nonlinear autonomous ODEs $\frac{du}{dt} = f(u)$ by approximately linearizing $f(u)$ around a root using the Jacobian of $f$.)
When we discretize the ODE, can plug $\lambda u$ (or $Au$) in for the right-hand side, for a fixed $\Delta t$, and again use eigenvalues to analyze whether $u_k \approx u(k\Delta t)$ is growing or decaying with $k$. (This reduces to "zero stability" analysis in the limit $\Delta t \to 0$, for which the right-hand-side disappears from the formula for $u_k$.)

In this way, we find that certain discretization schemes are linearly stable (non-growing $u_k$) only for certain values of $\lambda \Delta t$.

To begin with, we analyzed the forward Euler ("explicit") scheme $u_{k+1} = u_k + \Delta t f(u_k) = u_k + \Delta t (\lambda u_k)$ for $\frac{du}{dt} = \lambda u$, giving $u_{k} = (1 + \lambda \Delta t)^k u_0$. Hence, it is stable for $|1 + \lambda \Delta t| \le 1$, which the interior of a circle of radius 1 in the complex $\lambda \Delta t$ plane centered at $\lambda \Delta t = -1$. Hence:

For $\frac{du}{dt} = -u$, $\lambda = 1$ so it is stable for $0 \le \Delta t \le 2$. (This is why it performed so well numerically.) (The fact that it is only stable for certain values of $\Delta t$ in this case is called conditional stability; note that this depends on the right-hand side of the ODE!)
For $\frac{d^2 u}{dt^2} = -u$, we saw last time that this is equivalent to the $2 \times 2$ system $\frac{d}{dt} \begin{pmatrix} u \\ v \end{pmatrix} = \underbrace{\begin{pmatrix} 0 & 1 \\ -1 & 0 \end{pmatrix}}_{A} \begin{pmatrix} u \\ v \end{pmatrix}$, where $A$ has eigenvalues $\lambda = \pm i$. In this case $|1 + \lambda \Delta t| > 1$ for all $\Delta t > 0$, and forward Euler is linearly unstable. It is still zero stable, so it still converges as $\Delta t \to 0$ for any fixed time $t$! But instead of oscillating solutions (for the exact ODE), we have solutions that are oscillating and slowly exponentially growing (more and more slowly as $\Delta t$ gets smaller, which allows it to converge). Even though the solutions converge, people are often unhappy with ODE methods that generate exponential growth (not present in the exact ODE) as you run for longer and longer times!

Next, we analyzed the backward Euler ("implicit") scheme $u_{k+1} = u_k + \Delta t f(u_{k+1}) = u_k + \Delta t (\lambda u_{k+1})$. This gives $u_{k} = \frac{1}{(1 - \lambda \Delta t)^k} u_0$, which is stable for $|1 - \lambda \Delta t| \ge 1$. Notice the $\ge$ sign! This is the exterior of a circle of radius 1 in the complex $\lambda \Delta t$ plane centered at $\lambda \Delta t = +1$, which is a superset of the stable region $\Re \lambda \ke 0$ of the exact ODE solutions.

This is why people use implicit ODE schemes: they are often more complicated to implement, because $f(u_{k+1})$ appears on the right-hand-side, requiring you to solve for $u_{k+1}$ (which gets expensive as $f$ gets more complicated), but they tend to be more stable than explicit schemes.

Further reading: FENA book, section 4.3.

Lecture 16 (Mar 10)

notes and code: see links above

Analyzed the linear stability of the midpoint rule, and found that it was only stable for a small range of purely imaginary $\lambda \Delta t$ values.

To illustrate the utility of "implicit" schemes like backwards Euler, which are more difficult to implement and require a more expensive time-stepping procedure (solving a system of equations on each step, possibly even a nonlinear system), considered a simple example problem involving heat flow. In a system where heat can flow quickly between some components (e.g. metals in contact) but slowly between other components (e.g. between metals and air), one obtains a "stiff" ODE, characterized by a large ratio of timescales (eigenvalues).

With forward Euler (or other explicit methods) in a stiff problem, a small $\Delta t$ is required to resolve the fast timescale in a stable way, but then the number of timesteps is large (the slow timescale divided by $\Delta t$). With backward Euler, however, it remains stable even for very large $\Delta t$, much larger than the fast timescale (although then the fast dynamics can't be observed accurately). This is why implicit schemes are attractive: in stiff systems, the more-expensive implicit timesteps are worth it because of the huge savings in the number of timesteps (from the larger $\Delta t$).

Further reading: The FENA book, section 4.10, has some further discussion of stiff systems. There are many sources online about methods for stiff equations and implicit ODE methods. See e.g. these course notes from MIT 18.337/6.338, which focuses mainly on ways to construct the Jacobian (to linearize the right-hand-side) and/or efficiently perform the implicit solves on each step for the nonlinear case. The book by Griffiths and Highham (2010), along with many other similar books on numerical ODEs, contains a wealth of information, at a much more formal level than this course.

Lecture 17 (Mar 12)

notes and code: see links above

More numerical ODE schemes:

trapezoidal rule (implicit) — analyzed order of accuracy (= 2) and stability (Re λ ≤ 0). For large Δt, it produces undesirable oscillations if λ is real and < 0, but for purely imaginary λ (oscillating ODEs) it has the nice property of "conserving energy" (oscillating solutions with no artificial numerical growth or dissipation).
BDF (backward difference) rule (implicit) — analyzed an order-2 scheme, which has the same stability as trapezoidal rule, but does not introduce oscillations for real λ (while it introduces artifical dissipation for imaginary λ).
Multi-step methods and Runge–Kutta schemes — began talking about schemes that operate in multiple "stages": first they "estimate" the future u at an intermediate timestep (e.g. k+½) to plug into the right-hand-side f(u,t), and then exploit this to get an improved value for u at timestep k+1. Gave an example of a second-order explicit scheme using this strategy.

Further reading: FNC books, sections 6.4 and 6.6. FENA book section 4.6 (trapezoidal) and 4.8–4.9 (Runge–Kutta and multistep methods).

Lecture 18 (Mar 14)

pset 5 solutions: coming soon
pset 6: due 11am on Friday 3/21

The big picture of numerical linear algebra: amateurs think mostly about element-by-element algorithms oriented towards hand calculations, whereas practitioners think mostly in terms of factorizations (which give the results of these algorithms in terms of a product of simpler matrices, which allow use to re-use and reason about them):

Instead of Gaussian elimination (or Gauss–Jordan / matrix inversion), we think about LU factorization $PA = LU$: $U$ is upper triangular (the result of Gaussian elimination), $L$ is lower triangular (a record of the elimination steps), and $P$ is a permutation (a re-ordering of the rows, which happens more frequently on computers than by hand).
Instead of Gram–Schmidt orthogonalization, we think about QR factorization $A = QR$: $Q$ has orthonormal columns and $R$ is upper triangular. (In practice, computers typically use a Householder algorithm rather than Gram–Schmidt).
Instead of finding eigenvalues by looking for roots of characteristic polynomials, we think about diagonalization $A = X \Lambda X^{-1}$: the columns of $X$ are eigenvectors, and $\Lambda$ is a diagonal matrix of eigenvalues. The most important case is real-symmetric matrices $A = A^T$ (or complex Hermitian matrices), where $X=Q$ (orthonormal eigenvectors, different from the result of QR!) and $A = Q \Lambda Q^T$. Practical algorithms for diagonalization are very different from what you do by hand; the most important is the QR algorithm discovered around 1960.
Other important factorizations include the singular value decomposition (SVD) $A = U \Sigma V^T$ (which is hugely important in practice, but unfortunately gets tacked on at the end of many introductory linear-algebra courses), the Schur factorization $A = QTQ^T$ (generalizing diagonalization to an upper triangular $T$ whose diagonal entries are the eigenvalues), Cholesky factorization $A = LL^T$ (equivalent to LU factorization in the special case of symmetric positive-definite $A$, but twice as fast), Hessenberg factorization, and others.

For this lecture, we focused on LU factorization, and in particular a few key points:

Why is LU factorization equivalent to Gaussian elimination? By a simple 3x3 example, reviewed how Gaussian elimination steps produce an upper-triangular matrix $U$ from $A$, but working backwards yields $A = LU$ where $L$ is lower triangular (with 1s on the diagonal) and is obtained "for free": it is simply a record of the elimination steps. It is easy to see that the cost of elimination (hence LU factorization) scales as $O(m^3)$ for an $m \times m$ matrix.
- More generally, the factorization is $PA = LU$, where $P$ represents row swaps. Trying the same 3x3 example on a computer, we found that the computer performed row swaps ($P \ne I$) even though it didn't "have" to (no zero pivots were encountered). It turns out that the computer swaps rows to make the pivots as big as possible (for each column, it looks for the entry with maximum magnitude), an algorithm called partial pivoting, and this is essential to reduce sensitivity to roundoff errors in cases where the matrix entries have very different magnitudes.
How do we use an LU factorization? If you are solving $Ax = b$ and have $A = LU$, then you can equivalently solve $L(Ux) = b$ by (1) let $Ux=c$ and solve $Lc = b$ for $c$ and (2) solve $Ux=c$ for $x$. These two "triangular solves" are easily done by forward and backsubstitution. Moreover, solving $Lc=b$ by forward substitution is exactly equivalent to performing the Gaussian elimination steps (from $A$) on $b$, which is typically done in hand calculations by "augmenting" the matrix $A$ with an extra column $b$. The cost of each of these steps is $O(m^2)$, very similar to a matrix–vector multiplication. (So, solving new right-hand sides is cheap once you have the LU factors.)

Another important point is that having LU factors is as good as — or even better than — having the matrix inverse.

Computing a matrix inverse is more than twice as costly as getting the LU factors. To find $A^{-1}$, you essentially solve $AX = I$ for $X = A^{-1}$: $m$ right-hand-sides given by the columns of $I$. This involves first finding the LU factorization of $A$, with $O(m^3)$ cost, and then solving $m$ right hand sides, with an additional $m \times O(m^2) = O(m^3)$ cost for the triangular solves. (This process is equivalent to the Gauss–Jordan algorithm you may have learned by hand.) So, like LU, it is $O(m^3)$, but the "constant factor" is usually at least twice as bad.
If the matrix is sparse (mostly zero), then the LU factors are often sparse as well, and can be computed and used very efficiently by skipping the zeros. But the matrix inverse is almost never sparse, so you lose that advantage. Gave the example of a tridiagonal matrix, for which the LU factors are actually bidiagonal (without row swaps) and can be computed and used in $O(m)$ time … but the inverse is all nonzero, and has $O(m^2)$ cost.
A good (but not universal) rule of thumb is never compute matrix inverses. When you see $x = A^{-1} b$, read it as "solve Ax=b in the best way you can". For example, if $A=LU$, we could formally write $A^{-1} b = U^{-1} L^{-1} b$, but you wouldn't compute the inverses of these triangular matrices — you would compute $c = L^{-1} b$ by forward-subsitution on $Lc=b$, then compute $x = U^{-1} c$ by back-substitution on $Ux = c$. If you need to repeatedly apply $A^{-1}$ to many vectors, just LU-factorize $A$ and re-use the LU factors.

LU factorizations in Julia can be computed with F = lu(A), which returns a "factorization" object F that contains the permutation $P$ (stored as an ordering F.p) and the $L,U$ factors. You can then compute $A^{-1} b$ by F \ b (which does the permutation and 2 triangular solves). By default, it uses partial pivoting: always permuting rows to maximize the magnitude of the pivot. To compare to hand calculations (where you only do row swaps to avoid zero pivots), you can do F = lu(A, RowNonZero()), but never do this for "serious" work because it is "numerically unstable" and can cause roundoff errors to blow up. If you do x = A \ b in Julia, it will "solve Ax=b in the best way it can", by default using LU factorization.

In Python, the analogue of F = lu(A) is scipy.linalg.lu, the analogue of F \ b is scipy.linalg.lu_solve, and the analogue of A \ b is numpy.linalg.solve.

Further reading: FNC book, section 2.4 on LU factorization (and section 3.3 on QR, and chapter 7). 18.065 OCW lecture 2, Strang 18.06 lecture 4 on LU factorization. Matrix factorizations are discussed in every linear-algebra textbook at various levels of sophistication and practicality.

Lecture 19 (Mar 17)

A brief survey of other topics in ODE methods:

Runge–Kutta schemes: the general structure is a "tableau" of coefficients where you make a sequence of estimates $u((k+a)\Delta t)$ for coefficients $a \in (0,1]$, and then make a linear combination of these estimates to estimate $u((k+1)\Delta t)$. Deriving this tableau is somewhat of an art form, and typically involves careful choices of simplifying assumptions. People continue to discover new Runge–Kutta schemes! For example, state of the art 5th order scheme was recently found by Tsitorious (2010).
- Adaptive schemes: similar to adaptive quadrature methods, typically high-order Runge-Kutta methods are "nested": using a subset of the same function values, one gets a lower-order estimate "for free", and compared to the high-order estimate this gives an error estimate. In this way, they can adaptively adjust Δt until the error estimate obeys a prescribed tolerance.
Boundary-value problems: instead of providing a purely initial condition $u(0)$, here one specifies a final condition $u(T)$, or more generally a mix of constraints on $u(0)$ and $u(T)$ (where the total number of constraints equals the dimension of $u$). One approach to solving such problems is to reduce it to root-finding: one tries to solve for $u(0)$ that satisfies the constraints on $u(T)$, sometimes called a "shooting" method.
- To apply Newton's method, one then needs the Jacobian of $u(T)$ with respect to the initial conditions $u(0)$, which is a special example of the important problem of differentiating ODE solutions (with respect to parameters of the equations and/or initial conditions). Not only is this useful for boundary-value problems, but it is important for sensitivity analysis and quantifying uncertainties, and has become widely used for optimization of ODE solutions (e.g. to fit experimental data or to improve some other objective). See the links below.
Differential–algebraic equations (DAEs) — a DAE couples an ODE $\frac{du}{dt} = f(u,v,t)$ with a set of (possibly nonlinear equations) $g(u,v,t)=0$ that have to be solve simultaneously with the ODE to find additional unknowns $v(t)$. Sometimes, you can explicitly solve $g=0$ to find $v$, and eliminate these unknowns, giving just an ODE in $u$, but in other cases this is impractical or inconvenient. DAE algorithms simultaneously evolve $u$ from the ODE and $v$ from the "algebraic" equations. In some ways these are analogous to stiff ODE solvers, because the $v$ equations respond "infinitely quickly" to changes in $u$.
Integro-differential equations and delay differential equations (DDEs) are like ODEs but $du/dt$ depends explicitly not just on $u(t)$ but also on the solution $u(t')$ at times $t' < t$ in the past — either via a continuous integral or via a discrete sum of terms. They have specialized ODE solver methods, which in principle are similar to ordinary ODE schemes but need to additionally keep track of (and interpolate) the past solutions as needed.
Stochastic differential equations (SDEs) are like ODEs where the right-hand side includes a random "noise" term. (Even defining precisely what this means requires a new form of calculus: stochastic calculus.) Don't just plug random numbers into the right-hand-side of an ODE scheme! There are a variety of specialized SDE methods. Unfortunately, these methods are typically low-order: it difficult to obtain a high-order "strong" SDE scheme that produces the correct distribution of solutions. But if you only care about the expected value ("average") of the solutions, there are higher-order "weak" SDE methods to accomplish this. See e.g. the SDE solver algorithms page in Julia's sophisticated DifferentialEquations.jl package.

Further reading (ODE methods): There are many books that give a more sophisticated treatment of numerical methods for ODEs and related problems, such the textbook by Butcher or the books by Hairer's on non-stiff (1993) and stiff/DAE (1996) algorithms, but the field continues to evolve with new algorithms. Fortunately, there are now many full-featured ODE solver packages, available in a variety of languages and often free/open-source, that implement sophisticated algorithms for all of the above, and more! A useful 2017 overview of available ODE software packages gives a useful comparison of what features and algorithms they implement — written by Dr. Chris Rackauckas (then at MIT), who developed the state-of-the-art DifferentialEquations.jl suite in Julia.

Further reading (differentiating ODE solutions): Computing derivatives of ODE solutions (and functions thereof) with respect to parameters or initial conditions was introduced in our IAP class 18.063: Matrix Calculus — see the course notes (chapter 9) and lecture video 5 (forward mode) and lecture video 6 (reverse mode). This has become an increasingly important topic because of the growth of optimization and machine learning, and it becomes critically important for large-scale problems to know the pros and cons of "forward" and "reverse" mode algorithms. A recent review article is Sapienza et al. (2024), and a classic reference on adjoint-method (reverse-mode/backpropagation) differentiation of ODEs (and DAEs) is Cao et al (2003) (pdf). See also the SciMLSensitivity.jl package for sensitivity analysis with Chris Rackauckas's amazing DifferentialEquations.jl software suite for numerical solution of ODEs in Julia, along with his notes from 18.337. There is a nice YouTube lecture on adjoint sensitivity of ODEs, again using a similar notation. A discrete version of this process is adjoint methods for recurrence relations (MIT course notes), in which case one obtains a reverse-order "adjoint" recurrence relation.

Lecture 20 (Mar 19)

Back when we were solving least-square problems (minimizing $\Vert Ax-b\Vert_2$), we noted that the usual "normal equations" approach of solving $A^T A x = A^T b$ is not robust, because it squares the condition number of $A$. A more robust approach is to use a "QR factorization" $A = QR$ to yield $R x = Q^T b$, which avoids squaring the condition number because a factor of $R$ has been cancelled analytically. In this lecture, we consider this QR factorization in more detail. It turns out that there are (at least) two variants of QR factorization:

The "thin" QR factorization $A = \hat{Q}\hat{R}$ factorizes a "tall" $m \times n$ matrix ($m \ge n$) $A$ into a "tall" $m \times n$ matrix $\hat{Q}$ with orthonormal columns ($\hat{Q}^T \hat{Q} = I$) multiplied by an upper-triangular $n \times n$ matrix $\hat{R}$ (which is invertible if $A$ has independent columns).

The "meaning" of this factorization is that not only do the columns of $\hat{Q}$ form an orthonormal basis for column space $C(A)$ (the span of the columns of $A$), but the first k columns of $\hat{Q}$ are an orthonormal basis for the first k columns of $A$. This is the meaning of the triangular structure of $\hat{R}$.

Understanding the above structure of the QR factorization leads immediately to the Gram-Schmidt algorithm where one forms the columns $q_1, q_2, \ldots, q_n$ of $\hat{Q}$ by orthonormalizing the columns $a_1, a_2, \ldots, a_n$ of $A$ one by one. Unfortunately, roundoff errors can lead to a loss of orthogonality and "numerical instability" if Gram–Schmidt is not carried out and used very carefully.

Instead, computers typically apply a different approach, inspired by a different form of the QR factorization. The "full" QR factorization $A = QR$ factorizes a "tall" $m \times n$ matrix ($m \ge n$) $A$ into a square $m \times m$ matrix $Q$ with orthonormal columns ($Q^T Q = I$) — an orthogonal matrix $Q^T = Q^{-1}$ which can be interpreted as an $m$-dimensional rotation matrix — multiplied by an upper-triangular "tall" $m \times n$ matrix $R$. The first $n$ columns of $Q$ are $\hat{Q}$, and the remaining $m-n$ colums are an orthonormal basis for $C(\hat{Q})^{\perp} = C(A)^{\perp} = N(A^T)$ (the left nullspace). The first $n$ rows of $R$ are $\hat{Q}$, and the remaining rows are zero. That is, the full QR factorization is obtained by appending additional orthonormal columns to $\hat{Q}$ to make it square, multiplied by additional zero rows appended $\hat{R}$ so that the result is unchanged. At first glance, this seems like more work, and more storage, than the thin QR factorization, but it turns out that this is not the case!

Gram–Schmidt works by turning $A$ into $\hat{Q}$. The popular Householder QR algorithm instead turns $A$ into $R$ (upper-triangular) by multipling it on the left with a sequence of orthogonal "reflection" matrices $R = Q_{n-1} \cdots Q_2 Q_1 A$, so that $Q = Q_1^T Q_2^T \cdots Q_{n-1}^T$. Each of these $Q_k$ matrices turns one column of the matrix into upper triangular form (introducing zeros below row $k$), and it turns out that there is a simple formula called a "Householder reflector" to construct such a matrix. Moreover, each $Q_k$ can be represented by storing one vector $v$ (the "reflector"), and $Q_k x = Q_k^T x$ can be computed by $x - 2v(v^T x)$, for $O(m)$ cost and storage. This leads Householder QR to require $O(mn^2)$ work to find all the reflectors and form $R$, and only $O(mn)$ storage to store $Q$ implicitly as a set of reflectors. We don't need to store the entries of the $m \times m$ matrix $Q$, as long as we can multiply $Q$ or $Q^T$ by vectors quickly!

Just because we write a matrix $M$ in linear algebra doesn't mean that we need to store it as a matrix, i.e. store a 2d array of its entries explicitly. We often just need a way to compute matrix-vector products $Mx$ quickly for any given $x$, i.e. an algorithm implementing the linear operator $x \mapsto M x$. An explicit matrix is just one possible representation of a linear operator, and not always the best one. Examples:
- The identity matrix $I$ rarely needs to be stored explicitly, since we can multiply it by vectors with no work at all!
- Matrix inverses $A^{-1}$ should rarely be computed explicitly, because we can compute $A^{-1} b$ for any $b$ by solving $Ax = b$ by some other method, e.g. using the LU factors.
- For matrices that are sparse (mostly zero), a 2d array of entries is very wasteful because you are storing lots of zeros. Instead, specialized sparse-matrix data structures store only the nonzero entries, and can use these nonzero entries to multiply by vectors quickly. This is especially important for large-scale linear algebra, where the dimensions can be millions or more — huge matrices are almost never stored explicitly as 2d arrays.
- QR factorization returns an implicit representation of $Q$ as a collection of Householder reflector operations.
Given an algorithm to act $M$ on a vector, you can compute the entries of $M$ explicitly simply by computing $M I = M$: multiplying $M$ by the columns of $I$. But often this is a waste of effort, as in the examples above.

Further reading: FNC book, sections 3.3 3.4. and Householder QR notes from 18.335 at MIT. Any textbook on numerical linear algebra will cover QR factorization, e.g. the book by Trefethen and Bau, section II.

Lecture 21 (Mar 21)

Numerical methods for eigenvalue problems.

Name		Name	Last commit message	Last commit date
Latest commit History 206 Commits
notes		notes
psets		psets
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Interdisciplinary Numerical Methods: "Hub" 18.S190/16.S090

18.S190/16.S090 Syllabus, Spring 2025

Lecture 1 (Feb 3)

Lecture 2 (Feb 5)

Lecture 3 (Feb 7)

Optional Julia Tutorial (Feb 7 @ 4pm in 2-190)

Lecture 4 (Feb 10)

Lecture 5 (Feb 12)

Lecture 6 (Feb 14 💕)

Lecture 7 (TUESDAY, Feb 18)

Lecture 8 (Feb 19)

Lecture 9 (Feb 21)

Lecture 10 (Feb 22)

Lecture 11 (Feb 26)

Lecture 12 (Feb 28)

Lecture 13 (Mar 3)

Lecture 14 (Mar 5)

Lecture 15 (Mar 7)

Lecture 16 (Mar 10)

Lecture 17 (Mar 12)

Lecture 18 (Mar 14)

Lecture 19 (Mar 17)

Lecture 20 (Mar 19)

Lecture 21 (Mar 21)

About

Releases

Packages

Contributors 4

Languages

mitmath/numerical_hub

Folders and files

Latest commit

History

Repository files navigation

Interdisciplinary Numerical Methods: "Hub" 18.S190/16.S090

18.S190/16.S090 Syllabus, Spring 2025

Lecture 1 (Feb 3)

Lecture 2 (Feb 5)

Lecture 3 (Feb 7)

Optional Julia Tutorial (Feb 7 @ 4pm in 2-190)

Lecture 4 (Feb 10)

Lecture 5 (Feb 12)

Lecture 6 (Feb 14 💕)

Lecture 7 (TUESDAY, Feb 18)

Lecture 8 (Feb 19)

Lecture 9 (Feb 21)

Lecture 10 (Feb 22)

Lecture 11 (Feb 26)

Lecture 12 (Feb 28)

Lecture 13 (Mar 3)

Lecture 14 (Mar 5)

Lecture 15 (Mar 7)

Lecture 16 (Mar 10)

Lecture 17 (Mar 12)

Lecture 18 (Mar 14)

Lecture 19 (Mar 17)

Lecture 20 (Mar 19)

Lecture 21 (Mar 21)

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages