This new MIT course (Spring 2025) introduces numerical methods and numerical analysis to a broad audience (assuming 18.03, 18.06, or equivalents, and some programming experience). It is divided into two 6-unit halves:
-
18.S190/16.S090 (first half-term “hub”): basic numerical methods, including curve fitting, root finding, numerical differentiation and integration, numerical differential equations, and floating-point arithmetic. Emphasizes the complementary concerns of accuracy and computational cost. Prof. Steven G. Johnson and Prof. Qiqi Wang.
-
Second half-term: three options for 6-unit “spokes”
-
18.S191/16.S091 — numerical methods for partial differential equations: finite-difference and finite-volume methods, boundary conditions, accuracy, and stability. Prof. Qiqi Wang.
-
18.S097/16.S097 — large-scale linear algebra: sparse matrices, iterative methods, randomized methods. Prof. Steven G. Johnson.
-
18.S192/16.S098 — parallel numerical computing: multi-threading and distributed-memory, and trading off computation for parallelism — may be taken simultaneously with other spokes! Prof. Alan Edelman.
-
Taking both the hub and any spoke will count as an 18.3xx class for math majors, similar to 18.330. Taking both the hub and the PDE spoke will substitute for 16.90. Weekly homework, no exams, but spokes will include a final project.
This repository is for the "hub" course (currently assigned the temporary numbers 18.S190/16.S090).
Instructors: Prof. Steven G. Johnson and Prof. Qiqi Wang.
Lectures: MWF10 in 2-142 (Feb 3 – Mar 31), slides and notes posted below. Lecture videos posted in Panopto Video on Canvas.
Homework and grading: 6 weekly psets, posted Fridays and due Friday midnight; psets are accepted up to 24 hours late with a 20% penalty; for any other accommodations, speak with S3 and have them contact the instructors. No exams.
-
Homework assignments will require some programming — you can use either Julia or Python (your choice; instruction and examples will use a mix of languages).
-
Submit your homework electronically via Gradescope on Canvas as a PDF containing code and results (e.g. from a Jupyter notebook) and a scan of any handwritten solutions.
-
Collaboration policy: Talk to anyone you want to and read anything you want to, with two caveats: First, make a solid effort to solve a problem on your own before discussing it with classmates or googling. Second, no matter whom you talk to or what you read, write up the solution on your own, without having their answer in front of you (this includes ChatGPT and similar). (You can use psetpartners.mit.edu to find problem-set partners.)
Teaching Assistants: Mo Chen and Shania Mitra (shania at mit.edu)
Office Hours: Wednesday 4pm in 2-345 (Prof. Johnson) and Thursday 5pm via Zoom (Prof. Wang).
Resources: Piazza discussion forum, math learning center, TSR^2 study/resource room, pset partners.
Textbook: No required textbook, but suggestions for further reading will be posted after each lecture. The book Fundamentals of Numerical Computation (FNC) by Driscoll and Braun is freely available online, has examples in Julia, Python, and Matlab, and is a valuable resource. Fundamentals of Engineering Numerical Analysis (FENA) by Moin is another useful resource (readable online with MIT certificates).
This document is a brief summary of what was covered in each lecture, along with links and suggestions for further reading. It is not a good substitute for attending lecture, but may provide a useful study guide.
- Overview and syllabus: slides and this web page
- Finite-difference approximations: Julia notebook and demo
Brief overview of the huge field of numerical methods, and outline of the small portion that this course will cover. Key new concerns in numerical analysis, are (i) performance (traditionally, arithmetic counts, but now memory access often dominates) and (ii) accuracy (both floating-point roundoff errors and also convergence of intrinsic approximations in the algorithms). In contrast, the more pure, abstract mathematics of continuity is called "analysis", and is mainly concerned with (ii) but not (i): they are happy to prove that limits converge, but don't care too much how quickly they converge. Whereas traditional discrete computer science is concerned with mainly with (i) but not (ii): they care about performance and resource usage, but traditional algorithms like sorting are either right or wrong, never approximate.
As a starting example, considered the the convergence of finite-difference approximations to derivatives df/dx of given functions f(x), which appear in many areas of numerical analysis (such as solving differential equations) and are also closely tied to polynomial approximation and interpolation. By examining the errors in the finite-difference approximation, we immediately see two competing sources of error: truncation error from the non-infinitesimal Δx, and roundoff error from the finite precision of the arithmetic. Understanding these two errors will be the gateway to many other subjects in numerical methods.
Further reading: FNC book: Finite differences, FENA book: chapter 2. There is a lot of information online on finite difference approximations, these 18.303 notes, or Section 5.7 of Numerical Recipes. The Julia FiniteDifferences.jl package provides lots of algorithms to compute finite-difference approximations; a particularly robust and powerful way to obtain high accuracy is to employ Richardson extrapolation to smaller and smaller δx. If you make δx too small, the finite precision (#digits) of floating-point arithmetic leads to catastrophic cancellation errors.
One of the most basic sources of computational error is that computer arithmetic is generally inexact, leading to roundoff errors. The reason for this is simple: computers can only work with numbers having a finite number of digits, so they cannot even store arbitrary real numbers. Only a finite subset of the real numbers can be represented using a particular number of "bits", and the question becomes which subset to store, how arithmetic on this subset is defined, and how to analyze the errors compared to theoretical exact arithmetic on real numbers.
In floating-point arithmetic, we store both an integer coefficient and an exponent in some base: essentially, scientific notation. This allows large dynamic range and fixed relative accuracy: if fl(x) is the closest floating-point number to any real x, then |fl(x)-x| < ε|x| where ε is the machine precision. This makes error analysis much easier and makes algorithms mostly insensitive to overall scaling or units, but has the disadvantage that it requires specialized floating-point hardware to be fast. Nowadays, all general-purpose computers, and even many little computers like your cell phones, have floating-point units.
Went through some simple definitions and examples in Julia (see notebook above), illustrating the basic ideas and a few interesting tidbits. In particular, we looked at error accumulation during long calculations (e.g. summation), as well as examples of catastrophic cancellation and how it can sometimes be avoided by rearranging a calculation.
Further reading: FNC book: Floating-poing numbers. Trefethen & Bau's Numerical Linear Algebra, lecture 13. What Every Computer Scientist Should Know About Floating Point Arithmetic (David Goldberg, ACM 1991). William Kahan, How Java's floating-point hurts everyone everywhere (2004): contains a nice discussion of floating-point myths and misconceptions. A brief but useful summary can be found in this Julia-focused floating-point overview by Prof. John Gibson. Because many programmers never learn how floating-point arithmetic actually works, there are many common myths about its behavior. (An infamous example is 0.1 + 0.2
giving 0.30000000000000004
, which people are puzzled by so frequently it has led to a web site https://0.30000000000000004.com/!)
- Interpolation OneNote Notebook Code Complete Notes
- pset 1: due Feb 14
Discussed the important problem of interpolating a function
Further reading: FNC chapter 5 and FENA chapter 1. Piecewise linear interpolation is implemented in Python by numpy.interp
, and several other interpolation schemes by scipy.interpolate
. Interpolation packages in Julia include Interpolations.jl, Dierckx.jl (splines), BasicInterpolators.jl, and FastChebInterp.jl (high-order polynomials).
A basic overview of the Julia programming environment for numerical computations. This tutorial will cover what Julia is and the basics of interaction, scalar/vector/matrix arithmetic, and plotting — just as a "fancy calculator" for now (without the "real programming" features).
- Tutorial materials (and links to other resources)
If possible, try to install Julia on your laptop beforehand using the instructions at the above link. Failing that, you can run Julia in the cloud (see instructions above).
This won't be recorded, but you can find a video of a similar tutorial by Prof. Johnson last year (MIT only), as well as many other tutorial videos at julialang.org/learning.
- Notes: OneNote Notebook
One approach to generalizing piecewise-linear interpolation is to interpolate
A general conceptual approach is to set up a system of linear equations for the polynomial coefficients
-
The matrix
$A$ is nearly singular for large$n$ in the monomial basis, so floating-point roundoff errors are exacerbated. (We will say that it has a large condition number or is "ill-conditioned", and will define this more precisely next time.) Solution: It turns out that monomials are just a poorly behaved basis for high-degree polynomials, and it is much better to use orthogonal polynomials, most commonly Chebyshev polynomials, as a basis — with these, people regularly go to degrees of thousands or even millions. -
Polynomial interpolation from equally spaced points (in any basis!) can diverge exponentially from the underlying "true" smooth function in between the points (even in exact arithmetic, with no roundoff errors!). This is called a Runge phenomenon. Solution 1: Use carefully chosen non-equispaced points. A good choice that leads to exponentially good polynomial approximations (for smooth functions) is the Chebyshev nodes, which are clustered near the endpoints. Solution 2: use a lower degree (
$<< n$ ) polynomial and perform an approximate fit to the given points, rather than requiring the polynomial to go through them exactly. (More on this soon.)
If you address both of these problems, high-degree polynomial approximation can be a fantastic tool for describing smooth functions. If you have noisy data, however, you should typically use much lower-degree polynomials to avoid overfitting (trying to match the noise rather than the underlying "true" function).
Further reading: FNC book, chapter 9. Beware that the FENA book starts with the "Lagrange formula" for the interpolating polynomial, but this formula is very badly behaved ("numerically unstable") for high degrees and should not be used; it is superseded by the "barycentric" Lagrange formula (see FNC book; reviewed in much more detail by Berrut and Trefethen, 2004). The subject of polynomial interpolation is the entry point into approximation theory; if you are interested, the book by Trefethen and accompanying video lectures are a great place to get more depth. The numpy.polynomials
module contains a variety of functions for polynomial interpolation, including with Chebyshev polynomials. A package for multi-dimensional Chebyshev polynomial interpolation in Julia is FastChebInterp. A pioneering package that shows off the power of Chebyshev polynomial interpolation is chebfun
in Matlab, along with related packages in other languages. (This approach is taken to supercomputing extremes by the Dedalus package to solve partial differential equations using exponentially converging polynomial approximations.) It turns out that if you are interpolating using Chebyshev polynomials, and you choose your points
The goal of this lecture is to precisely define the notion of a condition number, which quantifies the sensitivity of a function f(x) to small perturbations in the input. A function that has a "large" condition number is called "ill-conditioned", and any algorithm to compute it may suffer inaccurate results from any sort of error (whether it be roundoff errors, input noise, or other approximations) — it doesn't mean you can't use that function, but it usually means you need to be careful (about both implementation and about interpretation of the results).
For a given function
-
absolute condition number
$\kappa_a(f, x) = \lim_{\epsilon \to 0^+} \sup_{\Vert \delta x \Vert = \epsilon} \frac{\Vert \overbrace{f(x + \delta x) - f(x)}^{\delta f} \Vert}{\Vert \delta x \Vert}$ . It is the worst case$\Vert \delta f \Vert / \Vert \delta x \Vert$ for any arbitrarily small input perturbation$\delta x$ . Unfortunately, if the inputs and outputs of$f$ have different scales ("units"), then$\kappa_a$ may be hard to interpret (it is "dimensionful"). -
relative condition number
$\kappa_r(f, x) = \lim_{\epsilon \to 0^+} \sup_{\Vert \delta x \Vert = \epsilon} \frac{\Vert \delta f \Vert / \Vert f(x) \Vert}{\Vert \delta x \Vert / \Vert x \Vert} = \kappa_a(f, x) \frac{\Vert x \Vert}{\Vert f(x) \Vert}$ . This is a dimensionless quantity: it is the maximum ratio of the relative change in f to the relative change in x. Most of the time,$\kappa_r$ is the quantity of interest.
All of these quantities involve the central concept of a norm ‖⋯‖, which is a way of measuring the "length" of a vector. The most familar norm, and usually the default or implicit choice for column vectors
For example, looked at the condition number of summation
If the function is differentiable, then the condition number simplifies even further:
- If
$x$ and$f(x)$ are scalars, then$\kappa_a = |f'(a)|$ is simply the magnitude of the derivative. - If
$x \in \mathbb{R}^n$ and$f(x) \in \mathbb{R}^m$ are vectors, then$\kappa_a = \Vert J \Vert$ is the "operator" or "induced" norm of the Jacobian matrix J ($J_{i,j} = \partial f_i /\partial x_j$ ), where the induced norm$\Vert A \Vert = \sup_{x \ne 0} \frac{\Vert A x \Vert}{\Vert x \Vert}$ measures "how big" a matrix$A$ is by how much it can stretch a vector. (We won't worry too much yet about how to compute this induced norm, but if you know the SVD it is the largest singular value of$A$ .)
This leads us to our next important concept, the (relative) condition number
For example, we can now explain why the monomial basis was so bad: it is easy to see that the Vandermonde matrix becomes nearly singular for large
Further reading: FNC book: problems and conditioning and conditioning linear systems. 18.06 lecture notes on conditioning of linear systems. Advanced books like Numerical Linear Algebra by Trefethen and Bau (lecture 12) treat this subject in much more depth. See also Trefethen, lecture 3, for more in-depth coverage of norms. A fancy vocabulary word for a vector space with a norm (plus some technical criteria) is a Banach space.
Lecture 6 (Feb 14 💕)
- pset 1 solutions
- pset 2: due Feb 21
- Notes: OneNote Notebook
Introduced the topic of least-square fitting of data to curves. As long as the fitting function is linear in the unknown coefficients c (e.g. a polynomial, or some other linear combination of basis functions), showed that minimizing the sum of the squares of the errors corresponds to minimizing the norm of the residual, i.e. the "loss function" L(c) = ‖Ac - y‖², where
It is a straightforward calculus exercise to show that ∇L = 2Aᵀ(Ac - y), which means that optimal coefficients c can be found by setting the gradient to zero, and ∇L=0 implies the "normal equations" AᵀAc = Aᵀy. In principle, these can be solved directly, the normal equations square the condition number of A (κ(AᵀA)=κ(A)²) so they are normally solved in a different way, typically by QR factorization of A (or sometimes using the SVD of A); there are typically library functions that do this for you e.g. c = numpy.linalg.lstsq(A, y)
in Python or c = A \ y
in Julia and Matlab.
More generally, minimizing L(c) is an example of an optimization problem. One simple way to attack such problems is by gradient descent: simply go downhill, in steps Δc = -s∇L where s is the "learning rate" or "step size" parameter (that has to be chosen carefully: too small and it will converge very slowly, but too large and it won't converge at all). It turns out that an ill-conditioned A also leads to gradient descent that converges slowly (because the local downhill direction -∇L doesn't point to the minimum, necessitating small steps). Nowadays, naive gradient descent (especially with a fixed learning rate) is a rather primitive technique that has been mostly superseded by better "accelerated" methods, but computing the gradient ∇L and using it to identify the downhill direction is still a key conceptual starting point for many algorithms. Of course, for least-square fitting where L(c) is linear in c, we can directly solve for c as shown above, but iterative optimization methods are crucial to solve more general problems where the unknowns enter in a nonlinear way (and/or there are constraints).
In both cases, compared the effect of the monomial basis vs. the Chebyshev-polynomial basis. Because the former leads to an ill-conditioned
Further reading: FNC book chapter 3. Strang's Introduction to Linear Algebra section 4.3 and 18.06 videow lecture 16. There are many, many books and other materials on linear least-squares fitting, from many different perspectives (e.g. numerical linear algebra, statistics, machine learning…) that you can find online. The FastChebInterp.jl package in Julia does least-square fitting ("regression") with Chebyshev polynomials for you in an optimized way, including multi-dimensional fitting.
First, discussed the computational cost of interpolation and fitting.
- Solving the least-squares problem
$\min_x \Vert Ax - b\Vert$ for an$m \times n$ matrix$A$ (m points and n basis functions) has$O(mn^2)$ cost, whether you use the normal equations$A^T A x = A^T b$ (which squares the condition number) or a more accurate method like QR factorization. - Interpolation can be thought of as the special case
$m=n$ : solving an$n \times n$ linear system$Ax =b$ is$O(n^3)$ (usually done by LU factorization, which is the matrix-factor form of Gaussian elimination). However, for specific cases there are more efficient algorithms: for polynomial interpolation from arbitrary points, the barycentric Lagrange formula (mentioned above) has$O(n^2)$ cost, and using the Chebyshev-polynomial basis from Chebyshev points has$O(n \log n)$ cost (via FFTs).
The computational scaling becomes even more important when you go to higher dimensions: multivariate interpolation and fitting. We discussed a few cases cases, using two dimensions (2d) for simplicity:
- If you have a general basis
$b_k(x,y)$ of 2d basis functions, you can form a Vandermonde-like matrix$A$ as above (whose columns are hte basis functions and whose rows are the grid points), and solve it as before. The matrices get much larger in higher dimensions, though! - If you can choose a Cartesian "grid" of points, also known as a "tensor product grid" (a grid in x by a grid in y), then it is convenient to use separable basis functions
$p_i(x) p_j(y)$ , e.g. products of polynomials. While you can still form a Vandermonde-like system as above, it turns out that there are much more efficient algorithms in this case. (The ideal case is a tensor product of Chebyshev points along each dimension, in which case you can interpolate products of Chebyshev polynomials in$O(n \log n)$ cost.) - If you have a tensor product grid, you can treat it as a collection of rectangles, and do bilinear interpolation on each rectangle — this is the obvious generalization of piecewise-linear interpolation in 1d. It is fast and second-order accurate (and there are higher-order versions like bicubic interpolation too).
- If you have an irregular set of points, there is still an analogue of piecewise-linear interpolation. One first connects the set of points into triangles in 2d (or tetrahedra in 3d, or more generally "simplices"); this is a tricky problem of computational geometry, but a standard and robust solution is called Delaunay triangulation. Once this is done, you can interpolate within each triangle (or simplex) using an affine function (linear + constant
$a^T x + \beta$ ).
Further reading: FNC book section 2.5 on efficiency of solving Ax=b; much more detail on both this and the least-squares case can be found in e.g. Trefethen & Bau's Numerical Linear Algebra and many other sources. Links to the barycentric formula can be found above, along with fast algorithms for Chebyshev interpolation. Tensor-product-grid interpolation and fitting products are also closely related to Kronecker products of matrices, and there are often more efficient algorithms than simply forming the giant "Vandermonde" matrix and solving it.
- Radial basis function (RBF) interpolation: slides
- Quadrature: Notes on error analysis of the trapezoidal rule and Clenshaw-Curtis quadrature in terms of Fourier cosine series, and a quick review of cosine series.
To begin with, briefly touched an another popular method for multidimensional interpolation: radial basis functions. This can be understood in the same computational framework as polynomial interpolation: we form a "Vandermonde" matrix equation Ac=b, where the rows are data points and the columns are basis functions, to solve for the coefficients c. In this case, the basis functions are
Launched into a new topic: Numerical integration ("quadrature"). In general, a "quadrature rule" is a scheme for estimating a definite integral ∫f(x)dx by a sum ∑ₖf(xₖ)wₖ of f(x) evaluated at N (or N+1) quadrature points xₖ with quadrature weights wₖ. The goal is to choose these quadrature points and weights so that the estimate converges to the true integral as quickly as possible with N. (The typical assumption is that evaluating f(x) is the dominant computational cost, so we want to minimize function evaluations.) Quadrature is closely related to interpolation: most quadrature schemes (especially in 1d) proceed by picking points, interpolating the function between the points somehow (usually by polynomials), and then integrating the interpolant (which is easy with polynomials).
Began by analyzing the two simplest schemes with equally spaced points: piecewise-linear interpolation leads to a composite trapezoidal rule with
These error estimates are upper bounds, but some functions do much better! For periodic functions, the trapezoidal and rectangle rules are equivalent, and for smooth ("analytic") functions it converges exponentially fast with
Further reading (RBFs): A very readable overview of radial basis functions can be found in this blog post by Shih-Chin. Much deeper coverage can be found in the book Meshfree Approximation Methods With Matlab (Fasshauer, 2007) or in this review by Buhmann (2000).
Further reading (quadrature):: FNC section 5.6. Lloyd N. Trefethen, "Is Gauss quadrature better than Clenshaw-Curtis?," SIAM Review 50 (1), 67-87 (2008). Trefethen's Six Myths of Polynomial Interpolation and Quadrature (2011) is a shorter and more colloquial description some of these ideas (and more!). A related polynomial interpolation method (in some sense a generalization of quadrature by Chebyshev polynomials/points) is Gaussian quadrature (and its many variants), whose accuracy is analyzed in the Trefethen papers above, and the state of the art for computing which is Hale and Townsend (2012); a suboptimal but beautifully simple algorithm was described by Golub & Welsch (1969).
- pset 2 solutions
- pset 3: due Friday, Feb 28.
Continued analysis from Lecture 8 (see notes). We related the convergence rate of trapezoidal rule to the convergence rate of the Fourier cosine series, and showed (using integration by parts) that the convergence rate of the cosine series is determined by the behavior of the odd-order derivatives at the boundaries (assuming that the function is smooth in the interior). This reproduces the
Moreover, showed how we can arrange for this fast convergence to occur all the time: for
Finally, if we change variables back to
Further reading:: See further reading for lecture 8, and for lecture 4 on Chebyshev polynomials.
- quadrature overview slides: now that we've analyzed trapezoidal rule and Clenshaw–Curtis, let's zoom out to survey the bigger picture
Overview of the big picture of quadrature algorithms: Clenshaw–Curtis is not the end!
Discussed Richardson extrapolation, which is a powerful general technique to compute limits
- It computes many extrapolations, formed by degree-$q$ polynomial interpolations of all consecutive subsequences of
$q+1$ $h_i, y(h_i)$ points for all possible degrees$q$ . - All of these extrapolations are computed efficiently at the same time (in
$O(n^2)$ cost for$n$ points) by a linear recurrence relation where each degree-$q$ extrapolation is computed from two degree-$(q-1)$ extrapolations (forming a table called a "Neville tableau" or "Neville–Aitken tableau": it is a version of Neville's algorithm for polynomial interpolation). - Each degree
$q > 0$ extrapolant comes with an error estimate, given by its difference from the degree-$(q-1)$ estimate ending at the same$h_i$ point. This allows Richardson's method to be robust and adaptive: by using the extrapolant with the smallest error estimate for the final$y(0^+)$ result, it automatically selects a good subsequence of$h_i$ values (neither so large that polynomial fitting doesn't work, nor so small that e.g. roundoff/cancellation errors dominate).
Famous applications of Richardson extrapolation include Romberg integration (extrapolating low-order quadrature formulas), the Bulirsch–Stoer ODE method (extrapolating ODE solutions to zero stepsize), and an algorithm by Ridders (1982) for extrapolating finite-difference derivatives (also reviewed in Numerical Recipes sec. 5.7).
Further reading: Unfortunately, almost all descriptions of Richardson extrapolation combine it with a particular application, e.g. many textbooks only describe it in the context of ODEs, or integration, or differentiation. These course notes by Flaherty are in the ODE context, but are written in a general enough way that you can see the applicability to other problems, and they discuss adaptive error estimation briefly at the end. The Richardson.jl package in Julia implements the algorithm in a very general setting, and its documentation includes a number of examples; it's used by the Romberg.jl package for Romberg integration and by the FiniteDifferences.jl package for extrapolating finite-difference approximations. In Python, I'm not currently aware of any general-purpose implementation (though there are implementations in the context of e.g. derivatives or integration).
- pset 3 solutions
- pset 4: due Friday, March 7
Root finding is the problem of solving
The most famous such algorithm is Newton's method, which you probably learned in first-year calculus. The key idea is to use the derivative
In calculus, it was enough to set up this algorithm, try it out, and see that it works pretty well (if you have a reasonable starting point). In numerical analysis, we want to be more precise: how fast does it converge, once
Gave a numerical demo in which we applied Newton's method to find square roots
Another important extension of Newton's method is to systems of
Newton's method works well if you start with a guess that is "reasonably close" to the root. For starting points that are far from the root, however, it can be wildly unpredictable in fascinating ways, leading to Newton fractals.
Further reading: FNC book, chapter 4, especially section 4.3 on Newton's method. A key class of methods that we didn't cover involve what to do when you don't have access to the derivative (or in the case of large nonlinear systems the whole Jacobian matrix might be too expensive to compute or even store). In this case, there are many algorithms:
-
Fixed-point iteration attempts to solve
$g(x)=x$ ($g(x) = f(x) + x$ ) by$x_{k+1} = g(x_k)$ . This may not converge at all unless$|g'(r)| < 1$ , but it is the starting point for the much more robust and fast-converging Anderson-accelerated fixed-point iteration, which itself is closely related to "quasi-Newton" methods. - In 1d, secant methods are Newton-like methods that perform linearization (or higher-order) by interpolating two (or more) points. Generalization to multiple dimensions yields quasi–Newton methods such as Broyden's method and Anderson acceleration.
-
Newton–Krylov methods only require you to compute Jacobian–vector products(jvp's)
$f'(x_k)\delta$ for given$x_k, delta$ , which corresponds to a single directional derivative — this is much cheaper than computing the entire Jacobian in high dimensions, at the price of using an iterative linear solver to find each Newton step. Another important class of methods are numerical continuation methods, which solve a family of root problems$f(x,\lambda) = 0$ for the roots$x(\lambda)$ at each value of the parameter$\lambda$ . Given a root at one$\lambda$ , the simplest algorithm is to use this as the starting point for Newton's method to find the root at a nearby$\lambda$ , but there are much more sophisticated methods such as pseudo-arclength continuation. A modern implementation of numerical continuation can be found in, for example, the BifurcationKit.jl package in Julia. There are many books and reviews on this topic, e.g. Herbert B. Keller, Lectures on Numerical Methods in Bifurcation Problems (Springer, 1988).
- lecture notes/code: (One Note)[https://mitprod-my.sharepoint.com/personal/qiqi_mit_edu/_layouts/15/Doc.aspx?sourcedoc={24321bd2-8c69-451a-b0cc-c2b42aed5743}&action=view&wd=target%28B.%20Initial%20Value%20Problems%2F20250303.one%7C6663d749-3747-45de-bb34-f8393732d800%2F%29&wdorigin=717]
Numerical methods for ordinary differential equations (ODEs). Introduced the concept of an initial value problem
Looked at an example problem
Moreover, we used the "stencil" algorithm from pset 1 to derive a 3rd-order finite difference approximation, and found that the behaviour was even worse than the midpoint rule. Even for a fixed time
Further reading: FNC book, chapter 6, sections 6.1–6.2. FENA book, chapter 4 and section 4.1. You can also find hundreds of other web pages and videos on these topics. 3Blue1Brown has an entertaining introduction to the idea of a differential equation. And here is a nice video about the history of numerical ODE solvers talks about the pioneering contributions of Katherine Johnson and her portrayal in the 2016 film Hidden Figures.
- notes and code: (One Note)[https://mitprod-my.sharepoint.com/personal/qiqi_mit_edu/_layouts/15/Doc.aspx?sourcedoc={24321bd2-8c69-451a-b0cc-c2b42aed5743}&action=view&wd=target%28B.%20Initial%20Value%20Problems%2F20250305.one%7Cd1ccb4e9-d0d8-414c-98cd-f1392b3d5277%2F%29&wdorigin=717] (Code)[https://colab.research.google.com/drive/1fmyyPSp5gImrcj42wxxn3kBqTDavq5qJ?usp=sharing]
Stability of numerical ODE methods. The term "stability" has special meanings in this context, different from other areas of numerical analysis where "stability" is mostly about the sensitivity of an algorithm to roundoff errors. For ODE methods, "stability" refers to sensitivity to truncation error
-
Consistency: Do the local truncation errors of a single time step
$u_{n+1} = \cdots$ go to zero as$O(\Delta t^{1+p})$ for$p > 0$ ? That is, are we correctly approximating$\frac{du}{dt} - f(u,t)$ to some positive order$O(\Delta t^p)$ ? -
Zero stability: With a right-hand sided
$f=0$ , it is zero-unstable if a nonzero initial condition can diverge ($\Vert u_k \Vert \to \infty$ ) as the step$k \to \infty$ (with$\Delta t$ cancelling from the analysis for$f=0$ , so this can be viewed as a divergence as$\Delta t \to 0$ for a fixed$t$ ). -
Linear stability: Later, we will include the right-hand side in the stability analysis by linearizing it when
$u$ is close to a root of$f$ , especially for autonomous ODEs$f(u,t) = f(u)$ . Through this analysis, we will show that some methods diverge for a fixed$\Delta t$ as you increase the total time$t$ , perhaps unless$\Delta t$ is sufficiently small (conditional stability).
An important result is the Dahlquist equivalence theorem (closely related to the Lax (or Lax–Richtmyer) equivalence theorem):
- If a method is both consistent and zero-stable, then it is ("globally") convergent: the approximate solution
$\tilde{u}(t)$ approaches the exact solution$u(t)$ at any time$t$ , as$\Delta t \to 0$ , with a convergence rate$\Vert\tilde{u}(t) - u(t)\Vert = O(\Delta t^p)$ matching the local truncation error.
That is, there are only two ways an ODE method could go badly wrong: either you have a bug in your finite-difference approximation (it's inconsistent with the equation you are discretizing, i.e. not approximating the right equation), or the solution diverges (it's unstable) as
All three of the schemes (Euler, midpoint, and third-order) from last lecture were consistent (with
The trick to analyze zero stability is to set
Showed by this analysis that Euler is zero-stable (its
Furher reading: FNC book, section 6.8: zero stability (which used an alternative framing equivalent to eigenvalues, but without explicitly forming the matrix). FENA book, chapter 4.2 (which used the eigenvalue formulation). For the general problem of analyzing matrix powers and linear recurrences via eigenvalues, you may want to review some material from 18.06: Strang Intro. to Linear Algebra section 6.2, and 18.06 OCW lecture 22 (diagonalization and matrix powers).
- notes and code: One Note Code
- pset 4 solutions
- pset 5: due Friday March 14
Linear stability analysis of ODEs and discretization schemes.
- The exact ODE
$\frac{du}{dt} = \lambda u$ has exponentially growing solutions for$\Re \lambda > 0$ , and non-growing ("stable") solutions for$\Re \lambda \le 0$ . This analysis extends to linear autonomous ODEs$\frac{du}{dt} = A u$ where$A$ is a (diagonalizable) matrix, since we can just check each eigenvalue$\lambda$ of$A$ . (Later, we will also extend this analysis to nonlinear autonomous ODEs$\frac{du}{dt} = f(u)$ by approximately linearizing$f(u)$ around a root using the Jacobian of$f$ .) - When we discretize the ODE, can plug
$\lambda u$ (or$Au$ ) in for the right-hand side, for a fixed$\Delta t$ , and again use eigenvalues to analyze whether$u_k \approx u(k\Delta t)$ is growing or decaying with$k$ . (This reduces to "zero stability" analysis in the limit$\Delta t \to 0$ , for which the right-hand-side disappears from the formula for$u_k$ .)
In this way, we find that certain discretization schemes are linearly stable (non-growing
To begin with, we analyzed the forward Euler ("explicit") scheme
- For
$\frac{du}{dt} = -u$ ,$\lambda = 1$ so it is stable for$0 \le \Delta t \le 2$ . (This is why it performed so well numerically.) (The fact that it is only stable for certain values of$\Delta t$ in this case is called conditional stability; note that this depends on the right-hand side of the ODE!) - For
$\frac{d^2 u}{dt^2} = -u$ , we saw last time that this is equivalent to the$2 \times 2$ system$\frac{d}{dt} \begin{pmatrix} u \\ v \end{pmatrix} = \underbrace{\begin{pmatrix} 0 & 1 \\ -1 & 0 \end{pmatrix}}_{A} \begin{pmatrix} u \\ v \end{pmatrix}$ , where$A$ has eigenvalues$\lambda = \pm i$ . In this case$|1 + \lambda \Delta t| > 1$ for all$\Delta t > 0$ , and forward Euler is linearly unstable. It is still zero stable, so it still converges as$\Delta t \to 0$ for any fixed time$t$ ! But instead of oscillating solutions (for the exact ODE), we have solutions that are oscillating and slowly exponentially growing (more and more slowly as$\Delta t$ gets smaller, which allows it to converge). Even though the solutions converge, people are often unhappy with ODE methods that generate exponential growth (not present in the exact ODE) as you run for longer and longer times!
Next, we analyzed the backward Euler ("implicit") scheme
- This is why people use implicit ODE schemes: they are often more complicated to implement, because
$f(u_{k+1})$ appears on the right-hand-side, requiring you to solve for$u_{k+1}$ (which gets expensive as$f$ gets more complicated), but they tend to be more stable than explicit schemes.
Further reading: FENA book, section 4.3.
- notes and code: see links above
Analyzed the linear stability of the midpoint rule, and found that it was only stable for a small range of purely imaginary
To illustrate the utility of "implicit" schemes like backwards Euler, which are more difficult to implement and require a more expensive time-stepping procedure (solving a system of equations on each step, possibly even a nonlinear system), considered a simple example problem involving heat flow. In a system where heat can flow quickly between some components (e.g. metals in contact) but slowly between other components (e.g. between metals and air), one obtains a "stiff" ODE, characterized by a large ratio of timescales (eigenvalues).
With forward Euler (or other explicit methods) in a stiff problem, a small
Further reading: The FENA book, section 4.10, has some further discussion of stiff systems. There are many sources online about methods for stiff equations and implicit ODE methods. See e.g. these course notes from MIT 18.337/6.338, which focuses mainly on ways to construct the Jacobian (to linearize the right-hand-side) and/or efficiently perform the implicit solves on each step for the nonlinear case. The book by Griffiths and Highham (2010), along with many other similar books on numerical ODEs, contains a wealth of information, at a much more formal level than this course.
- notes and code: see links above
More numerical ODE schemes:
- trapezoidal rule (implicit) — analyzed order of accuracy (= 2) and stability (Re λ ≤ 0). For large Δt, it produces undesirable oscillations if λ is real and < 0, but for purely imaginary λ (oscillating ODEs) it has the nice property of "conserving energy" (oscillating solutions with no artificial numerical growth or dissipation).
- BDF (backward difference) rule (implicit) — analyzed an order-2 scheme, which has the same stability as trapezoidal rule, but does not introduce oscillations for real λ (while it introduces artifical dissipation for imaginary λ).
- Multi-step methods and Runge–Kutta schemes — began talking about schemes that operate in multiple "stages": first they "estimate" the future u at an intermediate timestep (e.g. k+½) to plug into the right-hand-side f(u,t), and then exploit this to get an improved value for u at timestep k+1. Gave an example of a second-order explicit scheme using this strategy.
Further reading: FNC books, sections 6.4 and 6.6. FENA book section 4.6 (trapezoidal) and 4.8–4.9 (Runge–Kutta and multistep methods).
- pset 5 solutions: coming soon
- pset 6: due 11am on Friday 3/21
The big picture of numerical linear algebra: amateurs think mostly about element-by-element algorithms oriented towards hand calculations, whereas practitioners think mostly in terms of factorizations (which give the results of these algorithms in terms of a product of simpler matrices, which allow use to re-use and reason about them):
- Instead of Gaussian elimination (or Gauss–Jordan / matrix inversion), we think about LU factorization
$PA = LU$ :$U$ is upper triangular (the result of Gaussian elimination),$L$ is lower triangular (a record of the elimination steps), and$P$ is a permutation (a re-ordering of the rows, which happens more frequently on computers than by hand). - Instead of Gram–Schmidt orthogonalization, we think about QR factorization
$A = QR$ :$Q$ has orthonormal columns and$R$ is upper triangular. (In practice, computers typically use a Householder algorithm rather than Gram–Schmidt). - Instead of finding eigenvalues by looking for roots of characteristic polynomials, we think about diagonalization
$A = X \Lambda X^{-1}$ : the columns of$X$ are eigenvectors, and$\Lambda$ is a diagonal matrix of eigenvalues. The most important case is real-symmetric matrices$A = A^T$ (or complex Hermitian matrices), where$X=Q$ (orthonormal eigenvectors, different from the result of QR!) and$A = Q \Lambda Q^T$ . Practical algorithms for diagonalization are very different from what you do by hand; the most important is the QR algorithm discovered around 1960. - Other important factorizations include the singular value decomposition (SVD)
$A = U \Sigma V^T$ (which is hugely important in practice, but unfortunately gets tacked on at the end of many introductory linear-algebra courses), the Schur factorization$A = QTQ^T$ (generalizing diagonalization to an upper triangular$T$ whose diagonal entries are the eigenvalues), Cholesky factorization$A = LL^T$ (equivalent to LU factorization in the special case of symmetric positive-definite$A$ , but twice as fast), Hessenberg factorization, and others.
For this lecture, we focused on LU factorization, and in particular a few key points:
- Why is LU factorization equivalent to Gaussian elimination? By a simple 3x3 example, reviewed how Gaussian elimination steps produce an upper-triangular matrix
$U$ from$A$ , but working backwards yields$A = LU$ where$L$ is lower triangular (with 1s on the diagonal) and is obtained "for free": it is simply a record of the elimination steps. It is easy to see that the cost of elimination (hence LU factorization) scales as$O(m^3)$ for an$m \times m$ matrix.- More generally, the factorization is
$PA = LU$ , where$P$ represents row swaps. Trying the same 3x3 example on a computer, we found that the computer performed row swaps ($P \ne I$ ) even though it didn't "have" to (no zero pivots were encountered). It turns out that the computer swaps rows to make the pivots as big as possible (for each column, it looks for the entry with maximum magnitude), an algorithm called partial pivoting, and this is essential to reduce sensitivity to roundoff errors in cases where the matrix entries have very different magnitudes.
- More generally, the factorization is
- How do we use an LU factorization? If you are solving
$Ax = b$ and have$A = LU$ , then you can equivalently solve$L(Ux) = b$ by (1) let$Ux=c$ and solve$Lc = b$ for$c$ and (2) solve$Ux=c$ for$x$ . These two "triangular solves" are easily done by forward and backsubstitution. Moreover, solving$Lc=b$ by forward substitution is exactly equivalent to performing the Gaussian elimination steps (from$A$ ) on$b$ , which is typically done in hand calculations by "augmenting" the matrix$A$ with an extra column$b$ . The cost of each of these steps is$O(m^2)$ , very similar to a matrix–vector multiplication. (So, solving new right-hand sides is cheap once you have the LU factors.)
Another important point is that having LU factors is as good as — or even better than — having the matrix inverse.
- Computing a matrix inverse is more than twice as costly as getting the LU factors. To find
$A^{-1}$ , you essentially solve$AX = I$ for$X = A^{-1}$ :$m$ right-hand-sides given by the columns of$I$ . This involves first finding the LU factorization of$A$ , with$O(m^3)$ cost, and then solving$m$ right hand sides, with an additional$m \times O(m^2) = O(m^3)$ cost for the triangular solves. (This process is equivalent to the Gauss–Jordan algorithm you may have learned by hand.) So, like LU, it is$O(m^3)$ , but the "constant factor" is usually at least twice as bad. - If the matrix is sparse (mostly zero), then the LU factors are often sparse as well, and can be computed and used very efficiently by skipping the zeros. But the matrix inverse is almost never sparse, so you lose that advantage. Gave the example of a tridiagonal matrix, for which the LU factors are actually bidiagonal (without row swaps) and can be computed and used in
$O(m)$ time … but the inverse is all nonzero, and has$O(m^2)$ cost. - A good (but not universal) rule of thumb is never compute matrix inverses. When you see
$x = A^{-1} b$ , read it as "solve Ax=b in the best way you can". For example, if$A=LU$ , we could formally write$A^{-1} b = U^{-1} L^{-1} b$ , but you wouldn't compute the inverses of these triangular matrices — you would compute$c = L^{-1} b$ by forward-subsitution on$Lc=b$ , then compute$x = U^{-1} c$ by back-substitution on$Ux = c$ . If you need to repeatedly apply$A^{-1}$ to many vectors, just LU-factorize$A$ and re-use the LU factors.
LU factorizations in Julia can be computed with F = lu(A)
, which returns a "factorization" object F
that contains the permutation F.p
) and the F \ b
(which does the permutation and 2 triangular solves). By default, it uses partial pivoting: always permuting rows to maximize the magnitude of the pivot. To compare to hand calculations (where you only do row swaps to avoid zero pivots), you can do F = lu(A, RowNonZero())
, but never do this for "serious" work because it is "numerically unstable" and can cause roundoff errors to blow up. If you do x = A \ b
in Julia, it will "solve Ax=b in the best way it can", by default using LU factorization.
In Python, the analogue of F = lu(A)
is scipy.linalg.lu
, the analogue of F \ b
is scipy.linalg.lu_solve
, and the analogue of A \ b
is numpy.linalg.solve
.
Further reading: FNC book, section 2.4 on LU factorization (and section 3.3 on QR, and chapter 7). 18.065 OCW lecture 2, Strang 18.06 lecture 4 on LU factorization. Matrix factorizations are discussed in every linear-algebra textbook at various levels of sophistication and practicality.
A brief survey of other topics in ODE methods:
-
Runge–Kutta schemes: the general structure is a "tableau" of coefficients where you make a sequence of estimates
$u((k+a)\Delta t)$ for coefficients$a \in (0,1]$ , and then make a linear combination of these estimates to estimate$u((k+1)\Delta t)$ . Deriving this tableau is somewhat of an art form, and typically involves careful choices of simplifying assumptions. People continue to discover new Runge–Kutta schemes! For example, state of the art 5th order scheme was recently found by Tsitorious (2010).- Adaptive schemes: similar to adaptive quadrature methods, typically high-order Runge-Kutta methods are "nested": using a subset of the same function values, one gets a lower-order estimate "for free", and compared to the high-order estimate this gives an error estimate. In this way, they can adaptively adjust Δt until the error estimate obeys a prescribed tolerance.
-
Boundary-value problems: instead of providing a purely initial condition
$u(0)$ , here one specifies a final condition$u(T)$ , or more generally a mix of constraints on$u(0)$ and$u(T)$ (where the total number of constraints equals the dimension of$u$ ). One approach to solving such problems is to reduce it to root-finding: one tries to solve for$u(0)$ that satisfies the constraints on$u(T)$ , sometimes called a "shooting" method.- To apply Newton's method, one then needs the Jacobian of
$u(T)$ with respect to the initial conditions$u(0)$ , which is a special example of the important problem of differentiating ODE solutions (with respect to parameters of the equations and/or initial conditions). Not only is this useful for boundary-value problems, but it is important for sensitivity analysis and quantifying uncertainties, and has become widely used for optimization of ODE solutions (e.g. to fit experimental data or to improve some other objective). See the links below.
- To apply Newton's method, one then needs the Jacobian of
-
Differential–algebraic equations (DAEs) — a DAE couples an ODE
$\frac{du}{dt} = f(u,v,t)$ with a set of (possibly nonlinear equations)$g(u,v,t)=0$ that have to be solve simultaneously with the ODE to find additional unknowns$v(t)$ . Sometimes, you can explicitly solve$g=0$ to find$v$ , and eliminate these unknowns, giving just an ODE in$u$ , but in other cases this is impractical or inconvenient. DAE algorithms simultaneously evolve$u$ from the ODE and$v$ from the "algebraic" equations. In some ways these are analogous to stiff ODE solvers, because the$v$ equations respond "infinitely quickly" to changes in$u$ . -
Integro-differential equations and delay differential equations (DDEs) are like ODEs but
$du/dt$ depends explicitly not just on$u(t)$ but also on the solution$u(t')$ at times$t' < t$ in the past — either via a continuous integral or via a discrete sum of terms. They have specialized ODE solver methods, which in principle are similar to ordinary ODE schemes but need to additionally keep track of (and interpolate) the past solutions as needed. - Stochastic differential equations (SDEs) are like ODEs where the right-hand side includes a random "noise" term. (Even defining precisely what this means requires a new form of calculus: stochastic calculus.) Don't just plug random numbers into the right-hand-side of an ODE scheme! There are a variety of specialized SDE methods. Unfortunately, these methods are typically low-order: it difficult to obtain a high-order "strong" SDE scheme that produces the correct distribution of solutions. But if you only care about the expected value ("average") of the solutions, there are higher-order "weak" SDE methods to accomplish this. See e.g. the SDE solver algorithms page in Julia's sophisticated DifferentialEquations.jl package.
Further reading (ODE methods): There are many books that give a more sophisticated treatment of numerical methods for ODEs and related problems, such the textbook by Butcher or the books by Hairer's on non-stiff (1993) and stiff/DAE (1996) algorithms, but the field continues to evolve with new algorithms. Fortunately, there are now many full-featured ODE solver packages, available in a variety of languages and often free/open-source, that implement sophisticated algorithms for all of the above, and more! A useful 2017 overview of available ODE software packages gives a useful comparison of what features and algorithms they implement — written by Dr. Chris Rackauckas (then at MIT), who developed the state-of-the-art DifferentialEquations.jl suite in Julia.
Further reading (differentiating ODE solutions): Computing derivatives of ODE solutions (and functions thereof) with respect to parameters or initial conditions was introduced in our IAP class 18.063: Matrix Calculus — see the course notes (chapter 9) and lecture video 5 (forward mode) and lecture video 6 (reverse mode). This has become an increasingly important topic because of the growth of optimization and machine learning, and it becomes critically important for large-scale problems to know the pros and cons of "forward" and "reverse" mode algorithms. A recent review article is Sapienza et al. (2024), and a classic reference on adjoint-method (reverse-mode/backpropagation) differentiation of ODEs (and DAEs) is Cao et al (2003) (pdf). See also the SciMLSensitivity.jl package for sensitivity analysis with Chris Rackauckas's amazing DifferentialEquations.jl software suite for numerical solution of ODEs in Julia, along with his notes from 18.337. There is a nice YouTube lecture on adjoint sensitivity of ODEs, again using a similar notation. A discrete version of this process is adjoint methods for recurrence relations (MIT course notes), in which case one obtains a reverse-order "adjoint" recurrence relation.
Back when we were solving least-square problems (minimizing
The "thin" QR factorization
- The "meaning" of this factorization is that not only do the columns of
$\hat{Q}$ form an orthonormal basis for column space$C(A)$ (the span of the columns of$A$ ), but the first k columns of$\hat{Q}$ are an orthonormal basis for the first k columns of$A$ . This is the meaning of the triangular structure of$\hat{R}$ .
Understanding the above structure of the QR factorization leads immediately to the Gram-Schmidt algorithm where one forms the columns
Instead, computers typically apply a different approach, inspired by a different form of the QR factorization. The "full" QR factorization
Gram–Schmidt works by turning
- Just because we write a matrix
$M$ in linear algebra doesn't mean that we need to store it as a matrix, i.e. store a 2d array of its entries explicitly. We often just need a way to compute matrix-vector products$Mx$ quickly for any given$x$ , i.e. an algorithm implementing the linear operator$x \mapsto M x$ . An explicit matrix is just one possible representation of a linear operator, and not always the best one. Examples:- The identity matrix
$I$ rarely needs to be stored explicitly, since we can multiply it by vectors with no work at all! - Matrix inverses
$A^{-1}$ should rarely be computed explicitly, because we can compute$A^{-1} b$ for any$b$ by solving$Ax = b$ by some other method, e.g. using the LU factors. - For matrices that are sparse (mostly zero), a 2d array of entries is very wasteful because you are storing lots of zeros. Instead, specialized sparse-matrix data structures store only the nonzero entries, and can use these nonzero entries to multiply by vectors quickly. This is especially important for large-scale linear algebra, where the dimensions can be millions or more — huge matrices are almost never stored explicitly as 2d arrays.
- QR factorization returns an implicit representation of
$Q$ as a collection of Householder reflector operations.
- The identity matrix
- Given an algorithm to act
$M$ on a vector, you can compute the entries of$M$ explicitly simply by computing$M I = M$ : multiplying$M$ by the columns of$I$ . But often this is a waste of effort, as in the examples above.
Further reading: FNC book, sections 3.3 3.4. and Householder QR notes from 18.335 at MIT. Any textbook on numerical linear algebra will cover QR factorization, e.g. the book by Trefethen and Bau, section II.
Numerical methods for eigenvalue problems.