We have to do it. Despite the popularity of software libraries like pytorch, tensorflow, numpy/scipy or scikit-learn, sometimes we cannot get around writing our own implementations of basic machine learning (ML) algorithms. To ensure that these implementations are working as expected, we write unittests.

It is difficult. Coming up with meaningful unittests for algorithms involving machine learning or computational statistics is tedious in the best and extremely difficult in the worst case. To make life easier, on this website I summarise recipes for meaningful unittests for a range of ML algorithms.

Some vital things are ignored. Albeit central to meaningful unittesting, I do not mention implementation-specific tests. For example: ''I want my parameter to be a float, hence for any other input, the algorithm must raise an exception''). Of course, this does not mean that they are not to be considered for unittests.

Help the cause. If you spot typos, errors, problems or have any suggestions, please do not hesitate to get in touch.

Contents

  1. Automatic differentiation
  2. (General) local minimisation
  3. Random search minimisation
  4. Gradient descent minimisation
  5. Newton minimisation
  6. (General) Metropolis-Hastings
  7. Metropolis-Hastings with Langevin proposals
  8. Hamiltonian Monte Carlo
  9. Gibbs sampling
  10. Principal component analysis
  11. Numerical integration
  12. Ordinary differential equations
  13. Iteratively solving linear systems
  14. Covariance functions
  15. Data sets
  16. Supervised learning
  17. Neural networks
  18. Singular value decomposition
  19. Regression
  20. Probability distributions
  21. Kalman filter
  22. Gaussian process regression
  23. Preconditioners
  24. Acknowledgements

Automatic differentiation

When I say ''derivative'' I mean derivatives, gradients or Jacobians.
  • Assert that the automatically computed derivative is close to a finite difference approximation of the derivative. For direction-based derivatives like gradients, make sure that this holds true for each possible direction.
  • Test if the Hessian of a convex function is positive definite. To this end, pick a convex function, e.g. \(f(x)=3x^8 + x^4\), sample evaluation points \(x_1, ..., x_N\) and compute the smallest eigenvalue of the Hessian at each \(x_i, ~i=1, ..., N\). Remember to keep \(N\) as small as necessary for speed reasons.
  • Potential corner cases:
    • What happens if the function that is to be differentiated does not allow a derivative?

(General) local minimisation

This section also applies to all minimisation algorithms below. Aspects unique to stochastic versions can be found here.

I consider steepest descent minimisation of a loss function \(\ell(x)\) with iterations of the form \(x_i \mapsto x_i + \alpha \delta_i\), \(\alpha > 0 \).

  • Test whether for a really small learning rate \(\alpha \) the objective function is smaller after the minimisation step.
  • Assert that only the variable over which you want to optimise changes with each iteration.
  • Test that the optimiser finds the minimum of a few testfunctions. See e.g. here for an overview. Strictly speaking this may not be a unittest.
  • Potential corner cases:
    • What does the algorithm do if the learning rate \(\alpha\) is negative, zero or numerically zero, i.e. smaller than \( 10^{-16}\)?

Random search minimisation

  • Compute a few samples. Test that each loss is smaller than or equal to the previous one.
  • Choose a convex problem and compute a few samples. Assert that for iterations where the loss is strictly smaller than the previous loss, the iteration is strictly closer to the true minimum.

Gradient descent minimisation

  • Assert that for convex problems, the gradient always points towards the local minimum.
  • Potential corner cases:
    • What does the algorithm do if there is no gradient?

Newton minimisation

  • Test whether for quadratic objective functions and learning rate \(\alpha=1\), Newton minimisation converges in a single step
  • Potential corner cases:
    • What does the algorithm do if there is no Hessian?
    • How does the algorithm react to singular Hessians?

Stochastic minimisation

Here I summarise things unique to the stochastic versions of the minimisation methods mentioned above.
  • Test that with maximal batch size it coincides with classic minimisation methods.
  • Potential corner cases:
    • What does the algorithm do for batch-size zero?

(General) Metropolis-Hastings

Additional to the points mentioned below, one can use many different statistical hypothesis tests to assert whether the resulting samples have the desired invariant distribution. Bear in mind that for a meaningful qq-plot one may need many samples which can make the automated testing slow.
  • Test that proposals in regions with higher probability are always accepted.
  • Test that samples from a distribution with known CDF the qq-plot should be a straight line. Try this with multiple distributions and aim to include distributions with different supports.
  • Potential corner cases:
    • What does the sampler do if some regions have zero probability? This can be investigated by sampling from e.g. the exponential or inverse gamma distribution.

Metropolis-Hastings with Langevin proposals

This is usually referred to as Metropolis-adjusted Langevin algorithm (MALA).
  • Assert that MALA is equivalent to Hamiltonian Monte Carlo (HMC) if HMC uses only one leapfrog step. Be careful: depending on your implementation the proposal widths may have to be adapted via either \(\sqrt{2 \rho} \leftrightarrow \rho \leftrightarrow \rho^2/2 \).

Hamiltonian Monte Carlo

The presentation is restricted to the case where the Hamiltonian dynamics are simulated with the leapfrog integrator. This can be phrased as an instance of a Metropolis-Hastings algorithm and the MH unittest-suggestions mentioned above apply.
  • Assert that HMC with one leapfrog step is equivalent to MALA; see above. Be careful: depending on your implementation the proposal widths may have to be adapted via either \(\sqrt{2 \rho} \leftrightarrow \rho \leftrightarrow \rho^2/2 \))

Gibbs sampling

  • Test that the conditional distribution is consistent with the joint distribution. To this end, generate some samples and test Equation 1 on page 4 of this paper.

Principal component analysis

  • Compute the reconstruction error of using the maximal number of principal components. Assert that it is zero.
  • Assert that only the largest principal components are chosen. This is important as some library functions sort singular values and others do not.
  • Assert that the eigenvalues are indeed eigenvalues to the eigenvectors.
  • Assert that the PCA algorithm at least raises a warning if the data is not normalised to some extend.
  • Potential corner cases:
    • What happens if the covariance matrix is singular?
    • What does the algorithm do if the number of principal components is zero?

Numerical integration

  • Numerical integration rules based on polynomial interpolation have certain degrees of exactness, say \(N\). Test that this degree is satisfied: the error of a polynomial of degree \(\leq N\) should be roughly machine precision and the error of a polynomial of degree \(\geq N + 1\) should be larger.
  • Assert that integrals of probability density functions are close to 1. Try distributions where the PDFs have different supports. For the Gaussian distribution, keep the 68-95-99.7 rule in mind when restricting the domain of integration to a compact interval.
  • Potential corner cases:
    • What is the result if the interval of integration consists of only a single point?

Ordinary differential equations

  • Test that the approximation error is small when comparing the approximate solution of an ODE with a known solution to the true solution. One option can be the logistic differential equation. Assert that the error gets smaller as the step size gets smaller.
  • Assert that the approximate solution to a periodic system is perodic. For instance, test on the Lotka-Volterra equations or the FitzHugh-Nagumo model. For speed reasons, configure the parameters such that the period is short.
  • Compute the first few iterations of an ODE solver by hand and compare the result of the algorithm to your own calculations. Test that both coincide.
  • Potential corner cases:
    • How does the algorithm react to the time interval consisting of a single point?

Iteratively solving linear systems

  • Test that the iterative solver achieves the desired relative or absolute solution accuracy used as an input to the algorithm.
  • Assert that dimension mismatches between matrix and right hand side are detected.
  • Test if the initialisation with true solution converges in zero, respectively a single step.
  • Test that the solution of a system based on a matrix with condition number 1, e.g. the identity matrix, the solution is returned in a single step
  • Potential corner cases:
    • What does the solver do if matrix, right hand side or initial guess are empty arrays?
    • What happens if the maximum number of iterations is set to zero?
    • What happens if the desired accuracy is set to zero?

Covariance functions

I restrict this presentation to covariance functions that are reproducing kernels of a reproducing kernel Hilbert space.
  • Test that covariance matrices are symmetric
  • Assert that covariance matrices are positive definite. To this end, compute the minimal eigenvalue and make sure they are positive.
  • Compare evaluations of your covariance function against hand-computed values.

Data sets

This section is concerned with ``debugging'' data sets, i.e. with finding anomalies in data sets which might lead to unexpected behaviour of ML algorithms.
  • Look up which data types to expect in the data set and look for anomalies.
  • Find and deal with missing values and NaNs.
  • Look for surprisingly large variances among the data within each category. Sometimes, data sets contain values like -9999 instead of a NaN. If you find a "suspicious" category, investigate further by computing summary statistics and plotting a histogram.
  • Sometimes it is important to identify whether the data is normalised. If it is supposed to shifted to [-1, 1], test whether min and max are within that range. If it is normalised via \(X \mapsto (X-\text{mean}(X))/\text{std(X)}\), test whether the mean is zero and the standard deviation is 1.

Supervised learning

  • Assert that test set and training set have similar distributions. To this end, compute and compare summary statistics like mean, (co)variance, median or others.
  • Test that the loss function value decreases after a few iterations. This is taken from here.
  • Potential corner cases:
    • What happens if there is only a test set or only a training set?

Neural networks

This section would likely benefit the most from reader input.
  • Assert that on a small selection of data the model overfits after only few epochs? This is taken from here.

Singular value decomposition

The notation assumes a singular value decomposition (SVD), \(M = U S V^\top\).
  • Assert that if \(M\) is square, symmetric and positive definite, the singular values are eigenvalues and the columns of \(U\) and \(V\) are eigenvectors.
  • Test that identities \[M^\top M = V (S^\top S) V^\top, \quad M M^\top = U (S S^\top) U^\top,\] hold true.
  • Test if the matrices \(U\) and \(V\) are orthogonal. That is, assert that they satisfy \[U^\top U = U U^\top = I, \quad V^\top V = V V^\top = I,\] where \(I\) is an identity matrix of respective size.
  • Test that if the covariance matrix is a vector \(M \in \mathbb{R}^{m \times 1} \), \(U\) is a scalar equal to 1. Test the same for \(M^\top\) and \(V\) as well.
  • Potential corner cases:
    • What does the SVD do if the matrix \(M\) is an empty array?
    • What does the SVD do if the matrix \(M\) is a scalar?

Regression

  • Linear regression: Test that if the data are samples from a linear function without noise, the function should be recovered exactly with error equal to zero.

Probability distributions

What I mean by unittesting probability distributions is verifying that sampling methods and evaluations of probability density functions are correctly implemented.
  • Draw a significant amount of samples and test that the empirical moments match the known moments, if available. For speed reasons, consider using as few samples as necessary.
  • Use the same qq-plot tests I mentioned above under Metropolis-Hastings.
  • Potential corner cases:
    • What happens if the distribution does not support sampling? This can make it hard to unittest. Possible solutions are comparing integrals of the PDF to external implementations of the CDF or to use Metropolis-Hastings methods to generate samples. When comparing to e.g. scipy.stats' cumulative densitiy functions, be aware that scipy.stats often approximates these via quadrature which makes them inexact.

Kalman filter

  • Test that if the datastream consists of noise-free samples from the same state space model that the Kalman filter uses, the filtering result is exact. In other words, the Kalman gain and correction factor must be zero.
  • Potential corner cases:
    • What does the filter do if the data consists of an empty array?

Gaussian process regression

  • For interpolation, which uses exact data: Test that after conditioning the Gaussian process, the variance at the datapoints are exactly zero.
  • For regression, which uses noisy data: Test that after conditioning the Gaussian process, the variance at the datapoints are proportional to the standard deviation of the data point.
  • To test whether the covariance matrix is accidentally computed as \(K + sI\) instead of \(K + s^2I\): choose a small \(s\) which is large enough such that \(s^2\) is within machine precision, e.g. \(s=10^{-7}\), and test if after conditioning the process, the average deviation from the data (see bullet above) is proportional to \(s=10^{-7}\) or \(s=10^{-14}\).
  • An alternative option to test the case that the previous bullet describes is to choose a large \(s\), e.g. 10 or 100 and investigate whether the smallest eigenvalue of \(K + sI\), respectively \(K + s^2I\) is closer to 100 or to 10000.
  • If error estimates for the covariance function are available---which is e.g. the case for Matern class covariances or Wendland kernels---test that the error estimate is roughly fulfulled. That is, for error rate e.g. \(h^2\), where \(h\) is the fill-distance of the pointset, test if for a pointset 4 times as large, the error is roughly a 16th of the previous error. It is unlikely to obtain an exact 16th so if you find something in between a 10th and a 22nd, the test should pass.
  • Potential corner cases:
    • What happens if the covariance matrix \(K\) is singular?
    • What happens if the data consists of an empty array?
    • What does the conditioning do if the measurements are not taken at distinct timepoints/spacepoints?

Preconditioners

In this section I treat all kinds of preconditioning: numerical linear algebra, optimisation, Markov chain Monte Carlo, and more.
  • Test that preconditioning with the identity matrix yields the same results as the unpreconditioned version.
  • Assert that preconditioning with the "true solution"---if available---yields the solution straightaway. For example, use the true inverse for this test in the setting of linear algebra.

Acknowledgements

The design of this website is inspired by web design in 4 minutes.