Optimization (cont.)

Lecture 22

Dr. Colin Rundel

Method Variants

Method - CG in scipy

Scipy’s optimize module implements the conjugate gradient algorithm using a variant proposed by Polak and Ribiere, this version does not use the Hessian when calculating the next step. The specific changes are:

\(\alpha_k\) is calculated via a line search along the direction \(p_k\)
and the \(\beta_{k+1}\) calculation is replaced as follows

\[ \beta_{k+1} = \frac{ r^T_{k+1} \, \nabla^2 f(x_k) \, p_{k} }{p_k^T \, \nabla^2 f(x_k) \, p_k} \qquad \Rightarrow \qquad \beta_{k+1}^{PR} = \frac{\nabla f(x_{k+1}) \left(\nabla f(x_{k+1}) - \nabla f(x_{k})\right)}{\nabla f(x_k)^T \, \nabla f(x_k)} \]

Method - Newton-CG & BFGS

These are both variants of Newton’s method but do not require the Hessian (though it can optionally be provided to Newton-CG).

Newton-Conjugate Gradient (Newton-CG) algorithm uses a conjugate gradient algorithm to (approximately) invert the local Hessian
The Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm iteratively approximates the inverse Hessian
- Gradient is also not required and can be approximated using finite differences
Newton-CG also comes in a trust-region variant ("trust-ncg")
- Uses a trust-region approach to determine the step size rather than backtracking line search

Method - L-BFGS-B

Limited-memory BFGS (L-BFGS-B) is a variant of BFGS designed for problems with large numbers of parameters.

Rather than storing and updating a full \(n \times n\) approximation to the inverse Hessian, L-BFGS stores only the last \(m\) updates (typically \(m \approx 5\)–\(20\)) and uses these for the approximation
This reduces memory usage from \(O(n^2)\) to \(O(mn)\), making it practical for high-dimensional problems
The “B” in L-BFGS-B refers to support for bound constraints on the parameters (i.e. \(l_i \leq x_i \leq u_i\))

Method - Nelder-Mead

This is a gradient free method that uses a series of simplexes which are used to iteratively bracket the minimum.

Method Summary

SciPy Method	Description	Gradient	Hessian
—	Gradient Descent (naive w/ backtracking)	✓	✗
—	Newton’s method (naive w/ backtracking)	✓	✓
—	Conjugate Gradient (naive)	✓	✓
`"CG"`	Nonlinear Conjugate Gradient (Polak and Ribiere variation)	✓	✗
`"Newton-CG"`	Truncated Newton method (Newton w/ CG step direction)	✓	Optional
`"trust-ncg"`	Trust-region Newton-CG method	✓	Optional
`"BFGS"`	Broyden, Fletcher, Goldfarb, and Shanno (Quasi-newton method)	Optional	✗
`"L-BFGS-B"`	Limited-memory BFGS (Quasi-newton method)	Optional	✗
`"Nelder-Mead"`	Nelder-Mead simplex reflection method	✗	✗

R’s `optim()`

R’s optim() function (from the stats package) provides a similar set of general-purpose optimization methods:

R Method	SciPy Equivalent	Description	Gradient	Bounds
`"Nelder-Mead"`	`"Nelder-Mead"`	Nelder-Mead simplex (default)	✗	✗
`"BFGS"`	`"BFGS"`	Quasi-Newton (Broyden-Fletcher-Goldfarb-Shanno)	Optional	✗
`"CG"`	`"CG"`	Conjugate gradient (Fletcher-Reeves, Polak-Ribiere, or Beale-Sorenson)	Optional	✗
`"L-BFGS-B"`	`"L-BFGS-B"`	Limited-memory BFGS with box constraints	Optional	✓
`"SANN"`	—	Simulated annealing (stochastic global optimization)	✗	✗

R’s `optim()` (cont.)

Key differences from SciPy’s optimize.minimize():

R’s optim() does not support providing a Hessian to any method (but can return a numerically approximated Hessian via hessian=TRUE)
R has no equivalent to SciPy’s "Newton-CG" method
R includes "SANN" (simulated annealing) - a stochastic global optimizer not available in SciPy’s minimize()
The optimx packages provide access to additional methods beyond those in optim() - many additional packages available on CRAN

Methods collection

def define_methods(x, f, grad, hess, tol=1e-8):
  return {
    "naive_newton":   lambda: newtons_method(x, f, grad, hess, tol=tol),
    "naive_cg":       lambda: conjugate_gradient(x, f, grad, hess, tol=tol),
    "CG":             lambda: optimize.minimize(f, x, jac=grad, method="CG", tol=tol),
    "newton-cg":      lambda: optimize.minimize(f, x, jac=grad, hess=None, method="Newton-CG", tol=tol),
    "newton-cg w/ H": lambda: optimize.minimize(f, x, jac=grad, hess=hess, method="Newton-CG", tol=tol),
    "trust-ncg":      lambda: optimize.minimize(f, x, jac=grad, hess=hess, method="trust-ncg", tol=tol),
    "bfgs":           lambda: optimize.minimize(f, x, jac=grad, method="BFGS", tol=tol),
    "bfgs w/o G":     lambda: optimize.minimize(f, x, method="BFGS", tol=tol),
    "l-bfgs-b":       lambda: optimize.minimize(f, x, method="L-BFGS-B", tol=tol),
    "nelder-mead":    lambda: optimize.minimize(f, x, method="Nelder-Mead", tol=tol)
  }

Method Timings

x = (1.6, 1.1)
f, grad, hess = mk_quad(0.7)
methods = define_methods(x, f, grad, hess)
df = pd.DataFrame({
  key: timeit.Timer(methods[key]).repeat(10, 100) for key in methods
})

def time_cost_func(n, x, name, cost_func, *args):
  f, grad, hess = cost_func(*args)
  methods = define_methods(x, f, grad, hess)
  
  return ( pd.DataFrame({
      key: timeit.Timer(
        methods[key]
      ).repeat(n, n) 
      for key in methods
    })
    .melt()
    .assign(cost_func = name)
  )

x = (1.6, 1.1)  

time_cost_df = pd.concat([
  time_cost_func(10, x, "Well-cond quad", mk_quad, 0.7),
  time_cost_func(10, x, "Ill-cond quad", mk_quad, 0.02),
  time_cost_func(10, x, "Rosenbrock", mk_rosenbrock)
])

Random starting locations

Implementation
Results

pts = np.random.default_rng(seed=1234).uniform(-2,2, (20,2))
df = pd.concat([
  pd.concat([
    time_cost_func(3, x, "Well-cond quad", mk_quad, 0.7),
    time_cost_func(3, x, "Ill-cond quad", mk_quad, 0.02),
    time_cost_func(3, x, "Rosenbrock", mk_rosenbrock)
  ])
  for x in pts
])

Profiling - BFGS (cProfile)

import cProfile

f, grad, hess = mk_quad(0.7)
def run():
  for i in range(500):
    optimize.minimize(fun = f, x0 = (1.6, 1.1), jac=grad, method="BFGS", tol=1e-11)

cProfile.run('run()', sort="tottime")

         549504 function calls in 0.209 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      500    0.039    0.000    0.204    0.000 _optimize.py:1345(_minimize_bfgs)
    29000    0.018    0.000    0.018    0.000 {method 'reduce' of 'numpy.ufunc' objects}
    13000    0.010    0.000    0.031    0.000 _optimize.py:195(vecnorm)
     5000    0.009    0.000    0.020    0.000 <string>:3(f)
    18000    0.008    0.000    0.020    0.000 fromnumeric.py:66(_wrapreduction)
     5000    0.008    0.000    0.010    0.000 <string>:9(gradient)
     4500    0.007    0.000    0.103    0.000 _dcsrch.py:201(__call__)
    10000    0.007    0.000    0.018    0.000 numeric.py:2522(array_equal)
     4500    0.006    0.000    0.034    0.000 _linesearch.py:86(derphi)
    13000    0.005    0.000    0.020    0.000 fromnumeric.py:2304(sum)
     5000    0.005    0.000    0.019    0.000 _differentiable_functions.py:35(__call__)
     4500    0.004    0.000    0.109    0.000 _linesearch.py:100(scalar_search_wolfe1)
     9000    0.004    0.000    0.004    0.000 _dcsrch.py:269(_iterate)
     4500    0.004    0.000    0.113    0.000 _linesearch.py:37(line_search_wolfe1)
     5000    0.004    0.000    0.026    0.000 _util.py:600(__call__)
     4500    0.004    0.000    0.057    0.000 _linesearch.py:82(phi)
    15500    0.004    0.000    0.005    0.000 {built-in method numpy.array}
     5000    0.003    0.000    0.006    0.000 _delegation.py:34(atleast_nd)
     4500    0.003    0.000    0.017    0.000 _differentiable_functions.py:337(_update_x)
     4500    0.003    0.000    0.116    0.000 _optimize.py:1156(_line_search_wolfe12)
     5000    0.003    0.000    0.055    0.000 _differentiable_functions.py:391(fun)
    10500    0.002    0.000    0.004    0.000 shape_base.py:19(atleast_1d)
    10000    0.002    0.000    0.002    0.000 {built-in method numpy.arange}
      500    0.002    0.000    0.209    0.000 _minimize.py:54(minimize)
     5000    0.002    0.000    0.030    0.000 _differentiable_functions.py:397(grad)
     5000    0.002    0.000    0.006    0.000 _aliases.py:70(asarray)
    26500    0.002    0.000    0.002    0.000 {built-in method builtins.isinstance}
    32000    0.002    0.000    0.002    0.000 {built-in method numpy.asarray}
     5500    0.002    0.000    0.021    0.000 _differentiable_functions.py:371(_update_grad)
    10000    0.002    0.000    0.009    0.000 {method 'all' of 'numpy.ndarray' objects}
    10000    0.002    0.000    0.003    0.000 _helpers.py:683(_check_device)
      500    0.002    0.000    0.012    0.000 _differentiable_functions.py:218(__init__)
     5500    0.002    0.000    0.028    0.000 _differentiable_functions.py:360(_update_fun)
    10000    0.002    0.000    0.002    0.000 {method 'copy' of 'numpy.ndarray' objects}
     5000    0.002    0.000    0.008    0.000 fromnumeric.py:3127(amax)
    10000    0.001    0.000    0.007    0.000 _methods.py:64(_all)
    18500    0.001    0.000    0.001    0.000 {method 'items' of 'dict' objects}
    10500    0.001    0.000    0.003    0.000 _function_base_impl.py:924(copy)
    25500    0.001    0.000    0.001    0.000 multiarray.py:748(dot)
     5000    0.001    0.000    0.004    0.000 _aliases.py:97(astype)
     5000    0.001    0.000    0.001    0.000 {method 'astype' of 'numpy.ndarray' objects}
    21000    0.001    0.000    0.001    0.000 {built-in method numpy.asanyarray}
     6000    0.001    0.000    0.001    0.000 {built-in method builtins.getattr}
      500    0.001    0.000    0.002    0.000 _differentiable_functions.py:57(__init__)
    12000    0.001    0.000    0.001    0.000 {built-in method builtins.len}
     5000    0.001    0.000    0.001    0.000 enum.py:186(__get__)
    10000    0.001    0.000    0.001    0.000 {method 'get' of 'dict' objects}
    13000    0.001    0.000    0.001    0.000 fromnumeric.py:2299(_sum_dispatcher)
     5000    0.001    0.000    0.001    0.000 numeric.py:1962(isscalar)
    10000    0.001    0.000    0.001    0.000 numeric.py:2503(_array_equal_dispatcher)
    10500    0.001    0.000    0.001    0.000 _function_base_impl.py:920(_copy_dispatcher)
      500    0.001    0.000    0.001    0.000 _twodim_base_impl.py:176(eye)
     4500    0.001    0.000    0.001    0.000 _dcsrch.py:174(__init__)
    10500    0.001    0.000    0.001    0.000 shape_base.py:15(_atleast_1d_dispatcher)
      500    0.001    0.000    0.001    0.000 _linalg.py:2598(norm)
     8500    0.001    0.000    0.001    0.000 {built-in method builtins.callable}
     4500    0.000    0.000    0.000    0.000 _linesearch.py:27(_check_c1_c2)
     4500    0.000    0.000    0.000    0.000 {built-in method builtins.min}
     4500    0.000    0.000    0.000    0.000 {method 'pop' of 'dict' objects}
      500    0.000    0.000    0.012    0.000 _optimize.py:204(_prepare_scalar_function)
      500    0.000    0.000    0.001    0.000 fromnumeric.py:86(_wrapreduction_any_all)
      500    0.000    0.000    0.001    0.000 numerictypes.py:385(isdtype)
     4500    0.000    0.000    0.000    0.000 {built-in method builtins.abs}
        1    0.000    0.000    0.209    0.209 <string>:1(run)
      500    0.000    0.000    0.000    0.000 _minimize.py:1134(standardize_constraints)
      500    0.000    0.000    0.001    0.000 shape_base.py:78(atleast_2d)
      500    0.000    0.000    0.000    0.000 numerictypes.py:372(_preprocess_dtype)
     4500    0.000    0.000    0.000    0.000 _util.py:988(_call_callback_maybe_halt)
     5000    0.000    0.000    0.000    0.000 fromnumeric.py:3006(_max_dispatcher)
     5000    0.000    0.000    0.000    0.000 enum.py:1323(value)
      500    0.000    0.000    0.000    0.000 {method 'dot' of 'numpy.ndarray' objects}
      500    0.000    0.000    0.001    0.000 fromnumeric.py:2436(any)
      500    0.000    0.000    0.000    0.000 {method 'reshape' of 'numpy.ndarray' objects}
      500    0.000    0.000    0.000    0.000 {method 'flatten' of 'numpy.ndarray' objects}
      500    0.000    0.000    0.000    0.000 {method 'any' of 'numpy.ndarray' objects}
      500    0.000    0.000    0.000    0.000 {built-in method numpy.zeros}
      500    0.000    0.000    0.000    0.000 {built-in method _abc._abc_instancecheck}
      500    0.000    0.000    0.000    0.000 {method 'ravel' of 'numpy.ndarray' objects}
      500    0.000    0.000    0.000    0.000 _sparse.py:10(issparse)
      500    0.000    0.000    0.000    0.000 <frozen abc>:117(__instancecheck__)
     1000    0.000    0.000    0.000    0.000 {built-in method builtins.issubclass}
      500    0.000    0.000    0.000    0.000 {method 'update' of 'set' objects}
      500    0.000    0.000    0.000    0.000 _optimize.py:161(_check_positive_definite)
      500    0.000    0.000    0.000    0.000 _methods.py:55(_any)
      500    0.000    0.000    0.000    0.000 _linalg.py:169(isComplexType)
     1000    0.000    0.000    0.000    0.000 {built-in method _operator.index}
      500    0.000    0.000    0.000    0.000 _differentiable_functions.py:20(__init__)
      500    0.000    0.000    0.000    0.000 {method 'setdefault' of 'dict' objects}
      500    0.000    0.000    0.000    0.000 {method 'lower' of 'str' objects}
      500    0.000    0.000    0.000    0.000 _util.py:595(__init__)
      500    0.000    0.000    0.000    0.000 {method 'append' of 'list' objects}
      500    0.000    0.000    0.000    0.000 _optimize.py:176(_check_unknown_options)
      500    0.000    0.000    0.000    0.000 _differentiable_functions.py:325(nfev)
      500    0.000    0.000    0.000    0.000 _array_api_override.py:69(array_namespace)
      500    0.000    0.000    0.000    0.000 _optimize.py:88(_wrap_callback)
      500    0.000    0.000    0.000    0.000 fromnumeric.py:2431(_any_dispatcher)
      500    0.000    0.000    0.000    0.000 {method 'values' of 'dict' objects}
      500    0.000    0.000    0.000    0.000 _differentiable_functions.py:329(ngev)
      500    0.000    0.000    0.000    0.000 _optimize.py:299(hess)
      500    0.000    0.000    0.000    0.000 shape_base.py:74(_atleast_2d_dispatcher)
      500    0.000    0.000    0.000    0.000 _linalg.py:2594(_norm_dispatcher)
        1    0.000    0.000    0.209    0.209 {built-in method builtins.exec}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        1    0.000    0.000    0.209    0.209 <string>:1(<module>)

Profiling - BFGS (pyinstrument)

Code
Output

from pyinstrument import Profiler

f, grad, hess = mk_quad(0.7)

profiler = Profiler(interval=0.00001)

profiler.start()
opt = optimize.minimize(fun = f, x0 = (1.6, 1.1), jac=grad, method="BFGS", tol=1e-11)
p = profiler.stop()

profiler.write_html("Lec22_bfgs_quad.html")

Profiling - Nelder-Mead (pyinstrument)

Code
Output

from pyinstrument import Profiler

f, grad, hess = mk_quad(0.7)

profiler = Profiler(interval=0.00001)

profiler.start()
opt = optimize.minimize(fun = f, x0 = (1.6, 1.1), method="Nelder-Mead", tol=1e-11)
p = profiler.stop()

profiler.write_html("Lec22_nm_quad.html")

`optimize.minimize()` output

f, grad, hess = mk_quad(0.7)

optimize.minimize(
  fun = f, x0 = (1.6, 1.1), 
  jac=grad, method="BFGS"
)

  message: Optimization terminated successfully.
  success: True
   status: 0
      fun: 1.2739256453439323e-11
        x: [-5.318e-07 -8.843e-06]
      nit: 6
      jac: [-3.510e-07 -2.860e-06]
 hess_inv: [[ 1.515e+00 -3.438e-03]
            [-3.438e-03  3.035e+00]]
     nfev: 7
     njev: 7

optimize.minimize(
  fun = f, x0 = (1.6, 1.1), 
  jac=grad, hess=hess, method="Newton-CG"
)

 message: Optimization terminated successfully.
 success: True
  status: 0
     fun: 2.3418652989289317e-12
       x: [ 0.000e+00  3.806e-06]
     nit: 11
     jac: [ 0.000e+00  4.102e-06]
    nfev: 12
    njev: 12
    nhev: 11

`optimize.minimize()` output (cont.)

optimize.minimize(
  fun = f, x0 = (1.6, 1.1), 
  jac=grad, method="CG"
)

 message: Optimization terminated successfully.
 success: True
  status: 0
     fun: 1.4450021261144105e-32
       x: [-1.943e-16 -1.110e-16]
     nit: 2
     jac: [-1.282e-16 -3.590e-17]
    nfev: 5
    njev: 5

optimize.minimize(
  fun = f, x0 = (1.6, 1.1), 
  jac=grad, method="Nelder-Mead"
)

       message: Optimization terminated successfully.
       success: True
        status: 0
           fun: 2.3077013477040082e-10
             x: [ 1.088e-05  3.443e-05]
           nit: 46
          nfev: 89
 final_simplex: (array([[ 1.088e-05,  3.443e-05],
                       [ 1.882e-05, -3.825e-05],
                       [-3.966e-05, -3.147e-05]]), array([ 2.308e-10,  3.534e-10,  6.791e-10]))

`optimize.minimize()` output (cont.)

optimize.minimize(
  fun = f, x0 = (1.6, 1.1),
  jac=grad, hess=hess, method="trust-ncg"
)

 message: Optimization terminated successfully.
 success: True
  status: 0
     fun: 1.4831619534744353e-09
       x: [ 0.000e+00  9.577e-05]
     nit: 9
     jac: [ 0.000e+00  3.097e-05]
    nfev: 10
    njev: 10
    nhev: 9
    hess: [[ 6.600e-01  0.000e+00]
           [ 0.000e+00  4.620e-01]]

optimize.minimize(
  fun = f, x0 = (1.6, 1.1),
  method="L-BFGS-B"
)

  message: CONVERGENCE: NORM OF PROJECTED GRADIENT <= PGTOL
  success: True
   status: 0
      fun: 3.417411495023028e-12
        x: [ 2.832e-06  2.183e-06]
      nit: 5
      jac: [ 1.872e-06  7.077e-07]
     nfev: 18
     njev: 6
 hess_inv: <2x2 LbfgsInvHessProduct with dtype=float64>

Collect

def run_collect(name, x0, cost_func, *args, tol=1e-8, skip=[]):
  f, grad, hess = cost_func(*args)
  methods = define_methods(x0, f, grad, hess, tol)
  
  res = []
  for method in methods:
    if method in skip: continue
    
    x = methods[method]()
    time = timeit.Timer(methods[method]).repeat(10, 25)

    d = {
      "name":    name,
      "method":  method,
      "nit":     x["nit"],
      "nfev":    x["nfev"],
      "njev":    x.get("njev"),
      "nhev":    x.get("nhev"),
      "success": x["success"],
      "time":    [time]
    }
    res.append( pd.DataFrame(d, index=[1]) )
  
  return pd.concat(res)

df = pd.concat([
  run_collect(
    name, (1.6, 1.1), 
    cost_func, 
    arg, 
    skip=['naive_newton', 'naive_cg']
  ) 
  for name, cost_func, arg in zip(
    ("Well-cond quad", "Ill-cond quad", "Rosenbrock"), 
    (mk_quad, mk_quad, mk_rosenbrock), 
    (0.7, 0.02, None)
  )
])

Runtimes
Function calls

Exercise 1

Try minimizing the following function using different optimization methods starting from \(x_0 = [0,0]\), which method(s) appear to work best?

\[ \begin{align} f(x) = \exp(x_1-1) + \exp(-x_2+1) + (x_1-x_2)^2 \end{align} \]

MVN Example

MVN density cost function

For an \(n\)-dimensional multivariate normal we define
the \(n \times 1\) vectors \(x\) and \(\mu\) and the \(n \times n\)
covariance matrix \(\Sigma\),

\[ \begin{align} f(x) =& \det(2\pi\Sigma)^{-1/2} \\ & \exp \left[-\frac{1}{2} (x-\mu)^T \Sigma^{-1} (x-\mu) \right] \\ \\ \nabla f(x) =& -f(x) \Sigma^{-1}(x-\mu) \\ \\ \nabla^2 f(x) =& f(x) \left( \Sigma^{-1}(x-\mu)(x-\mu)^T\Sigma^{-1} - \Sigma^{-1}\right) \\ \end{align} \]

Our goal will be to find the mode (maximum) of this density.

def mk_mvn(mu, Sigma):
  Sigma_inv = np.linalg.inv(Sigma)
  norm_const = 1 / (np.sqrt(np.linalg.det(2*np.pi*Sigma)))
  
  # Returns the negative density (since we want the max not min)
  def f(x):
    x_m = x - mu
    return -(norm_const * 
      np.exp( 
        -0.5 * x_m.T @ Sigma_inv @ x_m
      ) 
    ).item()
  
  def grad(x):
    return (-f(x) * Sigma_inv @ (x - mu))
  
  def hess(x):
    n = len(x)
    x_m = x - mu
    return f(x) * (
      (Sigma_inv @ x_m).reshape((n,1)) 
      @ (x_m.T @ Sigma_inv).reshape((1,n))
      - Sigma_inv
    )
  
  return f, grad, hess

Gradient checking

One of the most common issues when implementing an optimizer is to get the gradient calculation wrong which can produce problematic results. It is possible to numerically check the gradient function by comparing results between the gradient function and finite differences from the objective function via optimize.check_grad().

# 5d
f, grad, hess = mk_mvn(
  np.zeros(5), np.eye(5,5)
)

optimize.check_grad(f, grad, np.zeros(5))

np.float64(2.6031257322754127e-10)

optimize.check_grad(f, grad, np.ones(5))

np.float64(1.725679820308689e-11)

# 10d
f, grad, hess = mk_mvn(
  np.zeros(10), np.eye(10)
)

optimize.check_grad(f, grad, np.zeros(10))

np.float64(2.8760747774580336e-12)

optimize.check_grad(f, grad, np.ones(10))

np.float64(2.850398669793798e-14)

Gradient checking (wrong gradient)

wrong_grad = lambda x: 5*grad(x)

# 5d
f, grad, hess = mk_mvn(
  np.zeros(5), np.eye(5)
)

optimize.check_grad(f, wrong_grad, np.zeros(5))

np.float64(2.6031257322754127e-10)

optimize.check_grad(f, wrong_grad, np.ones(5))

np.float64(0.007419234855235744)

# 10d
f, grad, hess = mk_mvn(
  np.zeros(10), np.eye(10)
)

optimize.check_grad(f, wrong_grad, np.zeros(10))

np.float64(2.8760747774580336e-12)

optimize.check_grad(f, wrong_grad, np.ones(10))

np.float64(8.703385925704049e-06)

Why does np.ones() detect an issue but np.zeros() does not?

Hessian checking

Note that since the gradient of the gradient is the Hessian, we can use this function to check our implementation of the Hessian as well — just use grad() as func and hess() as grad.

# 5d
f, grad, hess = mk_mvn(np.zeros(5), np.eye(5))

optimize.check_grad(grad, hess, np.zeros(5))

np.float64(3.878959614448864e-18)

optimize.check_grad(grad, hess, np.ones(5))

np.float64(3.8156075963144067e-11)

# 10d
f, grad, hess = mk_mvn(np.zeros(10), np.eye(10))

optimize.check_grad(grad, hess, np.zeros(10))

np.float64(4.2856853864461685e-20)

optimize.check_grad(grad, hess, np.ones(10))

np.float64(8.551196009381392e-14)

Unit MVNs

rng = np.random.default_rng(seed=1234)
runif = rng.uniform(-1,1, size=25)

df = pd.concat([
  run_collect(
    name, runif[:n], mk_mvn, 
    np.zeros(n), np.eye(n), 
    tol=1e-10, 
    skip=['naive_newton', 'naive_cg', 'bfgs w/o G', 'newton-cg w/ H', 'l-bfgs-b', 'nelder-mead']
  )
  for name, n in zip(
    ("5d", "10d", "25d"),
    (5, 10, 25)
  )
])

Performance (Unit MVNs)

Run times
Function calls

Performance (Unit MVNs, excl. CG)

Run times
Function calls

What’s going on? (good)

n = 5
f, grad, hess = mk_mvn(np.zeros(n), np.eye(n))

optimize.minimize(
  f, runif[:n], jac=grad, 
  method="newton-cg", tol=1e-10
)

 message: Optimization terminated successfully.
 success: True
  status: 0
     fun: -0.010105326013811651
       x: [ 5.071e-11 -1.274e-11  4.502e-11 -2.535e-11 -1.924e-11]
     nit: 4
     jac: [ 5.125e-13 -1.288e-13  4.550e-13 -2.562e-13 -1.945e-13]
    nfev: 5
    njev: 9
    nhev: 0

optimize.minimize(
  f, runif[:n], jac=grad, 
  method="bfgs", tol=1e-10
)

  message: Optimization terminated successfully.
  success: True
   status: 0
      fun: -0.010105326013811651
        x: [-2.482e-13  6.237e-14 -2.203e-13  1.240e-13  9.422e-14]
      nit: 4
      jac: [-2.508e-15  6.303e-16 -2.227e-15  1.253e-15  9.521e-16]
 hess_inv: [[ 4.463e+01 -1.096e+01 ... -2.181e+01 -1.656e+01]
            [-1.096e+01  3.756e+00 ...  5.481e+00  4.161e+00]
            ...
            [-2.181e+01  5.481e+00 ...  1.190e+01  8.276e+00]
            [-1.656e+01  4.161e+00 ...  8.276e+00  7.283e+00]]
     nfev: 10
     njev: 10

What’s going on? (okay)

n = 10
f, grad, hess = mk_mvn(np.zeros(n), np.eye(n))

optimize.minimize(
  f, runif[:n], jac=grad, 
  method="newton-cg", tol=1e-10
)

 message: Optimization terminated successfully.
 success: True
  status: 0
     fun: -0.00010211745783654051
       x: [-8.223e-04  2.067e-04 -7.301e-04  4.111e-04  3.120e-04
            6.588e-04  4.454e-04  3.130e-04 -8.005e-04  4.077e-04]
     nit: 3
     jac: [-8.397e-08  2.110e-08 -7.455e-08  4.198e-08  3.186e-08
            6.727e-08  4.549e-08  3.196e-08 -8.174e-08  4.163e-08]
    nfev: 6
    njev: 9
    nhev: 0

optimize.minimize(
  f, runif[:n], jac=grad, 
  method="bfgs", tol=1e-10
)

  message: Optimization terminated successfully.
  success: True
   status: 0
      fun: -0.00010211761384541865
        x: [-2.347e-09  5.898e-10 -2.084e-09  1.173e-09  8.906e-10
             1.880e-09  1.271e-09  8.933e-10 -2.285e-09  1.164e-09]
      nit: 2
      jac: [-2.396e-13  6.023e-14 -2.128e-13  1.198e-13  9.094e-14
             1.920e-13  1.298e-13  9.122e-14 -2.333e-13  1.188e-13]
 hess_inv: [[ 1.685e+04 -4.235e+03 ...  1.641e+04 -8.356e+03]
            [-4.235e+03  1.065e+03 ... -4.123e+03  2.100e+03]
            ...
            [ 1.641e+04 -4.123e+03 ...  1.597e+04 -8.134e+03]
            [-8.356e+03  2.100e+03 ... -8.134e+03  4.144e+03]]
     nfev: 14
     njev: 14

What’s going on? (bad)

n = 25
f, grad, hess = mk_mvn(np.zeros(n), np.eye(n))

optimize.minimize(
  f, runif[:n], jac=grad, 
  method="newton-cg", tol=1e-10
)

 message: Optimization terminated successfully.
 success: True
  status: 0
     fun: -1.2867324357023428e-12
       x: [ 9.534e-01 -2.396e-01 ...  2.220e-01 -8.797e-01]
     nit: 1
     jac: [ 1.227e-12 -3.083e-13 ...  2.857e-13 -1.132e-12]
    nfev: 1
    njev: 1
    nhev: 0

optimize.minimize(
  f, runif[:n], jac=grad, 
  method="bfgs", tol=1e-10
)

  message: Optimization terminated successfully.
  success: True
   status: 0
      fun: -1.2867324357023428e-12
        x: [ 9.534e-01 -2.396e-01 ...  2.220e-01 -8.797e-01]
      nit: 0
      jac: [ 1.227e-12 -3.083e-13 ...  2.857e-13 -1.132e-12]
 hess_inv: [[1 0 ... 0 0]
            [0 1 ... 0 0]
            ...
            [0 0 ... 1 0]
            [0 0 ... 0 1]]
     nfev: 1
     njev: 1

All bad?

optimize.minimize(
  f, runif[:n], jac=grad, 
  method="nelder-mead", tol=1e-10
)

       message: Maximum number of function evaluations has been exceeded.
       success: False
        status: 1
           fun: -5.2161537392613975e-11
             x: [ 7.181e-02 -3.136e-01 ...  3.193e-01  3.222e-02]
           nit: 4136
          nfev: 5000
 final_simplex: (array([[ 7.181e-02, -3.136e-01, ...,  3.193e-01,
                         3.222e-02],
                       [ 7.275e-02, -3.237e-01, ...,  3.218e-01,
                         2.192e-02],
                       ...,
                       [ 8.238e-02, -3.143e-01, ...,  3.247e-01,
                         2.232e-02],
                       [ 8.105e-02, -3.178e-01, ...,  3.119e-01,
                         3.078e-02]], shape=(26, 25)), array([-5.216e-11, -5.216e-11, ..., -5.213e-11, -5.213e-11],
                      shape=(26,)))

Options (newton-cg)

optimize.show_options(solver="minimize", method="newton-cg")

Minimization of scalar function of one or more variables using the
Newton-CG algorithm.

Note that the `jac` parameter (Jacobian) is required.

Options
-------
disp : bool
    Set to True to print convergence messages.
xtol : float
    Average relative error in solution `xopt` acceptable for
    convergence.
maxiter : int
    Maximum number of iterations to perform.
eps : float or ndarray
    If `hessp` is approximated, use this value for the step size.
return_all : bool, optional
    Set to True to return a list of the best solution at each of the
    iterations.
c1 : float, default: 1e-4
    Parameter for Armijo condition rule.
c2 : float, default: 0.9
    Parameter for curvature condition rule.
workers : int, map-like callable, optional
    A map-like callable, such as `multiprocessing.Pool.map` for evaluating
    any numerical differentiation in parallel.
    This evaluation is carried out as ``workers(fun, iterable)``.

    .. versionadded:: 1.16.0

Notes
-----
Parameters `c1` and `c2` must satisfy ``0 < c1 < c2 < 1``.

Options (bfgs)

optimize.show_options(solver="minimize", method="bfgs")

Minimization of scalar function of one or more variables using the
BFGS algorithm.

Options
-------
disp : bool
    Set to True to print convergence messages.
maxiter : int
    Maximum number of iterations to perform.
gtol : float
    Terminate successfully if gradient norm is less than `gtol`.
norm : float
    Order of norm (Inf is max, -Inf is min).
eps : float or ndarray
    If `jac is None` the absolute step size used for numerical
    approximation of the jacobian via forward differences.
return_all : bool, optional
    Set to True to return a list of the best solution at each of the
    iterations.
finite_diff_rel_step : None or array_like, optional
    If ``jac in ['2-point', '3-point', 'cs']`` the relative step size to
    use for numerical approximation of the jacobian. The absolute step
    size is computed as ``h = rel_step * sign(x) * max(1, abs(x))``,
    possibly adjusted to fit into the bounds. For ``jac='3-point'``
    the sign of `h` is ignored. If None (default) then step is selected
    automatically.
xrtol : float, default: 0
    Relative tolerance for `x`. Terminate successfully if step size is
    less than ``xk * xrtol`` where ``xk`` is the current parameter vector.
c1 : float, default: 1e-4
    Parameter for Armijo condition rule.
c2 : float, default: 0.9
    Parameter for curvature condition rule.
hess_inv0 : None or ndarray, optional
    Initial inverse hessian estimate, shape (n, n). If None (default) then
    the identity matrix is used.
workers : int, map-like callable, optional
    A map-like callable, such as `multiprocessing.Pool.map` for evaluating
    any numerical differentiation in parallel.
    This evaluation is carried out as ``workers(fun, iterable)``.

    .. versionadded:: 1.16.0

Notes
-----
Parameters `c1` and `c2` must satisfy ``0 < c1 < c2 < 1``.

If minimization doesn't complete successfully, with an error message of
``Desired error not necessarily achieved due to precision loss``, then
consider setting `gtol` to a higher value. This precision loss typically
occurs when the (finite difference) numerical differentiation cannot provide
sufficient precision to satisfy the `gtol` termination criterion.
This can happen when working in single precision and a callable jac is not
provided. For single precision problems a `gtol` of 1e-3 seems to work.

Options (Nelder-Mead)

optimize.show_options(solver="minimize", method="nelder-mead")

Minimization of scalar function of one or more variables using the
Nelder-Mead algorithm.

Options
-------
disp : bool
    Set to True to print convergence messages.
maxiter, maxfev : int
    Maximum allowed number of iterations and function evaluations.
    Will default to ``N*200``, where ``N`` is the number of
    variables, if neither `maxiter` or `maxfev` is set. If both
    `maxiter` and `maxfev` are set, minimization will stop at the
    first reached.
return_all : bool, optional
    Set to True to return a list of the best solution at each of the
    iterations.
initial_simplex : array_like of shape (N + 1, N)
    Initial simplex. If given, overrides `x0`.
    ``initial_simplex[j,:]`` should contain the coordinates of
    the jth vertex of the ``N+1`` vertices in the simplex, where
    ``N`` is the dimension.
xatol : float, optional
    Absolute error in xopt between iterations that is acceptable for
    convergence.
fatol : number, optional
    Absolute error in func(xopt) between iterations that is acceptable for
    convergence.
adaptive : bool, optional
    Adapt algorithm parameters to dimensionality of problem. Useful for
    high-dimensional minimization [1]_.
bounds : sequence or `Bounds`, optional
    Bounds on variables. There are two ways to specify the bounds:

    1. Instance of `Bounds` class.
    2. Sequence of ``(min, max)`` pairs for each element in `x`. None
       is used to specify no bound.

    Note that this just clips all vertices in simplex based on
    the bounds.

References
----------
.. [1] Gao, F. and Han, L.
   Implementing the Nelder-Mead simplex algorithm with adaptive
   parameters. 2012. Computational Optimization and Applications.
   51:1, pp. 259-277

SciPy implementation

The following code comes from SciPy’s minimize() implementation:

if tol is not None:
  options = dict(options)
  if meth == 'nelder-mead':
      options.setdefault('xatol', tol)
      options.setdefault('fatol', tol)
  if meth in ('newton-cg', 'powell', 'tnc'):
      options.setdefault('xtol', tol)
  if meth in ('powell', 'l-bfgs-b', 'tnc', 'slsqp'):
      options.setdefault('ftol', tol)
  if meth in ('bfgs', 'cg', 'l-bfgs-b', 'tnc', 'dogleg',
              'trust-ncg', 'trust-exact', 'trust-krylov'):
      options.setdefault('gtol', tol)
  if meth in ('cobyla', '_custom'):
      options.setdefault('tol', tol)
  if meth == 'trust-constr':
      options.setdefault('xtol', tol)
      options.setdefault('gtol', tol)
      options.setdefault('barrier_tol', tol)

Fixing our tolerances?

n = 25
f, grad, hess = mk_mvn(np.zeros(n), np.eye(n))

optimize.minimize(
  f, runif[:n], jac=grad, 
  method="newton-cg", tol=1e-16
)

 message: Optimization terminated successfully.
 success: True
  status: 0
     fun: -1.2867324357023428e-12
       x: [ 9.534e-01 -2.396e-01 ...  2.220e-01 -8.797e-01]
     nit: 1
     jac: [ 1.227e-12 -3.083e-13 ...  2.857e-13 -1.132e-12]
    nfev: 1
    njev: 1
    nhev: 0

optimize.minimize(
  f, runif[:n], jac=grad, 
  method="bfgs", tol=1e-16
)

  message: Optimization terminated successfully.
  success: True
   status: 0
      fun: -1.0537841099018322e-10
        x: [-2.645e-08  6.648e-09 ... -6.160e-09  2.441e-08]
      nit: 3
      jac: [-2.788e-18  7.006e-19 ... -6.492e-19  2.572e-18]
 hess_inv: [[ 9.790e+08 -2.461e+08 ...  2.280e+08 -9.034e+08]
            [-2.461e+08  6.184e+07 ... -5.730e+07  2.270e+08]
            ...
            [ 2.280e+08 -5.730e+07 ...  5.310e+07 -2.104e+08]
            [-9.034e+08  2.270e+08 ... -2.104e+08  8.336e+08]]
     nfev: 27
     njev: 27

Limits of floating point precision

Every type of floating point value has finite precision due to the limitations of how it is represented. This value is typically referred to as the machine epsilon — the smallest possible spacing between 1.0 and the next representable floating-point number.

np.finfo(np.float64).eps

np.float64(2.220446049250313e-16)

np.finfo(np.float32).eps

np.float32(1.1920929e-07)

np.finfo(np.float16).eps

np.float16(0.000977)

1+np.finfo(np.float64).eps > 1

np.True_

1+np.finfo(np.float64).eps/2 > 1

np.False_

Fixes?

def mk_prop_mvn(mu, Sigma):
  Sigma_inv = np.linalg.inv(Sigma)
  #norm_const = 1 / (np.sqrt(np.linalg.det(2*np.pi*Sigma)))
  norm_const = 1
  
  # Returns the negative density (since we want the max not min)
  def f(x):
    x_m = x - mu
    return -(norm_const * 
      np.exp( 
        -0.5 * x_m.T @ Sigma_inv @ x_m
      ) 
    ).item()
  
  def grad(x):
    return (-f(x) * Sigma_inv @ (x - mu))
  
  def hess(x):
    n = len(x)
    x_m = x - mu
    return f(x) * (
      (Sigma_inv @ x_m).reshape((n,1)) 
      @ (x_m.T @ Sigma_inv).reshape((1,n))
      - Sigma_inv
    )
  
  return f, grad, hess

n = 25
f, grad, hess = mk_prop_mvn(np.zeros(n), np.eye(n))

optimize.minimize(
  f, runif[:n], jac=grad, 
  method="newton-cg", tol=1e-10
)

 message: Optimization terminated successfully.
 success: True
  status: 0
     fun: -1.0
       x: [-3.701e-16  9.302e-17 ... -8.619e-17  3.415e-16]
     nit: 4
     jac: [-3.701e-16  9.302e-17 ... -8.619e-17  3.415e-16]
    nfev: 10
    njev: 14
    nhev: 0

optimize.minimize(
  f, runif[:n], jac=grad, 
  method="bfgs", tol=1e-10
)

  message: Optimization terminated successfully.
  success: True
   status: 0
      fun: -1.0
        x: [-3.040e-15  7.639e-16 ... -7.079e-16  2.805e-15]
      nit: 4
      jac: [-3.040e-15  7.639e-16 ... -7.079e-16  2.805e-15]
 hess_inv: [[ 1.000e+00 -6.660e-09 ...  6.171e-09 -2.445e-08]
            [-6.660e-09  1.000e+00 ... -1.551e-09  6.145e-09]
            ...
            [ 6.171e-09 -1.551e-09 ...  1.000e+00 -5.694e-09]
            [-2.445e-08  6.145e-09 ... -5.694e-09  1.000e+00]]
     nfev: 11
     njev: 11

Performance

Run times
Function calls

Some general advice

Having access to the gradient is almost always helpful / necessary
Having access to the hessian can be helpful, but usually does not significantly improve things
The curse of dimensionality is real
- Be careful with tol - it means different things for different methods
- Be aware of the scale of your function and gradients
In general, BFGS or L-BFGS should be a first choice for most problems (either well- or ill-conditioned)
- CG can perform better for well-conditioned problems with cheap function evaluations

Optimization (cont.)

Method Variants

Method - CG in scipy

Method - Newton-CG & BFGS

Method - L-BFGS-B

Method - Nelder-Mead

Method Summary

R’s optim()

R’s optim() (cont.)

Methods collection

Method Timings

Timings across cost functions

Random starting locations

Profiling - BFGS (cProfile)

Profiling - BFGS (pyinstrument)

Profiling - Nelder-Mead (pyinstrument)

optimize.minimize() output

optimize.minimize() output (cont.)

optimize.minimize() output (cont.)

Collect

Exercise 1

MVN Example

MVN density cost function

Gradient checking

Gradient checking (wrong gradient)

Hessian checking

Unit MVNs

Performance (Unit MVNs)

Performance (Unit MVNs, excl. CG)

What’s going on? (good)

What’s going on? (okay)

What’s going on? (bad)

All bad?

Options (newton-cg)

Options (bfgs)

Options (Nelder-Mead)

SciPy implementation

Fixing our tolerances?

Limits of floating point precision

Fixes?

Performance

Some general advice

R’s `optim()`

R’s `optim()` (cont.)

`optimize.minimize()` output

`optimize.minimize()` output (cont.)

`optimize.minimize()` output (cont.)