[NLopt-discuss] Runtime errors for large gradients

Discussion:

[NLopt-discuss] Runtime errors for large gradients

s***@arcor.de

2011-02-16 12:39:27 UTC

Hi,

I would like to support the request that Alexander Riess has posted on
January 28. Just like him, I keep getting std::runtime_errors for large
gradients. I am using the C++ library, not the Python interface. I could
provide additional test cases, if necessary, but I suppose the simple
example posted by Alexander already illustrates the problem very well.

Best regards,
Peter

Thanks for the Bugfix.
Today I am back with a class of new, but similar testproblems
min -s*x on [-1,1]
with different s.
opt.set_lower_bounds( [-1.0 ] )
opt.set_upper_bounds( [ 1.0 ] )
opt.set_min_objective(f)
opt.set_xtol_abs(1e-10)
opt.set_xtol_rel(1e-10)
opt.set_ftol_abs(1e-10)
opt.set_ftol_rel(1e-10)
x0 = array([0.0])
1.) s = 1.0
Iteration x f(x) f'(x)
1 0.0 -0 -1.0
2 1.0 -1.0 -1.0
xopt = 1.0
fopt = -1.0
2.) s = 1.0E4
Iteration x f(x) f'(x)
1 0.0 -0 -10000.0
2 0.999806062342 -9998.06062342 -10000.0
3 0.999873394578 -9998.73394578 -10000.0
xopt = 0.999873394578
fopt = -9998.73394578
3.) s = 1.0E5
Iteration x f(x) f'(x)
1 0.0 -0 -100000.0
2 0.562577480974 -56257.7480974 -100000.0
3 1.0 -100000.0 -100000.0
xopt = 1.0
fopt = -100000.0
4.) s = 1.0E6
Iteration x f(x) f'(x)
1 0.0 -0 -1000000.0
xopt = 0.0
fopt = -0.0
5.) s = 1.0E7
Iteration x f(x) f'(x)
1 0.0 -0 -10000000.0
xopt = 0.0
fopt = -0.0
6.) s = 1.0E8
Iteration x f(x) f'(x)
1 0.0 -0 -100000000.0
File "XXX\minExample.py", line 35, in <module>
xOpt = opt.optimize(x0)
File "XXX\Python265\lib\site-packages\nlopt.py", line 231, in optimize
def optimize(*args): return _nlopt.opt_optimize(*args)
RuntimeError: nlopt failure
Conclusion: SQP fails on this class of test-problems if there is a large
gradient ( abs(grad) >= 1.0E6 ). Can you please tell me what is going wrong.
Note: Using MMA eveything works fine.
Kind regards
Alexander Riess

Steven G. Johnson

2011-03-07 20:58:39 UTC

Tracing through the code, this appears to be a problem in the original
SLSQP software. SLSQP works by minimizing an approximate quadratic
subproblem at each step. At the first step of Alexander's 1d test
case, SLSQP is minimizing:

min_x (x - s)^2
-1 <= x <= 1

which should give x = 1 for s > 0 -- i.e., it should converge in 1
step. However, for large s, SLSQP (in its subroutine LSQ) seems to be
suffering from extremely large round-off errors, as if it were solving
an ill-conditioned problem or if were using a numerically unstable
algorithm. I'm not sure why that should be the case here, however;
the problem doesn't seem too badly conditioned even for s=1e6, and the
QR algorithm that LSQ claims to be using should be stable, I think.
For s=1e6 the LSQ routine is "solving" the above subproblem with
x=-356.31518661149312, and that is screwing everything up in
subsequent steps.

One thing that would be helpful would be if someone tried Alexander's
test problem in fmin_slsqp from SciPy, which also uses the SLSQP code,
to see if a similar problem (albeit probably with different symptoms)
occurs there.

Unfortunately, the guts of the LSQ routine in SLSQP are rather painful
to debug; this is pretty ugly spaghetti-style code (even before f2c
conversion from Fortran to C), and this appears to be a roundoff
susceptibility rather than a more straightforward bug. It almost
looks easier to rip it out completely and replace it with something
LAPACK-based (i.e. based on modern linear algebra libraries), but that
is not a small job.

For now, your options seem to be (a) rescale your problem so that
gradients etcetera are of order 1 or (b) use a different algorithm in
NLopt.

Steven

Post by s***@arcor.de
Hi,
I would like to support the request that Alexander Riess has posted
on January 28. Just like him, I keep getting std::runtime_errors for
large gradients. I am using the C++ library, not the Python
interface. I could provide additional test cases, if necessary, but
I suppose the simple example posted by Alexander already illustrates
the problem very well.
Best regards,
Peter

Thanks for the Bugfix.
Today I am back with a class of new, but similar testproblems
min -s*x on [-1,1]
with different s.
opt.set_lower_bounds( [-1.0 ] )
opt.set_upper_bounds( [ 1.0 ] )
opt.set_min_objective(f)
opt.set_xtol_abs(1e-10)
opt.set_xtol_rel(1e-10)
opt.set_ftol_abs(1e-10)
opt.set_ftol_rel(1e-10)
x0 = array([0.0])
1.) s = 1.0
Algorithm Sequential Quadratic Programming (SQP) (local,
Iteration x f(x) f'(x)
1 0.0 -0 -1.0
2 1.0 -1.0 -1.0
xopt = 1.0
fopt = -1.0
2.) s = 1.0E4
Algorithm Sequential Quadratic Programming (SQP) (local,
Iteration x f(x) f'(x)
1 0.0 -0 -10000.0
2 0.999806062342 -9998.06062342 -10000.0
3 0.999873394578 -9998.73394578 -10000.0
xopt = 0.999873394578
fopt = -9998.73394578
3.) s = 1.0E5
Algorithm Sequential Quadratic Programming (SQP) (local,
Iteration x f(x) f'(x)
1 0.0 -0 -100000.0
2 0.562577480974 -56257.7480974 -100000.0
3 1.0 -100000.0 -100000.0
xopt = 1.0
fopt = -100000.0
4.) s = 1.0E6
Algorithm Sequential Quadratic Programming (SQP) (local,
Iteration x f(x) f'(x)
1 0.0 -0 -1000000.0
xopt = 0.0
fopt = -0.0
5.) s = 1.0E7
Algorithm Sequential Quadratic Programming (SQP) (local,
Iteration x f(x) f'(x)
1 0.0 -0 -10000000.0
xopt = 0.0
fopt = -0.0
6.) s = 1.0E8
Algorithm Sequential Quadratic Programming (SQP) (local,
Iteration x f(x) f'(x)
1 0.0 -0 -100000000.0
File "XXX\minExample.py", line 35, in <module>
xOpt = opt.optimize(x0)
File "XXX\Python265\lib\site-packages\nlopt.py", line 231, in optimize
def optimize(*args): return _nlopt.opt_optimize(*args)
RuntimeError: nlopt failure
Conclusion: SQP fails on this class of test-problems if there is a large
gradient ( abs(grad) >= 1.0E6 ). Can you please tell me what is going wrong.
Note: Using MMA eveything works fine.
Kind regards
Alexander Riess

_______________________________________________
NLopt-discuss mailing list
http://ab-initio.mit.edu/cgi-bin/mailman/listinfo/nlopt-discuss

s***@arcor.de

2011-03-08 09:29:32 UTC

Dear Steven,

Thank you for investigating.

Post by Steven G. Johnson
One thing that would be helpful would be if someone tried Alexander's
test problem in fmin_slsqp from SciPy, which also uses the SLSQP code,
to see if a similar problem (albeit probably with different symptoms)
occurs there.

Unfortunately, I cannot help you with that. I have no idea about Python.

Post by Steven G. Johnson
Tracing through the code, this appears to be a problem in the original
SLSQP software. SLSQP works by minimizing an approximate quadratic
subproblem at each step. At the first step of Alexander's 1d test case,
min_x (x - s)^2
-1 <= x <= 1

I wonder whether this is really the problem actually solved by the
method. I must admit, I have not checked the code, and I am not sure
about SLSQP. The original problem is
min_x f(x) = -s * x
g(x) = (x-1, -x-1)^T <= 0.
In a general SQP method, one would then solve
min_d -s * d
(x_k - 1, -x_k - 1) + (d, -d) <=0,
where x_k is the previous iterate, and then set
x_{k+1} = x_k + d.
(Actually, one would perform a line search with Armijo rule and search
direction d.)
There is no quadratic term in the target function in this case, since
the second derivative of the Lagrange function
L(x) = f(x) + lambda^T g(x) = -s*x + lambda^T (x-1, -x-1)^T
is 0.
Perhaps SLSQP somehow transforms this problem, which is quadratic in
general, to the minimization problem you state. I probably should have a
look at the original papers by Dieter Kraft describing SLSQP.

Post by Steven G. Johnson
Unfortunately, the guts of the LSQ routine in SLSQP are rather painful
to debug; this is pretty ugly spaghetti-style code (even before f2c
conversion from Fortran to C),

That is very true ;-).

Post by Steven G. Johnson
For now, your options seem to be (a) rescale your problem so that
gradients etcetera are of order 1 or (b) use a different algorithm in
NLopt.

My actual application is highly non-linear, 50-100 dimensional, and
unfortunately too complex to do any rescaling. While LBFGS works, it is
not really an option, since it needs too many gradient evaluations. SQP
is the only chance to get results within reasonable times (less than 12
hours).
Knowing that there may be numerical instabilities in the from my point
of view most important optimization algorithm offered by NLOPT is
somewhat scaring...

Best regards,
Peter

Steven G. Johnson

2011-03-08 16:25:35 UTC

Post by s***@arcor.de

Post by Steven G. Johnson
Tracing through the code, this appears to be a problem in the
original
SLSQP software. SLSQP works by minimizing an approximate quadratic
subproblem at each step. At the first step of Alexander's 1d test case,
min_x (x - s)^2
-1 <= x <= 1

I wonder whether this is really the problem actually solved by the
method. I must admit, I have not checked the code, and I am not sure
about SLSQP. The original problem is
min_x f(x) = -s * x
g(x) = (x-1, -x-1)^T <= 0.
In a general SQP method, one would then solve
min_d -s * d

You need to realize that SLSQP is a quasi-Newton method, not a Newton
method: it is sequential quadratic programming with an *approximate*
Hessian (2nd derivative) that is built up iteratively.

That is, at each step, it solves a quadratic approximation of the
original objective, where the linear term is the gradient and the
quadratic term is an approximate Hessian. The Hessian is constructed
from BFGS updates, based on the gradients and function values at
subsequent steps, and like any BFGS method it only approaches the true
Hessian in the limit of many iterations near a minimum.

In particular, for the very first step the Hessian is initialized to
the identity matrix (divided by 2), hence the (x - s)^2 objective.
This is a typical choice for quasi-Newton methods. (Hence the initial
steps should depend on the magnitude of the gradient.)

Post by s***@arcor.de
My actual application is highly non-linear, 50-100 dimensional, and
unfortunately too complex to do any rescaling. While LBFGS works, it
is not really an option, since it needs too many gradient
evaluations. SQP is the only chance to get results within reasonable
times (less than 12 hours).

SLSQP is a BFGS method. The main differences between it and LBFGS are:

1) it uses the original BFGS method with dense matrices, rather than
the LBFGS sparse/low-rank/low-storage factorizations.

2) SLSQP handles nonlinear constrants (its subproblems are general
QPs: quadratic objectives, linear constraints) whereas LBFGS only
handles bound constraints.

So, if you have no nonlinear constraints I'm not sure why SLSQP would
converge significantly differently than any of the other quasi-Newton
methods.

Steven

Alexander Riess

2011-03-14 10:50:07 UTC

Hi

Post by Steven G. Johnson
One thing that would be helpful would be if someone tried Alexander's
test problem in fmin_slsqp from SciPy, which also uses the SLSQP code,
to see if a similar problem (albeit probably with different symptoms)
occurs there.

I tried to solve the test-problem min -s*x on [-1.0,1.0] with SciPy. The
summary, program and results are listed bellow.

If 1.0 <= s <= 100 everything works fine: Optimization terminated
successfully. If s > 1.0E2 the result is: Positive directional derivative for
linesearch. If s > 1.0E5 the calculated value of xmin is wrong: [0.0].

Summary:
------------------------------------------------------------------------

s = 1.0E0: [[1.0], -1.0, 2, 0, 'Optimization terminated successfully.']
s = 1.0E1: [[1.0000000000000924], -10.000000000000924, 2, 0, 'Optimization
terminated successfully.']
s = 1.0E2: [[0.99999999977386267], -99.999999977386267, 2, 0, 'Optimization
terminated successfully.']
s = 1.0E3: [[0.99999982147039645], -999.99982147039645, 6, 8, 'Positive
directional derivative for linesearch']
s = 1.0E4: [[0.99987339457766211], -9998.7339457766211, 7, 8, 'Positive
directional derivative for linesearch']
s = 1.0E5: [[1.1326320648076944], -113263.20648076944, 7, 8, 'Positive
directional derivative for linesearch']
s = 1.0E6: [[0.0], -0.0, 5, 8, 'Positive directional derivative for
linesearch']
s = 1.0E7: [[0.0], -0.0, 5, 8, 'Positive directional derivative for
linesearch']

Program:
------------------------------------------------------------------------

from scipy.optimize import fmin_slsqp
from numpy import array, asfarray, finfo,ones, sqrt, zeros

def f(x,*args):
out = -s*x[0]
return out

def df(x,*args):
out = [-s]
return out

s = 1.0E1

print "Optimization of -s*x on [-1.0,1.0] using fmin_slsqp | s = ",s
print "=============================================================="
x = fmin_slsqp(f,
x0=[0.0],
fprime = df,
bounds=[(-1.0,1.0)],
iprint=2,
full_output=1)
print "=============================================================="
print "Results",x

Results:
------------------------------------------------------------------------

Optimization of -s*x on [-1.0,1.0] using fmin_slsqp | s = 1.0
==============================================================
NIT FC OBJFUN GNORM
1 1 -0.000000E+00 1.000000E+00
2 2 -1.000000E+00 1.000000E+00
Optimization terminated successfully. (Exit mode 0)
Current function value: -1.0
Iterations: 2
Function evaluations: 2
Gradient evaluations: 2
==============================================================
Results [[1.0], -1.0, 2, 0, 'Optimization terminated successfully.']

------------------------------------------------------------------------

Optimization of -s*x on [-1.0,1.0] using fmin_slsqp | s = 10.0
==============================================================
NIT FC OBJFUN GNORM
1 1 -0.000000E+00 1.000000E+01
2 2 -1.000000E+01 1.000000E+01
Optimization terminated successfully. (Exit mode 0)
Current function value: -10.0
Iterations: 2
Function evaluations: 2
Gradient evaluations: 2
==============================================================
Results [[1.0000000000000924], -10.000000000000924, 2, 0, 'Optimization
terminated successfully.']

------------------------------------------------------------------------

Optimization of -s*x on [-1.0,1.0] using fmin_slsqp | s = 100.0
==============================================================
NIT FC OBJFUN GNORM
1 1 -0.000000E+00 1.000000E+02
2 2 -1.000000E+02 1.000000E+02
Optimization terminated successfully. (Exit mode 0)
Current function value: -99.9999999774
Iterations: 2
Function evaluations: 2
Gradient evaluations: 2
==============================================================
Results [[0.99999999977386267], -99.999999977386267, 2, 0, 'Optimization
terminated successfully.']

------------------------------------------------------------------------

Optimization of -s*x on [-1.0,1.0] using fmin_slsqp | s = 1000.0
==============================================================
NIT FC OBJFUN GNORM
1 1 -0.000000E+00 1.000000E+03
6 2 -9.999998E+02 1.000000E+03
Positive directional derivative for linesearch (Exit mode 8)
Current function value: -999.99982147
Iterations: 6
Function evaluations: 2
Gradient evaluations: 2
==============================================================
Results [[0.99999982147039645], -999.99982147039645, 6, 8, 'Positive
directional derivative for linesearch']

------------------------------------------------------------------------

Optimization of -s*x on [-1.0,1.0] using fmin_slsqp | s = 10000.0
==============================================================
NIT FC OBJFUN GNORM
1 1 -0.000000E+00 1.000000E+04
3 2 -9.998061E+03 1.000000E+04
7 3 -9.998734E+03 1.000000E+04
Positive directional derivative for linesearch (Exit mode 8)
Current function value: -9998.73394578
Iterations: 7
Function evaluations: 3
Gradient evaluations: 3
==============================================================
Results [[0.99987339457766211], -9998.7339457766211, 7, 8, 'Positive
directional derivative for linesearch']

------------------------------------------------------------------------

Optimization of -s*x on [-1.0,1.0] using fmin_slsqp | s = 100000.0
==============================================================
NIT FC OBJFUN GNORM
1 1 -0.000000E+00 1.000000E+05
3 2 -5.625775E+04 1.000000E+05
7 3 -1.132632E+05 1.000000E+05
Positive directional derivative for linesearch (Exit mode 8)
Current function value: -113263.206481
Iterations: 7
Function evaluations: 3
Gradient evaluations: 3
==============================================================
Results [[1.1326320648076944], -113263.20648076944, 7, 8, 'Positive
directional derivative for linesearch']

------------------------------------------------------------------------

Optimization of -s*x on [-1.0,1.0] using fmin_slsqp | s = 1000000.0
==============================================================
NIT FC OBJFUN GNORM
5 1 -0.000000E+00 1.000000E+06
Positive directional derivative for linesearch (Exit mode 8)
Current function value: -0
Iterations: 5
Function evaluations: 1
Gradient evaluations: 1
==============================================================
Results [[0.0], -0.0, 5, 8, 'Positive directional derivative for linesearch']

------------------------------------------------------------------------

Optimization of -s*x on [-1.0,1.0] using fmin_slsqp | s = 10000000.0
==============================================================
NIT FC OBJFUN GNORM
5 1 -0.000000E+00 1.000000E+07
Positive directional derivative for linesearch (Exit mode 8)
Current function value: -0
Iterations: 5
Function evaluations: 1
Gradient evaluations: 1
==============================================================
Results [[0.0], -0.0, 5, 8, 'Positive directional derivative for linesearch']

Steven G. Johnson

2011-03-14 11:55:30 UTC

Post by Alexander Riess

Post by Steven G. Johnson
One thing that would be helpful would be if someone tried Alexander's
test problem in fmin_slsqp from SciPy, which also uses the SLSQP code,
to see if a similar problem (albeit probably with different symptoms)
occurs there.

I tried to solve the test-problem min -s*x on [-1.0,1.0] with SciPy. The
summary, program and results are listed bellow.
If 1.0 <= s <= 100 everything works fine: Optimization terminated
successfully. If s > 1.0E2 the result is: Positive directional
derivative for
linesearch. If s > 1.0E5 the calculated value of xmin is wrong: [0.0].

Thanks for checking on this; this confirms that the problem is in the
original SLSQP, and not in some mistake I made in porting it.

Steven

Alexander Riess

2011-03-15 06:17:34 UTC

Hi

Post by Steven G. Johnson
One thing that would be helpful would be if someone tried Alexander's
test problem in fmin_slsqp from SciPy, which also uses the SLSQP code,
to see if a similar problem (albeit probably with different symptoms)
occurs there.

I used SciPy to solve the problem min -s*x on [-1;1] with fmin_slsqp.

Fmin_slsqp fails to solve the problem for s >= 1.0E6. If s >= 1.0E3 the result
code is 'Positive directional derivative for linesearch'.

Below the results and program is listed.

Kind regards

Alexander Riess

Results:
-------------------------------------------------------------------------

Optimization of -s*x on [-1.0,1.0] using fmin_slsqp | s = 1.0
==============================================================
NIT FC OBJFUN GNORM
1 1 -0.000000E+00 1.000000E+00
2 2 -1.000000E+00 1.000000E+00
Optimization terminated successfully. (Exit mode 0)
Current function value: -1.0
Iterations: 2
Function evaluations: 2
Gradient evaluations: 2
==============================================================
Results [[1.0], -1.0, 2, 0, 'Optimization terminated successfully.']

-------------------------------------------------------------------------

Optimization of -s*x on [-1.0,1.0] using fmin_slsqp | s = 100.0
==============================================================
NIT FC OBJFUN GNORM
1 1 -0.000000E+00 1.000000E+02
2 2 -1.000000E+02 1.000000E+02
Optimization terminated successfully. (Exit mode 0)
Current function value: -99.9999999774
Iterations: 2
Function evaluations: 2
Gradient evaluations: 2
==============================================================
Results [[0.99999999977386267], -99.999999977386267, 2, 0, 'Optimization
terminated successfully.']

-------------------------------------------------------------------------

Optimization of -s*x on [-1.0,1.0] using fmin_slsqp | s = 1000.0
==============================================================
NIT FC OBJFUN GNORM
1 1 -0.000000E+00 1.000000E+03
6 2 -9.999998E+02 1.000000E+03
Positive directional derivative for linesearch (Exit mode 8)
Current function value: -999.99982147
Iterations: 6
Function evaluations: 2
Gradient evaluations: 2
==============================================================
Results [[0.99999982147039645], -999.99982147039645, 6, 8, 'Positive
directional derivative for linesearch']

-------------------------------------------------------------------------

Optimization of -s*x on [-1.0,1.0] using fmin_slsqp | s = 1000.0
==============================================================
NIT FC OBJFUN GNORM
1 1 -0.000000E+00 1.000000E+03
6 2 -9.999998E+02 1.000000E+03
Positive directional derivative for linesearch (Exit mode 8)
Current function value: -999.99982147
Iterations: 6
Function evaluations: 2
Gradient evaluations: 2
==============================================================
Results [[0.99999982147039645], -999.99982147039645, 6, 8, 'Positive
directional derivative for linesearch']

-------------------------------------------------------------------------

Optimization of -s*x on [-1.0,1.0] using fmin_slsqp | s = 10000.0
==============================================================
NIT FC OBJFUN GNORM
1 1 -0.000000E+00 1.000000E+04
3 2 -9.998061E+03 1.000000E+04
7 3 -9.998734E+03 1.000000E+04
Positive directional derivative for linesearch (Exit mode 8)
Current function value: -9998.73394578
Iterations: 7
Function evaluations: 3
Gradient evaluations: 3
==============================================================
Results [[0.99987339457766211], -9998.7339457766211, 7, 8, 'Positive
directional derivative for linesearch']

-------------------------------------------------------------------------

Optimization of -s*x on [-1.0,1.0] using fmin_slsqp | s = 100000.0
==============================================================
NIT FC OBJFUN GNORM
1 1 -0.000000E+00 1.000000E+05
3 2 -5.625775E+04 1.000000E+05
7 3 -1.132632E+05 1.000000E+05
Positive directional derivative for linesearch (Exit mode 8)
Current function value: -113263.206481
Iterations: 7
Function evaluations: 3
Gradient evaluations: 3
==============================================================
Results [[1.1326320648076944], -113263.20648076944, 7, 8, 'Positive
directional derivative for linesearch']

-------------------------------------------------------------------------

Optimization of -s*x on [-1.0,1.0] using fmin_slsqp | s = 1000000.0
==============================================================
NIT FC OBJFUN GNORM
5 1 -0.000000E+00 1.000000E+06
Positive directional derivative for linesearch (Exit mode 8)
Current function value: -0
Iterations: 5
Function evaluations: 1
Gradient evaluations: 1
==============================================================
Results [[0.0], -0.0, 5, 8, 'Positive directional derivative for linesearch']

Program:
-------------------------------------------------------------------------
from scipy.optimize import fmin_slsqp
from numpy import array, asfarray, finfo,ones, sqrt, zeros

def f(x,*args):
out = -s*x[0]
return out

def df(x,*args):
out = [-s]
return out

s = 1.0E2

print "Optimization of -s*x on [-1.0,1.0] using fmin_slsqp | s = ",s
print "=============================================================="
x = fmin_slsqp(f,
x0=[0.0],
fprime = df,
bounds=[(-1.0,1.0)],
iprint=2,
full_output=1)
print "=============================================================="
print "Results",x

s***@arcor.de

2011-03-08 17:07:57 UTC

Post by Steven G. Johnson
You need to realize that SLSQP is a quasi-Newton method, not a Newton
method: it is sequential quadratic programming with an *approximate*
Hessian (2nd derivative) that is built up iteratively.
[snip]
In particular, for the very first step the Hessian is initialized to the
identity matrix (divided by 2)

You are, of course, right. Thanks for pointing this out.

Post by Steven G. Johnson
SLSQP is a BFGS method.
So, if you have no nonlinear constraints I'm not sure why SLSQP would
converge significantly differently than any of the other quasi-Newton
methods.

I agree with that. In the NLopt implementation, however, SLSQP seems to
perform a line search, while LBFGS does not (?). LBFGS evaluates the
gradient every time the function is called. This also affects
convergence. For a very small, randomly picked test with my application,
I obtain the following results:
SLSQP: 54 gradients, 100 function evaluations, 28 seconds,
LBFGS: 116 gradients, 116 function evaluations, 67 seconds.
Perhaps, I am missing something here? According to the manual, the
gradient is required if and only if the gradient is not empty().

Apart from that, I would love to see the SQP method being reliable and
stable for all cases. At some point, I will have to add non-linear
constraints to my problem and then SQP will definitely be the method of
choice.

Best regards,
Peter

Steven G. Johnson

2011-03-08 17:46:36 UTC

Post by s***@arcor.de
I agree with that. In the NLopt implementation, however, SLSQP seems to
perform a line search, while LBFGS does not (?). LBFGS evaluates the
gradient every time the function is called. This also affects
convergence. For a very small, randomly picked test with my
application,
SLSQP: 54 gradients, 100 function evaluations, 28 seconds,
LBFGS: 116 gradients, 116 function evaluations, 67 seconds.
Perhaps, I am missing something here? According to the manual, the
gradient is required if and only if the gradient is not empty().

They both perform line searches (essentially any quasi-Newton method
must do this), but you are right in that SLSQP's line-search
implementation tries to avoid evaluating the gradient if it does not
need it. (The gradient is needed only for the first and last step of
the line search.)

It probably wouldn't be too hard to modify the LBFGS implementation to
do something similar.

Steven

PS. You can probably save a few more function evaluations if you check
for repeated calls with the same arguments. The reason is that, on
the line search, SLSQP often calls your function twice at the end of
the line search: once without the gradient, and then once more (when
it realizes ex post facto that the line search is done) with the
gradient. (An alternative interface would be to have separate
function calls for the objective and its gradient, but this is
problematic because the value and gradient usually share many
calculations.)

s***@arcor.de

2011-03-08 19:05:17 UTC

Post by Steven G. Johnson
PS. You can probably save a few more function evaluations if you
check for repeated calls with the same arguments. The reason is
that, on the line search, SLSQP often calls your function twice at
the end of the line search: once without the gradient, and then once
more (when it realizes ex post facto that the line search is done)
with the gradient.

Yes, I have noticed that. Unfortunately, I cannot make use of
this fact. I use automatic differentiation for the gradients, so
computing the derivative involves computing the function anyway.

Post by Steven G. Johnson
An alternative interface would be to have separate function calls for
the objective and its gradient, but this is problematic because the
value and gradient usually share many calculations.

Exactly! I think the interface is perfect as is, given that the gradient
computation is really avoided whenever possible.

In fact, for realistic parameter sets, computing the gradients (with
some automatic differentiation data type) consumes about 90% of the
total runtime of my program. Evaluating the function without
differentiation (data type double) is quite cheap.

Post by Steven G. Johnson
It probably wouldn't be too hard to modify the LBFGS implementation
to do something similar.

That would be great! Perhaps LBFGS would be competitive then (as long as
I have no non-linear constraints).

Best regards,
Peter

Gregory Maxwell

2011-03-08 19:14:28 UTC

Post by s***@arcor.de

Post by Steven G. Johnson
PS. You can probably save a few more function evaluations if you
check for repeated calls with the same arguments. The reason is
that, on the line search, SLSQP often calls your function twice at
the end of the line search: once without the gradient, and then once
more (when it realizes ex post facto that the line search is done)
with the gradient.

Yes, I have noticed that. Unfortunately, I cannot make use of
this fact. I use automatic differentiation for the gradients, so
computing the derivative involves computing the function anyway.

Perhaps a dumb question— but why are you using automatic differentiation
instead of using a gradient free algorithm?

Are the gradient free algorithims in NLOPT really so poor for your
application that
automatic differentiation + a with-gradient algorithm is superior?

Steven G. Johnson

2011-03-08 19:49:25 UTC

Post by Gregory Maxwell

Post by s***@arcor.de

Post by Steven G. Johnson
PS. You can probably save a few more function evaluations if you
check for repeated calls with the same arguments. The reason is
that, on the line search, SLSQP often calls your function twice at
the end of the line search: once without the gradient, and then once
more (when it realizes ex post facto that the line search is done)
with the gradient.

Yes, I have noticed that. Unfortunately, I cannot make use of
this fact. I use automatic differentiation for the gradients, so
computing the derivative involves computing the function anyway.

Perhaps a dumb question— but why are you using automatic
differentiation
instead of using a gradient free algorithm?
Are the gradient free algorithims in NLOPT really so poor for your
application that
automatic differentiation + a with-gradient algorithm is superior?

Assuming Peter is using standard terminology, automatic
differentiation does *not* mean numerical differentiation with finite
differences. Automatic differentiation means calculating the exact
derivative analytically, just using a computer program rather than
doing it by hand: an AD differentiaties your source code symbolically.

So, automatic differentiation can in principle be the same efficiency
as programming analytical derivatives by hand, and with it the
gradient-based algorithms are probably far superior to the derivative-
free algorithms for this many unknowns.

In principle, since AD takes source code to compute an objective and
generates source code to compute objective+gradient, you could
possibly edit the resulting code to remove stuff only needed for the
computation of the objective. But editing program-generated code by
hand is messy, and is fragile because you need to re-do it each time
you change your objective, so I can understand why Peter wouldn't want
to do it.

Steven

Gregory Maxwell

2011-03-08 19:53:20 UTC

Assuming Peter is using standard terminology, automatic differentiation does
*not* mean numerical differentiation with finite differences. Automatic
differentiation means calculating the exact derivative analytically, just
using a computer program rather than doing it by hand: an AD differentiaties
your source code symbolically.
So, automatic differentiation can in principle be the same efficiency as
programming analytical derivatives by hand, and with it the gradient-based
algorithms are probably far superior to the derivative-free algorithms for
this many unknowns.
In principle, since AD takes source code to compute an objective and
generates source code to compute objective+gradient, you could possibly edit
the resulting code to remove stuff only needed for the computation of the
objective. But editing program-generated code by hand is messy, and is
fragile because you need to re-do it each time you change your objective, so
I can understand why Peter wouldn't want to do it.

Ah, duh, I was tunnel visioned on finite differences because of the
statements that the differentiation was very expensive compared to the
objective and that 'computing the derivative involves computing the
function'.

Don't mind me. :)

Steven G. Johnson

2011-03-08 19:58:34 UTC

Post by s***@arcor.de
In fact, for realistic parameter sets, computing the gradients (with
some automatic differentiation data type) consumes about 90% of the
total runtime of my program. Evaluating the function without
differentiation (data type double) is quite cheap.

Peter, this probably means that your automatic differentiation program
is poor for this kind of problem.

There is a theorem that you can always compute the gradient of f(x)
(any function from R^n to R) with about as many additional operations
as are required to compute f(x) **once**. The technique for doing so
is called an "adjoint method", or "reverse accumulation".

So you want an AD program that advertises "reverse mode"
differentiation or something similar.

In contrast, AD programs that use "forward mode" or "forward
accumulation" differentiation are poor for differentiating multi-
variate functions (R^n to R), since they have cost that grows
proportional to n. (Forward accumulation is good for differentiating
functions from R to R^m.)

Steven

s***@arcor.de

2011-03-09 08:54:38 UTC

Post by Steven G. Johnson
Assuming Peter is using standard terminology, automatic
differentiation does *not* mean numerical differentiation with finite
differences. Automatic differentiation means calculating the exact
derivative analytically, just using a computer program rather than
doing it by hand: an AD differentiaties your source code
symbolically.

Yes, that is precisely what I do.

Post by Steven G. Johnson
There is a theorem that you can always compute the gradient of f(x)
(any function from R^n to R) with about as many additional operations
as are required to compute f(x) **once**. The technique for doing so
is called an "adjoint method", or "reverse accumulation".

I do use a library which allows for reverse accumulation (namely CppAD),
and I think it is quite efficient. In particular, it usually requires
only a constant factor of additional computational time compared to the
function evaluation. This factor is roughly equal to 10 in my case.
The overhead is due to the fact that all arithmetic operations have to
be recorded to a 'tape' first, Taylor coefficients are computed and
propagated, and then an additional reverse sweep is performed. By
design, the function has to be evaluated before any differentiation can
be done.

Furthermore, CppAD works by overloading functions with a different data
type, not by code generation. This is essential, since my functions
contain external templated libraries relying on (large amounts of)
conditional expressions. Thus, the function I differentiate actually
varies with the parameters. Differentiating once and for all, which is
usually the goal of AD, does not work.

So, while a factor of 10 is great compared to the factor of 50 or even
100 (for 50 variables) you would get with finite differences, it still
means that most of the time is spent for the gradient.

Post by Steven G. Johnson
why are you using automatic
differentiation instead of using a gradient free algorithm? Are the
gradient free algorithims in NLOPT really so poor for your
application that automatic differentiation + a with-gradient
algorithm is superior?

Actually, for most mathematical applications, there is a huge
performance difference between gradient free algorithms and gradient
based methods. If it is possible to obtain a gradient by any means
(sometimes even finite differences), the gradient based methods should
easily beat other methods. This is particularly true if you have more
than, say, 5 variables, even with the overhead for differentiation. For
50 variables in a non-linear problem, gradient-free methods will perhaps
never converge.

Post by Steven G. Johnson
In contrast, AD programs that use "forward mode" or "forward
accumulation" differentiation are poor for differentiating
multi-variate functions (R^n to R), since they have cost that grows
proportional to n. (Forward accumulation is good for differentiating
functions from R to R^m.)

Exactly ;-).

Best regards,
Peter

Steven G. Johnson

2011-03-09 16:00:42 UTC

Post by s***@arcor.de
Furthermore, CppAD works by overloading functions with a different
data type, not by code generation.

Ah, that's why it is slow.

Post by s***@arcor.de
For 50 variables in a non-linear problem, gradient-free methods will
perhaps never converge.

There are many gradient-free methods that are provably convergent (for
sufficiently smooth functions); this is not a general weakness of
gradient-free methods.

(The notorious counterexample is Nelder-Mead, but to be fair there
have also been historical gradient-based methods that were
subsequently found to not converge in certain cases.)

Steven

15 Replies
484 Views
Permalink to this page
Disable enhanced parsing

Thread Navigation

s***@arcor.de 2011-02-16 12:39:27 UTC

Steven G. Johnson 2011-03-07 20:58:39 UTC

s***@arcor.de 2011-03-08 09:29:32 UTC

Steven G. Johnson 2011-03-08 16:25:35 UTC

Alexander Riess 2011-03-14 10:50:07 UTC

Steven G. Johnson 2011-03-14 11:55:30 UTC

Alexander Riess 2011-03-15 06:17:34 UTC

s***@arcor.de 2011-03-08 17:07:57 UTC

Steven G. Johnson 2011-03-08 17:46:36 UTC

s***@arcor.de 2011-03-08 19:05:17 UTC

Gregory Maxwell 2011-03-08 19:14:28 UTC

Steven G. Johnson 2011-03-08 19:49:25 UTC

Gregory Maxwell 2011-03-08 19:53:20 UTC

Steven G. Johnson 2011-03-08 19:58:34 UTC

s***@arcor.de 2011-03-09 08:54:38 UTC

Steven G. Johnson 2011-03-09 16:00:42 UTC

about - legalese

Loading...