# Implementation of L-BFGS algorithm in liblbfgs

In the blog post " Optimization Algorithm - L-BFGS Algorithm of Quasi-Newton Method "In ", the algorithm principle of L-BFGS has been introduced in detail. This article mainly reviews the algorithm principle and specific implementation process of L-BFGS on the open source code liblbfgs. The L-BFGS algorithm includes OWL for processing L1 regularization. -QN algorithm, for the detailed principle of OWL-QN algorithm, please refer to the blog post " Optimization algorithm - OWL-QN".

# 1. liblbfgs overview

liblbfgs is an L-BFGS algorithm library based on C language, which is used to solve nonlinear optimization problems. can be found via liblbfgs' home page ( http://www.chokkan.org/software/liblbfgs/ ) Query to introduce the liblbfgs module. Its code can be downloaded through the link below:

• For Windows platform https://github.com/chokkan/liblbfgs

# 2. liblbfgs source code analysis

## 2.1. The main structure of the source code

In liblbfgs, the main code includes

• liblbfgs-1.10/lib/lbfgs.c: specific implementation
• liblbfgs-1.10/lib/arithmetic_ansi.h (the other two arithmetic_sse_double.h and arithmetic_sse_float.h are two equivalent forms written in assembly): equivalent to some tools
• liblbfgs-1.10/sample/sample.c: test sample

## 2.2. Tools arithmetic_ansi.h

In the arithmetic_ansi.h file, there are mainly some operations on the vector (vector). The program code of this part is relatively simple. Here is a brief description of the function of each function, including:

• Apply for a space of size and initialize it at the same time
void* vecalloc(size_t size)
copy
• free up space
void vecfree(void *memblock)
copy
• set the value in vector x to c
void vecset(lbfgsfloatval_t *x, const lbfgsfloatval_t c, const int n)
copy
• Copies the values ​​in vector x to vector y
void veccpy(lbfgsfloatval_t *y, const lbfgsfloatval_t *x, const int n)
copy
• Take the negative of each value in vector x and put it in vector y
void vecncpy(lbfgsfloatval_t *y, const lbfgsfloatval_t *x, const int n)
copy
• For each element in vector y, increase c times the corresponding element in vector x
void vecadd(lbfgsfloatval_t *y, const lbfgsfloatval_t *x, const lbfgsfloatval_t c, const int n)
copy
• Computes the difference between vector x and vector y
void vecdiff(lbfgsfloatval_t *z, const lbfgsfloatval_t *x, const lbfgsfloatval_t *y, const int n)
copy
• product of vector and constant
void vecscale(lbfgsfloatval_t *y, const lbfgsfloatval_t c, const int n)
copy
• outer product of vectors
void vecmul(lbfgsfloatval_t *y, const lbfgsfloatval_t *x, const int n)
copy
• dot product of vectors
void vecdot(lbfgsfloatval_t* s, const lbfgsfloatval_t *x, const lbfgsfloatval_t *y, const int n)
copy
• square root of dot product of vectors
void vec2norm(lbfgsfloatval_t* s, const lbfgsfloatval_t *x, const int n)
copy
• reciprocal of the square root of the dot product of vectors
void vec2norminv(lbfgsfloatval_t* s, const lbfgsfloatval_t *x, const int n)
copy

## 2.3. Main functions of L-BFGS algorithm

In liblbfgs, there are many codes optimized by assembly language. These optimized codes are not considered here for the time being. For these optimized codes, the author provides a basic implementation method.

### 2.3.1. Allocating and reclaiming memory space for variables

The function lbfgs_malloc is a function to allocate memory space for variables in the optimization function, and its specific form is:

// allocate space for variables
lbfgsfloatval_t* lbfgs_malloc(int n)
{
// Some knowledge related to compilation is not considered for the time being
#if     defined(USE_SSE) && (defined(__SSE__) || defined(__SSE2__))
n = round_out_variables(n);
#endif/*defined(USE_SSE)*/
// Allocate n size memory space
return (lbfgsfloatval_t*)vecalloc(sizeof(lbfgsfloatval_t) * n);
}
copy

The function lbfgs_malloc is used to allocate memory space with a length of n for variables, where lbfgsfloatval_t is the type of the defined variable, which is defined as float or double type:

#if     LBFGS_FLOAT == 32
typedef float lbfgsfloatval_t;

#elif   LBFGS_FLOAT == 64
typedef double lbfgsfloatval_t;

#else
#error "libLBFGS supports single (float; LBFGS_FLOAT = 32) or double (double; LBFGS_FLOAT=64) precision only."

#endif
copy

Corresponding to memory allocation is memory recovery, and its specific form is:

// free space for variables
void lbfgs_free(lbfgsfloatval_t *x)
{
vecfree(x);
}
copy

### 2.3.2. Initialization of parameters in L-BFGS

The function lbfgs_parameter_init provides the initialization method of L-BFGS default parameters.

In fact, the default parameter method is also provided during the algorithm of L-BFGS, so this method is a bit redundant.

// Initialize the parameters of lbfgs by default
void lbfgs_parameter_init(lbfgs_parameter_t *param)
{
memcpy(param, &_defparam, sizeof(*param));// Initialize with default parameters
}
copy

The function lbfgs_parameter_init copies the default parameter _defparam to the parameter param, lbfgs_parameter_t is the structure of the L-BFGS parameter, and its specific code is as follows:

The author's comments in this part of the code are particularly detailed, from which you can learn a lot of important information when tuning parameters.

### 2.3.3. Overview of the core process of the L-BFGS algorithm

The function lbfgs is the core process of the optimization algorithm, and its parameters are:

int n,// number of variables
lbfgsfloatval_t *x,// variable
lbfgsfloatval_t *ptr_fx,// objective function value
lbfgs_evaluate_t proc_evaluate,// Callback function to calculate the objective function value and gradient
lbfgs_progress_t proc_progress,// Callback function to handle calculation process
void *instance,// data
lbfgs_parameter_t *_param// Parameters of L-BFGS
copy

This function is used to calculate specific functions and handle some work in the calculation process by calling two functions proc_evaluate and proc_progress.

In the function lbfgs, its basic process is: ### 2.3.4. Declaration and initialization of parameters

The declaration of many parameters is involved in the initialization phase, and the function of each parameter will be introduced in detail next.

### 2.3.5. The process of loop solution

The solution process of the loop starts from the for loop. In the for loop, it is first necessary to use the line search strategy to select the optimal step size. The specific process is as follows:

/* Store the current position and gradient vectors. */
// store the current variable value and gradient value
veccpy(xp, x, n);// Copy the current variable value into the vector xp
veccpy(gp, g, n);// Copy the current gradient value to the vector gp

/* Search for an optimal step. */
// Line search strategy, search for optimal step size
if (param.orthantwise_c == 0.) {// No L1 regularization
ls = linesearch(n, x, &fx, g, d, &step, xp, gp, w, &cd, &param);// gp is the gradient
} else {// Contains L1 regularization
ls = linesearch(n, x, &fx, g, d, &step, xp, pg, w, &cd, &param);// pg is the pseudo gradient
pg, x, g, n,
param.orthantwise_c, param.orthantwise_start, param.orthantwise_end
);
}
if (ls < 0) {// The termination condition has been met
// Since vector x and vector g are updated during the online search, variable values ​​and gradient values ​​are recovered from xp and gp
/* Revert to the previous point. */
veccpy(x, xp, n);
veccpy(g, gp, n);
ret = ls;
goto lbfgs_exit;// release resources
}
copy

Since the variable x and the gradient vector g are modified during the online search, the variable x and the gradient g are saved in another vector before performing the online search. The parameter param.orthantwise_c represents the parameter of L1 regularity. If it is 0, L1 regularity is not used, that is, L-BFGS algorithm is used; if it is not 0, L1 regularity is used, that is, OWL-QN algorithm is used.

For the specific method of line search, please refer to 2.3.6. For the owlqn_pseudo_gradient function, please refer to 2.3.4

In OWL-QN, pseudo gradients are used instead of gradients in L-BFGS since there are no derivatives at some points.

### 2.3.6. The termination condition of the loop

In the process of selecting the optimal step size, the variables will be updated at the same time. The second step is to judge whether the update at this time meets the termination conditions. The termination conditions are divided into the following three categories:

• whether to converge
vec2norm(&xnorm, x, n);// root of sum of squares
if (param.orthantwise_c == 0.) {// Non-L1 regular
vec2norm(&gnorm, g, n);
} else {// L1 regularization
vec2norm(&gnorm, pg, n);
}
copy
// Judging whether to converge
if (xnorm < 1.0) xnorm = 1.0;
if (gnorm / xnorm <= param.epsilon) {
/* Convergence. */
ret = LBFGS_SUCCESS;
break;
}
copy

The method of judging convergence is:

\frac{\left | g\left ( x \right ) \right |}{max\left ( 1,\left | x \right | \right )}<\varepsilon

If the above formula is satisfied, it means that the convergence condition has been met.

• Is there a large enough drop in the objective function value (minimum problem)
if (pf != NULL) {// Termination condition
/* We don't test the stopping criterion while k < past. */
// k is the number of iterations, only the case of past>k is considered, and past means that only the value of past times is retained
if (param.past <= k) {
/* Compute the relative improvement from the past. */
// Calculates the ratio by which the function is reduced
rate = (pf[k % param.past] - fx) / fx;

/* The stopping criterion. */
// Whether the reduction ratio meets the conditions
if (rate < param.delta) {
ret = LBFGS_STOP;
break;
}
}

/* Store the current value of the objective function. */
// update the new objective function value
pf[k % param.past] = fx;
}
copy

In pf, the objective function value of param.past times is saved. The calculation method is:

\frac{f_o-f_n}{f_n}<\Delta
• Whether to reach the maximum number of iterations
// The maximum number of iterations has been reached
if (param.max_iterations != 0 && param.max_iterations < k+1) {
/* Maximum number of iterations. */
ret = LBFGSERR_MAXIMUMITERATION;
break;
}
copy

If the conditions for termination are not met, then a new Hessian matrix needs to be fitted.

### 2.3.7. Fitting Hessian Matrix

In the BFGS algorithm ( Optimization algorithm - BFGS algorithm of quasi-Newton method ), its Hessian matrix is:

H_{k+1}=\left ( I-\frac{s_ky_k^T}{y_k^Ts_k} \right )^TH_k\left ( I-\frac{y_ks_k^T}{y_k^Ts_k} \right )+\frac{s_ks_k^T}{y_k^Ts_k}

in,

y_k=\bigtriangledown f\left ( x_{k+1} \right )-\bigtriangledown f\left ( x_{k} \right )

s_k=x_{k+1}-x_{k}

. In the process of calculation, it is necessary to continuously calculate and store the historical Hessian matrix. In the L-BFGS algorithm, it is hoped that only the most recent

m

The iteration information can be used to fit the Hessian matrix. In the L-BFGS algorithm, the complete

H_k

, but instead stores a sequence of vectors

\left \{ s_k \right \}

and

\left \{ y_k \right \}

, when the matrix is ​​required

H_k

, using a sequence of vectors

\left \{ s_k \right \}

and

\left \{ y_k \right \}

The calculation can be obtained, and the vector sequence

\left \{ s_k \right \}

and

\left \{ y_k \right \}

Not all have to be saved, just save the latest

m

step vector. Its specific calculation method is: The specific principle of L-BFGS can be found in " Optimization Algorithm - L-BFGS Algorithm of Quasi-Newton Method".

In the above process, the first loop calculates the penultimate

m

In the second stage, the current descending direction is iteratively calculated using the method calculated above.

According to the above process, start to fit the Hessian matrix:

• Computing a sequence of vectors
\left \{ s_k \right \}

and

\left \{ y_k \right \}
// update s vector and y vector
it = &lm[end];// Initially, end is 0
vecdiff(it->s, x, xp, n);// x - xp, xp is the value of the previous generation, x is the current value
vecdiff(it->y, g, gp, n);// g - gp, gp is the value of the previous generation, g is the current value
copy
• two loops
// Calculation of Hessian matrix by quasi-Newton method
// Two cycles of L-BFGS
j = end;
for (i = 0;i < bound;++i) {
j = (j + m - 1) % m;    /* if (--j == -1) j = m-1; */
it = &lm[j];
/* \alpha_{j} = \rho_{j} s^{t}_{j} \cdot q_{k+1}. */
vecdot(&it->alpha, it->s, d, n);// calculate alpha
it->alpha /= it->ys;// multiplied by rho
/* q_{i} = q_{i+1} - \alpha_{i} y_{i}. */
}

vecscale(d, ys / yy, n);

for (i = 0;i < bound;++i) {
it = &lm[j];
/* \beta_{j} = \rho_{j} y^t_{j} \cdot \gamma_{i}. */
vecdot(&beta, it->y, d, n);
beta /= it->ys;// multiplied by rho
/* \gamma_{i+1} = \gamma_{i} + (\alpha_{j} - \beta_{j}) s_{j}. */
vecadd(d, it->s, it->alpha - beta, n);
j = (j + 1) % m;        /* if (++j == m) j = 0; */
}
copy

Among them, the calculation methods of ys and yy are as follows:

vecdot(&ys, it->y, it->s, n);// Calculate the dot product
vecdot(&yy, it->y, it->y, n);
it->ys = ys;
copy

The calculation method of bound and end is as follows:

bound = (m <= k) ? m : k;// Judging whether there are enough m generations
++k;
end = (end + 1) % m;
copy

### 2.3.8. Line search strategy

A large number of line search strategies are involved in liblbfgs, and the main function of the line search strategy is to find the optimal step size. We will conduct a detailed analysis in the next topic.

# 3. supplement

## 3.1. Callback function

A callback function is a process of using a function pointer to call a function. The advantage of the callback function is that the specific calculation process is passed to the calling place in the form of a function pointer, so that the calling function can be expanded more conveniently.

Suppose there is a print_result function that needs to output the sum of two int-type numbers, then write it directly. If you need to change it to a difference, you have to modify it again; if a function pointer is passed in the parameter of the print_result function, the specific calculation process If implemented in this function, the function can be expanded without changing the print_result function, as in the following example:

• frame.h
#include <stdio.h>

void print_result(int (*get_result)(int, int), int a, int b){
printf("the final result is : %d\n", get_result(a, b));
}
copy
• process.cc
#include <stdio.h>
#include "frame.h"

return a + b;
}

int sub(int a, int b){
return a - b;
}

int mul(int a, int b){
return a * b;
}

int max(int a, int b){
return (a>b?a:b);
}

int main(){
int a = 1;
int b = 2;
}