VFI in InfHorz: Howards, PFI, and relation to implicit finite differences

robertdkirkby · July 10, 2025, 9:24am

This post is going to describe some of the various options in relation to infinite horizon value fn iteration and what to toolkit does. It is largely built around discretized VFI and discussing pure discretization vs interpolation of next period endogenous state. It also explains Howards improvement, both iteration and greedy, and finally how discretized VFI with linear interpolation and greedy-Howards improvement is understood to the same major gains that the implicit finite differences method in continuous time delivers.

The problem we want to solve is the infinite horizon value fn problem given by the Bellman eqn,

V(a,z)=max_{a'} F(a',a,z) + \beta E[V(a',z')|z]

Let’s just start out by thinking about the pure discretization version of the problem, so we say that there is a grid on the current endogenous state a given by a\_{}grid=\{a_1,a_2,\dots,a_{n_a}\} and we require the next period endogenous state to be chosen on that same grid. We also have a grid on z given by z\_{}grid=\{z_1,z_2,\dots,z_{n_z}\}, and a transition probability matrix for z given by pi\_{}z=[p_{ij}], where p_{ij} is the probability of transitioning from z_i this period to z_j next period, i,j=1,2,\dots,n\_{}z.

So our pure discretization infinite horizon value function problem is given by,

V(a_i,z_j)=max_{a' \in a\_{}grid} F(a',a_i,z_j) + \beta E[V(a',z')|z_i]

for i=1,\dots,n\_{}a, j=1,\dots,n\_{}z.

We can solve this by standard value function iteration, which starts from an initial guess V_0, and then iterates,

V_m(a_i,z_j)=max_{a' \in a\_{}grid} F(a',a_i,z_j) + \beta E[V_{m-1}(a',z')|z_i]

for m=1,2,\dots and we keep going until V_m converges, which we check by ||V_m(a,z)-V_{m-1}(a,z)||<tolerance, where tolerance is a criterion for convergence (say 10^(-9)).

The problem with this baseline VFI is that it is slow. Looking at each iteration there are three main steps: evaluate the expectations, take the max, update the left-hand side. The most computationally expensive step of these is the max, so if we can avoid doing it so many times we could be faster. One way to do this is a method like Endogenous Grid Method (EGM) which replaces the max with a function inversion (and then requires an additional interpolation), which is very fast, but very tricky to code in more advanced models.

Howards improvement algorithm is a way of reducing the number of times we do this computationally expensive max operation. The intuition is that we don’t need to do the max and then only use that Policy (the argmax) to update the V_m once, instead we can use the Policy to update multiple times before we next do the max. So we do iteration m,

V_m(a_i,z_j)=max_{a' \in a\_{}grid} F(a',a_i,z_j) + \beta E[V_{m-1,H}(a',z')|z_i]

and then we do H Howard-improvement-iterations indexed by h=1,2,\dots,H,

V_{m,h+1}(a_i,z_j)= F(a*_m',a_i,z_j) + \beta E[V_{m,h}(a*_m',z')|z_i]

where a*'_m is the argmax from iteration m of the VFI (the optimal policy from the value function iteration immediately before we use Howard improvement iterations); V_{m,0} is defined to be V_m. And after the H Howard-iterations we do the m+1 VFI iteration,

V_{m+1}(a_i,z_j)=max_{a' \in a\_{}grid} F(a',a_i,z_j) + \beta E[V_{m,H}(a',z')|z_i]

Notice how the expectation for the m+1 iteration is based on the final Howard-iteration of the m-th iteration.

If we consider Howards improvement iteration as H \to \infty, we get greedy-Howards, often called PFI (Policy Function Iteration). Note that this is about calculating V_{m,\infty} which satisfies

V_{m,\infty}(a_i,z_j)=F(a*'_m,a_i,z_j) + \beta E[V_{m,\infty}(a*'_m,z')|z_i]

which is to say it is the value function that comes from evaluating the policy a*'_m; the ‘policy-greedy’ value fn is another way of saying this, hence why I refer to this method above as greedy-Howards.

When it comes to coding, there is something important to notice about the greedy-Howards problem, namely that the equation,

V_{m,\infty}(a_i,z_j)=F(a*'_m,a_i,z_j) + \beta E[V_{m,\infty}(a*'_m,z')|z_i]

with pure discretization is just a system of linear equations. This can be seen more explictly if we rewrite the expectation term, taking advantage of the fact that the expectations can be written as a matrix operation,

V_{m,\infty}(a_i,z_j)=F(a*'_m,a_i,z_j) + \beta T_E V_{m,\infty}(a*'_m,z')

where T_E is the expectations operator expressed as a transition mapping from todays state (a,z) to tomorrows state (a',z'). Rewriting in matrix notation we get

V_{m,\infty}=F_{a*'} + \beta T_E V_{m,\infty}

and by some matrix algebra

V_{m,\infty}=(\mathbb{I} - \beta T_E)^{-1} F_{a*'}

where V_{m,\infty} is a n\_{}a-by-n\_{}z matrix, F_{a*'} is a n\_{}a-by-n\_{}z matrix (the return function, evaluated at the iteration-m optimal policy), and T_E is n\_{}a *n\_{}z-by-n\_{}a * n\_{}z (it is the combination of the policy function and the exogenous shock transition probabilities matrix). The implementation of greedy-Howards (a.k.a. PFI) can thus be done as a single step by evaluating V_{m,\infty}=(\mathbb{I} - \beta T_E)^{-1} F_{a*'}

Summing up what we have so far. We can do VFI to solve this infinite horizon value fn problem. We can speed this up using Howards improvement iterations (doing H iterations) or we can do the greedy-Howards improvement (implicitly \infty iteration, but we can do it as just one line of matrix algebra). Obviously we could do Howards, greedy or iteration, between every iteration of VFI, or only between some of them.

So using Howards will be faster than just VFI without Howards. But do we want Howards iteration or greedy? Turns out that the ‘solve system of linear equations’ in Howards greedy is very fast but also suffers severely from the curse of dimensionality which massively slows it down as the model size increases. Thus the answer is that for small problems we should do Howards greedy and for anything but small problems we should do Howards iteration.

I want to turn next to redoing all of this with interpolation. But before that, a quick digression to make a few points.

First, to do the VFI iterations, or the Howard improvement iterations (but not greedy-Howard) we need to evaluate E[V(a*',z')|z]. We have V(a',z'), a*', and pi\_z (the next period value fn, the policy, and the transition probabilities for the exogenous shock). You would typically do this by first taking V(a',z')*pi\_z^{transpose} to get E[V(a',z')|z] (this is just a matrix multiplication; note that V(a',z') is a matrix on (a',z'), pi\_{}z is a matrix on (z,z’), and E[V(a',z')|z] is a matrix on (a’,z)). The you would use the policy a*' to index E[V(a',z')|z] to get E[V(a*',z')|z] (a*' is a matrix of size (a,z) containing indexes for a', so you need to include the appropriate z indexes while doing this).

The way we set up greedy-Howards points to an alternative way to calculate E[V(a*',z')|z]. Namely you could combine a*' (which maps from (a,z) to a’) with pi\_{}z (which maps from z to z’) to get T_E which maps from (a,z) to (a’,z’). Then you would just do T_E V(a',z') to get E[V(a*',z')|z]. T_E is very sparse and so any implementation of this need to be sure to set T_E as sparse matrix.

So we have two options: use pi\_{}z and a*' as two steps, or create T_E and use that once.

All this might look a bit familiar. Once we have the optimal policy, we will want to iterate the agent distribution \mu. We do this by combining the optimal policy a*' with the transition matrix for exogenous shocks pi\_{}z to create a transition matrix on the model state-space, T mapping from (a,z) to (a’,z’). We then iterate by taking \mu_m=T^{transpose} \mu_{m-1}. We can massively speed this up by using the Tan improvement, which simply observes that instead of doing \mu_m=T^{transpose} \mu_{m-1}, we can instead do this as two steps, first \mu_{m,intermediate}=G_{a*'}^{transpose} \mu_{m-1} and second \mu_m=pi\_{}z^{transpose} \mu_{m,intermediate}; G_{a*'} is just a*' rewritten as a (very sparse) transition matrix from (a,z) to a’, rather than the usual a*' as matrix over (a,z) containing the indexes for a’.

Notice how Tan improvement now just looks like doing what we would always normally do to get the expected next period value function, rather than evaluating the next period value function using T_E. Except that when we go backwards in the VFI, we do exogenous shock transitions, then policies, while going forwards in agent dist we do policies, then exogenous shock transitions.

Two last side notes. First, rational expectations is just the assumption that T (in agent dist iterations) is equal to T^*_E (use for agent expectations) [to be precise, T^*_E is for the solution to the VFI, not just one of the many m iterations]. Second, near the beginning I said there were three steps to VFI, expectations, maximization and evaluating the left-hand side. But I have said nothing about the third. This is because when we use discretization to solve VFI the third step is so trivial that there is effectively nothing to do. But if you do fitted VFI (say, approximate the value fn with Chebyshev polynomials), then after you solve the right-hand side of the Bellman eqn (doing expectations, then maximum) you still need to do a fitting step, which is about evaluating the left-hand side.

Now, let’s rethink everything above about VFI, but allowing for interpolation of a', so that we are no longer restricting it to take values on a\_{}grid.

This time our value fn problem becomes,

V(a_i,z_j)=max_{a'} F(a',a_i,z_j) + \beta E[V(a',z')|z_i]

for i=1,\dots,n\_{}a, j=1,\dots,n\_{}z. Notice that a, z and z' are all on the grids, but a' is not required to be on the grid. I will however assume that a' can only take values between the min and max grid points of a\_{}grid, so that we only need interpolation and no extrapolation.

There are a few different ways we could do the interpolation.

One is that we could use a root-finding method, like Newton-Raphson, to find a*' (which is just a number, conditional on (a,z)). This would involve trying lots of candidate values of a', evaluating the return fn F(a',a_i,z_j) exactly as it is a known function, and then interpolating E[V(a',z')|z_i] for values of a' as it is known only on the grid on a. In principle, we could use any kind of interpolation for E[V(a',z')|z_i], such as Akima cubic spines, but here I will just cover linear interpolation.

Second is that once we assume linear interpolation, we can take derivatives (using finite differences) of the return fn and policy fn (based on their values on the grid) to find the a*'.

Third, which is what grid interpolation layer in VFI Toolkit does, is that we could just consider n ‘additional’ grid points on a' evenly spaced between every two consecutive points on the a\_{}grid. We can evaluate the return fn F(a',a_i,z_j) exactly as it is a known function, and then interpolating E[V(a',z')|z_i] for values of a' as it is known only on the grid on a. Again, in principle we could use any kind of interpolation for E[V(a',z')|z_i], such as Akima cubic spines, but here I will just cover linear interpolation which is what VFI Toolkit uses.

Regardless of which of these three methods is used, we get a*' that (potentially/likely) takes values between the points on a\_{}grid. The second of these methods is going to be faster than the first (as it does not have to evaluate lots of different a*'), but has the weakness that the return fn is interpolated rather than exact, so is less accurate. The third is much more GPU friendly that than the first (but worse for CPU) as it is parallel rather than serial, although at the cost that a' is still on a ‘fine’ grid.

Enough about how we can do the interpolation, solving a discretized value function problem but allowing a' to take ‘any’ value. [I distinguish ‘pure discretization’, where a' is on grid, from ‘discretization’ where a' can be between grid.] Let’s go back to VFI and Howards and see what changes now that we are interpolating.

VFI now becomes, start from an initial guess V_0, and then iterate,

V_m(a_i,z_j)=max_{a'} F(a',a_i,z_j) + \beta E[V_{m-1}(a',z')|z_i]

for m=1,2,\dots and we keep going until V_m converges, which we check by ||V_m(a,z)-V_{m-1}(a,z)||<tolerance, where tolerance is a criterion for convergence (say 10^(-9)). The only change relative to pure discretization VFI that we saw earlier is that here a' is no longer constrained to be in a\_{}grid. Not obvious in the equations is that the E[V_{m-1}(a',z')|z_i] is now more complicated than before, it only exists on the grid but we need to evaluate it for a' that are not on grid.

Howards improvement iterations, and greedy-Howards are also just the same as before, but with a' no longer constrained to be on the grid, and with evaluating the expectations term and T_E, respectively, now more complicated than before. For greedy-Howards, we again have,

V_{m,\infty}=(\mathbb{I} - \beta T_E)^{-1} F_{a*'}

and T_E is again a transition matrix mapping from (a,z) to (a’,z’), all on the grid. It is still constructed by combining the policy a*' with the transition probabilities for the exogenous shock pi\_{}z. But now our policy a*' takes values between the grid points, this is easy enough to deal with, just find the grid points below and above the a*' value (the lower and upper grid points) and the assign linear weights to these (so if a*' is halfway between the upper and lower grid points, we put a weight of half on each; if a*' is one quarter of the way from lower to upper grid points, we put weight of 3/4 on lower and 1/4 on upper; more generally, put weight of 1-(a*'-a_l)/(a_u-a_l) on the lower grid point, where a_l and a_u are the values of the lower and upper grid points); recall that we saw how T_E for greedy-Howards is the familiar T matrix for the agent distribution iteration, and there this linear interpolation onto grid (also known as ‘lottery’ method) is well-known.

For Howards iteration we again do H Howard-improvement-iterations indexed by h=1,2,\dots,H,

V_{m,h+1}(a_i,z_j)= F(a*_m',a_i,z_j) + \beta E[V_{m,h}(a*_m',z')|z_i]

where a*'_m is the argmax from iteration m of the VFI (the optimal policy from the value function iteration immediately before we use Howard improvement iterations); V_{m,0} is defined to be V_m. The complication is that now a*_m' is no longer on the a\_{}grid. Because the return fn is a known function, there is no issue there and F(a*_m',a_i,z_j) can just be evaluated. This leaves us with how to evaluate the expectations E[V_{m,h}(a*_m',z')|z_i]. One option would be to create T_E and use this, but it is likely to be inefficient for the same reason the Tan improvement is better than using T, namely T_E contains a low of additional unnecessary calculations (as it ignores that z transitions are independent of a). The obvious alternative is to use the same first step as when we didn’t do interpolation, so first taking V(a',z')*pi\_z^{transpose} to get E[V(a',z')|z] (this is just a matrix multiplication; note that V(a',z') is a matrix on (a',z'), pi\_{}z is a matrix on (z,z’), and E[V(a',z')|z] is a matrix on (a’,z)). But now a*' can take values not on the a\_{}grid, while E[V(a',z')|z] only exists on a\_{}grid. There are two roughly equivalent things we can do here. The first would be to simply linearly interpolate E[V(a',z')|z] onto the value of a*'. The second would be to switch a*' to a lower (and upper) grid index and associated lower (and upper) grid probabilities, we would then index the E[V(a',z')|z] at the lower and upper grid points, and take the weighted sum of these using the lower and upper grid probabilities. Note, one way to do this later would be to create the G_a*' matrix from a*', which would be a transition matrix from (a,z) to a’, that had two non-zeros per row for the lower and upper grid points, and with the values as the lower and upper grid probabilities (the only reason this might make sense is because it can be reused for all the H iterations).

One obvious, but seeming little used, improvement to discretized VFI with linear interpolation is to take a multi-grid approach. First just solve the pure-discretized VFI problem for a few iterations, and then switch to solving the linear interpolation discretized VFI problem for the remaining iterations. Multi-grid in this way looks like a cheap way to quickly get a decent initial guess, and in infinite-horizon VFI a good initial guess does wonders for runtimes.

Okay, we have covered discretized VFI with pure discretization, and with (linear) interpolation of next period endogenous state. You would definitely always want to use Howards improvement, either iteration or greedy.

How does all of this relate to the implicit finite difference method in continuous time? Rendahl (2022) argues that the main speed gains of the implicit finite difference method come from solving a (sparse) system of linear equations, and that this step is just essentially the same thing as you are doing when solving discretized VFI with linear interpolation and greedy Howards. He provides a bunch of runtime results to back this up (his discrete time is marginally faster than the continuous time). In terms of the interpolation, Rendahl does it in the second of the three linear interpolation approaches described above, so he uses derivatives to linearly interpolate the return fn, rather than evaluating it exactly.

We can also relate all of this to ‘Markov Chain Approximation methods’. Phelan & Eslami (2022) (with corregendium) argue that Markov Chain Approximation methods are a good way to solve continuous time problems, and that implicit finite differences is just one specific example of a Markov Choice approximation method. They also highlight how ‘Markov Chain approximations’ are similar to —but not the same as, see next paragraph— what is here called ‘discretized VFI’ (ignoring that they first have the additional step of getting from the continuous time problem to a discrete time one). For our purposes their point can best be understood as saying that if you solve a discretized VFI there are two main choices (i) you need to choose an interpolation method (they follow the ‘derivatives’ approach like Rendahl; they call this a ‘choice of chain’, their paper is all based on fixed grid spacing but in general this grid is also part of the choice of chain) and (ii) you need to choose whether to use Howards greedy or Howards iteration (they call them PFI and modified-PFI). They illustrate runtimes showing that implicit finite differences (which is Howard greedy, and linear interpolation by derivatives) is faster for small problems, but that switching to Howards iteration is faster for even just a one endogenous state with two exogenous shocks model.

Bakota & Kredler (2022WP) show how to adapt Markov Chain Approximation methods to discrete-time and helps highlight the differences from discretized VFI. Specifically, in discretized VFI we (a) set grid on endogenous state a\_{}grid, (b) set grid and transition matrix for exogenous shocks, z\_{}grid and pi\_{}z, (c) take expectations and interpolate value functions (for aprime values), (d) solve for optimal policy (by a max operation), (e) update the value function. By contrast, for Markov Chain Approximation methods, (c) is done by defining a probability distribution over all points [Note that in discretized VFI if we use linear interpolation we are assigning weights/probabilities to the a\_{}grid and for the expectations we are assigning weights/probabilities to the z\_{}grid; so any difference is just in how the probabilities are assigned, both methods are effectively assigning probabilities to all grid points.] MCA then takes a further step and does a linear moment approximation, taking a first-order approximation of the distribution vector over grid points in the 1st and 2nd moments of the stochastic process. [MCA can if needed take a further approximation of the law-of-motion/return-function.] So we can understand MCA (and the adaptation to discrete time by Bakota & Kredler) as follows: discretized VFI with linear interpolation implemented by the derivatives (the second linear interpolation method described above), with one further ‘linear moment assumption’ (and of course you want to use Howards iteration or Howards greedy).

To understand this linear moment assumption, first consider pure discretized VFI (so next period endogenous state is chosen on the grid). If we model ‘years of education’ as an endogenous state, it is an ‘incremental endogenous state’ as next period/years value is either todays value or todays value plus one (so it can only move by an ‘increment’ of one). Compared to a standard endogenous state where we have n\_{}a values today, and then consider all n\_{}a value tomorrow, an incremental endogenous state is much faster to compute as while we still have n\_{}a values today, we have just 2 values tomorrow; to say the same thing with different words, the state-space is unchanged, but the choice-space is much smaller. The ‘linear moment approximation’ of MCA can here be understood as that the endogenous state is approximated as a bi-incremental endogenous state, so next period value is either the same, or goes up or down one grid point. Now we add the linear interpolation step back in, and we get that the ‘linear moment approximation’ of MCA is to assume that instead of linearly interpolating for all possible next period values, we impose that you can at most go up or down own grid point, so we just have to interpolate for ‘this period grid point and one grid point increase’ and for ‘this period grid point and one grid point increase’. Again, this gives a speed increase (and memory decrease) for the same reason, we keep the same state-space, but massively reduce the next-period choice-space. [Their ‘second’ approximation can be roughly understood as saying if you have two choices, then don’t consider the interaction/cross-term; which is what continuous-time does, always ignoring interactions.] Viewed in this way, the key assumption of the MCA method is that you impose only moving up or down at most one grid point; which cuts runtimes and memory uses, but obviously will be inaccurate if not correct. Note that this (deliberately) looks a lot like the continuous-time assumption, which is that you can only move up or down an epsilon-amount.

More approximation comes with the usual pros and cons, namely faster but less accurate. The main limitation of their linear moment approximation is clearly the requirement that the continuous state does not move up or down more than one grid point in expectation. With one continuous state variable, if the assumption does hold MCA will give the same solution as discretized VFI with linear interpolation (which we know converges to the true solution under standard assumptions). With multiple continuous endogenous states, even when the assumption does hold, the MCA solution will differ subtly from discretized VFI with linear interpolation. There are two tricks that can get the assumption to be more likely to hold (or at least close to holding, so the approximation errors are small). One is to use a coarse grid on the endogenous state —which comes with obvious limits to the accuracy of the algorithm— and the other is to make the model time period very small —which means the discount factor will be larger, which increase the runtimes for infinite horizon VFI problems.

In principle, the ‘bi-incremental endogenous state’ assumption (which they call the ‘linear moment assumption’) is a separate decision to how we do the interpolation, and to the choice of Howards iteration or greedy. So you could do any combination of these. This ‘bi-incremental endogenous state’ is also another way to see what the assumption of continuous-time is implicitly imposing.

A few final details on implementing codes. When doing Howards improvement iteration, the return fn evaluated as the current policy F(a*_m',a_i,z_j) is the same for all iterations h=1,\dots,H, so should be precomputed (it is just a matrix on (a,z)). When doing greedy-Howards improvement T_E is very very sparse, so make sure you set it up as a sparse matrix (T in agent distribution is also very very sparse, but there you want to use the Tan improvement so you won’t create T).

Using Howards iteration (but not greedy) leads to minor differences in the solution to the value function problem, roughly of the same order of magnitude as our solution tolerance (you can see that it leads to different steps in the iteration, so makes sense it might change the solution slighty). This is irrelevant if you just want to solve stationary general equilibrium, but if you want to solve transition paths it can matter, because you will solve the transition path without Howards (Howards doesn’t exist in a finite horizon problem, which is what a transition path is in practice). So if you use the Howards solution, and try to solve a transition path where ‘nothing happens’, those tiny differences in the solution can cumulate over the transition path and result is a tiny non-zero change in policies. As soon as you have any change in policy, the ‘nothing happens’ transition path will blow up. For this reason VFI Toolkit turns off Howards iteration once you get within an order of magnitude of the solution tolerance, so that the final solution is from VFI and not influenced by Howards; increases runtimes very slightly, but becomes more reliable for transition paths. This is implemented in VFI Toolkit code as only doing Howards if ‘currdist/Tolderance>10’.

Using Howards can also be problematic in the first few iterations of VFI. Specfically, if the value of V can contain -\infty then in the first few iterations of VFI these -\infty get spread out around the relevant parts of V. Using Howards, iteration or greedy, while the -\infty are being spread can cause them to be spread to all sorts of places contaminating your value fn with -\infty and breaking everything. For this reason you need to avoid Howards in the first few iterations of VFI if/when the -\infty are being moved around. VFI Toolkit does this by checking that ||V_m(a,z)-V_{m-1}(a,z)|| is finite (so no new -\infty) and only if this is true is Howards used (this is done each m iteration of the VFI). This is implemented in VFI Toolkit code as only doing Howards if ‘isfinite(currdist)’.

aledinola · July 10, 2025, 4:00pm

This is a great post!
Related to this, here I have my own implementation of Howard algo with linear interpolation, that follows loosely Pontus Rendhal’s paper: https://github.com/aledinola/Aiyagari_Interpolation/blob/main/VFI_interp2.m

As of now, it is written for CPU, but I think it should work also on GPU once I manage to vectorize the loop over z.

Here is the complete implementation of Howard with linear interpolation. The main thing is to build a big transition matrix Q(x,x') where x=(a,z) is the vector of state variables and Q(x,x') gives the probability of moving from x to x'. This matrix Q is T_E in your notation, unless I am mistaken. It is a large matrix, with N_A*N_Z rows and N_A*N_Z columns but most of its elements are zero, and Matlab is really good with sparse matrices.

%---------------------------------------------------------------------%
    % Howard update
    Ftemp = Aiyagari1994_ReturnFn_cpu(Policy,a_grid,z_grid',Params.alpha,Params.delta,Params.mu,Params.r);
    Ftemp_vec = reshape(Ftemp,[N_a*N_z,1]);

    % Find interp indexes and weights
    [aprime_opt,weight_opt] = find_loc_vec2(a_grid,Policy);

    % Build big transition matrix from (a,z) to (a',z')
    G = cell(N_z,1);
    NA = (1:N_a)';
    for z_c = 1:N_z
        G{z_c} = sparse(NA,aprime_opt(:,z_c),weight_opt(:,z_c),N_a,N_a)+...
        sparse(NA,aprime_opt(:,z_c)+1,1-weight_opt(:,z_c),N_a,N_a);
    end
    Q = cell(N_z,1);
    for z_c = 1:N_z
        % Kron product of (1,nz) with (na,na) gives (na,nz*na) matrix
        Q{z_c} = kron(pi_z(z_c,:),G{z_c}); %dim: (na,na*nz)
    end %close z
    Qmat = vertcat(Q{:});
    
    Ibig = speye(N_a*N_z,N_a*N_z);
    %V_howard = (Ibig-DiscountFactorParamsVec*Qmat)\Ftemp_vec;
    VKron = bicgstab((Ibig-DiscountFactorParamsVec*Qmat),Ftemp_vec,1e-6,100);
    VKron = reshape(VKron,[N_a,N_z]);
    %---------------------------------------------------------------------%

The code above can be seen as an implementation of what you call PFI or greedy-Howards algorithm. This works very well on the cpu but I think it could be ported to the gpu following the steps described below

Moving this from CPU to GPU
(1) The part on

Ftemp = Aiyagari1994_ReturnFn_cpu(Policy,a_grid,z_grid',Params.alpha,Params.delta,Params.mu,Params.r);

can be easily replaced with a call to arrayfun, after having the arrays Policy, a_grid and z_grid on the GPU.
(2) The construction of the sparse matrix should work well on GPU, especially if I vectorize the loop over z

for z_c = 1:N_z
   G{z_c} = sparse(NA,aprime_opt(:,z_c),weight_opt(:,z_c),N_a,N_a)+...
        sparse(NA,aprime_opt(:,z_c)+1,1-weight_opt(:,z_c),N_a,N_a);
end

Note that aprime_opt gives the left grid point that brackets the optimal choice a', while weight_opt is the corresponding linear weight (on the left point).

(3) bigcstab on the GPU. On the cpu, I started with the \ operator to solve the linear system but realized that bigcstab is faster, especially for larger problems (i.e. if N_A*N_Z is large). Matlab documentation claims that bigcstab works also on gpu arrays, but I haven’t tried yet (see here for documentation: https://uk.mathworks.com/help/matlab/ref/bicgstab.html)

(4) find_loc_vec2 is a simple function to calculate the left grid point and the associated weight and internally calls the Matlab function discretize. I think discretize also works on gpu arrays as input arguments since I’ve seen it used in the toolkit somewhere else (for experience assets?)

NOTE: How is this approach related to Pontus Rendhal’s paper

Essentially, what Pontus does in his paper (among many other things of course), is to implement the Howard algo for the case with linear interpolation (so he extends what is explained in Liunqvist and Sargent’s book for the discrete case). He then argues that you can reuse the matrix T_E to find the invariant distribution \mu(a,z) and he calls this as “killing two birds with the same stone”. This is however not efficient: to find the invariant distribution it is better to use the Tan method which is a bit different

aledinola · July 12, 2025, 9:04pm

Update

I have implemented the Howard acceleration algorithm with interpolation on the GPU here:

https://github.com/aledinola/Aiyagari_Interpolation/blob/main/VFI_interp2_gpu.m

The Aiyagari example with n_k=1000, n_z=21 and n_fine =30 (i.e. 30 extra grid points b/w points on the original grid for a') runs in about 1.1 seconds on my PC.

Main file: Aiyagari1994_interp.m
Grid sizes are: 1000 points for assets, and 21 points for exogenous shock 
Elapsed time is 1.120794 seconds.

To do Howard, I changed it a bit with respect to the code posted above. In particular, I follow the Tan improvement in reverse order: first I update V using the exogenous transition \pi(z,z'), thus getting EV(a',z). Then I update a using the policy function a'(a,z). The second step is done with sparse matrix multiplication: I build a sparse transition from (a,z) to (a',z).

Step 1 of Howard:
EV(a',z) = V(a',z')*\pi(z,z')^{T}

Step 2 of Howard:
V(a,z) = F(a^{\prime*}(a,z),a,z) + \beta Q((a,z)(a',z))*EV(a',z),

where * denotes matriz multiplication and Q((a,z)(a',z)) is a sparse matrix (on the GPU) with size [N_A*N_z,N_a*N_z].

The code in VFI_interp2_gpu could be further improved by vectorizing the loop over z.

Technical note: in the Howard, I reshape the gpuArray VKron, which is on gpu but is not sparse. The matrix Qmat is a sparse gpuArray but I don’t need to reshape it, so it’s fine.

Name          Size             Bytes  Class       Attributes

  VKron      1000x21            168000  gpuArray   

Name          Size                Bytes  Class       Attributes

  Qmat      21000x21000            588004  gpuArray    sparse

If the link does not work, this should be ok:
https://github.com/aledinola/Aiyagari_Interpolation

robertdkirkby · July 12, 2025, 9:52pm

Nice. re your first post, the Rendahl (2022) paper now gets a mention in my more complete post.

Yeah, I’m currently building some runtime tests for both ‘pure discretization’ and ‘grid interpolation layer’, with both ‘Howard iteration’ and ‘Howard greedy’. I plan to put the runtimes in a table alongside those reported in Rendahl (2022) [Pontus Rendahl’s codes for continuous and discrete time are both available on this website]. I also want to look at how much the VFI Toolkit adds in terms of overhead runtime. Plan is to add a table of runtimes to my post.

I will have to try play with ‘bigcstab’ while doing the Howards-greedy. Given the way grid interpolation layer is currently implemented there is no need for ‘find_loc_vec2’, although experienceasset and the like do require that kind of operation.

PS. Pontus’ code could be slightly faster as he does ‘griddedInterpolant’ and then uses it only once (and does this every iteration), so would be faster to ‘interp1’. Plus as Alessandro observed above, he iterates the agent distribution, but isn’t using Tan improvement.

aledinola · July 13, 2025, 12:02am

Is the Howard greedy algorithm implemented in the subfolder PFI?

The line in PolicyFnIter_Case1_NoD_Par2_raw.m

VKron=(eye(N_a*N_z)-beta*(Ptranspose'))\Ftemp

uses the backslash operator to solve the linear system. Two comments:

The matrix Ptranspose should be sparse since it contains many zeros, and the identity matrix should be sparse as well. Also, here I would use bigcstab instead of the backslash.
The matrix Ptranspose gives the transition from (a,z) to (a’,z’) if I read the code correctly. It is however faster to split this transition in two separate steps (same logic as for the Tan improvement).

Apologies if I misread your code

robertdkirkby · July 13, 2025, 12:06am

That is ex-code you won’t find the implementation in toolkit yet as has not been pushed.
[was my last attempt years ago to do PFI, back when I was much worse at sparse matrices than nowadays]

aledinola · July 14, 2025, 12:05am

Not strictly related to this post, but the new Matlab 2025a supports reshape of sparse gpu matrices

https://uk.mathworks.com/help/parallel-computing/release-notes.html?s_tid=CRUX_lftnav#mw_4b993cba-51fa-49b1-93ec-f4ac02b65a6c

Might be useful for Tan improvement and (maybe) for Howard

robertdkirkby · July 14, 2025, 12:18am

wow, cool! Will have to try it out on Tan improvement.

It might make memory a bottleneck, but otherwise it should cut runtimes even further (if nothing else then because of saving overhead moving Policy from gpu to cpu).

aledinola · July 14, 2025, 1:52pm

I tested the new Matlab R2025a and is slightly slower on the Aiyagari example. (Of course these things depend on operating system, cpu and gpu used, etc. Not claiming that R2025a is significantly slower in general.)

aledinola · July 15, 2025, 2:38pm

While working on my model with entrepreneurs in infinite horizon (and pure discretization), I rewrote the function ValueFnIter_nod_raw and got a 50% speed improvement. The major change is in the implementation of the Howard improvement step. I share my code below.

As you can see, the function follows the original version as closely as possible. In my model with N_a=1000 and N_z=80, the running time of ValueFnIter_nod_raw shrinks from 3.8 seconds to 1.43 seconds.
If you trust me, you could replace the current version with this one in the toolkit

Of course, you may want to test the performance for different values of N_a and N_z (I have a bit more values for z variables than in most cases, perhaps), but I doubt the runtime comparison would change much.
The main reason for the speed up is using sparse matrix in the Howard step.

function [VKron, Policy]=ValueFnIter_nod_raw(VKron, n_a, n_z, pi_z, DiscountFactorParamsVec, ReturnMatrix, Howards,Howards2,Tolerance,maxiter) % Verbose, a_grid, z_grid,
%Does pretty much exactly the same as ValueFnIter_Case1, only without any decision variable (n_d=0)

N_a=prod(n_a);
N_z=prod(n_z);

pi_z_transpose = pi_z';
pi_z_alt = shiftdim(pi_z_transpose,-1);
NaNzVec  = gpuArray.colon(1,N_a*N_z)';

% Precompute variables for Howard improvement step
ReturnMatrix1 = reshape(ReturnMatrix,[N_a,N_a*N_z]);
NaVec   = gpuArray.colon(1,N_a)';
NzVec   = gpuArray.colon(1,N_z)';
[a_ind,z_ind] = ndgrid(NaVec,NzVec);
a_ind = a_ind(:);
z_ind = z_ind(:);
ind = a_ind+(z_ind-1)*N_a;

%%
tempcounter=1;
currdist=Inf;
while currdist>Tolerance && tempcounter<=maxiter
    VKronold=VKron;

    % Calc the condl expectation term (except beta), which depends on z but not on control variables
    EV=VKronold.*pi_z_alt;
    EV(isnan(EV))=0; % multilications of -Inf with 0 gives NaN, this replaces them with zeros (as the zeros come from the transition probabilites)
    EV=sum(EV,2); % sum over z', leaving a singular second dimension

    entireRHS=ReturnMatrix+DiscountFactorParamsVec*EV; %aprime by a by z

    %Calc the max and it's index
    [VKron,Policy]=max(entireRHS,[],1);

    VKron=shiftdim(VKron,1); % a by z

    VKrondist=VKron(:)-VKronold(:);
    VKrondist(isnan(VKrondist))=0;
    currdist=max(abs(VKrondist));
    %disp(currdist)

    % Use Howards Policy Fn Iteration Improvement 
    % (except for first few and last few iterations, as it is not a good idea there)
    if isfinite(currdist) && currdist/Tolerance>10 && tempcounter<Howards2

        Policy=Policy(:); % [N_a*N_z,1] (this shape is just convenient for Howards)
        tempmaxindex = Policy+(NaNzVec-1)*N_a; % like sub2ind
        Ftemp=ReturnMatrix1(tempmaxindex); % [N_a*N_z,1]
        indp = Policy+(z_ind-1)*N_a;
        Qmat = sparse(ind,indp,1,N_a*N_z,N_a*N_z);
        for Howards_counter=1:Howards
            EV_howard = VKron*pi_z_transpose; % (a',z)
            EV_howard = reshape(EV_howard,[N_a*N_z,1]);
            VKron = Ftemp+DiscountFactorParamsVec*Qmat*EV_howard;
            VKron = reshape(VKron,[N_a,N_z]);
        end
    end

    tempcounter=tempcounter+1;
end

Policy=reshape(Policy,[N_a,N_z]);

end %end function ValueFnIter_nod_raw

robertdkirkby · July 15, 2025, 8:32pm

Cool. I will definitely adopt something like this, but will probably take a week or three until I do.

robertdkirkby · July 17, 2025, 5:28am

Did a bunch of runtime tests for the pure discretized VFI.

Results were:

Howards-greedy is fastest for very small problems, VFI Toolkit defaults has been set to use greedy when N_a<400 || N_z<20
For anything but very small problems, VFI Toolkit uses Howards iteration as that is faster
Howards iteration is faster with indexing than sparse matrix, for expectations.
Overhead costs of ValueFnIter command versus the hardcode is negligible.

[Not covered in these codes, but by default VFI Toolkit will ‘refine’ away a ‘d’ (decision) variable. Then based on N_a and N_z will use Howards greedy or iteration on what is left]

Codes: InfHorzVFIruntimes/purediscVFI_InfHorz_runtimes.m at main · robertdkirkby/InfHorzVFIruntimes · GitHub

Note to self: didn’t try bigcstab for Howards-greedy yet. Should add that.

aledinola · July 17, 2025, 2:18pm

Interesting! The results look reasonable. I have also found that Howard greedy with backslash operator is slow for large problem. Slighlty surprised that Howard iterations with sparse matrix is slower than Howard iterations with full arrays.

Small comments:

Issue 1
on lines 212-213, shouldn’t you generate azind on the gpu? In the current version, when the code runs the line

 Q = sparse(azind(:),indp(:),1,N_a*N_z,N_a*N_z);

it has to convert azind from cpu to gpu, since azind is on cpu but indp is a gpu array

Issue 2
Also, on line 272, I would replace

spI = speye(N_a*N_z);

with

spI = gpuArray.speye(N_a*N_z);

as suggested here: https://www.mathworks.com/help/parallel-computing/work-with-sparse-arrays-on-a-gpu.html

Issue 3

This line

 T_E=sparse(repelem((1:1:N_a*N_z)',1,N_z),Policy(:)+N_a*(0:1:N_z-1),repelem(pi_z,N_a,1),N_a*N_z,N_a*N_z);

has variables that can be precomputed:

xx1 = repelem((1:1:N_a*N_z)',1,N_z);
xx2 = repelem(pi_z,N_a,1);

Probably these modifications will have a minor impact on the runtimes, but you never know

aledinola · July 18, 2025, 10:54am

I think my runtimes differ with yours: in my case, Howard iterations with sparse matrix is faster than Howard iteration with indexing and without sparse

Howard iteration with indexing

time_ret = 0.23 
time_vfi = 3.32 
Total    = 3.55

Howard iteration with sparse:

time_ret = 0.39 
time_vfi = 1.23 
Total    = 1.62

Parameters:

n_a =

        1001
n_z =

    41     2
     tolerance: 1.0000e-09
              howards: 80
           maxhowards: 500

See update below

aledinola · July 19, 2025, 10:10pm

I redid my tests a bit better, using Robert’s InfHorzVFIruntimes as a template, comparing three methods of doing Howard:

Howard iterations with indexing
Howard iterations with sparse matrix
Howard greedy with bicgstab

The results are something like time3>time1~time2. Method 3 is definitely slower than the first two, but 1 vs 2 is not very clear-cut: the two methods have similar running times.

RUNNING TIMES
fun_VFI_iter_indexing: 0.335993 
fun_VFI_iter_sparse:   0.262829 
fun_VFI_iter_bicgstab: 1.043208 
err2:   0.000000 
err3:   0.000000

As a test model, I have considered the income fluctuation problem (basically, the partial equilibrium block of most Bewley models, as described on quantecon here) and coded it up in IFP_ReturnFn:

function F = IFP_ReturnFn(aprime,a,z,r,w,gamma)

c = w*z + (1+r)*a - aprime;

F = -inf;
if c>0
    F = c^(1-gamma)/(1-gamma);
end

end

(I don’t like using the model in Rendhal 2022 as a test case since it is hardcoded that z has two values).

I have coded the first method, based on Howard iterations and indexing, in the function fun_VFI_iter_indexing, the second method, based on Howard iteration with sparse matrix, in fun_VFI_iter_sparse and finally the bicgstab method in fun_VFI_iter_bicgstab.

The user can run the tests from the script main:

%% Method 1: Howard with iterations, indexing

tic
[Policy1] = fun_VFI_iter_indexing(V0,pi_z,N_a,N_z,ReturnMatrix,...
    DiscountFactorParamsVec,Tolerance,maxiter,maxhowards,H);
time1=toc;

%% Method 2: Howard with iterations, sparse matrix

tic
[Policy2] = fun_VFI_iter_sparse(V0,pi_z,N_a,N_z,ReturnMatrix,...
    DiscountFactorParamsVec,Tolerance,maxiter,maxhowards,H);
time2=toc;

%% Method 3: Howard greedy with bigcstab
tic
[Policy3] = fun_VFI_iter_bicgstab(V0,pi_z,N_a,N_z,ReturnMatrix,...
    DiscountFactorParamsVec,Tolerance,maxiter,maxhowards,H);
time3=toc;

err2 = max(abs(Policy2-Policy1),[],"all");
err3 = max(abs(Policy3-Policy1),[],"all");

disp('RUNNING TIMES')
fprintf('fun_VFI_iter_indexing: %f \n',time1)
fprintf('fun_VFI_iter_sparse:   %f \n',time2)
fprintf('fun_VFI_iter_bicgstab: %f \n',time3)
fprintf('err2:   %f \n',err2)
fprintf('err3:   %f \n',err3)

I find the running times very volatile from one run to the next. On average the two methods have similar performance. Maybe I should use the built-in gputimeit to time the performance.

Questions:

The bigcstab method seems unstable, in the sense that if I change the convergen crit, the VFI might not converge. I wonder if I made some inaccuracy in the implementation (I wouldn’t say mistake since when it converges the results are correct).
The line EV(isnan(EV))=0 or similar, to get rid of NaN values, take some time especially in the Howard step. Is it really needed? Can it be eliminated at least from the Howard step?

My files are available on Github: GitHub - aledinola/InfHorzVFIruntimes: tests of runtimes for different variations of VFI in InfHorz

robertdkirkby · July 21, 2025, 4:15am

Interestingly, running your codes on my gpu, I get that iteration with iteration is 0.27s and with sparse matrix is 0.32s. My takeaway is there is nothing really to be gained/lost from indexing vs sparse matrix for iteration, so just going to leave toolkit doing indexing. Your runtimes also tell me that bicgstab() is not worthwhile.

Cleaned up my codes precomputing things (I had done this in the final implementations in toolkit, but not in my runtime tests).

EV gets NaN when V is -Inf and pi_z is 0 (as -Inf*0 gives a NaN). Because the toolkit needs to allow for V to be -Inf and pi_z to contain zeros I can’t see a way around this step (there is probably something clever, especially in the Howards iterations that can mostly avoid it). [My codes tried skipping this step just for the Rendahl (2022) model, but the runtimes gains were minor.]

Cleaned up so that there is now a vfoptions.howardsgreedy so user can control which to use. And default is vfoptions.howardsgreedy=1 if N_a<400 || N_z<20, =0 otherwise.

I also realized that one of the advantages of MCA is that it essentially imposes a good initial guess. If you set V_0=0, then VFI chooses Policy_1(a_i,z_j)=1 (index, choose minimum aprime value, as ReturnFn is decreasing in aprime), while MCA chooses Policy_1(a_i,z_j)=a_{i-1}, since it cannot go more than one point down the grid. The later of these is much closer to the solution, and so it generates a value fn much closer to the solution after the first iteration. The following graph shows V_0, the V_1 for VFI vs MCA, and the V_\infty solution for Rendahl (2022) [solid line is z=1, dotted is z=0], and as is obvious MCA gets way closer on the first iteration. That said, runtime gains from this were unreliable (across grid sizes for a and z) and never exceeded 15% reduction in runtimes, so is not a big deal. One way to think of it is that with MCA, what you put for V0 is largely irrelevant, as your Policy_1 is kind of chosen for you by the MCA assumption (that you can only move up/down one grid point) and this Policy_1 and associated V_1 is really the MCA initial guess.

aledinola · July 21, 2025, 7:28pm

So the criterion to choose the Howard algorithm (greedy vs iterations) is not based on the total number of points on the state space? If you have na=10000 but nz = 7, you want to choose Howard greedy?

robertdkirkby · July 22, 2025, 9:14pm

I have no idea which to choose for na=10000 & nz = 7, so for now it is greedy

If anyone wants to do some runtime tests for situations like this I’m happy to set a better default

robertdkirkby · July 24, 2025, 5:34am

Did a bunch of runtime tests for the linear interpolation with discretized VFI (gridinterplayer, which is the third linear interpolation approach described in original post).

Results were:

Howards-greedy is always fastest [so is the default]
Multi-grid is marginally faster, but by very little [so is the default]
Howards iteration is faster with indexing than sparse matrix, for expectations.
Overhead costs of ValueFnIter command versus the hardcode is negligible.

[Not covered in these codes, but by default VFI Toolkit will ‘refine’ away a ‘d’ (decision) variable. Then based on N_a and N_z will use Howards greedy or iteration on what is left]

Codes: InfHorzVFIruntimes/gridinterplayerVFI_InfHorz_runtimes.m at main · robertdkirkby/InfHorzVFIruntimes · GitHub

robertdkirkby · July 24, 2025, 5:36am

You can now use grid interpolation layer to solve the Aiyagari (1994) model. I want to get it working for the transition path and once that is done I will activate it in the example code.

Note-to-self: Still want to try relative/endogenous VFI (Bray, 2019), and maybe one of the more advanced update approaches (beyond Howards).