Aiyagari example on CPU, part I

aledinola · August 4, 2024, 5:36pm

I wanted to try if the toolkit works on the CPU for some basic models. This might be useful for teaching or whenever a coauthor does not have an NVIDIA gpu

So I tried the example based on Aiyagari here: VFItoolkit-matlab-examples/HeterogeneousAgentModels/Aiyagari1994.m at master · vfitoolkit/VFItoolkit-matlab-examples · GitHub

and I discovered that the main bottleneck is the computation of the Return Matrix over (a’,a,z). I would like to propose a couple of fixes that may greatly improve the speed of the code

FIX 1: Compute ReturnMatrix
In the current version of the toolkit, it takes forever. This computation is made by the function CreateReturnFnMatrix_Case1_Disc. The bottleneck is here:

 Fmatrix=zeros(N_a,N_a,N_z);
        for i1=1:N_a
            for i2=1:N_a
                for i3=1:N_z
                    tempcell=num2cell([a_gridvals(i1,:),a_gridvals(i2,:),z_gridvals(i3,:)]);
                    Fmatrix(i1,i2,i3)=ReturnFn(tempcell{:},ParamCell{:});
                end
            end
        end

First thing to do is to reorder the loops to conform to Matlab columnwise ordering: the innermost loop should be over a’, then a and then z. After doing this, one should vectorize the innermost loop over a’. The speed gain is huge: from more than 5 minutes down to 2 seconds.

Fmatrix=zeros(N_a,N_a,N_z);
        for i3=1:N_z
            for i2=1:N_a
                %for i1=1:N_a
                    %tempcell=num2cell([a_gridvals(i1,:),a_gridvals(i2,:),z_gridvals(i3,:)]);
                    %Fmatrix(i1,i2,i3)=ReturnFn(tempcell{:},ParamCell{:});
                    Fmatrix(:,i2,i3)=ReturnFn(a_gridvals,a_gridvals(i2),z_gridvals(i3),ParamCell{:});
                %end
            end
        end

The problem with this approach is that the function Aiyagari1994_ReturnFn needs to be vectorized as well, as I show below:

function F=Aiyagari1994_ReturnFn_cpu(aprime, a, s,alpha,delta,mu,r)
% The return function is essentially the combination of the utility
% function and the constraints.

%F = -Inf;
w = (1-alpha)*((r+delta)/alpha)^(alpha/(alpha-1));
c = w*s+(1+r)*a-aprime; % Budget Constraint

if mu==1
    F=log(c);
else
    F=(c.^(1-mu) -1)/(1-mu);
end

F(c<=0) = -inf;

end %end function

A better option can be to use the option vfoptions.returnmatrix==1 in the toolkit. Unfortunately it gives an error (I wrote another post on this).

FIX 2: Value function iteration
After making the computation of ReturnMatrix faster, the VFI is still slow. The code is here (recall that this is the case without d variable):

github.com

vfitoolkit/VFIToolkit-matlab/blob/master/ValueFnIter/InfHorz/ValueFnIter_Case1_NoD_raw.m

function [VKron, Policy]=ValueFnIter_Case1_NoD_raw(VKron, N_a, N_z, pi_z, beta, ReturnMatrix, Howards,Howards2,Tolerance) % Verbose
%Does pretty much exactly the same as ValueFnIter_Case1, only without any
%decision variable (n_d=0)

% N_a=prod(n_a);
% N_z=prod(n_z);
% 
% if Verbose==1
%     disp('Starting Value Fn Iteration')
%     tempcounter=1;
% end

PolicyIndexes=zeros(N_a,N_z);
tempcounter=1;
currdist=Inf;
while currdist>Tolerance

    VKronold=VKron;

    for z_c=1:N_z

This file has been truncated. show original

The code above is slow because there are loops over a and z that can be vectorized. I simply copied and pasted from the GPU version and the speedup is great. The GPU version is here:

github.com

vfitoolkit/VFIToolkit-matlab/blob/master/ValueFnIter/InfHorz/ValueFnIter_Case1_NoD_Par2_raw.m

function [VKron, Policy]=ValueFnIter_Case1_NoD_Par2_raw(VKron, n_a, n_z, pi_z, DiscountFactorParamsVec, ReturnMatrix, Howards,Howards2,Tolerance,maxiter) % Verbose, a_grid, z_grid, 
%Does pretty much exactly the same as ValueFnIter_Case1, only without any
%decision variable (n_d=0)

N_a=prod(n_a);
N_z=prod(n_z);

% PolicyIndexes=zeros(N_a,N_z,'gpuArray');
% Ftemp=zeros(N_a,N_z,'gpuArray');

bbb=reshape(shiftdim(pi_z,-1),[1,N_z*N_z]);
ccc=kron(ones(N_a,1,'gpuArray'),bbb);
aaa=reshape(ccc,[N_a*N_z,N_z]);

addindexforaz=gpuArray(N_a*(0:1:N_a-1)'+N_a*N_a*(0:1:N_z-1));

%%
tempcounter=1;
currdist=Inf;
while currdist>Tolerance && tempcounter<=maxiter

This file has been truncated. show original

Note however that the line

ccc=kron(ones(N_a,1,'gpuArray'),bbb);

needs to be replaced by

ccc=kron(ones(N_a,1),bbb);

Also, in the old cpu version the input maxiter is missing.

After doing all these changes, it turns out that the computation of the value function is fast enough. Below I compare the time on the GPU vs the CPU (based on the two modifications described above)

GPU TIME

Creating return fn matrix
Time to create return fn matrix: 0.2239
Starting Value Function
Time to solve for Value Fn and Policy: 0.3411
Transforming Value Fn and Optimal Policy matrices back out of Kronecker Form
Time to create UnKron Value Fn and Policy: 0.0019

CPU TIME (parallel=0)
NOTE: When using CPU you can speed things up by giving return fn as a matrix; see vfoptions.returnmatrix=1 in VFI Toolkit documentation.
Time to create return fn matrix: 1.9314
Starting Value Function
Time to solve for Value Fn and Policy: 0.2940
Transforming Value Fn and Optimal Policy matrices back out of Kronecker Form
Time to create UnKron Value Fn and Policy: 1.4339

Note in particular that there the CPU is as fast as the GPU on “Value Fn and Policy”. If you precompute the return matrix with the option returnmatrix=1, then the CPU code will be almost as fast as the GPU (see my second post on this).

Of course the vfi toolkit is mostly built to take advantage of parallelization on the gpu, but it would be nice to have some basic commands work well also without GPU (for students, non-Windows users etc.)

robertdkirkby · August 6, 2024, 3:48am

I have done the second of these two points (overhaul ValueFnIter_Case1_NoD_raw.m). But I don’t really want to do the first (overhaul CreateReturnFnMatrix_Case1_Disc.m) because, as you point out, it requires a modification of the ReturnFn. I feel like keeping the CPU as working on the same code as GPU is more important than making the CPU work. The CPU is never going to perform as good as the GPU (given the kind of discretization algorithms the toolkit uses) and I feel like the harmonization is more important than CPU speed as the CPU is really just there as a demo.

If you see a way to get some of that speed without needing to modify the ReturnFn let me know and I am happy to do that.

aledinola · August 6, 2024, 2:42pm

Hi Robert, agree with you (but see my next post)! I found a way to improve the running time of CreateReturnFnMatrix_Case1_Disc.m. On a Mac computer, the time to create the return matrix falls from 3.9 seconds to 2.8 seconds. This fix is not as good as vectorization but does not require changing the return function.

What I propose is to replace this part of the code:

Fmatrix=zeros(N_a,N_a,N_z);
        for i1=1:N_a
            for i2=1:N_a
                for i3=1:N_z
                    tempcell=num2cell([a_gridvals(i1,:),a_gridvals(i2,:),z_gridvals(i3,:)]);
                    Fmatrix(i1,i2,i3)=ReturnFn(tempcell{:},ParamCell{:});
                end
            end
        end

with the following:

Fmatrix=zeros(N_a,N_a,N_z);
        for i3=1:N_z
            tempcell3 = num2cell(z_gridvals(i3,:));
            for i2=1:N_a
                tempcell2 = num2cell(a_gridvals(i2,:));
                for i1=1:N_a
                    tempcell1 = num2cell(a_gridvals(i1,:));
                    Fmatrix(i1,i2,i3)=ReturnFn(tempcell1{:},tempcell2{:},tempcell3{:},ParamCell{:});
                end
            end
        end

Please note that my proposed change refers to the case N_d=0, something similar has to be done for the case N_d>0
The logic is the following: after profiling the code, the line num2cell to convert the numeric array to a cell array takes a long time. So it is better for performance to split it according to a’, a and z.
The best solution would be to eliminate the cell array altogether and do this (running time falls to 1 second):

for i3=1:N_z
    
            for i2=1:N_a
                
                for i1=1:N_a
                    
                    Fmatrix(i1,i2,i3)=ReturnFn(agridvals(i1),agridvals(i2),agridvals(i3),ParamCell{:});
                end
            end
        end

I think you use a cell array to allow for a variable number of inputs in the Return Function, right? In the simple Aiyagari example, we have only one a and one z variable, but in principle you could have more. This trick is called “comma separated list” and is used a lot in Matlab to passing a variable number of inputs to a function. Same logic applies to ParamCell. Unfortunately this trick in Matlab is slow

P.S. I’m sure you know this already, but I’m writing it to my future self and to document things

aledinola · August 6, 2024, 3:15pm

Regarding cpu vs gpu, I fully agree with you that the toolkit really shines with the gpu but I was relatively surprised that some important parts of the code run fast also on the cpu (I am referring to the VFI after your overhaul and also to the Tan improvement, which is only on the cpu!). The real bottleneck on the cpu is the creation of the return matrix. Arrayfun on the gpu does some magic!

robertdkirkby · August 8, 2024, 12:46am

I ended up hardcoding that if d,a,z are all 1D, then I use matrices. Hopefully that gives the speed you describe. And the cells are only used for higher dimensions; as you say the cells allow me to not have to know how many of each there (which is especially important to how ParamCell is used).