I wanted to try if the toolkit works on the CPU for some basic models. This might be useful for teaching or whenever a coauthor does not have an NVIDIA gpu
So I tried the example based on Aiyagari here: VFItoolkit-matlab-examples/HeterogeneousAgentModels/Aiyagari1994.m at master · vfitoolkit/VFItoolkit-matlab-examples · GitHub
and I discovered that the main bottleneck is the computation of the Return Matrix over (a’,a,z). I would like to propose a couple of fixes that may greatly improve the speed of the code
FIX 1: Compute ReturnMatrix
In the current version of the toolkit, it takes forever. This computation is made by the function CreateReturnFnMatrix_Case1_Disc
. The bottleneck is here:
Fmatrix=zeros(N_a,N_a,N_z);
for i1=1:N_a
for i2=1:N_a
for i3=1:N_z
tempcell=num2cell([a_gridvals(i1,:),a_gridvals(i2,:),z_gridvals(i3,:)]);
Fmatrix(i1,i2,i3)=ReturnFn(tempcell{:},ParamCell{:});
end
end
end
First thing to do is to reorder the loops to conform to Matlab columnwise ordering: the innermost loop should be over a’, then a and then z. After doing this, one should vectorize the innermost loop over a’. The speed gain is huge: from more than 5 minutes down to 2 seconds.
Fmatrix=zeros(N_a,N_a,N_z);
for i3=1:N_z
for i2=1:N_a
%for i1=1:N_a
%tempcell=num2cell([a_gridvals(i1,:),a_gridvals(i2,:),z_gridvals(i3,:)]);
%Fmatrix(i1,i2,i3)=ReturnFn(tempcell{:},ParamCell{:});
Fmatrix(:,i2,i3)=ReturnFn(a_gridvals,a_gridvals(i2),z_gridvals(i3),ParamCell{:});
%end
end
end
The problem with this approach is that the function Aiyagari1994_ReturnFn
needs to be vectorized as well, as I show below:
function F=Aiyagari1994_ReturnFn_cpu(aprime, a, s,alpha,delta,mu,r)
% The return function is essentially the combination of the utility
% function and the constraints.
%F = -Inf;
w = (1-alpha)*((r+delta)/alpha)^(alpha/(alpha-1));
c = w*s+(1+r)*a-aprime; % Budget Constraint
if mu==1
F=log(c);
else
F=(c.^(1-mu) -1)/(1-mu);
end
F(c<=0) = -inf;
end %end function
A better option can be to use the option vfoptions.returnmatrix==1 in the toolkit. Unfortunately it gives an error (I wrote another post on this).
FIX 2: Value function iteration
After making the computation of ReturnMatrix faster, the VFI is still slow. The code is here (recall that this is the case without d variable):
The code above is slow because there are loops over a and z that can be vectorized. I simply copied and pasted from the GPU version and the speedup is great. The GPU version is here:
Note however that the line
ccc=kron(ones(N_a,1,'gpuArray'),bbb);
needs to be replaced by
ccc=kron(ones(N_a,1),bbb);
Also, in the old cpu version the input maxiter
is missing.
After doing all these changes, it turns out that the computation of the value function is fast enough. Below I compare the time on the GPU vs the CPU (based on the two modifications described above)
GPU TIME
Creating return fn matrix
Time to create return fn matrix: 0.2239
Starting Value Function
Time to solve for Value Fn and Policy: 0.3411
Transforming Value Fn and Optimal Policy matrices back out of Kronecker Form
Time to create UnKron Value Fn and Policy: 0.0019
CPU TIME (parallel=0)
NOTE: When using CPU you can speed things up by giving return fn as a matrix; see vfoptions.returnmatrix=1 in VFI Toolkit documentation.
Time to create return fn matrix: 1.9314
Starting Value Function
Time to solve for Value Fn and Policy: 0.2940
Transforming Value Fn and Optimal Policy matrices back out of Kronecker Form
Time to create UnKron Value Fn and Policy: 1.4339
Note in particular that there the CPU is as fast as the GPU on “Value Fn and Policy”. If you precompute the return matrix with the option returnmatrix=1, then the CPU code will be almost as fast as the GPU (see my second post on this).
Of course the vfi toolkit is mostly built to take advantage of parallelization on the gpu, but it would be nice to have some basic commands work well also without GPU (for students, non-Windows users etc.)