Grid interpolation in Infinite horizon models

I think I found a bug in ValueFnIter_Refine_postGI_raw. I wanted to disable the Howard improvement step to do some checks, so I set vfoptions.maxhowards= 0, as suggested by @robertdkirkby, but I got this error message:

Index exceeds matrix dimension.

Error in ValueFnIter_Refine_postGI_raw (line 276)
Policy(1,:,:)=reshape(dstar(temppolicyindex),[N_a,N_z]);
                      ^^^^^^^^^^^^^^^^^^^^^^
Error in ValueFnIter_GridInterpLayer (line 149)
                    [V,Policy]=ValueFnIter_Refine_postGI_raw(V0, n_d, n_a, n_z, d_gridvals, a_grid, z_gridvals, pi_z, ReturnFn, DiscountFactorParamsVec, ReturnFnParamsVec, vfoptions);
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Error in ValueFnIter_Case1 (line 465)
    [V,Policy]=ValueFnIter_GridInterpLayer(V0, n_d, n_a, n_z, d_gridvals, a_grid, z_gridvals, pi_z, ReturnFn, DiscountFactorParamsVec, ReturnFnParamsVec, vfoptions);
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Error in main (line 193)
[~, Policy] = ValueFnIter_Case1( ...
              ^^^^^^^^^^^^^^^^^^^^^^

Of course in practice one would never want to disable Howard, but I thought it’s worth to point this out. This brings me to another issue that I have encountered. I wanted to stress-test the routine that I have written for grid interpolation with d variable and Howard sparse and compare it to the default routine in the toolkit. The two routines are

  • ValueFnIter_Refine_postGI_raw default, which works well
  • ValueFnIter_postGI_sparse_raw new routine

In the Pijoan-Mas model, I have found a specific parameterization where unfortunately ValueFnIter_postGI_sparse_raw fails to converge. Here is the code: PijoanMasTaxes/main.m at main · aledinola/PijoanMasTaxes · GitHub

I set the usual options:

vfoptions.howardsgreedy = 0;
vfoptions.howardssparse = 0;
vfoptions.gridinterplayer = 1;
vfoptions.ngridinterp     = 15;

This calls the default for interpolation, which is ValueFnIter_Refine_postGI_raw and it runs well. When I set vfoptions.howardssparse = 1; the code calls instead ValueFnIter_postGI_sparse_raw and unfortunately it fails to converge: the distance (variable currdist) goes down initially but then it oscillates between 0.0002 and 0.0003

@robertdkirkby I was very happy when you accepted this contribution in the toolkit but it seems that my function, although faster, is less stable than yours :face_with_open_eyes_and_hand_over_mouth:. Do you happen to see some easy-to-spot inaccuracy or lack of robusteness in the code of ValueFnIter_postGI_sparse_raw? VFIToolkit-matlab/ValueFnIter/InfHorz/GridInterpLayer/ValueFnIter_postGI_sparse_raw.m at master · vfitoolkit/VFIToolkit-matlab · GitHub

If I increase the number of grid points from n_a=600 to n_a=700, my function converges. However I’m struggling to understand where the problem might be compared to the default function that always converges no matter what.

1 Like

I found another issue with the toolkit function ValueFnIter_Refine_postGI_raw.m (this is one of the default functions, I think). I noticed that we are still using interpMatrix in a couple of places. For me is fine, but based on our previous discussion we had decided not to use it for the time being, due to some potential inconsistencies with -Inf that @MichaelTiemann found out?

1 Like

It happened to me that I go out of memory when I reach that line in the code (with the multiplication of interpmatrix with EV). So I went back to the solution without interpolation but it is a pity since it is less accurate.

2 Likes

Fixing/improving ValueFnIter_Refine_postGI_raw.m is a priority for me, in particular finding a good way to implement the interpolation step. But I am away on holidays for the next two weeks, so it is a priority that I won’t get to until I return.

@javier_fernan good to know that this can be a memory bottleneck, I was not aware.

1 Like

I tried to run your Pijoan-Mas code with grid interp layer and howards sparse and it does not converge. If I set howards sparse to 0 it converges but it is much slower. I had a look at the files but they are quite coplicated for me…

1 Like

I just fixed ValueFnIter_Refine_postGI_raw

Problem was the sparse matrix approach to interpolation, which cannot handle -Inf. Have switched back to interp1() approach to implementing the grid interpolation layer.

Will look at the sparse howards shortly.

1 Like

Grid interpolation layer is now back to being done with interp1().

Based on my runtime tests this was anyway the fastest. The sparse matrix version is close, but is actually marginally slower than interp1(), and has the problem it cannot handle the -Inf.

Following is copy-paste of my runtime tests


% Some runtime tests to figure out a custom version of interp1() that can
% be used to implement grid interpolation layer.

% Everything is done on GPU

%% The place I use interp1() was for interpolating the expectations term,
N_aprime=101;
N_a=101;
N_z=21;

vfoptions.ngridinterp=3;

EV=rand(N_aprime,N_a,N_z,'gpuArray');

% EV(round(N_aprime*rand(5,1)))=-Inf;

T=100; % average runtimes over T iterations

%%
N_aprimefine=N_aprime+(N_aprime-1)*vfoptions.ngridinterp;


%% My interp1() implementation was
time1=zeros(T,1);

EVinterpindex1=gpuArray(1:1:N_aprime)';
EVinterpindex2=linspace(1,N_aprime,N_aprime+(N_aprime-1)*vfoptions.ngridinterp)';

% Interpolate EV over aprime_grid
for tt=1:T
    tic;
    EVinterp1=interp1(EVinterpindex1,EV,EVinterpindex2);
    time1(tt)=toc;
end

fprintf('Using interp1 the mean runtime is %2.8f  [with min of %2.8f and max of %2.8f ]\n', mean(time1),min(time1),max(time1))

%% Next, a version based on weights and sum of the dot-product.
time2=zeros(T,1);
% Because grid interpolation layer is always evenly-spaced points, the weights are easy to calculate.

lowergridweight=[repmat(gpuArray(linspace(1,1/(vfoptions.ngridinterp+1),vfoptions.ngridinterp+1))',N_aprime-1,1); 0];
uppergridweight=[repmat(gpuArray(linspace(0,1-1/(vfoptions.ngridinterp+1),vfoptions.ngridinterp+1))',N_aprime-1,1); 1];

% Interpolate EV over aprime_grid
for tt=1:T
    tic;
    
    tempL=[repelem(EV(1:end-1,:,:),vfoptions.ngridinterp+1,1,1); EV(end,:,:)];
    tempH=[EV(1,:,:); repelem(EV(2:end,:,:),vfoptions.ngridinterp+1,1,1)];
    EVinterp2=lowergridweight.*tempL+uppergridweight.*tempH;
    % EVinterp2(isnan(EVinterp2))=-Inf; % Otherwise zero weight on -Inf gives NaN

    time2(tt)=toc;
end

fprintf('Using weighted-sum the mean runtime is %2.8f  [with min of %2.8f and max of %2.8f ]\n', mean(time2),min(time2),max(time2))

%% Sparse matrix approach
time3=zeros(T,1);

EVinterpindex2=gpuArray.linspace(1,N_aprime,N_aprime+(N_aprime-1)*vfoptions.ngridinterp)';

% Pre-compute interpolation matrix
idx_low = floor(EVinterpindex2);
idx_low = max(1, min(idx_low, N_aprime-1));
idx_high = idx_low + 1;
weight_high = EVinterpindex2 - idx_low;
weight_low = 1 - weight_high;
i_indices = [gpuArray.colon(1,N_aprimefine)'; gpuArray.colon(1,N_aprimefine)'];
j_indices = [idx_low; idx_high];
weights = [weight_low; weight_high];
interpMatrix = full(sparse(i_indices, j_indices, weights, N_aprimefine, N_aprime));

% Interpolation of EV is performed as a matrix multiplication
% Interpolate EV over aprime_grid
for tt=1:T
    tic;

    EVinterp3 = reshape(interpMatrix *reshape(EV, [N_aprime,N_a*N_z]), [N_aprimefine,N_a,N_z]);
    % EVinterp3(isnan(EVinterp3))=-Inf; % Otherwise zero weight on -Inf gives NaN

    time3(tt)=toc;
end


fprintf('Using sparse matrix the mean runtime is %2.8f  [with min of %2.8f and max of %2.8f ]\n', mean(time3),min(time3),max(time3))


%% Compare runtimes
fprintf(' \n')
fprintf(' \n')
fprintf(' \n')
fprintf('Using interp1 the mean runtime is %2.8f  [with min of %2.8f and max of %2.8f ]\n', mean(time1),min(time1),max(time1))
fprintf('Using weighted-sum the mean runtime is %2.8f  [with min of %2.8f and max of %2.8f ]\n', mean(time2),min(time2),max(time2))

fprintf('weighted-sum as fraction of interp1 the mean runtime is %2.8f \n', mean(time2)/mean(time1))
fprintf('sparse as fraction of interp1 the mean runtime is %2.8f \n', mean(time3)/mean(time1))

% fprintf('weighted-sum versions, ratio of A to B %2.8f \n', mean(time2A)/mean(time2)) % B is better

max(abs(EVinterp2(:)-EVinterp1(:)))
max(abs(EVinterp3(:)-EVinterp1(:)))


%% Method also has to work in the presence of -Inf terms in EV (does this work when EV is -Inf and weight is zero?)

PS. It is important to the runtimes that ‘EVinterp1index1’ and ‘EVinterp1index2’ are on the gpu prior to calling interp1().

2 Likes

Interesting. I think the grid sizes are not very realistic though:

N_aprime=101;
N_a=101;
N_z=21;

N_a and N_aprime should be larger, no?

Moreover, if you run the matlab profiler you will see that interp1 is the bottleneck of the VFI routine and therefore is worth speeding it up.

1 Like

Fair point. I reran the runtime with a bunch of different grid sizes, and interp1() was fastest in all of the combos (not by much, but it is the fastest).

Why do you say that interp1() is the bottleneck? Here is the profiler when I ran it on the GL2017 example, but deleting everything except just solving the value function problem. The interp1() doesn’t even appear to make the list. Do you see something different when you profile?

Here is updated version of the code that looks at bunch of different grid sizes


% Some runtime tests to figure out a custom version of interp1() that can
% be used to implement grid interpolation layer.

% Everything is done on GPU

N_aprime_vec=[51,101,201,501,1001];
N_a_vec=[51,101,201,501];
N_z_vec=[5,11,21];
ngridinterp_vec=[5,20,50];

MeanTimes=zeros(3,length(N_aprime_vec),length(N_a_vec),length(N_z_vec),length(ngridinterp_vec));

for i1=1:length(N_aprime_vec)
    for i2=1:length(N_a_vec)
        for i3=1:length(N_z_vec)
            for i4=1:length(ngridinterp_vec)
                N_aprime=101;
                N_a=101;
                N_z=21;
                vfoptions.ngridinterp=3;


                %% The place I use interp1() was for interpolating the expectations term,
                % N_aprime=101;
                % N_a=101;
                % N_z=21;
                %
                % vfoptions.ngridinterp=3;

                EV=rand(N_aprime,N_a,N_z,'gpuArray');

                % EV(round(N_aprime*rand(5,1)))=-Inf;

                T=100; % average runtimes over T iterations

                %%
                N_aprimefine=N_aprime+(N_aprime-1)*vfoptions.ngridinterp;


                %% My interp1() implementation was
                time1=zeros(T,1);

                EVinterpindex1=gpuArray(1:1:N_aprime)';
                EVinterpindex2=linspace(1,N_aprime,N_aprime+(N_aprime-1)*vfoptions.ngridinterp)';

                % Interpolate EV over aprime_grid
                for tt=1:T
                    tic;
                    EVinterp1=interp1(EVinterpindex1,EV,EVinterpindex2);
                    time1(tt)=toc;
                end

                fprintf('Using interp1 the mean runtime is %2.8f  [with min of %2.8f and max of %2.8f ]\n', mean(time1),min(time1),max(time1))

                %% Next, a version based on weights and sum of the dot-product.
                time2=zeros(T,1);
                % Because grid interpolation layer is always evenly-spaced points, the weights are easy to calculate.

                lowergridweight=[repmat(gpuArray(linspace(1,1/(vfoptions.ngridinterp+1),vfoptions.ngridinterp+1))',N_aprime-1,1); 0];
                uppergridweight=[repmat(gpuArray(linspace(0,1-1/(vfoptions.ngridinterp+1),vfoptions.ngridinterp+1))',N_aprime-1,1); 1];

                % Interpolate EV over aprime_grid
                for tt=1:T
                    tic;

                    tempL=[repelem(EV(1:end-1,:,:),vfoptions.ngridinterp+1,1,1); EV(end,:,:)];
                    tempH=[EV(1,:,:); repelem(EV(2:end,:,:),vfoptions.ngridinterp+1,1,1)];
                    EVinterp2=lowergridweight.*tempL+uppergridweight.*tempH;
                    % EVinterp2(isnan(EVinterp2))=-Inf; % Otherwise zero weight on -Inf gives NaN
                    % UNCOMMENTING THE ABOVE LINE MEANS THAT THIS APPROACH CAN HANDLE -Inf (WHILE IT IS COMMENT THIS APPROACH DOES NOT HANDLE -Inf)

                    time2(tt)=toc;
                end

                fprintf('Using weighted-sum the mean runtime is %2.8f  [with min of %2.8f and max of %2.8f ]\n', mean(time2),min(time2),max(time2))

                %% Sparse matrix approach
                time3=zeros(T,1);

                EVinterpindex2=gpuArray.linspace(1,N_aprime,N_aprime+(N_aprime-1)*vfoptions.ngridinterp)';

                % Pre-compute interpolation matrix
                idx_low = floor(EVinterpindex2);
                idx_low = max(1, min(idx_low, N_aprime-1));
                idx_high = idx_low + 1;
                weight_high = EVinterpindex2 - idx_low;
                weight_low = 1 - weight_high;
                i_indices = [gpuArray.colon(1,N_aprimefine)'; gpuArray.colon(1,N_aprimefine)'];
                j_indices = [idx_low; idx_high];
                weights = [weight_low; weight_high];
                interpMatrix = full(sparse(i_indices, j_indices, weights, N_aprimefine, N_aprime));

                % Interpolation of EV is performed as a matrix multiplication
                % Interpolate EV over aprime_grid
                for tt=1:T
                    tic;

                    EVinterp3 = reshape(interpMatrix *reshape(EV, [N_aprime,N_a*N_z]), [N_aprimefine,N_a,N_z]);
                    % EVinterp3(isnan(EVinterp3))=-Inf; % Otherwise zero weight on -Inf gives NaN
                    % NOTE: THIS DOES NOT HANDLE -Inf, NOR DOES UNCOMMENING THE LINE ABOVE WORK TO HANDLE -Inf

                    time3(tt)=toc;
                end


                fprintf('Using sparse matrix the mean runtime is %2.8f  [with min of %2.8f and max of %2.8f ]\n', mean(time3),min(time3),max(time3))


                %% Compare runtimes
                fprintf(' \n')
                fprintf(' \n')
                fprintf(' \n')
                fprintf('Using interp1 the mean runtime is %2.8f  [with min of %2.8f and max of %2.8f ]\n', mean(time1),min(time1),max(time1))
                fprintf('Using weighted-sum the mean runtime is %2.8f  [with min of %2.8f and max of %2.8f ]\n', mean(time2),min(time2),max(time2))

                fprintf('weighted-sum as fraction of interp1 the mean runtime is %2.8f \n', mean(time2)/mean(time1))
                fprintf('sparse as fraction of interp1 the mean runtime is %2.8f \n', mean(time3)/mean(time1))

                % fprintf('weighted-sum versions, ratio of A to B %2.8f \n', mean(time2A)/mean(time2)) % B is better

                max(abs(EVinterp2(:)-EVinterp1(:)))
                max(abs(EVinterp3(:)-EVinterp1(:)))

                %% Store just the mean time
                MeanTimes(1,i1,i2,i3,i4)=mean(time1);
                MeanTimes(2,i1,i2,i3,i4)=mean(time2);
                MeanTimes(3,i1,i2,i3,i4)=mean(time3);

            end
        end
    end
end

%%
ratio2to1=MeanTimes(2,:,:,:,:)./MeanTimes(1,:,:,:,:);
ratio3to1=MeanTimes(3,:,:,:,:)./MeanTimes(1,:,:,:,:);

ratio2to1(:)' % all are greater than 1
ratio3to1(:)' % all are greater than 1

%% Method also has to work in the presence of -Inf terms in EV (does this work when EV is -Inf and weight is zero?)
1 Like

Hi Rob! I ran the profiler on the Guerrieri-Lorenzoni example, only value function iteration and obtained quite a different result (unless I misread the output)

Besides the particular timing results, which may vary from one machine to another, I think the issue is that repeatedly calling interp1 within the Howard loops (so basically 150 times) can hardly be a good idea in terms of performance. Since the policy for assets and the asset grid are fixed, you can precompute the interpolation indeces and weights before the Howard loop and then do the linear interpolation with multiplications and additions (even without using a sparse interpolating matrix). Otherwise, in the current implementation, every time you call interp1, you have
(i) overhead of calling interp1 with associated input checks etc
(ii) Locating the position of a’ on the a grid
(iii) Calculating the interpolation weights
(iv) Computing EV = weight*EV(ind)+(1-weight)*EV(ind+1), very roughly.

I asked ChatGPT to summarize what I wanted to say and this is the output :slight_smile:

Howard update, naive implementation

Fix a policy a'(a_i, z_j).

For h = 1, \dots, H:

EV_h(a_i, z_j) = \sum_{z'} V_h(a_i, z') \, \pi(z_j, z')

Let a_k \le a'(a_i, z_j) \le a_{k+1} and define

\omega(i,j) = \frac{a_{k+1} - a'(a_i, z_j)}{a_{k+1} - a_k}

Then

EV_h(a'(a_i, z_j), z_j) = \omega(i,j) EV_h(a_k, z_j) + (1 - \omega(i,j)) EV_h(a_{k+1}, z_j)

This is done automatically by interp1 with the line

EV_h(a'(a_i, z_j), z_j) = interp1(a_{grid},EV_h(:, z_j),a'(a_i, z_j))

The Howard update is

V_{h+1}(a_i, z_j) = F(a'(a_i, z_j), a_i, z_j) + \beta \, EV_h(a'(a_i, z_j), z_j)

end h loop

2. Efficient implementation

Precompute once (given fixed a'):

F^*(a_i, z_j) = F(a'(a_i, z_j), a_i, z_j)

Find k(i,j) such that

a_{k(i,j)} \le a'(a_i, z_j) \le a_{k(i,j)+1}

and store

\omega(i,j) = \frac{a_{k(i,j)+1} - a'(a_i, z_j)} {a_{k(i,j)+1} - a_{k(i,j)}}

Before we start the Howard loop, we have the arrays \omega(i,j) and k(i,j).

Howard loop, for h=1,...,150:

EV_h(a_i, z_j) = \sum_{z'} V_h(a_i, z') \, \pi(z_j, z')
V_{h+1}(a_i, z_j) = F^*(a_i, z_j) + \beta \left[ \omega(i,j) EV_h(a_{k(i,j)}, z_j) + (1 - \omega(i,j)) EV_h(a_{k(i,j)+1}, z_j) \right]

Of course all of this should be vectorized.

That being said, it might be that interp1 on the GPU is optimized so well that it still pays calling it 150 times. Or perhaps indexing arrays on the GPU is inefficient (but you do index gpu arrays all the time in other parts of the toolkit). Or maybe \omega(i,j) and k(i,j) are very big arrays that it is costly to precompute them and keep them in memory.

In any case, I trust your expertise on the matter :smiley: I just wanted to explain my reasoning.

1 Like

Your runtime profiler looks similar to mine, I don’t see interp1() in the list.

I agree that in principle precomputing the interpolation weights should be faster than calling interp1() every iteration. But my runtime tests disagree. Presumably something about how Matlab hardcodes the internals of interp1 that makes them hard to beat by a manual ‘apply weights and sum’? I don’t include the ‘time to precompute weights’ in my runtime tests (they are done before the tic), so it is the comparison you are describing conceptually.

It is possible that I just need to rewrite the code in some minor way to get the runtimes to change, but not obvious to me what that would be (difference in runtimes are all small enough that very minor changes in code steps could flip the results).

PS. This reminds me a little bit of Judd, Maliar, Maliar & Tsener (2017) - How to solve dynamic stochastic models computing expectations just once. But I have gone and re-skimmed through this article I don’t think my current codes are in conflict with their point (which is kind of obvious for pure discretization with discretized shocks, but not obvious for polynomial approximations which is what they are talking about).

PPS. It was important for runtimes that when I call interp1(gridcoarse,EV,gridfine) all three of the inputs are already on the gpu. This did not use to be the case for the ValueFnIter_Refine_postGI_raw command (gridcoarse was on cpu) and this minor change gave important speed gains. Very likely there are various toolkit commands where this is not yet the case and so they could be substantially improved.

1 Like