VFI Gridinterplayer with Howard sparse

New function that solves VFI with NO d variable using grid interp layer and sparse matrix with Howard.

To test it, choose the following options when calling the VFI command:

vfoptions.lowmemory       = 0; % already default
vfoptions.howardsgreedy   = 0; % already default
vfoptions.howardssparse   = 1;
vfoptions.gridinterplayer = 1; 
vfoptions.ngridinterp     = 15;

I have tested in a model with entrepreneurs and is about 5 times faster than the case vfoptions.howardssparse=0, so good news!
@robertdkirkby When you have time, you could test this new function on the Aiyagari example and let me know if you also see a substantial speed gain. In my test I used n_a=1000 and n_z=40, which is larger than what is needed for Aiyagari. Also, let me know if you want me to modify it, I should have some time before Tuesday when I start teaching.

Important note: In this function I loop over z, so this should be a low memory feature. However, since it is fast, in my PR I put in the normal case. I started with the ‘looping over z’ approach since it is easier to code. By the way, is there a lowmemory=1 in ValueFnIter_GridInterpLayer? I also consider only the postGI case. In preGI, I left the error message ‘not implemented yet’. I do not plan to implement preGI since it is too memory intensive.

Question. In the toolkit function ValueFnIter_GridInterpLayer there is an if condition if isscalar(n_a) in a couple of places, but I don’t understand what happens if n_a is not scalar (i.e. if there is more than one endogenous state). I think if this case is not implemented yet, there should be an error message. By the way, my sparse improvement is only for scalar a variable.

1 Like

I actually tested the Aiyagari example (the basic version on the vfi toolkit repo and a version with two z variables, n_z = [5,5]) and my new version is much faster even in this case: 5-6 times faster.

I will go on implementing the version with d variable(s).

1 Like

I have pushed two changes:

  1. Added function to do Howards sparse with d variable. So now vfoptions.howardssparse works with and without d variable for infinite horixzon models. There is still the issue about what happens if isscalar(n_a) is false. There might be a bug in the code, in the sense that nothing happens, while there should be an error.
  2. I have improved the existing function ValueFnIter_Refine_postGI_raw with the help of Claude code. After running the Matlab profile, the bottleneck was line 177 when calling interp1 inside Howard. Claude replaced interp1 with matrix multiplication (the interp matrix has the weights) and delivered a robust 25% speedup).

Cool.

Can you point me to anything that explains nicely how to use Claude with Matlab? I might try it out.

1 Like

I don’t know how to integrate claude within the matlab IDE. What I did was:

(1) Run the code with profiler to identify the slow parts of the code. Typically it is always a “kernel” or “raw” function.

(2) Upload the function to Claude and ask it to optimize it, pointing to the results of the matlab profiler (e.g. focus on lines xx to yy).

(3) Claude will generate a new file that I copy back in Matlab and I test it. Sometimes there is no performance improvement and then I tell Claude that it did not improve, and keep on iterating.

To avoid messing up the matlab code on my computer, I created a new branch. Then I commit only if I like what claude did.

Tip: sometimes you have to tell Claude to use implicit expansion instead of repmat, perhaps since AI is trained on a lot of old material that does not use implicit expansion.

Here is the link to my claude chat:
https://claude.ai/chat/78ebf2a1-3b26-43fd-8faf-b37d4e818523

1 Like

As always, trust but verify AI. I left a few comments in the merge related to this change. To wit: the linear interpolation leaves -Inf as -Inf. Matrix multiplication leaves it as -Inf when multiplied by a nonzero number, but translates the -Inf to NaN when multiplying by zero. This is not the behavior we want. I propose a simple hack that works given that the kind of infinity with which we are dealing starts as -Inf but I don’t know what sort of shorthand or math rules one could expediently apply to ensure that zero * signed infinity retains the correct sign (without doing a whole second multiplication using ones() and then doing something like:

result=interpMatrix * EV_reshaped;
signed_infinities=ones(size(interpMatrix)) * EV_reshaped;
result(isnan(result))=signed_infinities(isnan(result));

I’m hoping somebody has a better idea.

1 Like

Hi @MichaelTiemann, thanks for spotting this. What about adding this line after the matrix multiplication:

EVKrontemp(isnan(EVKrontemp)) = -Inf;  % Fix: matrix mult creates NaN when -Inf*0

This doesn’t handle the +Inf case. It’s true that in the present case we are dealing with -Inf in our return function, but we don’t know somebody won’t use +Inf in their coding. I came up with a better solution:

add exp(-500) to the interpMatrix. This is a very small number, but non-zero. It will not change non-zero values (because the value 7x10^-218 will round off even 10^-100). But it’s large enough to change a zero to a non-zero when multiplying by Inf. Adding a scalar to every term in the matrix is faster than looking for zeros (or NaNs) to fix. And it preserves the sign of the Inf term.

1 Like

I propose creating a utility function to encapsulate the creation of the interp matrix. This way if ever we need to change it from dense to sparse, or GPU to CPU, there’s just one place that needs changing. It will also reduce the distraction of all the index and weight magic we don’t want to replicate in other parts of the code.

1 Like

I like the idea of modular programming.

Something else that could be outsourced to a small function, to avoid repeating it all over, in the context of grid interp layer: convert midpoint and grid point from 1 to 3+2*ngridinterp to ‘lower left grid’ and index from 1 to ngridinterp+2. This is what I am talking about:

% Currently Policy(1,:) is the midpoint, and Policy(2,:) the second layer
% (which ranges -n2short-1:1:1+n2short). It is much easier to use later if
% we switch Policy(1,:) to 'lower grid point' and then have Policy(2,:)
% counting 0:nshort+1 up from this.
adjust=(Policy(2,:,:)<1+n2short+1); % if second layer is choosing below midpoint
Policy(1,:,:)=Policy(1,:,:)-adjust; % lower grid point
Policy(2,:,:)=adjust.*Policy(2,:,:)+(1-adjust).*(Policy(2,:,:)-n2short-1); % from 1 (lower grid point) to 1+n2short+1 (upper grid point)

That being said, one should consider the cost in terms of function calling overhead.

I am understanding from this discussion that there should be a dedicated function in VFI Toolkit that replaces the use of interp1() in implementing grid interpolation later. Michael did a version here

I am going to try some minor variations of this to check speed as the whole point of eliminating the use of interp1() is to be slightly faster.

I like the idea the two of you are proposing here so I will implement a version of the pull request above. I will set it up as an entirely separate function.

Something else that could be outsourced to a small function, to avoid repeating it all over, in the context of grid interp layer: convert midpoint and grid point from 1 to 3+2*ngridinterp to ‘lower left grid’ and index from 1 to ngridinterp+2.

This one is always just the last lines of the script. So I think copy-paste is easy enough here.

1 Like

Thanks for doing some tests! Yes, according to the profiler the bottleneck was the line with interp1, especially when called within the Howard loop. The interp1 function must have quite a bit of overhead since it must handle many cases, so replacing it with something tailored to our problem should give some speed up. I think this makes sense only in the context of a hot loop: if interp1 is called only once I would not bother replacing it.

Once you are done with it, I look forward testing it in the replication of Bruggeman (2021) with interpolation and more importantly in my own project

@robertdkirkby If you want some help in running some tests, just let me know

Great point, thanks. I copy and paste the function you have written here:

function interpMatrix=getInterpMatrix(EVinterpindex, N_aprime, N_aprimediff)
    idx_low = floor(EVinterpindex);
    idx_low = max(1, min(idx_low, N_aprimediff-1));
    idx_high = idx_low + 1;
    weight_high = EVinterpindex - idx_low;
    weight_low = 1 - weight_high;
    
    i_indices = [gpuArray.colon(1,N_aprime)'; gpuArray.colon(1,N_aprime)'];
    j_indices = [idx_low; idx_high];
    weights = [weight_low; weight_high];
    % Adding exp(-500) to every element in the matrix makes it dense
    interpMatrix = sparse(i_indices, j_indices, weights, N_aprime, N_aprimediff)+exp(-500);

end

As you have noted, with your modification, the interpolation matrix is not sparse anymore. The expensive line is now

interpMatrix * reshape(EV, N_aprimediff, N_aNz)

This is matrix * full_matrix multiplication where EV (the second matrix) is always full. So it is not clear to me whether it is faster to do sparse_matrix * full_matrix or full_matrix * full_matrix. Note also that we are dealing with matrix multiplication on the GPU. I guess this question goes beyond my expertise :slight_smile:

Please, be my guest on running some tests.

Because of where it will be used in VFI Toolkit for implementing grid interpolation layer it only needs to accept inputs that are gpuArray (full, not sparse).

I can see a few minor variations on how to do this, so is worth writing two or three and seeing which is fastest. Given that speed is the point of using a custom replacement for interp1().

Note, nor is there a reason it needs to be interp1(a_grid, EV, aprime_grid), rather than inputs (N_a, EV, N_aprime) or even just (EV,n2short). We know because of how grid interpolation layer works that the interpolation is to evenly-spaced points.

1 Like

I’ll also add that if it is true that -Inf is a special and important value in the return functions, but that no return functions ever meaningfully return +Inf, and we know that because we’ve combed out NaNs ahead of time (which we do), then we can replace NaNs in the multiplication by -Inf, preserving sparsity.

The MATLAB documentation shows how one can measure the density of a matrix, and we can test whether we are truly working with “sparse enough” matricies to make it matter. MATLAB says that unless the matrix is at most 1% full, that the overhead of sparseness is rarely worth it. They give an example of a 10,000 element identity matrix (eye(10000)). That’s 0.01% dense, so very well worth it in that case. If what makes it sparse is just the ratio of fine to course gridpoints, then unless it’s a 100x ratio (which it might be if it’s 5x in each of three dimensions) then again, dense matrices win.

2 Likes

Makes sense. I suspect that in this instance with how grid interpolation layer is used and what kind of values V and E[V] typically take that the sparseness is not worthwhile. As you say, it would only relate to the -Inf (I also cannot see any reason you would use Inf, except perhaps as part of implementing Epstein-Zin preferences where you do a lot of transforming 1/x in place of x, and -x in place of x).

I wonder if sparse gpu matrices vs full gpu matrices is any different to sparse cpu matrices vs full cpu matrices. I.e., is 1% still a useful guide.

1 Like

I started measuring times with tic and toc but then I found this on the matlab documentation: https://uk.mathworks.com/help/parallel-computing/measure-and-improve-gpu-performance.html

They suggest using gputimeit

1 Like

@robertdkirkby Could you please point to me where the function getInterpMatrix is called? I wanted to test it but I wasn’t able to find it in the VFI repo. Thanks

Is not in repo as does not yet exist (given it replaces interp1 which already works and is just about being a little faster, there was no point adding it until we have something that is written to be fast). You can find version Michael proposed as the bottom lines of

1 Like