Test Howard repo

aledinola · February 25, 2026, 11:54am

In a previous post, there was an interesting discussion on how to do the Howard step in an infinite-horizon model.

I have created a new repo where I compare two methods: the first uses interp1 and the second uses precomputation of interpolation weights and indices. Interestingly, I was able to create the repo and its contents using codex in VSCode, which feels amazing! The documentation and explanations are really well done, better than I could have done myself. For example, codex generated a super clear README file:

Howard VFI GPU Benchmark (MATLAB)

This repository benchmarks two versions of value function iteration (VFI) with Howard policy improvement for an Aiyagari-style problem on the GPU:

fun_vfi_howard_interp1.m: uses interp1 in Howard policy evaluation.
fun_vfi_howard_manual_precomp.m: uses manual interpolation with precomputed indices/weights.

The main script, main.m, builds grids and parameters, moves data to GPU, runs warm-up calls, times both solvers with gputimeit, compares solution differences, and plots the policy function.

Requirements

MATLAB with Parallel Computing Toolbox (GPU support)
VFIToolkit-matlab on your MATLAB path
Project functions available on path:
- fun_vfi_howard_interp1.m
- fun_vfi_howard_manual_precomp.m
- Aiyagari1994_ReturnFn.m
- veclinspace.m

Run

Open MATLAB in this folder.
Update the addpath(...) line in main.m if your VFIToolkit location differs.
Run:

main

What `main.m` reports

GPU runtime of both methods (seconds)
Speed ratio (interp1/manual)
Max absolute difference between value/policy outputs
A policy plot for low and high productivity states

Notes

Timing uses gputimeit (appropriate for GPU workloads).
The script includes warm-up runs to reduce first-call/JIT timing bias.

@robertdkirkby I would definitely recommend using Matlab from VSCode+codex to document semi-automatically many toolkit features and repositories. It is a boring and repetitieve task at which LLMs nowadays excel.

aledinola · February 25, 2026, 12:02pm

Regarding the comparison of interp1 vs manual interpolation, this is the output of main.m

Benchmarking VFI with interp1 in Howard (GPU)...
Benchmarking VFI with manual interpolation in Howard (GPU)...
Time VFI with interp1 in Howard:              1.506464 s
Time VFI with manual interpolation in Howard: 0.578993 s
Speed ratio (interp1/manual):                 2.602
Difference b/w two methods:                   1.776357e-15

@robertdkirkby

I am not sure you are measuring runtimes correctly in your test. Matlab suggests using gputimeit, not tic toc.
For the manual interpolation, I use discretize to locate the interpolation indices, but it would be faster to use an algebraic formula since the fine grid is equally spaced (as you do in your implementation). This means that the advantage of manual interpolation over interp1 would be even more substantial.

javier_fernan · February 25, 2026, 2:46pm

Hi @aledinola, I was able to run your files and I got similar timings: the version without interp1 is three times faster. In your file fun_vfi_howard_manual_precomp.m I see that on line 84 you still use interp1. Is that intentional?

aledinola · February 25, 2026, 2:51pm

Thanks for confirming my results. Yes, I use interp1 in the maximization block of the VFI in both versions. This was intentional: I wanted to focus on the Howard block. In the maximization step, the policy function a'(a,z) changes for every iteration, so there aren’t many advantages of precomputing the interpolation indices and weights.

Also, please bear in mind that the codes in this repo are purely for illustration: to make the point that it is faster to do Howard in a certain way: for any other purpose you should use the VFI Toolkit commands, they are pretty good in terms of performance and generality.

javier_fernan · February 25, 2026, 4:18pm

Ok I compare your version with the manual interpolation and also with the toolkit command ValueFnIter_Case1 and they have almost same run time. I think you can make your version faster if you remove this loop.

for z_c=1:n_z
% EV_h_interp is [n_a,1]
EV_h_interp = omega(:,z_c).*EV_h(ind(:,z_c),z_c)+(1-omega(:,z_c)).EV_h(ind(:,z_c)+1,z_c);
V_h(:,z_c) = U_pol(:,z_c)+betaEV_h_interp;
end

aledinola · February 25, 2026, 7:42pm

Yes you are right of course but I just wanted to make an easy-to-understand example.

robertdkirkby · February 25, 2026, 10:27pm

I made minor modification to loop 100 times over the two solution methods, and then report mean of the runtimes. I get
Time VFI with interp1 in Howard: 0.294313 s
Time VFI with manual interpolation in Howard: 0.367992 s
Speed ratio (interp1/manual): 0.800
Difference b/w two methods: 1.332268e-15

I guess this leaves the question of why is interp1 faster on my computer but slower on both of yours.

Do you want to try with repeated runs? Just copy-paste the following modification to lines 63-67 of main.m.

T=100;
time_interp=zeros(1,T);
time_manual=zeros(1,T);
for tt=1:T
    disp('Benchmarking VFI with interp1 in Howard (GPU)...')
    time_interp(tt) = gputimeit(f_interp);

    disp('Benchmarking VFI with manual interpolation in Howard (GPU)...')
    time_manual(tt) = gputimeit(f_manual);
end

and then change the print out (lines 77-78) to be

fprintf('Time VFI with interp1 in Howard:              %.6f s\n', mean(time_interp))
fprintf('Time VFI with manual interpolation in Howard: %.6f s\n', mean(time_manual))

PS. When you do an operation just once you often get slightly misleading runtime results because of things ‘warming up’, like if you need to open a parallel pool to run a parfor loop on cpu. [Not claiming this happened here, just as a general rule for looking at runtimes.]

PPS. This is roughly the same runtime ratios as I was getting in my own tests.

PPPS. I got a new desktop last month with an NVIDIA RTX 4000 Ada, hence my fast runtimes I’m running Matlab R2025B.

robertdkirkby · February 25, 2026, 10:32pm

I went and read the contents of “fun_vfi_howards_manual_precomp”. It is tightly written code. Impressive performance by codex.

aledinola · February 26, 2026, 1:11pm

Thanks for running the test on your machine. I will modify the code as you suggest and run (1) using R2025b on the same GPU card and (2) using R2025b on a more powerful GPU. It is interesting that run times vary across machines/Matlab versions.

Regarding averaging run times: I agree with you, but in main.m I already do a warm-up, see lines 54-61. However you method is more accurate.

g = gpuDevice;
wait(g);

% Warm-up to avoid first-call/JIT overhead dominating the benchmark
f_interp();
wait(g);
f_manual();
wait(g);

aledinola · February 26, 2026, 4:09pm

Congrats! 20 GB of GPU Ram

robertdkirkby · February 26, 2026, 7:40pm

The 20 GB of GPU ram vs my previous 8 has taken two endogenous state models from ‘can be run’ to ‘easy to run even with quite a few exogenous shocks’, which is nice

aledinola · February 26, 2026, 9:02pm

Let me know about the replication of Bruggeman’s paper!

aledinola · February 26, 2026, 10:23pm

@robertdkirkby On R2024b but on a more powerful GPU (NVIDIA GeForce RTX 5080), manual interpolation is still faster (0.85 bs 0.91 seconds) but the difference is much smaller. I report below the results of my test.

=== Benchmark log ===
MATLAB Release: (R2024b)
MATLAB Version: 24.2
Platform: win64
------------------------------------------------------------
Grid sizes: n_a=600, n_z=11, n_a_fine=20
Params: r=0.0300, w=1.0000, beta=0.9000, crra=3.0000, rho=0.6000, sig_eps=0.2000
VFI options: max_iter=1000, tol=1.0e-08, howard_iter=80
------------------------------------------------------------
GPU Device:
  Name: NVIDIA GeForce RTX 5080 Laptop GPU
  ComputeCapability: 12.0
  DriverVersion: 1.290000e+01
  ToolkitVersion: 1.220000e+01
  TotalMemory(GB): 15.920
------------------------------------------------------------
Timing repetitions: T=100
interp1 Howard (GPU): mean=0.911083 s, std=0.044176 s, median=0.921053 s
manual Howard (GPU): mean=0.852868 s, std=0.095932 s, median=0.886504 s
Speed ratio (interp1/manual): 1.068259
Max abs diff: errV=1.332268e-15, errP=0.000000e+00, err=max=1.332268e-15
------------------------------------------------------------
Raw times (seconds)
time_interp = [0.701185 0.854751 0.70937 0.68563 0.70006 0.92539 0.923885 0.917494 0.916735 0.922329 0.923277 0.92344 0.92229 0.923039 0.923857 0.921075 0.920173 0.918813 0.917881 0.917609 0.921319 0.921172 0.921154 0.921563 0.921908 0.919858 0.917206 0.919237 0.920886 0.921539 0.920481 0.922262 0.921953 0.919916 0.920194 0.922226 0.908994 0.919508 0.911048 0.921134 0.920665 0.924164 0.911264 0.918683 0.921865 0.922934 0.922898 0.924183 0.925429 0.921031 0.919983 0.922334 0.922292 0.922288 0.923053 0.920582 0.919754 0.92131 0.922153 0.911483 0.919421 0.913351 0.91829 0.918736 0.92182 0.922325 0.919931 0.919727 0.912927 0.920123 0.919132 0.921767 0.910406 0.922378 0.919554 0.919107 0.920974 0.918508 0.918515 0.921945 0.923021 0.918766 0.909918 0.93057 0.930541 0.926866 0.920986 0.91725 0.922171 0.921454 0.923407 0.922988 0.924373 0.922208 0.923211 0.922955 0.920688 0.920618 0.924517 0.922695 ]
time_manual = [0.56139 0.563565 0.584644 0.581705 0.902356 0.885826 0.886557 0.882369 0.639668 0.885513 0.888735 0.888972 0.88958 0.885478 0.885229 0.88792 0.888661 0.887262 0.886711 0.887517 0.891345 0.886451 0.889514 0.880644 0.886864 0.889638 0.888427 0.88548 0.886602 0.889866 0.886304 0.889077 0.884887 0.882756 0.882712 0.883652 0.880796 0.884711 0.886754 0.885083 0.886204 0.888367 0.884739 0.886407 0.889335 0.892437 0.896065 0.887029 0.888564 0.888446 0.892387 0.889994 0.890708 0.882083 0.889557 0.888106 0.890149 0.888718 0.582548 0.599581 0.575436 0.587417 0.882606 0.892524 0.890386 0.882725 0.886288 0.883574 0.885443 0.888149 0.561122 0.587961 0.88374 0.886692 0.883162 0.88144 0.876915 0.885618 0.883932 0.88069 0.886359 0.883487 0.776859 0.892701 0.891969 0.885752 0.886449 0.890691 0.891058 0.886728 0.889014 0.889318 0.885529 0.888399 0.889346 0.885468 0.891171 0.88696 0.896545 0.894499 ]

I will now run this test on the same GPU but with R2025b.

aledinola · February 26, 2026, 11:40pm

Second set of results for R2025b. Manual interpolation is still better (0.55 vs 0.65 seconds).

=== Benchmark log ===
MATLAB Release: (R2025b)
MATLAB Version: 25.2
Platform: win64
------------------------------------------------------------
Grid sizes: n_a=600, n_z=11, n_a_fine=20
Params: r=0.0300, w=1.0000, beta=0.9000, crra=3.0000, rho=0.6000, sig_eps=0.2000
VFI options: max_iter=1000, tol=1.0e-08, howard_iter=80
------------------------------------------------------------
GPU Device:
  Name: NVIDIA GeForce RTX 5080 Laptop GPU
  ComputeCapability: 12.0
  DriverVersion: 1.290000e+01
  ToolkitVersion: 1.220000e+01
  TotalMemory(GB): 15.920
------------------------------------------------------------
Timing repetitions: T=100
interp1 Howard (GPU): mean=0.647486 s, std=0.019189 s, median=0.645313 s
manual Howard (GPU): mean=0.549022 s, std=0.004196 s, median=0.548201 s
Speed ratio (interp1/manual): 1.179345
Max abs diff: errV=1.332268e-15, errP=0.000000e+00, err=max=1.332268e-15
------------------------------------------------------------
Raw times (seconds)
time_interp = [0.648882 0.833523 0.645629 0.647817 0.645161 0.644819 0.645192 0.644971 0.643318 0.646122 0.646353 0.644967 0.645794 0.644648 0.640821 0.646683 0.647719 0.646674 0.645863 0.679621 0.645672 0.646119 0.643875 0.647027 0.645007 0.64688 0.646777 0.643791 0.645479 0.647136 0.647461 0.647944 0.646812 0.648914 0.649293 0.649268 0.64751 0.648675 0.647694 0.648714 0.646516 0.646123 0.645516 0.644619 0.645613 0.644805 0.645022 0.645353 0.643685 0.64585 0.645241 0.642112 0.643869 0.643564 0.644487 0.643449 0.643455 0.645399 0.643896 0.64455 0.643574 0.64392 0.643603 0.64409 0.644488 0.645867 0.645895 0.646652 0.64585 0.645154 0.643383 0.644154 0.642388 0.643726 0.643774 0.645027 0.645347 0.646046 0.645278 0.644608 0.644654 0.645507 0.645725 0.644988 0.645263 0.646379 0.644897 0.645447 0.645512 0.645734 0.645035 0.64255 0.642174 0.642609 0.641146 0.642385 0.641696 0.641271 0.645689 0.649394 ]
time_manual = [0.554324 0.558343 0.54883 0.557295 0.549078 0.557557 0.547968 0.547755 0.546547 0.549346 0.550197 0.552404 0.548562 0.548936 0.545087 0.555452 0.552683 0.560711 0.559356 0.557705 0.553126 0.547411 0.555522 0.56066 0.558137 0.557979 0.550092 0.548504 0.548069 0.548213 0.550343 0.548188 0.550417 0.549171 0.550156 0.553585 0.549764 0.546343 0.548159 0.548712 0.545853 0.549578 0.547055 0.552098 0.549211 0.550027 0.550452 0.546044 0.556266 0.546394 0.542729 0.546366 0.545165 0.545049 0.545532 0.545364 0.546212 0.546241 0.54481 0.548628 0.545978 0.544712 0.545651 0.54471 0.545088 0.549233 0.548167 0.548541 0.548338 0.545751 0.54553 0.545447 0.552885 0.546539 0.546091 0.549153 0.548016 0.547409 0.545059 0.544778 0.544131 0.545831 0.54522 0.545855 0.546298 0.548941 0.546136 0.556711 0.549342 0.549369 0.551177 0.547413 0.544975 0.545942 0.545179 0.54473 0.544692 0.542348 0.550731 0.550346 ]

aledinola · February 26, 2026, 11:44pm

A side thing I learnt from this comparison is that on my RTX 5080, R2025b is faster than R2024b.

Manual interpolation a bit faster than interp1, by 18% (mean comparison but median is very similar). My earlier results showing a three-fold advantage of manual interpolation were probably biased by not averaging out run times (but I was also on a different GPU so boh…)

@robertdkirkby What could be the difference b/w my results and your results? Maybe the operating system? I assume you are on Linux while I am on win64 (see results files on github). Also, we are using different GPUs. In any case, the advantage of replacing interp1 with precomputation of indices and weights is not so clear as I had hoped

robertdkirkby · February 27, 2026, 12:22am

Yeah, I’m on linux. Apparently it can make a small difference:
" MATLAB performance is similar on Windows, macOS, and Linux, although differences can occur among platforms for the following reasons: MathWorks builds its products with a different compiler on each platform, and each has its own performance characteristics."

Not sure if that is what the difference is.

I suspect maybe this is also a lesson about high-level programming languages, which is that more complicated in-build functions can still perform well on runtimes if their internals are carefully written in lower level languages. I’m guessing that this is the key to why the runtime differences between the two is negligible.

My overall takeaway: manual interpolation is probably very minorly faster than using interp1(), but the difference is small enough to be largely irrelevant. Given the whole toolkit is already coded around interp1() I just don’t think any advantage from doing manual interpolation is worth re-coding anything.

Test Howard repo

Howard VFI GPU Benchmark (MATLAB)

Requirements

Run

What main.m reports

Notes

What `main.m` reports