Divide and Conquer (solve one endogenous state models faster)

aledinola · October 26, 2024, 5:55pm

That’s cool! Look forward to using this new feature!

aledinola · November 15, 2024, 7:12pm

I was playing around with a simple life-cycle model with the following structure

No d variable
One endogenous state, a
One exogenous stochastic state, z, age-dependent
Permanent type, theta (income fixed effect)

I tested divide and conquer but it is roughly 4 times slower than the standard method. This holds for different grid sizes n_a (from 500 to 1500 points) and for different options level1n (from 7 to 21). I was a bit puzzled. Here is my example on github:

IntroToLifeCycleModels/ALE/main.m at main · aledinola/IntroToLifeCycleModels · GitHub

EDIT
The only slightly non-standard feature is that I have only two grid points for z (since I took the example from one of yours where you had only employed vs unemployed). Divide and conquer should be still beneficial though

robertdkirkby · November 17, 2024, 8:19pm

Will look at this, but probably not until January (might fit it into early Dec)

I am not too surprised it is slower with n_a=500 and n_z=2 (gpus are just so good at parallelization that dividing it up is probably just not worthwhile), I am surprised it is still slower with n_a=1500 and n_z=2.

aledinola · November 17, 2024, 8:28pm

Great, thanks! If find out something in the meantime, I’ll let you know

aledinola · May 24, 2025, 9:21pm

Regarding the implementation of divide and conquer, I was looking at this code: VFIToolkit-matlab/ValueFnIter/InfHorz/DivideConquer/ValueFnIter_DC1_nod_raw.m at master · vfitoolkit/VFIToolkit-matlab · GitHub

I noticed that several arrays are accessed in a row-major order, like

V(curraindex,:)

Instead of the column-major order

V(:,curraindex)

Matlab is like Fortran and Julia (not C or Python) and arrays are stored column-wise. By using the first method, you are accessing array elements that are not contiguous in memory. This may seem an obscure and nerdy observation, but it actually matters a lot for performance on the cpu, in my experience.

I don’t know if it matters as much for arrays on the gpu, but I suspect so. I’m not suggesting to rewrite the whole code, it’s just a consideration

robertdkirkby · May 28, 2025, 6:52am

I am also not sure if it matters on gpu. [For anyone reading who is wondering what this ‘column vs row access’ thing is, see this example]

I would be interested to see if it matters on gpu, and if it does I will keep it in mind. See how big the speed difference is before deciding if it is worth rewriting anything

aledinola · May 28, 2025, 9:25pm

Some more information on this here: https://uk.mathworks.com/matlabcentral/answers/406125-flattening-of-gpuarray-row-major-or-column-major-order

It says that also gpu arrays are stored in column order:

gpuArray data is stored in column-major format, just like CPU data in MATLAB.

This link provides an example that is spot on:

https://uk.mathworks.com/matlabcentral/answers/521744-perfomance-loss-of-matrix-vector-multilplication-on-gpu-with-array-indexing

aledinola · June 3, 2025, 7:47pm

I tested this using the example on your website as a template (see my code below). The result is that on the cpu the loop order matters a lot while on the gpu there are no significant differences.
I find this a bit surprising but that’s because I don’t know much about gpu probably

clear,clc,close all
 
 
M = 5000; % Try 1000, 2000, 5000
 
N = 10000;
 
 
%% On CPU
 
A1 = zeros(M,N);
 
tic
 
for n = 1:N
 
    for m = 1:M
 
        A1(m,n) = 1;
 
    end
 
end
 
time1=toc;
 
fprintf('CPU, Correct loop order, run time: %f \n',time1)
 
 
A1 = zeros(M,N);
 
tic
 
for m = 1:M
 
    for n = 1:N
 
        A1(m,n) = 1;
 
    end
 
end
 
time2=toc;
 
fprintf('CPU, Wrong loop order, run time: %f \n',time2)
 
 
%% On GPU
 
M = 100; % Try 1000, 2000, 5000
 
N = 100;
 
A2 = zeros(M,N,'gpuArray');
 
tic
 
for n = 1:N
 
    for m = 1:M
 
        A2(m,n) = 1;
 
    end
 
end
 
time1=toc;
 
fprintf('GPU, Correct loop order, run time: %f \n',time1)
 
 
A2 = zeros(M,N,'gpuArray');
 
tic
 
for m = 1:M
 
    for n = 1:N
 
        A2(m,n) = 1;
 
    end
 
end
 
time2=toc;
 
fprintf('GPU, Wrong loop order, run time: %f \n',time2)

robertdkirkby · June 3, 2025, 8:06pm

Cool to see

Seems reasonable. I suspect the GPU is more ‘access just those elements of the matrix, one to each gpu core’ and so where exactly they are is less important.

Of course, with the gpu in some sense the number one thing you want to do is avoid that kind of ‘nested for loop’ code as much as possible to start with in any case. So maybe it is also partly the ‘this is bad gpu code, and that outweighs the access by columns-vs-rows’ that is driving the result. [Nested for-loops is a lot of serial, gpu should be parallel as much as possible]

aledinola · June 3, 2025, 9:21pm

I modified the test, replacing nested loops with a single loop and now it seems that the correct column-major order matters for performance, even though not as much as for the cpu. But the difference is not big and is not very stable, so I’m not sure if it’s worth it to rewrite anything.

Results

CPU, Correct loop order, run time: 0.118482 
CPU, Wrong loop order, run time: 0.777604 
GPU, Correct loop order, run time: 0.144062 
GPU, Wrong loop order, run time: 0.188719

Code

clear,clc,close all

M = 10000; % Try 1000, 2000, 5000
N = 10000;

%% On CPU
A1 = zeros(M,N);
tic
for n = 1:N
    A1(:,n) = 1;
end
time1=toc;
fprintf('CPU, Correct loop order, run time: %f \n',time1)

A1 = zeros(M,N);
tic
for m = 1:M
    A1(m,:) = 1;
end
time2=toc;
fprintf('CPU, Wrong loop order, run time: %f \n',time2)

%% On GPU
M = 10000; % Try 1000, 2000, 5000
N = 10000;
A2 = zeros(M,N,'gpuArray');
tic
for n = 1:N
    A2(:,n) = 1;
end
time1=toc;
fprintf('GPU, Correct loop order, run time: %f \n',time1)

A2 = zeros(M,N,'gpuArray');
tic
for m = 1:M
    A2(m,:) = 1;
end
time2=toc;
fprintf('GPU, Wrong loop order, run time: %f \n',time2)