That’s cool! Look forward to using this new feature!
I was playing around with a simple life-cycle model with the following structure
- No d variable
- One endogenous state, a
- One exogenous stochastic state, z, age-dependent
- Permanent type, theta (income fixed effect)
I tested divide and conquer but it is roughly 4 times slower than the standard method. This holds for different grid sizes n_a (from 500 to 1500 points) and for different options level1n (from 7 to 21). I was a bit puzzled. Here is my example on github:
IntroToLifeCycleModels/ALE/main.m at main · aledinola/IntroToLifeCycleModels · GitHub
EDIT
The only slightly non-standard feature is that I have only two grid points for z (since I took the example from one of yours where you had only employed vs unemployed). Divide and conquer should be still beneficial though
Will look at this, but probably not until January (might fit it into early Dec)
I am not too surprised it is slower with n_a=500 and n_z=2 (gpus are just so good at parallelization that dividing it up is probably just not worthwhile), I am surprised it is still slower with n_a=1500 and n_z=2.
Great, thanks! If find out something in the meantime, I’ll let you know
Regarding the implementation of divide and conquer, I was looking at this code: VFIToolkit-matlab/ValueFnIter/InfHorz/DivideConquer/ValueFnIter_DC1_nod_raw.m at master · vfitoolkit/VFIToolkit-matlab · GitHub
I noticed that several arrays are accessed in a row-major order, like
V(curraindex,:)
Instead of the column-major order
V(:,curraindex)
Matlab is like Fortran and Julia (not C or Python) and arrays are stored column-wise. By using the first method, you are accessing array elements that are not contiguous in memory. This may seem an obscure and nerdy observation, but it actually matters a lot for performance on the cpu, in my experience.
I don’t know if it matters as much for arrays on the gpu, but I suspect so. I’m not suggesting to rewrite the whole code, it’s just a consideration
I am also not sure if it matters on gpu. [For anyone reading who is wondering what this ‘column vs row access’ thing is, see this example]
I would be interested to see if it matters on gpu, and if it does I will keep it in mind. See how big the speed difference is before deciding if it is worth rewriting anything
Some more information on this here: https://uk.mathworks.com/matlabcentral/answers/406125-flattening-of-gpuarray-row-major-or-column-major-order
It says that also gpu arrays are stored in column order:
gpuArray data is stored in column-major format, just like CPU data in MATLAB.
This link provides an example that is spot on:
I tested this using the example on your website as a template (see my code below). The result is that on the cpu the loop order matters a lot while on the gpu there are no significant differences.
I find this a bit surprising but that’s because I don’t know much about gpu probably
clear,clc,close all
M = 5000; % Try 1000, 2000, 5000
N = 10000;
%% On CPU
A1 = zeros(M,N);
tic
for n = 1:N
for m = 1:M
A1(m,n) = 1;
end
end
time1=toc;
fprintf('CPU, Correct loop order, run time: %f \n',time1)
A1 = zeros(M,N);
tic
for m = 1:M
for n = 1:N
A1(m,n) = 1;
end
end
time2=toc;
fprintf('CPU, Wrong loop order, run time: %f \n',time2)
%% On GPU
M = 100; % Try 1000, 2000, 5000
N = 100;
A2 = zeros(M,N,'gpuArray');
tic
for n = 1:N
for m = 1:M
A2(m,n) = 1;
end
end
time1=toc;
fprintf('GPU, Correct loop order, run time: %f \n',time1)
A2 = zeros(M,N,'gpuArray');
tic
for m = 1:M
for n = 1:N
A2(m,n) = 1;
end
end
time2=toc;
fprintf('GPU, Wrong loop order, run time: %f \n',time2)
Cool to see
Seems reasonable. I suspect the GPU is more ‘access just those elements of the matrix, one to each gpu core’ and so where exactly they are is less important.
Of course, with the gpu in some sense the number one thing you want to do is avoid that kind of ‘nested for loop’ code as much as possible to start with in any case. So maybe it is also partly the ‘this is bad gpu code, and that outweighs the access by columns-vs-rows’ that is driving the result. [Nested for-loops is a lot of serial, gpu should be parallel as much as possible]
I modified the test, replacing nested loops with a single loop and now it seems that the correct column-major order matters for performance, even though not as much as for the cpu. But the difference is not big and is not very stable, so I’m not sure if it’s worth it to rewrite anything.
Results
CPU, Correct loop order, run time: 0.118482
CPU, Wrong loop order, run time: 0.777604
GPU, Correct loop order, run time: 0.144062
GPU, Wrong loop order, run time: 0.188719
Code
clear,clc,close all
M = 10000; % Try 1000, 2000, 5000
N = 10000;
%% On CPU
A1 = zeros(M,N);
tic
for n = 1:N
A1(:,n) = 1;
end
time1=toc;
fprintf('CPU, Correct loop order, run time: %f \n',time1)
A1 = zeros(M,N);
tic
for m = 1:M
A1(m,:) = 1;
end
time2=toc;
fprintf('CPU, Wrong loop order, run time: %f \n',time2)
%% On GPU
M = 10000; % Try 1000, 2000, 5000
N = 10000;
A2 = zeros(M,N,'gpuArray');
tic
for n = 1:N
A2(:,n) = 1;
end
time1=toc;
fprintf('GPU, Correct loop order, run time: %f \n',time1)
A2 = zeros(M,N,'gpuArray');
tic
for m = 1:M
A2(m,:) = 1;
end
time2=toc;
fprintf('GPU, Wrong loop order, run time: %f \n',time2)