Tuning toolkits for GPUs

A recent commit add this error message to the toolkit:

On transtion paths you can only use vfoptions.lowmemory>0 when using transpathoptions.fastOLG=1, because otherwise the runtimes will anyway be so slow as to be essentially unusable

To that I say “yes, and no”. What I have measured is that for a large model, which needs a lowmemory option just to fit, fastOLG is just another dimension (age) that needs to be either prioritized or iterated as appropriate.

In my case, I have 32GB of GPU RAM and 265GB of CPU RAM (of which 128GB can be “shared” with the GPU). The statistics I measured where:

fastOLG: 26-31GB GPU RAM in use; 16-27GB shared GPU RAM, 74GB of CPU RAM (includes shared GPU RAM), and a GPU utilization rate that fluctuated mostly between 40% and 70%. Total runtime for a transition iteration: 2431 seconds. There was no indication that the GPU was thrashing (nothing was charged against time spent in Copy).

slowOLG: 9.5GB GPU RAM; no shared GPU RAM, 40GB of CPU RAM, and a GPU utilization rate above 98%. Total runtime for a transition iteration: 1025 seconds.

I think that as long as the GPU compute cores are fully occupied, the toolkit is making best use of the cores, and there’s no need to enforce that a particular dimension be loaded into the GPU.

It might be useful to create a test harness that can benchmark various iteration plans so that the user can put tunings into their models, but that’s just a thought for the future.

Now that I have Claude to assist I will turn back on the vfoptions.lowmemory>0 with transpathoptions.fastOLG=0. Will aim to get to it later this week.

1 Like

Great! Definitely worth discussing my discoveries from the past weekend regarding fastOLG et al. When you are ready :wink:

Expecting to be doing that mid-week :slight_smile:
I’ve been working hard on ‘finishing’ all the two endo state FHorz combos
FHorz TPath and FHorz PType are the next after that

1 Like

Interesting. I had kind of just always worked on the rule-of-thumb that if I can fill up the GPU memory while running parallel operations on large matrices then the GPU would be being used to the max in terms of utilizing the GPU cores. But clearly that was an over-simplified view.

1 Like

All lowmemory for both fastOLG=0 and fastOLG=1 is now implemented for FHorz TPath with one endogenous state. Will gradually roll out for the rest of FHorz TPath over the rest of this week.

1 Like