math_(Double/Single)Tab allocation

Forums: 

I have noticed, that the math_*Tab classes statically allocate an 512 length array on the stack, and only allocate dynamically, when the data does not fit to those arrays. Unfortunately, this can easily cause stack overflow, like in

AppParCurves_Function::Perform

method. On iOS a new thread has only 512kb of stack memory, and the AppParCurves_Function::Perform method allocates 16 matrices and and 21 vectors on the stack, and causing stack overflow in some cases. Indeed it is possible, that I am the only one affected by this problem, but I believe that allocating such huge chunks of memory on the stack is not desired.

Ps.: this probably also explains some of the crashes I reported, and could not be reproduced by the OCCT team.

Roman Lygin's picture

Hi Istvan,

Using the stack for smaller size arrays has been introduced as part of #24044 (tracker.dev.opencascade.org/view.php?id=24044) to improve performance of various modeling algos. In those scenarios allocation was on a critical path and was a noticeable hotspot.

The size has been chosen on the x86/Windows where the stack size is 2M by default. You might want to decrease that based on your experiments on the other architecture. Certainly there must be a trade-off and maybe arch-specific defaults can be chosen.

Roman
P.S. Forgot to say thanks for your work on the iOS port. We've been following that with great interest but never had time to constructively comment. Good luck!

István Csanády's picture

Can not we achieve this performance improvement by using memory pools? A simple lockfree queue like the boost lockfree queue implementation (http://www.boost.org/doc/libs/1_53_0/doc/html/boost/lockfree/queue.html) could do the job, and we could get rid of the static allocation. Do you remember what modeling algorithms benefited from this approach?

I'm glad you are interested in the iOS port, thankfully it was pretty straightforward.

Roman Lygin's picture

Memory pools do not quite apply here (regardless whether to introduce Boost or OCC's own NCollection_IncAllocator).

The scenarios were something like these:

void Foo()
{
while (a lot of iterations) {
if (A) {
math_SingleTab aSmallArray (1, 4);
...
} else {
math_SingleTab anAnotherSmallArray (1, 8);
...
}
Bar();
}
}

void Bar()
{
int n = ...; //something small but may be not constant
while (another lot of iterations) {
math_DoubleTab anAnotherSmallArray (1, n);
Bar2();

}
}

void Bar2()
{
//... can use some medium-size arrays...
}
...

So propagating any pool isn't quite feasible as allocations happens in various places with complex call chain.
So the idea was to follow SSO (small string optimization) pattern. Can be checked, for instance, here - http://stackoverflow.com/questions/10315041/meaning-of-acronym-sso-in-th...

AFAIR, the hotspot was at least in the surface-surface intersection algorithm.

István Csanády's picture

Can not we have a global memory pool for these classes, that can provide them these 512 sized arrays? And if they need bigger arrays, then they can simply allocate them, like they do it now. If the pool would be based on a lockfree queue, then no locking would be needed for adding/removing memory items. And also, if it is global, then the pool would not had to be passed. I mentioned boost not because of its allocator, but because of its lockfree queue implementation.

Mikhail Sazonov's picture

The new bug 0025746 has recently been registered devoted to this place of code. We are going to use memory pools (NCollection_IncAllocator) to resolve the problem. At this, only one allocator instance is declared before entering the most outer loop, and its method Reset is called in the beginning of each iteration in order to repeat using the same already allocated memory.

Roman Lygin's picture

Hi Michael,

Would be good although I don't immediately see how this could be easily implemented.
After some recollections, I think the initial issue has to do with multiple math_Vectors residing inside math. For instance:

math_FunctionRoot::math_FunctionRoot(math_FunctionWithDerivative& F,
const Standard_Real Guess,
const Standard_Real Tolerance,
const Standard_Integer NbIterations ){
math_Vector V(1,1), Tol(1,1);

It would be good indeed if the allocator could be propagated. However anyway this is *a very hotpath* in multiple modeling algorithms (Extrema-based). Given that nothing can beat stack allocation performance-wise, you might still want to consider SSO-like technique, maybe with reduced thresholds. Unlike the allocator (even IncAllocator) the SSO-like allocation will likely allow inlining and hence greater optimization in compile time.

These two does not exclude each other, so you could:
- allocate on stack for small sizes (smaller than current defaults)
- else use the allocator if supplied
- else fallback to Standard::Allocate() if the allocator has not been supplied

Hope this helps.
Roman

Mikhail Sazonov's picture

The bug 25746 has been resolved and the fix will appear in OCCT version 6.9.0.
We decided to make very simple fix:
- the size of the field Buf has been reduced from 512 to 16 items.
- we have got rid of indirection connected with storage of addresses of array rows (the field AddrBuf was thrown away that had the size 32 items).