OCCT run in separate threads (multithreading is too slow)

Hi guys! I trying to run in parallel few series of OCCT functions like intersects/extrusions and so on, for example...for 10 models, but speed in this case is too slow... Why? Ok, if I process 10 models in single thread, i have ~100 seconds of processing time, but if a try 10 threads (i model per one thread) i have ~90 seconds... Why so slow? OCCT have internal mutex locks or what? Or i can build OCCT with one magic flag for multithreading?

Caleb Smith's picture

I see, major slowdowns due to function: Standard_Transient::operator new, but why? Can i somehow disable internal mutex?

Caleb Smith's picture

I think, i should try different options of this:

Standard_Address Standard::Allocate(const Standard_Size theSize)
{
#ifdef OCCT_MMGT_OPT_FLEXIBLE
  return Standard_MMgrFactory::GetMMgr()->Allocate(theSize);
#elif defined OCCT_MMGT_OPT_JEMALLOC
  Standard_Address aPtr = je_calloc(theSize, sizeof(char));
  if (!aPtr)
    throw Standard_OutOfMemory("Standard_MMgrRaw::Allocate(): malloc failed");
  return aPtr;
#elif defined OCCT_MMGT_OPT_TBB
  Standard_Address aPtr = scalable_calloc(theSize, sizeof(char));
  if (!aPtr)
    throw Standard_OutOfMemory("Standard_MMgrRaw::Allocate(): malloc failed");
  return aPtr;
#else
  Standard_Address aPtr = calloc(theSize, sizeof(char));
  if (!aPtr)
    throw Standard_OutOfMemory("Standard_MMgrRaw::Allocate(): malloc failed");
  return aPtr;
#endif // OCCT_MMGT_OPT_FLEXIBLE
}

 

Dmitrii Pasukhin's picture

Hello. Interesting investigation. Yes, in 7.8 was reorginized memory manager with more options. I really recommend JeMalloc. https://dev.opencascade.org/doc/overview/html/occt__upgrade.html#upgrade...

If you are working on Windows, you can find jemalloc as a package: https://github.com/Open-Cascade-SAS/OCCT/releases/tag/V7_8_0

The changing between types available only on CMake stage. Or for sure you can change manually in code. TBB or JeMalloc are prefferable.

You need to have dynamic lib of jemalloc. For static I can give you a hot fix, if needed.

Best regards, Dmitrii.

 

Caleb Smith's picture

Thanks for reply Dmitrii! Bad news, I'll have to rebuild whole OCCT with new flags... But wait, what do you mean "Or for sure you can change manually in code"? I can somewhere put MMGT_OPT = 0 flag in my application (for example, before all OCCT includes) or did you just mean that I can define the flag I need directly in Standard.cxx before complie OCCT?
And one more question: where i can set jemalloc in CMake settings? I see options only for TBB.

Caleb Smith's picture

I found jmalloc in CMake settings :)

Caleb Smith's picture

I rebuilt OCCT with jemalloc and the problem went away, long live to jemalloc :)

gkv311 n's picture

Caleb, could you share some details of your system, on which you experience slow behavior?

Caleb Smith's picture

My system is certainly far from top-end: Intel i7-10700K (8 cores / 16 threads), 32Gb DDR4 system memory and nvidia 3060 (12Gb GDDR6 VRAM), but the problem was not in my system, the problem was because internal mutex of opencascade, because if you don't change memory allocator when building opencascade from source, then the standard one will slow down your system a lot in multithreading because for each memory allocation it locks its internal mutex, so I strongly recommend everyone to rebuild opencascade with the allocator type JMALLOC. Look at image below:

gkv311 n's picture

I was actually curious about OS and C++ building configuration / toolchain rather than hardware, but I see from screenshot that you are using some Windows platform.

the problem was because internal mutex of opencascade,

This mutex doesn't come from OCCT, but rather from memory allocator implemented by C/C++ runtime, which depends on compiler and platform. I guess it should MSVCRT from some Visual Studio version in your case.

Modern systems are expected to provide default memory allocator optimized for multi-threading environments - e.g. you shouldn't experience it like 'one global mutex' nowadays. From my personal experience, the default allocator in Emscripten SDK suffers from global mutex, but this is quite specific platform.

In the past, we have experimented with TBB allocator, which shown a little bit better multithreading performance but at the cost of considerably higher memory utilization. So be aware that you may also experience higher memory footprint.

gkv311 n's picture

Would be also interesting to have some minimalistic reproducer for this issue to make more experiments on different platforms, but I guess you will be unable to share this...

gkv311 n's picture

About allocators in Emscripten:

The default system allocator in Emscripten, dlmalloc, is very efficient in a single-threaded program, but it has a single global lock which means if there is contention on malloc then you can see overhead. You can use mimalloc instead by using -sMALLOC=mimalloc, which is a more sophisticated allocator tuned for multithreaded performance. mimalloc has separate allocation contexts on each thread, allowing performance to scale a lot better under malloc/free contention.

Note that mimalloc is larger in code size than dlmalloc, and also uses more memory at runtime (so you may need to adjust INITIAL_MEMORY to a higher value), so there are tradeoffs here.

Dmitrii Pasukhin's picture

I was tested 6 different memory managers on Linux and Windows in x64. For sure there some memory overheads, but usually it is much cheaper then time improvements.

I was planning to share my research based on OCCT single thread DE and multi threading boolean or inc mesh functionality. But I forgot:-(

The tested was heap size and CPU time and realised time.

Best regards, Dmitrii.

Dmitrii Pasukhin's picture

One of the point to improve - ability to use dirty memory for Handle pointers. For now only calloc is accepted, because some constrictors do not initialize default types. I was check them by clang-tidy and collect, but have no time to upgrade that classes.