Fri, 10/11/2024 - 22:15
Forums:
Hi guys! I trying to run in parallel few series of OCCT functions like intersects/extrusions and so on, for example...for 10 models, but speed in this case is too slow... Why? Ok, if I process 10 models in single thread, i have ~100 seconds of processing time, but if a try 10 threads (i model per one thread) i have ~90 seconds... Why so slow? OCCT have internal mutex locks or what? Or i can build OCCT with one magic flag for multithreading?
Fri, 10/11/2024 - 22:48
I see, major slowdowns due to function: Standard_Transient::operator new, but why? Can i somehow disable internal mutex?
Fri, 10/11/2024 - 22:55
I think, i should try different options of this:
Sat, 10/12/2024 - 00:10
Hello. Interesting investigation. Yes, in 7.8 was reorginized memory manager with more options. I really recommend JeMalloc. https://dev.opencascade.org/doc/overview/html/occt__upgrade.html#upgrade...
If you are working on Windows, you can find jemalloc as a package: https://github.com/Open-Cascade-SAS/OCCT/releases/tag/V7_8_0
The changing between types available only on CMake stage. Or for sure you can change manually in code. TBB or JeMalloc are prefferable.
You need to have dynamic lib of jemalloc. For static I can give you a hot fix, if needed.
Best regards, Dmitrii.
Sat, 10/12/2024 - 09:30
Thanks for reply Dmitrii! Bad news, I'll have to rebuild whole OCCT with new flags... But wait, what do you mean "Or for sure you can change manually in code"? I can somewhere put MMGT_OPT = 0 flag in my application (for example, before all OCCT includes) or did you just mean that I can define the flag I need directly in Standard.cxx before complie OCCT?
And one more question: where i can set jemalloc in CMake settings? I see options only for TBB.
Sat, 10/12/2024 - 09:58
I found jmalloc in CMake settings :)
Sat, 10/12/2024 - 14:41
I rebuilt OCCT with jemalloc and the problem went away, long live to jemalloc :)
Mon, 10/14/2024 - 07:39
Caleb, could you share some details of your system, on which you experience slow behavior?
Wed, 10/16/2024 - 08:14
My system is certainly far from top-end: Intel i7-10700K (8 cores / 16 threads), 32Gb DDR4 system memory and nvidia 3060 (12Gb GDDR6 VRAM), but the problem was not in my system, the problem was because internal mutex of opencascade, because if you don't change memory allocator when building opencascade from source, then the standard one will slow down your system a lot in multithreading because for each memory allocation it locks its internal mutex, so I strongly recommend everyone to rebuild opencascade with the allocator type JMALLOC. Look at image below:
Wed, 10/16/2024 - 08:44
I was actually curious about OS and C++ building configuration / toolchain rather than hardware, but I see from screenshot that you are using some Windows platform.
This mutex doesn't come from OCCT, but rather from memory allocator implemented by C/C++ runtime, which depends on compiler and platform. I guess it should MSVCRT from some Visual Studio version in your case.
Modern systems are expected to provide default memory allocator optimized for multi-threading environments - e.g. you shouldn't experience it like 'one global mutex' nowadays. From my personal experience, the default allocator in Emscripten SDK suffers from global mutex, but this is quite specific platform.
In the past, we have experimented with TBB allocator, which shown a little bit better multithreading performance but at the cost of considerably higher memory utilization. So be aware that you may also experience higher memory footprint.
Wed, 10/16/2024 - 08:46
Would be also interesting to have some minimalistic reproducer for this issue to make more experiments on different platforms, but I guess you will be unable to share this...
Wed, 10/16/2024 - 09:25
About allocators in Emscripten:
Wed, 10/16/2024 - 10:34
I was tested 6 different memory managers on Linux and Windows in x64. For sure there some memory overheads, but usually it is much cheaper then time improvements.
I was planning to share my research based on OCCT single thread DE and multi threading boolean or inc mesh functionality. But I forgot:-(
The tested was heap size and CPU time and realised time.
Best regards, Dmitrii.
Wed, 10/16/2024 - 10:57
One of the point to improve - ability to use dirty memory for Handle pointers. For now only calloc is accepted, because some constrictors do not initialize default types. I was check them by clang-tidy and collect, but have no time to upgrade that classes.