Feedback on 7.0 - putting cache out of B-Spline

Forums: 

As explained in http://dev.opencascade.org/index.php?q=node/1091 and in #24682, motivation for removing the cache out of B-Spline was to improve performance by avoding mutex lock (which is required for ensuring thread-safe access). Although overhead was OK for x86 platforms, Istvan Csanady reported significant overhead on iOS (see http://dev.opencascade.org/index.php?q=node/1043).

To address the issue the OCC team moved out the cache to GeomAdaptor_Surface (and alike). Respectively use of plain Geom_BSplineSurface revealed performance loss (as anticipated in http://dev.opencascade.org/index.php?q=node/1091) and required updates throughout OCC, which seems to be not completed today yet and perhaps will never will (as there multiple simple curve/surface evaluations throughout OCC source base).

Moreover, user codes with 7.0 may (and likely will in most cases) slow down out-of-the-box and will require manual revision of the codes and updates, what is extra work for the users. For instance, we observed 10% to 1200% performance loss on about 1/3 of the test cases in our product and early analysis confirmed the root-cause to be changed B-Spline cache mechanism.

More efficient alternative could be to keep cache inside B-Splines but protecting with another mutex - so called spin mutex (see https://en.wikipedia.org/wiki/Spinlock). Instead of going into OS kernel (what Standard_Mutex does wrapping pthread_mutex_t on non-Windows platforms), the spin_mutex could quickly check an atomic flag and proceed or spin (if the mutex is locked). TBB offers efficient implementation in tbb::spin_mutex which OCC could use (if respective macro is defined) along with own non-TBB implementation.

Contention on B-Spline evaluation is an extremely rare case, so in most scenarios mutex overhead would be negligible (and improved comparing to 6.x version). Yet it would not require user's efforts for migration and thus would improve out-of-the-box experience.

Andrey (or other OCC folks): Has this option been considered and if yes, what were the reasons to decline it ?

István Csanády's picture

Actually even simple atomic operations can be slow on ARM as we discussed this before. Wouldn't it be a better solution to identify those codes that are doing BSpline evaluation directly, and using the adaptor class there as well?

Roman Lygin's picture

Hi Istvan,

There are too many of those cases where B-Spline is used directly. Using the adaptor (creating it on the fly just to evaluate a few sample points) has own overhead which can likely exceed use of simple atomic.

Can you remind about *atomic* (not Standard_Mutex) overhead ? I can't recall/find that discussion.

The key point is not OCC itself but rather users' codes - it is too much pain to analyze performance loss and revisit the codes. Some OCC users may not be even skilled enough to efficiently do that, they may just use OCC and evaluate curves/surfaces in their codes.

Thanks,
Roman

István Csanády's picture

From the "Experiences with OCCT on iOS" topic:

"That was one of the most surprising experience I had during the development. It turned out, that atomic operations on ARM are not as effective as on Intel CPUs, resulting that even locking an unlocked mutex can be expensive. That led to big performance bottlenecks on iOS, since every BSpline curve/surface evaluation involves locking and unlocking a mutex, and that led to spending 30-50% of CPU time in locking/unlocking during meshing and intersection calculations. Yuck… And the worst part is, that you can not really do anything about this, because the BSpline caching requires locking. Finally I ended up simply removing mutexes from those classes, and disabling parallelism. But I am still struggling with this one, because now parallelism would be super useful for boolean operations, but on the other hand I would slow down other parts of the application. I will try to disable caching, and run some performance tests, maybe that is going to be the solution (but I don’t know how much does caching matters, is there any measurement for this?)
I also tried to replace mutexes with GCD (Grand Central Dispatch) synchronization primitives, but it was also slow."

The performance problems caused by mutexes on iOS are related to the performance of atomic operations, not the performance of system calls.

Roman Lygin's picture

Hi Istvan,

Could you please run the enclosed example using current OCC git master (where BSplines do not use cache) ?
Please compile and run it with three different flavors - using no mutex, using Standard_Mutex and tbb::spin_mutex - see MUTEX_OPTION.

I attach a sample B-Spline curve (of degree 6) and results collected on Windows / vc12/32 bit / release.
The elapsed time is the minimum of 3 runs with each mutex option. As seen, tbb::spin_mutex is on par or slightly better than Standard_Mutex, average overhead is 4% - 10% vs no mutex version.

Your data would be very interesting to see impact of atomics on iOS.

Many thanks in advance,
Roman

Andrey BETENEV's picture

Hello Roman,

I suppose that your example is aimed to measure relative cost that two variants of mutex locks impose on b-spline evaluation, is that correct? If yes, why you are using non-cached evaluation of b-spline in your example?

The only need to use mutexes for b-spline evaluation is to protect shared cache against concurrent access. Thus correct measurement should use cached version, in my opinion. By using non-cached calculation, you set unrealistically high baseline for your measurement, and thus overhead is predictably small.

Andrey

Roman Lygin's picture

Hi Andrey,

Fair comment indeed, thank you. The real purpose is to compare overhead vs baseline and between each other.
I have updated the sample to run on OCC 6.9.1 (using caches).

As expected the overhead vs baseline grows with cache hits. The example uses 11, 21 or 101 sample points (which is quite representative I believe) what results in reusing the cache respectively 3-25 times. tbb::spin_mutex vs mutex is consistently ~5% faster.

The patch and modified Geom_BSplineCurve_1.cxx is also included.

Istvan, could you please run the experiments on OCC 6.9.1 ?

Thanks,
Roman

István Csanády's picture

That would require a few hours of work, and we have a very strict deadline in a few weeks, but let me show you an example.

First case:
int value;
value = 0;

for (int i = 0; i < 100000000; ++i)
{
value++;
}

std::cout << value << std::endl;
Second case:

std::atomic value;
value = 0;

for (int i = 0; i < 100000000; ++i)
{
value++;
}

std::cout << value << std::endl;

So we are basically incrementing the value on a single thread of a atomic and a nonatomic integer. I turned off optimization, to be able to compare the pure atomic performance.

Runtimes:

Nonatomic Intel: 0.241018s
Atomic Intel: 0.94s
Ratio: 3.91x

Nonatomic ARM: 0.664850s
Atomic ARM: 14.328221s
Ratio: 21.55x

As you can see, on ARM the same operations with an atomic variable takes 20x more time than with a non-atomic variable in case of a single threaded execution.

This is indeed not the _same_ case, but it highlights the performance differences of atomics between ARM and Intel.

Ps.: actually with OCCT7.0 we experienced huge performance improvements instead of performance loss.

István Csanády's picture

And we had this discussion on the OCE mailing list:

https://groups.google.com/forum/#!topic/oce-dev/7sj1XMxY06s

Where you can see in the attached profiler results, that during single threaded execution the top items are:

1. OSAtomicCompareAndSwap64Barrier
2. __mtx_droplock
3. _pthread_mutex_lock
4. ...
5. ...
6. ...
7. OSAtomicDequeue

About 25% of CPU runtime was spent in locking/unlocking mutexes and incrementing/decrementing atomic variables.

And as you can see in that thread, by the time I tried to use C++11 spin locks, without resulting the desired improvements. We would get the same results now.

Roman Lygin's picture

The previous discussion above on using various mutices (standards vs spin) has currently stalled due to Istvan's unavailability. Anyway, even on x86 the tests show that single thread execution may yield about ~20% overhead in good cache hit ratio scenarios. That is, removal of the mutex (and consequent need to move the cache out of B-Spline) might be a worth objective to target indeed.

Let me bring another proposal for discussion.

BSplCLib::Bohm() is a typical hotspot when the cache hit ratio is poor, i.e. when points are evaluated in random order, not within tight proximity of the previous point.
For instance, we recently conducted a few performance analysis studies using 6.9.1 (i.e. older cache mechanism) and BSplCLib::Bohm() was often number one. We found out that BSplCLib::Bohm() might have about 50 different call stacks leading to it. This is no surprise as obviously B-Spline evaluation has broader usage - these stacks include curve projection, triangulation, curve length calculation, etc. Thus, poor cache hit ratio is a frequent scenario.

The proposal is to create multiple caches in GeomAdaptor - one for *each* B-Spline span.

This will allow to improve performance even in the case of randomly evaluated points - as no cache recalculation will take place. Cache creation can be done in lazy manner - i.e. upon first need to evaluate a point in the span_i.
GeomAdaptor always has a temporary lifespan (i.e. within some compute intensive algorithm, e.g. triangulation), i.e. there is no much concern of the increased memory footprint.

In this case migration to GeomAdaptor could be more tempting for the users as it would promise higher returns comparing to the current single cache case.

Any thoughts on this before filing a feature request?

Mikhail Sazonov's picture

> GeomAdaptor always has a temporary lifespan (i.e. within some compute intensive algorithm, e.g. triangulation), i.e. there is no much concern of the increased memory footprint.

This statement is not correct. Look, e.g. at the class IntTools_Context. It caches various algorithms (in lazy mode) for each shape during the work of BOP. Among those algorithms, there are GeomAPI_ProjectPointOnSurf, GeomAPI_ProjectPointOnCurve, BRepClass3d_SolidClassifier, Geom2dHatch_Hatcher.

Each of these algos stores an instance of Adaptor (Surface of Curve). So, during the work of one BOP a number of alive adaptors depends only on the number of input sub-shapes.

Roman Lygin's picture

Hi Mikhail,

Thank you for the feedback.
My point is that any typical real-world scenario is like this:
- original shape
- algorithm
- modified shape

Adaptors live inside the "algorithm" and when it finishes, all temporary structures get destructed.

Certainly, a large cache will increase a peak footprint but upon algorithm termination the footprint will get away.
As a way to limit the footprint, some threshold could be introduced. For instance, up to 10 caches can be maintained. So for longer (>10 spans) B-Splines, only 10 most recently used caches will be retained. Anyway the approach should help significantly increase performance by improving cache hit ratio.

Makes sense ?

Mikhail Sazonov's picture

I agree with your explanation. It makes sense.
However, if some application store algorithms like it is done in BOP adaptors can become having unlimited lifespan.
Anyway, I vote for limiting the number of caches.

Andrey BETENEV's picture

Hello Roman,

You are perfectly right, possibility to build cache once for all spans of b-spline and thus avoid its recalculation was one of reasons why we decided to separate cache from data classes. Other reasons are:

- Less memory occupied by shapes (up to 20% saving). This is especially important when dealing with big industrial models, on 32-bit and/or mobile devices.

- Absence of contention when evaluating b-splines in parallel threads (real case: building pcurves on face with thousands of inner wires).

Regarding multi-span cache, this is subject of the future development. As explained by Mikhail, there are situations where this can severely increase memory usage by the algorithm, thus experimenting and likely some compromise will be necessary.

Your contributions are welcome, and I encourage you to register an issue for this improvement, to start with.

Andrey

Benjamin Bihler's picture

Hello Andrey,

I am one of those working with "big industrial models". Right now we have no problems with memory usage, but severe problems with speed (probably because of the locality/cache problem). Therefore if some multi-span cache would be implemented, I would like to vote for cache sizes that can be modified by method/constructor arguments with reasonable default values and not by compile time constants.

Benjamin

Roman Lygin's picture

Hi Andrey,
Reported as #27074.