Experiences with OCCT on iOS

Forums: 

Since the beta release of our application is coming, I thought it could be interesting to summarize my experiences with OpenCascade on iOS.
First of all, I would like to declare, that I was very satisfied with OCCT, despite all of its problems, it is still the best option for CAD development if you want an open source solution, and in many ways its knowledge competes with other (extremely expensive) CAD kernels.

So here are my experiences with OCCT on iOS:

1. Compilation
It was pretty straightforward to compile OCC. I just had to modify the osutils.tcl WOK files, to generate iOS project files, instead of MacOS project files, regenerate the Xcode projects, and press compile. There were a few trivial compilation issues, coming mainly from C++11 compatibility (such as string syntax), but those were easy to fix. After the release of OCCT7 this will be even easier with the new build system.

2. Performance
Well, that was the most exciting part, and here I made some improvements, that I am going to publish of course. In my application, after a few days of profiling, I identified two major performance bottlenecks:

a.) Polynomial evaluation
The biggest spike in the profiler was the polynomial evaluation (PLib::EvalPolynomial). After reading Roman's great article on optimizing BSpline evaluation (http://opencascade.blogspot.com/2014/05/applying-vectorization-technique...), I found that these techniques could be applied on ARM too, since the architecture has the so called NEON, that is a 128bit floating point SIMD (single instruction multiple data) unit. So finally I reimplemented the EvalPolynomial function, and after a few days of experimenting, it turned out, that this algorithm could benefit from C++'s dark magic: template metaprogramming. It turned out, that the implementing the Horner algorithm using template metaprogramming is not a very hard task. The benefits were huge: I was able to get 10-12x performance improvement in polynomial evaluation, in real life scenarios! The solution is a little bit hacky, since template metaprogramming generates static programs, but the parameters are varying, so I pregenerated many many possible Horner algorithm (from 0 degree to 16 degree, from 0th derivative to 8th derivative) like this:
EvalPoly,
EvalPoly,
EvalPoly,
EvalPoly,
EvalPoly,
EvalPoly,

where the signature of this template metaprogram is this:

template
void EvalPoly(Standard_Real x0, Standard_Real *coeff, Standard_Real *resultVec) {…}

And the actual implementation looks like this:

void EvalPoly(const Standard_Real U,const Standard_Integer DerivativeOrder,const Standard_Integer Degree,const Standard_Integer Dimension,Standard_Real& PolynomialCoeff,Standard_Real& Results)
{
if (Degree > MaxDegree || Dimension > MaxDimension || DerivativeOrder > MaxDerivativeOrder)
_EvalPolynomial(U, DerivativeOrder, Degree, Dimension, PolynomialCoeff, Results);
else
evalPolyFuncs[DerivativeOrder + MaxDerivativeOrder * Dimension + MaxDimension * MaxDerivativeOrder * Degree](U, &PolynomialCoeff, &Results);
}

So if there is no pregenerated algorithm, we fall back to the original implementation, otherwise we call the template metaprogram. Since the output of these metaprograms are like lots of embedded SIMD instructions (like multiply_add(a, b, multiply_add(c, d, …)) ), this leaves room for the compiler for insane optimizations, leading to great performance. Despite I am pretty sure, that such a solution will never make it to the main branch of OCCT, of course I will publish this implementation also. Now it is ARM64 only, but basically only one line has to be changed to use it on Intel, or on other architecture that supports SIMD instructions. Yes, I know, template metaprogramming is dark magic, and produces write-only code, but in this case, it really worth it.

b.) Locks
That was one of the most surprising experience I had during the development. It turned out, that atomic operations on ARM are not as effective as on Intel CPUs, resulting that even locking an unlocked mutex can be expensive. That led to big performance bottlenecks on iOS, since every BSpline curve/surface evaluation involves locking and unlocking a mutex, and that led to spending 30-50% of CPU time in locking/unlocking during meshing and intersection calculations. Yuck… And the worst part is, that you can not really do anything about this, because the BSpline caching requires locking. Finally I ended up simply removing mutexes from those classes, and disabling parallelism. But I am still struggling with this one, because now parallelism would be super useful for boolean operations, but on the other hand I would slow down other parts of the application. I will try to disable caching, and run some performance tests, maybe that is going to be the solution (but I don't know how much does caching matters, is there any measurement for this?)
I also tried to replace mutexes with GCD (Grand Central Dispatch) synchronization primitives, but it was also slow.

These two are the only significant changes I made in OCCT.

3. Bugs
I have not found many bugs in OCCT (at least not many that could not be worked around), but there was one that is really annoying, bug #25563 (pretty funny, I can't access it now, why?). It can be a compiler bug in Clang, but can be a bug in the code as well. The problem is that if you compile AdvApp2Var_ApproxF2var.cxx with optimization level >=2, it will crash. And the problem with this, is that many algorithms are based on this. I spent two days trying to separate the issue without any success, mainly because that code for me is impossible to understand :( But this can be worked around by simply setting O1 optimization level to that file.

4. Allocations
Thats just a note: it would be sometimes very useful to be able to provide your own allocator for some classes, because sometimes I found that allocations take lots of time in some algorithms, especially in (GCPntsQuasiUniform/GCPntsTangential)_Deflection. Probably this is also not an issue on Intel CPUs.

5. Overall
OCCT has lots of legacy, but they are working really hard to get rid of them. They are pretty responsive, and if you file a bugreport, they will take a look at it. The whole library is getting better every week, and the development process and communication improved a lot since OCCT went LGPL, so kudos to the OCCT team, and keep up the good work.

PS.: I have attached the template metaprogram, if someone wants to take a look, get it here: https://www.dropbox.com/s/qij7rdvuuirsh00/eval_poly_template_metaprogram....
I will make the whole branch available next week, but I think I have summarized everything that could be interesting in this post. I just have to make some cleanup. If anyone wants to compile OCCT on iOS feel free to contact me.

Andrey BETENEV's picture

Hello Istvan,

Your posit is very interesting, thank you for sharing your experience and positive feedback on OCCT!

We are looking forward to see and try your improvements, please feel free to submit them.

For the locks in BSpline cache, we have been working last year on redesigning that code so as to separate cache from BSpline objects. This version should not use mutexes for bspline evaluation, you can have a look at the current state in Git branch CR24682_2. We would appreciate knowing if it works for you.

Issue #25563 is still accessible, you perhaps do not see it in your list because it is closed (you can change "Hide status" option in the filters to have closed issues listed).

Andrey

István Csanády's picture

Hello Andrey,

I will take a look at the separate cache version, sounds very interesting. I will share the improvements this week.

Istvan

István Csanády's picture

Is there any reason why this separate caching has not been merged to the master yet?

Andrey BETENEV's picture

Hello,

The reason is that it has not been completely tested and debugged yet. There are some regressions in tests, mostly due to subtle changes in sequence of floating point operations that lead to slightly different results. However, it may happen that some actual bugs are hidden there. This is under investigation, but may take considerable time, that is why I am especially interested in knowing your experience if you try this.

Andrey

István Csanády's picture

I have given it a try, and... Well, it just works, and the mutex spikes are just gone from the profiler :) Works like a charm. Of course, I was only able to do a smoke test, but my app is doing tons of BSpline evaluations, and it seems to be working perfectly. I will keep using it, and report every problem that occurs. I will do some more precise measurements.

István Csanády's picture

The only issue I have noticed so far, is that when the curve is embedded in some other geometric entity (like a Geom_SurfaceOfLinearExtrusion or Geom_OffsetCurve), there will be no caching. Are there any plans to make caching work even in these cases? I understand this is not a trivial task. Still this solution seems to be better then the previous one.

István Csanády's picture

Hi,

Here you can access the mentioned modifications: https://bitbucket.org/istvancsanady/occt/src
This branch is basically the main branch of OCCT with very little modifications. This is what we are using for our project. Probably the most interesting part is the polynomial evaluation. 99% of the commits were temporary experiments.

If you are interested in what we are doing, check out our website: http://www.shapr3d.com

István Csanády's picture

I have noticed, that you have implemented the EvalPoly function using a template metaprogram as I suggested, that's amazing! I was wondering how much performance improvement have you experienced using the new evaluation function. Do you have some test results?

PS.: Why can't I access the ticket, does it contain customer data?

Mikhail Sazonov's picture

The track item 24285 has been devoted for this problem. I think you can access it. It is open.

István Csanády's picture

I get an Access Denied error message.

Mikhail Sazonov's picture

Here is an extract from there:

Result of testing vs. master is (for most notable cases):

CPU boolean bcut_complex Q9: 6.6768428 / 10.1244649 [-34.05%]
CPU boolean bsection P4: 6.3648408 / 9.9684639 [-36.15%]
CPU bugs moddata_2 bug6862_3: 10.3116661 / 12.1212777 [-14.93%]
CPU bugs moddata_2 bug6862_4: 10.2648658 / 12.0588773 [-14.88%]
CPU bugs moddata_2 bug6862_6: 24.6793582 / 30.0769928 [-17.95%]
CPU de step_3 F2: 313.3748088 / 361.5791178 [-13.33%]
CPU de step_4 I1: 217.8865967 / 253.1428227 [-13.93%]
...

Overall CPU difference by all cases vs. master is reported as [-2%]

István Csanády's picture

Wow, that's great. Have you investigated whether any compiler is able to organize the instructions to SIMD instructions automatically? If not, it might worth a try to organize them manually to SIMD instructions. But indeed this is a pretty significant improvement by itself.

Andrey BETENEV's picture

Hello Istvan,

Sorry the issue (#24285) had incorrect visibility flag; now it should be visible to everyone, please check.
Besides, have you tried to access it under your own account or anonymously?
In the former case there should be some problem in access rights settings, please let us know if it is the case so we could fix it.

We have not investigated what kind of code is generated actually, so I deem there could be a space for further improvement here. It would be great if you could have a look at this and share your findings.

In fact, we have tried to use the version of templated evaluator that you posted in the first message of this thread (2a), but could not get good result on Windows due to memory alignment problems. The same version with fixes for alignment is the one located in branch CR24285_azn, and it does not bring more performance than other variants. I guess we have somehow lost SIMD features; we intended to make some more in-depth investigation of this. Sorry for having no time yet to give comment on your code, hope we shall be able to give more details soon.

Andrey

István Csanády's picture

I tried to access it with my own account.

I will try to get a look at it later, thank you for sharing your experiences.

Maybe the ARM+Clang magical combination is a factor here, Clang optimization features can be insanely effective.

ZAIKIN Alexander's picture

Hi Istvan,

Comparative performance testing has been conducted for various implementations of EvalPolynom function.
Testing has been performed for the following versions:
- occt 6.8.0 (without improvements);
- occt 6.8.0 (with your version ported on SSE2) [branch CR24285_Istvan];
- occt 6.8.0 (modified SSE2) [branch CR24285_azn];
- occt 6.9.0.

All branches have been compiled by Microsoft(R) Visual Studio 2012 C/C++ compiler 17.0.61030.0.
Processor : Intel(R) Core(TM) i5-3450
Memory (RAM) : 16.0 GB
OS : Microsoft Windows 7 Professional x64

Test results:

perf bspline intersect
occt 6.8.0 : 47.01188
occt 6.8.0 (Istvan SSE2) : 43.18880 ( -8,13% )
occt 6.8.0 (azn) : 40.47462 ( -13.9% )
occt 6.9.0 (master) : 40.94877 ( -12.89% )

boolean bcut_complex Q9
occt 6.8.0 : 16.26851
occt 6.8.0 (Istvan SSE2) : 15.72012 ( -3,37% )
occt 6.8.0 (azn) : 15.15565 ( -6.84% )
occt 6.9.0 (master) : 10.24492 ( -37.02% )

boolean bsection P4
occt 6.8.0 : 15.79673
occt 6.8.0 (Istvan SSE2) : 15.43582 ( -2,28% )
occt 6.8.0 (azn) : 14.86571 ( -5.89% )
occt 6.9.0 (master) : 9.97542 ( -36.85% )

bugs moddata_2 bug6862_3
occt 6.8.0 : 21.86812
occt 6.8.0 (Istvan SSE2) : 21.66858 ( -0,91% )
occt 6.8.0 (azn) : 20.72454 ( -5.22% )
occt 6.9.0 (master) : 20.70408 ( -5.32% )

bugs moddata_2 bug6862_4
occt 6.8.0 : 21.60477
occt 6.8.0 (Istvan SSE2) : 21.47732 ( -0,58% )
occt 6.8.0 (azn) : 20.86634 ( -3.41% )
occt 6.9.0 (master) : 19.86530 ( -8.05% )

bugs moddata_2 bug6862_6
occt 6.8.0 : 51.02871
occt 6.8.0 (Istvan SSE2) : 51.84278 ( +1.57% )
occt 6.8.0 (azn) : 50.84567 ( -0.35% )
occt 6.9.0 (master) : 45.31487 ( -11.19% )

Of course, we selected the most indicative tests with the highest performance progress.
At this moment the most performant is OCCT 6.9.0.
Could you please provide your results on ARM platform?

István Csanády's picture

Sure I will run the tests as soon as I will have some time.
Basically these results means that the solution I provided before made it slower on Intel? Very interesting. Were these tests run in release configuration with the highest optimization level? What changes made the 6.9.0 faster? The first measurements were much more promising.

István Csanády's picture

Fresh and crispy OCCT 7.0 for iOS.

https://bitbucket.org/istvancsanady/occt/

Modifications:
- cmake build system is tailored for the needs of iOS (Xcode project generates dynamic frameworks, and adds .plists to frameworks, and also does code signing)
- basically nothing else, only a small modification in ShapeUpgrade_UnifySameDomain, that we will submit as an integration request soon
- visualization, and CAF are still not compiled on iOS

Kirill Gavrilov's picture

Hello Istvan,

I'm just curious what is the actual purpose for this repository - are you just trying to meet LGPL requirements, or maybe bitbucket is more convenient for you?

- visualization, and CAF are still not compiled on iOS

do you mean that these components are not compilable by modified CMake tools or you are experiencing some general issues?

At least Visualization should build fine (this have been tested with projects generated by genproj some time ago) - otherwise there is a regression in master which should be fixed.

István Csanády's picture

There are a few reasons:
1. Yes, LGPL
2. But more importantly: we quite often need fixes, that can not wait until they make it to the OCCT main branch. The latest example was the clang null reference crash issue, but there were many other examples, like using c++ atomics instead of OSAtomics etc., or some other workarounds that are related to clang's latest optimization issues. These changes are always submitted as bug tickets, or changenrequests, but sometimes our solution (that works well for us) needs refinements to be merged to the main repository. Then we use our patches as a temporary solution, and dismiss them when they are integrated to occt.
3. We need a custom build system to be able to build frameworks that can be deployed to the App Store.
4. We can try our ideas here that are specific to iOS/arm, like the template metaprogram for the polynomial evaluation.

We don't use the visualization frameworks this is why we did not bother to build them, but yes probably they could be compiled.

Bryan Oswalt's picture

Hi Istvan,

Excellent work. Thanks for providing this.

I tried building the tree with CMake, XCode 7.2, and version 9.2 of the iPhone SDK (Changed SDKVER to 9.2 in CMakeLists.txt) and ran into one issue, so I wanted to make sure that I'm doing things correctly.

After creating the XCode project files with CMake, I opened OCCT.xcodeproj and tried building the ALL_BUILD target. The build failed immediately because adm/mac/xcd/Info.plist was missing. I copied this file from one of the project folders into /adm/mac/xcd, and everything seemed to build fine after that.

So, my question is, am I missing some step that is supposed to copy the Info.plist file over automatically?

István Csanády's picture

No you don't. You have to create the plist manually, that depends on your project. It's a basic info.plist.

Just out of curiosity, what are you working on? I would love to hear about other iOS+OCCT projects.

Bryan Oswalt's picture

Hi Istvan, thanks for the answer. Your project looks awesome by the way. I'm terrible at CAD modeling, and your app makes it look so easy. It sounds like from your YouTube posts that the app will be very affordable.

I wish I were doing something cool with OCCT. At this time, I work on an R & D team and am just evaluating the current level of support for OCCT on mobile devices.

zhai's picture

hello
About your problemпјЊSo do iгЂ‚
Xcode 7.2 and iOS 9.2, Info.plist is missing and there is no mac folder in the admпјЊHave you solved itпјџ

nic

Bryan Oswalt's picture

Hi Nic,

First off, I'll call your source root folder OCCT_SRC and the folder that you specify in CMake for your binaries OCCT_BIN. I tried making OCCT_BIN the same as OCCT_SRC, but one of the XCode builds failed, so I think that these should be different.

Edit the OCCT_SRC/CMakeLists.txt file in your source root and change the SDK version on line 35 to read 'set(SDKVER "9.2")' or change it to match your actual SDK version. If you don't do this, you will get an error when you run CMake that the SDK folder was not found, and the projects won't be generated correctly.

Manually create the "mac/xcd" folder under OCCT_SRC/adm.

Run CMake, fill in the paths for OCCT_SRC and OCCT_BIN, configure for XCode, and generate the projects. CMake will create several copies of a generic Info.plist under your OCCT_BIN folder.

These should all be generic and identical, so you can use any one of them. I manually copied OCCT_BIN/src/TKBO/CMakeFiles/TKBO.dir/Info.plist to OCCT_SRC/adm/mac/xcd. There's nothing special about TKBO - I just picked that one as it was the first one that appeared in Finder. After doing those steps, the "ALL_BUILD" scheme should build.

Let me know if you need any more help.

Now I'm looking for a way to test that my build worked properly, as I don't see any sample apps built, only libraries.

Bryan Oswalt's picture

Hi Istvan,

What approach did you take in order to set up the OCCT environment on iOS? For example, the UnitsAPI class uses the CASROOT environment variable and the "/src/UnitsAPI/Units.dat" file. Do you set the CASROOT environment variable in your iOS project and copy the 'Units.dat' file to one of the iOS folders, or does your project not require these?

I'm just curious as to what you did before I take my own approach. I'm not very familiar with OCCT, but I assume that other parts of the framework require environment variables to be set and might also require configuration files as above.

István Csanády's picture

We don't use the units API currently. But it should be easy to define your iOS folder as CASROOT.

Bryan Oswalt's picture

Hi Istvan,

I saw that you had asked a question similar to this in 2013. Do you know whether or not closed source applications that are built with the dynamic OCCT frameworks can be released on the iOS app store? From my understanding, the application has to be dynamically linked with OCCT to meet LGPL requirements, and I am also unclear as to whether applications that contain dynamic frameworks will be accepted by the app store.

Thanks

Roman Lygin's picture

Follow up on item 2b (use of locks in B-Splines) - http://dev.opencascade.org/index.php?q=node/1138