Sciencemadness Discussion Board

Speed of C programs, relative to Java

woelen - 11-12-2023 at 04:20

I am working on a library for polynomial root finding (soon, more will follow on that on sciencemadness). The software is written in Java, but I also make ports of the software to C, which is fairly easy (porting the other way around is more cumbersome).

The reason for making the C-ports is twofold. One reason is that it makes the software available to a larger group of people, and the second is that C-programs are said to perform better than Java programs in terms of computational speed.

However, I notice that my C ports consistently run slower than the Java ports. it hardly matters what software I port, but the C-code always runs between 1.5 and 2 times slower than the Java code with exactly the same input.

I run the software on a Core i7-13700 with DDR4-3200 memory, running Ubuntu 22.04. For comparison of speed, I run all software in a single thread, just to make comparison simpler. None of the software is really memory intensive, it all runs in a few MByte at most.

The Java environment is a standard JDK, 11.0.x, supplied with Ubuntu, or Oracle's HotSpot JDK, also for Java 11. I compile the programs, using Netbeans. No additional frameworks besides standard Java SE are used.

The C-environment is GCC, as supplied with Ubuntu 22.04, with the standard library for I/O and math. The code, however, uses math-library calls sparingly, most of it is just in-language multiplication, division, addition and subtraction. I compile the programs, using optimization level -O2, -O3, or even -O6, but between those, there hardly is any difference. Without optimization, -O0, the C-software is nearly twice as slow as with -O2. The software is standard C, no C++. No use of additional frameworks is made, it runs with just standard stdio.h, stdlib.h, and math.h.

Porting of the code is done in a really straightforward way. The Java code works on arrays of double's, ints, etc. The C-code also does, using the same indexing scheme. No pointer arithmetic is used in the C-code, the Java code is copied into the C-code nearly 1 to 1.

I have the impression that I am missing something. On the internet I read many websites, which tell that C is so much faster than Java in all kinds of applications, but I see things being the other way around. My C code ports are consistently at least 1.5 times slower than the Java code, even when the C code is compiled with heavy optimization.

Could it be that my type of code (numerical mathematics, containing a lot of computation, but hardly any I/O, no database interaction, and hardly any heavy data-shifting throughout memory, is very atypical and is actually faster in Java, due to JIT-compiler optimization?

Any experts out there, who could shed some light on this?

Tsjerk - 11-12-2023 at 05:23

Are you promoting a lot of int to long? That takes a lot of time, that might be optimized by starting out with long instead of int.

woelen - 11-12-2023 at 06:12

No, I do not convert ints to longs. It mostly is floating point (doubles) arithmetic, and a little integer arithmetic (e.g. computing indexes in arrays, that kind of things).

Sulaiman - 11-12-2023 at 06:31

I expect that if you wrote the programme in C then ported to any other language it would run better in the C version.
Applies to all programming languages.

Keras - 11-12-2023 at 12:48

Quote: Originally posted by woelen  
I am working on a library for polynomial root finding (soon, more will follow on that on sciencemadness). The software is written in Java, but I also make ports of the software to C, which is fairly easy (porting the other way around is more cumbersome).

However, I notice that my C ports consistently run slower than the Java ports. it hardly matters what software I port, but the C-code always runs between 1.5 and 2 times slower than the Java code with exactly the same input.


This is absolutely not normal. C code is, barring assembly, the closest to binary you can get with a ‘high-level’ programming language. In comparison, Java is interpreted, which means that the program is not transformed into bytes directly ingestible by the CPU, but run by a specific process (the ‘interpreter’), itself having been written in C (in the majority of cases, like Java or Python).

First of all, why Java? Why not Python. Use Python + SciPy, and you have your root finding function already coded, and probably better than anything you can code yourself.

If you insist on writing C code, maybe as a didactical task, I encourage you to get a copy of the book Numerical Recipes in C, which is a sort of Bible for a lot of mathematical algorithms.

There is, however, not a lot of good algorithms to find polynomial roots, especially when those roots are multiple. The dichotomy method is slow and prone to roundoff accumulation. Newton method can converge quadratically, but it needs a good estimate to start with. Your best bet is to use the QR inversion method to get those estimates, then use the Newton method to increase their precision.

What version of the GCC compiler do you use?

[Edited on 11-12-2023 by Keras]

clearly_not_atara - 11-12-2023 at 14:12

Unfortunately we do not have much hope of answering your questions without seeing the code! If you are observing Java outperforming C, there are a few possible reasons:

- Your code is memory-bound, and your manual memory management is poor. The Java GC is mediocre for small tasks because it is optimized for large ones. It will beat C that is not written to a high standard on these loads. This happens sometimes when using a lot of data.

- Your code is parallel, and the load is distributed more efficiently across threads in the Java version than in the C version. This is a common reason if the code is parallel, and it is relatively easy to test by using a process monitor and seeing if your program utilizes all of your cores.

- Your code uses an efficient internal Java function for an important computation, but you used an inferior version in C, possibly one you wrote yourself.

- You failed to turn on compiler optimizations (unlikely).

- Your C code has mistakes.

The fix will depend on where the problem is.


woelen - 11-12-2023 at 14:46

@Keras: I actually am in the process of improving some of the best polynomial root finders available nowadays. I have specialized code for degree 2, 3, 4, 5 polynomials and I know no code which beats what I achieve, both in terms of speed and accuracy. I tested it with an extensive test suite, using all kinds of nasties, including polynomials with multiple roots and closely clustered roots. A web page and paper will follow soon.
For general polynomials, I used Madsen-Reid's method, Jenkins-Traub and Aberth-Eherlich's methods. Best results are obtained with MR and AE, JT is quite disappointing and many polynomials simply cannot be solved at all by this code. I ported MR code to Java and C (original code is written in Fortran, modules PA16 and PA17) and made many modifications to the original code, in order to solve issues with certain classes of polynomials, which cannot be solved by the original MR-algorithm. Aberth-Ehrlich I also ported to Java, and made some small modifications to make it somewhat more robust. It is exceptionally useful for sparse polynomials and in particular trinomials, nothing can beat this in terms of speed and accuracy. In order to achieve this, I again modified the original algorithm, by selecting suitable/better starting values.

I use GCC version 11.4.0 (from Ubuntu 22). I also tried on other older versions of GCC with similar results.


@clearly_not_atara: I do not need any memory management in my code. It is basically numerical algorithms, with needed memory either simply on the stack, or pre-allocated before running the algorithm and no need to dynamically allocate and release memory. So, this is not an issue. The same is true for Java and C. Polynomial root finding is not memory-intensive at all (e.g. for solving a degree 100 polynomial I only need storage for a few hundreds of numerical values).
The code also is not parallel. Solving one polynomial equation takes microseconds or milliseconds at most. No need to parallellize the algorithms for that. Of course, I can run the algorithm concurrently on different polynomials, but inside one polynomial solving 'session' I use no parallel code.
As I wrote in the opening post, my code does not use any external library, nor is it dominated by using some internal function. It uses mainly standard arithmetic and comparisons, which is part of the language (e.g. +, -, *, /, <, >, and an occasional sqrt()).
When I release the code on publishing the upcoming web page, then you can have a look at it.

I use Java, because of its superior performance in many cases. It indeed produces byte code, but with modern JVM's the byte code in turn is converted to native machine code dynamically, while running the code. This is done by the so-called JIT-compiler and on modern JVMs this does an amazing job and it is said that it can even outperform well-crafted manually written assembly code. Also, many low-level Java-functions nowadays are implemented by means of so-called intrinsics, which are specialized optimized pieces of assemby code, which either is called, or is inlined in the code from the JIT-compiler if it is small. But indeed, I expected optimized C-code to be faster, especially if it is not using external functions from a pre-compiled library, which is fixed.


I am actually very content with the performance of my Java code. E.g. 4th degree equations can be solved reliably in only 400 ns or so in a single thread of a standard recent Core i5 or Core i7 processor (2.5 million of such polynomials per second, at around theoretically attainable accuracy, also for ill-conditioned ones). Polynomials of degree, averaging between 12 and 25 can be solved accurately in approximately 8 seconds per million polynomials on the same CPU in one thread.

I also wrote a simple library, which allows me to do computations at 105 bits precision (almost twice the accuracy of a standard double). This code is MUCH faster than stuff like GMP or MPFR in C or BigDecimal in Java. Of course, its precision also is more limited, but it fills in a gap between full high precision arithmetic and the limited precision of doubles. Many algorithms, I also implemented in Java, using this 105 bits code. I do not have a port to C for this, maybe I will do that in the future.

I expected more from my C port, but was disappointed by its performance. It might be that there is some mistake in it, but given the fact that it is a nearly 1 to 1 copy of the Java code (just omitting all the OO-stuff from Java) I hardly can imagine that there is a real bug, because the numerical outcomes of the C code and Java code are frequently identical up to bit level, sometimes they slightly differ, because of the use of a different random generator for test data or using another implementation of a math-function, which may have one or two ulp's of difference in the output.

[Edited on 11-12-23 by woelen]

Rainwater - 11-12-2023 at 16:53

I would suggest running a profiler to see where the bottlenecks are.
This will direct you to where improvement can be made.

If you can add some macros before and after sections of your code and
time them to see how long they take, you will know where to start
Assuming your not using std::cerr for anything else amd your program
is console based, you can pipe the err output to a file for review later.
Writing on my phone, code might contain errors.
Its the simple things that make programs fast.
Thread blocking is also another major porting issue.
Again java has optimizations in place to make this easier,
would be happy to take a look,
even add to my stack of nda's if need be
Code:
myProgram 2> debug_file.txt

Code:
#include <stdint.h> // Windows #ifdef _WIN32 #include <intrin.h> uint64_t rdtsc(){ return __rdtsc(); } // Linux/GCC #else uint64_t rdtsc(){ unsigned int lo,hi; __asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi)); return ((uint64_t)hi << 32) | lo; } #endif #define DEBUG_CPU_CYCLES std::cerr << __FILE__ << ":" << __LINE__ << " " << rdtsc();

code source

The issue will be a simple one. c/c++ is as fast as you get.
The evolution of java has removed so many design hurdles aimed at making it easier to work with, that
porting from java down to c will likely result in these optimizations being omitted.
For example, an array in java, is compared to a std::vector in c++, but the implementation is very different.
To the user they both function the same, in c the vector can be fragmented in places, acting as kinda a double linked list. Making lookups slower. Or they can fragment the heap by not using std::vector.reserve() before pushing items into it.
Java optimized this years ago into a binary tree for large arrays, making the lookups quicker.

[Edited on 12-12-2023 by Rainwater]

[Edited on 12-12-2023 by Rainwater]

clearly_not_atara - 11-12-2023 at 18:20

Quote:
or using another implementation of a math-function, which may have one or two ulp's of difference in the output.

If you are noticing slight differences in the last few bits of floating-point operations between C and Java, and the C code is slower, the first thing that I would try is using -ffast-math. I would be surprised if Java doesn't always stick to IEEE-compliant floating point, but differences like that in the output suggest that it is not. The goal of IEEE floating-point is to ensure that computations are always exactly identical, while the unsafe operations enabled by -ffast-math will avoid things like correcting the least significant bit in favor of using fewer cycles and getting very close. Most of the standard math library is also IEEE-defined and should always be exact.

Also, there is no -O6:

https://gcc.gnu.org/legacy-ml/gcc-help/2011-06/msg00450.html
https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html

-O3 will not generally enable -ffast-math because gcc's authors believe (correctly) that general compiler optimization flags should not alter the output of a program, while -ffast-math can definitely alter the output of the program. See:

https://kristerw.github.io/2021/10/19/fast-math/

EDIT: There is also -Ofast which enables all unsafe optimizations.

Also, modern gcc will typically use all available intrinsics, although this may not always be true on Windows or the newest processors, where MSVC or ICC (free for educational use IIRC) can be faster. But the difference between GCC and ICC usually isn't that big. ICC sucks on AMD processors usually (Intel's weird thing).

[Edited on 12-12-2023 by clearly_not_atara]

Keras - 11-12-2023 at 22:43

Okay, I apologise, I assumed (wrongly) you were simply dabbling with root finding. I see you’re venturing into advanced territory. Since you’re there, I would advise you to give a close look to rounding errors and their accumulation during the computations.

For your performance issue, if you want, I can benchmark your code on Clang/LLVM on ARM which I daily use as MacOS developer. Also try and analyse if there is any possibility to vectorise your code, despite the gains being minimal, they may amount to something when repeatedly called. It maybe that Java is using some highly optimised routines that are unavailable at C level, because GCC is unable to optimise the code beyond a certain level.

Isn’t 128-bit arithmetic already available at ASM level? Why 105-bit ? This seems an odd figure to me.

[Edited on 12-12-2023 by Keras]

Rainwater - 12-12-2023 at 03:50

User interface output can also be a major time consumer.
Java spawns a seperate thread to handle writing to the os,
this has std::out calls return immediately, c does not.
Using platform specific code, you can redirect the std::cout or in 'C', the stdout handles
to a buffer and reproduce the same optimization.
Thread safty is achieved by pausing the program thread,
swaping out the buffer, then
Resumeing the program thread while the buffer is printed.

[Edited on 12-12-2023 by Rainwater]

woelen - 12-12-2023 at 07:40

After searching a lot on the internet I think I found a possible answer to my question. It seems that the latest incarnations of the JVM are capable of detecting situations, where use of AVX-instructions is beneficial.

A simple case is something like


Code:
for (int i=0; i<n; i++) { a[i] += b[i]*c[i]; }


or

Code:
zb = .... double sum = 0.0; for (int i=0; i<n; i++) { sum += Math.abs(zb - z[i]); }



The JIT-compiler can translate this into a loop, which uses AVX or AVX2 instructions, which take 2 or 4 operands at once and do the same piece of code 2 or 4 times in parallel. This reduces computational time. Not by a factor 4 (due to e.g. limitations on memory bandwidth), but it certainly can be significant (e.g. a factor 1.5 speed up is plausible). The newer JIT-compilers also recognize more complicated structures with loops, e.g. situations where there are multiple lines of code, executed after each other in a loop. For this reason, there is less and less need for specialized vectorizing code in Java (Java has a special SIMD-API, but use of that is quite low-level and makes code much harder to understand).

Most likely this is not the complete answer, but I can imagine that this at least partly explains the effect I observe. On older CPUs without SIMD instructions this may not be the case. I unfortunately have no really old CPUs lying around, where I can test this (oldest I have is a 2012 core i5 3th generation, dusty, but most likely still working).

@Keras: Why do I use 105-bits and not 128 bits? This is because of performance. Beautiful algorithms exist for combining the precision of two doubles with 53-bit mantissas into a single numerical entity with 106-bit mantissa. In practice, due to loss of one or 2 ulps, this leads to appr. 105 bit accuracy: https://www.davidhbailey.com/dhbpapers/qd.pdf
This works for combining two doubles to get 105 bits of precision, or combining 4 doubles to get 211 bits of precision. This works on any IEEE 754 or IEEE 854 architecture, which adheres strictly to the standard. Java does this (not yet Java 11, unless you use strictfp on methods and classes, Java 17 does this out of the box). Using FMA (fused multiply add instructions), this can be made even faster, and that is what I do in my Java-implementation. This is MUCH faster than any multiprecision library I have seen and 105 bits of precision is good enough for polynomials even up to degree 100. With standard 53 bits of precision, polynomial root finding works well for polynomials with just a few terms, but for general dense polynomials, the limits are reached somewhere at degree 50, especially if the polynomials is somewhat ill-conditioned.

I also have a general full-precision implementation, using MPSOLVE, with GMP for multiprecision and JNI for use from Java code. This can be used to solve polynomial equations with tens of thousands of terms, but of course, this is much slower than using basic floating point with limited precision.

I also have experimented with real 128 bit arithmetic on GCC and did play around with this, but this is prohibitively slow. The arithmetic is done by using integer/long arithmetic, using inline assembly (it is available in GCC as a special floating point type). This is MUCH slower than the 105-bit algorithm, described in the paper, mentioned above. This is mainly true because of all the bit-manipulation, needed for exponent management, checking of overflow/underflow/NaN-generation, and other bit-stuffing to represent floating point numbers.




[Edited on 12-12-23 by woelen]

clearly_not_atara - 12-12-2023 at 07:44

^right, GCC will not enable this optimization without -fassociative-math, which is part of -ffast-math, even with -O3 enabled, because floating-point arithmetic is not associative and vectorizing it can change the final result (however slightly).

Compilers are very strict about consistency!

woelen - 12-12-2023 at 07:55

I'll try compiling with the -fassociative-math option or the -ffast-math option. Sounds interesting. I'll come back on that.

Keras - 12-12-2023 at 08:06

Did you experiment with fixed point arithmetic?

woelen - 12-12-2023 at 11:23

I tried the -ffast-math option. It makes the C-code somewhat faster.

Solving 100M quartics, plus analysing the error, relative to theory, takes 38 seconds with -ffast-math -O3 and it takes 42 seconds with only -O3. I used gcc version 11.4.0.
In Java, solving the 100 million quartics, plus doing the same analysis, on JDK 17.0.7, takes 34 seconds. This is with strict IEEE mathematics. (since Java 17, strictfp is the standard behavior in Java, loose IEEE conformance is not allowed anymore).

So, using the -ffast-math optimization has some effect, but it is not dramatic. Besides that, it introduces loose IEEE comformance.

@Keras: Fixed point arithmetic is not interesting at all in polynomial solving. I need big ranges for storing intermediate results. Even if you look at polynomials with small roots in the range 0.1 to 10, then the coefficients can become HUGE for polynomials of degree 50 or so. Think of (x+1)^50 and see what coefficients you get. For anything with larger or smaller roots, the effect only is worse.
I even consider extending my 105-bits floating point library with the use of large exponents (e.g. a Java int for storing exponents in the range -2 billion to + 2 billion). This would make certain algorithms more robust, although at additional computational cost. Polynomial root finding really is a science on its own ;)

[Edited on 12-12-23 by woelen]

clearly_not_atara - 12-12-2023 at 13:22

A hundred million quartics? Are you sure that your application isn't spending the difference generating the quartics?

Maybe in C you call a function 100 million times and it actually calls a function by creating and destroying a stack frame while Java is just running the test program as one big loop? It does seem like there is room for some "non-arithmetic" optimization here because the actual math (a Newton iteration?) happens very quickly.

Also, doesn't the C rand() only generate integers? Are you doing something like
Code:
double x = rand() / ((double) RAND_MAX);


I just ran some code that generates 500 million random floating-point numbers and returns their sum and it took 3.4 seconds to run in C on my laptop. Since a quartic has five coefficients, that suggests this could be affecting your runtime for 100 million quartics. I was not able to beat this with pointer twiddling, but I didn't try very hard.

Keras - 13-12-2023 at 00:31

Quote: Originally posted by woelen  
Polynomial root finding really is a science on its own ;)


I don’t dispute that a single second! ;)

Your pointer to the extended precision library is very interesting. I wonder if I’m not going and try implement it in ARM assembly. I wish you could release your code, it would make benchmarking much easier.

Also do all your method work with multiple roots. I know you quickly answered that before, but I’m really curious about that. Especially the A-E method you seem to especially cherish :)

woelen - 13-12-2023 at 01:02

I only timed the actual solving of the generated quartics and doing the analysis of the error. The time, needed for generating the quartics, is not included in the numbers, reported above.

I now timed the generation of the quartics as well. Generation of 100M quartics takes 4 seconds in C and appr. 2 seconds in Java, but this comparison is not fair. In C I use drand48() and in Java I use a highly optimized random generator, which I wrote myself, based on the new Xoshiro256 class of pseudo random generators: https://prng.di.unimi.it/ In both cases, however, I use a function call in a library (I put the Xoshiro256 in a separate library, in which it is implemented as a subclass of Java's standard Random() class).
In both environments, I scale the random generator output, such that the coefficients are in some range [coefMin .. coefMax], for each term its own range:
Code:
double x = coefMin + drand48()*(coefMax - coefMin);


The quartic solver does not use Newton iteration. It is based on a hybrid analytical/numerical method, based on the LDL^T decomposition of a 3x3 matrix. This method is very fast, and can also be made very accurate, also in all kinds of corner cases. The latter I describe in my paper, which I soon will put on my website.



woelen - 13-12-2023 at 01:27

Quote: Originally posted by Keras  

Your pointer to the extended precision library is very interesting. I wonder if I’m not going and try implement it in ARM assembly. I wish you could release your code, it would make benchmarking much easier.

The code for the extended precision I can share. I'll see if I can put it somewhere here this evening. It is Java code, but it should be easy to port it to C.

Quote: Originally posted by Keras  

Also do all your method work with multiple roots. I know you quickly answered that before, but I’m really curious about that. Especially the A-E method you seem to especially cherish :)

The methods I use also work well for multiple roots. Accuracy for multiple roots goes down though, but that is not a flaw of the software, but it is a basic mathematical property. For example, if you have a limited precision, such that 1 and 1+eps are just not distinguisable anymore, then you cannot distinguish multiple roots with value 1 from a cluster of separate roots which differ from 1 by a margin with absolute value eps^(1/k), where k is the multiplicity. For IEEE floating point, eps is around 10^-16. This means that for double roots, the accuracy is limited to around 8 digits, for a triple root it is a little over 5 digits, etc.

For a quadratic (x+1-delta)*(x+1+delta) you get the polynomial x^2+2x+1-delta*delta. For delta equal to 10^-8, you have an eps equal to 10^-16, which just cannot be detected anymore with standard IEEE accuracy. For a cubic polynomial with three equal roots you get coefficients with delta^3 as deviation. For these, the detectable error bound is appr. 5*10^-6.

Keras - 13-12-2023 at 03:39

Quote: Originally posted by woelen  
The methods I use also work well for multiple roots. Accuracy for multiple roots goes down though, but that is not a flaw of the software, but it is a basic mathematical property. […]


Yeah, I get that. Does your software returns the root and its multiplicity?

And thanks for the multiprecision code! :)

[Edited on 13-12-2023 by Keras]

woelen - 13-12-2023 at 06:17

The software returns the roots. It is possible to detect multiple roots. One can compute guaranteed error bounds for each of the roots. These are small disks around the found roots. If there are overlapping disks, then these can be considered multiple roots. The multiple root then is the average of the cluster of roots with overlapping disks of uncertainty.

An example is a polynomial with double roots 1.2345 besides some other roots. The solver finds roots 1.23450001987... and 1.23449998012...
The average of these is 1.23449999999....
The error bound for the individual roots is 3e-8 or something close to that, while the error bound for the average is 1e-15 or so.
In this way, one can detect multiple roots. I did not implement this for the root finders I have, because it reduces the overall speed with 30% or so and also makes usage of the software more complicated. Just computing error bounds on single roots and leaving out the cluster analysis is simple though. I did this in my test programs to validate the accuracy of my software.
Things can become much more complicated if there are no true multiple roots, but clusters of very close roots. A nasty situation can be a real polynomial with the following roots:

z1 = 1.234
z2 = 4.567
z3 = 6.123
z4 = 6.123 + 0.001*i
z5 = 6.123 - 0.001*i
z6 = 6.124
z7 = -1.256 - 5.612*i
z8 = -1.256 + 5.612*i
This polynomial has 4 well-separated roots, which have relative error bounds in the order of magnitude 1e-15 or so with standard IEEE floating point arithmetic. The remaining 4 roots form a tight cluster, but are not multiple. The software may have difficulty separating these roots and then a seemingly random mix of roots is produced, with values close to 6.123, with random deviations of magnitude 0.001 or so. The average of these 4 computed roots, however, is close to the average of the given 4 clustered roots.

Even polynomials with well-separated roots may pose serious problems. A notorious example is Wilkinson's polynomial (x-1)*(x-2)*(x-3)*...*(x-20). It has 20 well-separated roots, but with standard double precision (53 bits mantissa) one cannot compute all of these roots at much better than 10% accuracy. Just search the internet for info on this (search for "Wilkinson polynomial" in Google).

Keras - 13-12-2023 at 09:27

BTW, you don’t have access to 128-bit IEEE754 floats? Long double type?

About the Wilkinson polynomial. Yeah, I knew that example. But you can still use a QR method to find the roots as eigenvalues of a specially constructed matrix, right?

[Edited on 13-12-2023 by Keras]

woelen - 13-12-2023 at 12:10

In Java there is no 128-bit IEEE754, at least not through standard code. There may be some software library out there, but it will be slow. In C, things are not different. GCC has some 128 bit floating point type, but it is prohibitively slow. OK for simple use cases, where small amounts of work need to be done, but not useful for serious math.

Using matrix eigenvalue methods for polynomial root finding does not solve the issue of inherent ill-conditioning of the polynomial. The Wilkinson polynomials simply are ill-conditioned and when expanded, they simply have lost some information in their coefficients, which leads to bad approximations of the zeros if factorized, even with a perfect solver, working at standard 64-bit floating point precision. This loss of accuracy is a mathematical property of the polynomial itself and no piece of software can magically restore the information.
Matrix eigenvalue methods can be interesting for polynomial root finding, as a relatively stable means of solving, but they have some serious drawbacks. The polynomials only have N+1 coefficients, while the matrix methods use N^2 elements. For somewhat larger polynomials (e.g. degree 50), this leads to much more computational effort, needed for solving the equation. So, in modern software, the eigenvalue methods are not the preferred methods anymore. Much faster methods, which also are stable, exist nowadays.

Two files are available for download. One is the code, used for my polynomial-solvers with 105-bits precision, and a very small Rnd-package, which contains my own implementation of Java's Random-class (much faster, and much better qualty random numbers).


Attachment: Rnd.tgz (31kB)
This file has been downloaded 129 times

Attachment: polarith.tgz (4.2MB)
This file has been downloaded 136 times


Keras - 14-12-2023 at 01:46

I’m going to have a look! Thanks for that! And yeah, 128-bit IEEE is handled by software, so I guess this doesn’t help the performance skyrocket. Concerning eigenvalues method, I’m surprised you cannot use sparse matrix methods, since the matrices used to solve polynomials are inherently almost all zeros.

woelen - 14-12-2023 at 05:27

For solving polynomial equations, the so-called companion matrix is used. This indeed is a sparse matrix, but the methods for determination of eigenvalues destroy this sparsity already after the first iterative step . There are optimized eigenvalue methods for tridiagonal matrices, but the companion matrix has a very different structure.

And conceptually, if there were a matrix-method, which preserves the sparsity structure of the companion matrix while it does its iterative computatations, then what would be the difference with a method, which just works on the polynomial coefficients? Such a matrix method with order O(n) storage instead of order O(n^2) storage would be another O(n) polynomial solving method. E.g. AE gives all roots simultaneously, just like most eigenvalue methods for matrices, and this removes the need of polynomial deflation, which has its own numerical stability issues.

@clearly_not_atara: The fact that the computations are very fast and the runtime may be a dominated by "non-arithmetic" logic is an interesting observation. Indeed, modern JVMs use inlining, combined with so-called escape analysis to strongly optimize the use of small objects, which only have limited scope. These techniques can dramatically increase the performance of Java applications. Such things unfortunately are not easily achieved with statically compiled C-programs.

teodor - 24-12-2023 at 07:36

I did some tests and I see that fma instructions (calculating a*b - c*d by Kahan's algorithm) are not used until you specify -march like this : gcc -O3 -march=skylake ...
Instead, the software implementation of fma is used (call to some subroutines).
This definitely can delay execution a lot.

woelen - 27-12-2023 at 11:01

It does make some difference.

In my quartic solver code it hardly matters, but that code only uses fma instruction sparingly. The quintic solver benefits more from it.

Solving 100M quintics on a single core goes from 100 seconds to 85 seconds, with the -fmarch=skylake option added.
I already used -ffast-math besides the -O3 option. Adding that option also already made a difference of appr. 10% in the C code.

teodor - 28-12-2023 at 00:42

So, what is the time difference for Java vs C for different types of equations now?

woelen - 29-12-2023 at 02:12

For quartics, generating, solving 100M equations, and analysing the results:
- Java: 35 seconds
- C (with -O3 -ffast-math -march=skylake): 40 seconds

For quintics, same number of equations:
- Java: 89 seconds
- C (with -O3 -ffast-math -march=skylake): 85 seconds

As you can see, things have improved a lot for the C code. For quintics, it even is slightly faster.

Hwever, I'll stick to Java as main language. What I really like with the Java environment is that the good performance of that is obtained completely automatically. And the same compiled code works on all platforms I have. If a piece of hardware has FMA instructions, then these are used, otherwise it uses a software implementation. I do not have to worry about the right compiler option and distributing the right version of software for the right hardware. Most likely, the binaries, compiled with -march=skylake will not run on an older CPU, which does not have FMA instructions.

teodor - 29-12-2023 at 03:13

Indeed, I am impressed by the performance of the modern JIT. But this achievement is only possible because of Java was intensively used by a large business for many years. This is not the case with any new hardware or custom libraries. Also, many scientific applications which require intensive calculation are written in C++ and using JVM libraries from C++ is not an option. But using C library is possible from virtually any language, so I think the effort to make it fast is absolutely necessary.
I will look on quartics, thank you for presenting the current status.
It would be also very interesting to compare the performance on the new Apple M2/M3 architecture.
As you know, I am interesting in DSP applications. Do you have any example how your solver could be used in such application: audio signal filtering, processing, etc?

Did you compare your solver whith what MATLAB has? May be we can make an external library for MATLAB in C++ togetger if you can decide with license conditions.


[Edited on 29-12-2023 by teodor]

Rainwater - 29-12-2023 at 16:17

Are the data sets being processed identical for both platforms?

woelen - 31-12-2023 at 11:19

@Rainwater: The data sets are not identical, but comparable. Actually, even between two runs on the same platform, the data set is not the same. For each polynomial, a few pseudo random numbers are generated, uniformly distributed between 0 and 1 and from that, coefficients are derived. The random generator is seeded by the current system time in nanoseconds. Because of the large number of polynomials testen (100M), on average, the properties of the data sets are so close to each other over different runs, and also for the C and Java versions, that it is possible to do a meaningful comparison of performance.

@teodor: I compared my polynomial solvers with Matlab's roots-function. My generic polynomial solver performs better than roots, especially for somewhat larger degrees. This is because roots uses a matrix eigenvalue method for finding the roots. It is fine to me if the C-code is used for making a Matlab library. At the moment, I only have the Java code on my website, but I'll put C-code on it next week.

Random - 22-10-2024 at 04:16

I will now write this post in what we would call Coherent English

I have been dealing with programming since I was very young. My first serious programming started with C++.

Assembly Language is only one. That means, that if an application is performing faster than another one which does the same deed, then this slower application has data which is not necessary for performance of that deed.

It is not known what is incorporated into code. This again touches 1960s. What I would date, I mean, I would definitely date this to 20th century.

.

I used dot as a break in the text.

Original people who dealt with programming languages is what I classify as Psychiatry 1967.

.

Assembly Language is only one. That is why I posted this last post about smallest application that does the deed.

.

The question is, if assembly language itself is optimized enough and I will say, that there is probability, if I could call that, that assembly language itself could be more, how we could call that, efficient.

Last statement is touching Hardware - Software relation.

.

I have those documents from what I will date to 07 2020, concerning this pathway from source code into binary. Which is not from source code directly into binary, but also incorporates this, is it source code - assembly - machine code - binary pathway.

.

Fastest application would be most optimized one.

.

Depends, also what if the specific compiler is adding the unnecessary data?

I have been dealing with something what we would call, Digital Forensics.

.

You never know what is in the code unless you review it.

This again puts me into period 2007 - 2010 when I was in contact with various WarezGroups.

.

It took me, I would say, that I grew up in terms of thinking that I started dealing with this, what they call, low level language of assembly.

Again, assembly is assembly.

If every source code is eventually put into assembly then we should review what is making it slower.

Edit:

I was thinking inside 07 2024 how 64kB is not that small, in terms of that would be, ... Yes, I remember that Google provided incorrect answer how much is 64kB in bit amount, because I was thinking about how much would that FORMAT be if we multiplied a number with another number to get 64kB.

.

Was it 64kB I was researching.

[Edited on 22-10-2024 by Random]