I recently tried my hand at writing a ray tracer (something I haven't done for 15 years!) and, in particular, I to write the shortest possible ray tracer which was still comprehensible and interesting. The initial result was a 222-line OCaml program which incrementally renders a sphere-flake via OpenGL:
The cut-down versions don't do reflections, color or such a pretty sphere-flake and output a greyscale PGM but they do still use hierarchical spherical bounding volumes to accelerate rendering enough that they can render scenes containing millions of spheres.
Great work. What is default floating point type used by OCaml compiler? I got running time of about 36s for scene level 3 on Pentium3 machine and running time of about 25s when all references to "double" type in C++ code changed to "float" (and float is usually "good enough" for ray tracing)...
Jon Harrop wrote: > I then cut this program down to a 66-line OCaml program and ported it into a > 97-line C++ program. These programs are compared on this page:
tbp...@gmail.com writes: > Jon Harrop wrote: >> I then cut this program down to a 66-line OCaml program and ported it > into a >> 97-line C++ program. These programs are compared on this page:
> [demonstration that "inline" vector math makes C++ programs faster > elided]
> If your point was to show how to write unbelievably inneficient code in > C++, you've succeeded.
I think the point was to have some fun and to make a little demo that shows how a functional language like OCaml can give rise to nice short expressive programs while still delivering competitive performance. And for those goals, the OP succeeded!
Note also that he mentions on his web page that he's doing this for "the computer language shootout benchmarks", http://shootout.alioth.debian.org/. That page clearly points out all the caveats around benchmarks like these (to the point that they can be effectively meaningless in the real world.)
Specifically:
>> about The Language Shootout Benchmarks
>>Our goals are to learn about programming languages, compare their >>performance in various (possibly meaningless) ways and, most importantly, >>have some fun!
I've at least had some fun looking at his implementation!
-matt -- Matt Pharr m...@pharr.org <URL:http://graphics.stanford.edu/~mmp> ======================================================================= In a cruel and evil world, being cynical can allow you to get some entertainment out of it. --Daniel Waters
Matt Pharr wrote: > > [demonstration that "inline" vector math makes C++ programs faster > > elided]
You've missed the point. a) he's telling the OCaml compiler to inline everything it can while not giving its c++ counterpart opportunities to do so; fairness? b) more importantly he's passing large structures by _value_ and i've turned them into _references_.
In fact if you do the same for other hotpath functions as in: 19c21 < virtual void intersect(double &, Vec &, Ray) = 0; ---
> void intersect(double &lambda, Vec &normal, const Ray &ray) {
then the c++ version becomes 50% faster than originally (and in fact, unconditionally faster than the OCaml version), without adding a single line.
You'll notice i haven't even fixed the gratitious use of virtual functions, and other details that force standard compliant c++ compilers to pessimize a lot when facing such... hmm.. source.
I'm sure OCaml is a wonderful & expressive language, but crafting such a fishy c++ equivalent to make it shine (because you have an agenda) is a despicable practice.
> > If your point was to show how to write unbelievably inneficient code in > > C++, you've succeeded.
> I think the point was to have some fun and to make a little demo that shows > how a functional language like OCaml can give rise to nice short expressive > programs while still delivering competitive performance. And for those > goals, the OP succeeded!
> Note also that he mentions on his web page that he's doing this for "the > computer language shootout benchmarks",
> That page clearly points out all the caveats around benchmarks like these > (to the point that they can be effectively meaningless in the real
world.) I have nothing against such benchmarks when they are not clearly rigged.
But i guess the main point of the whole operation wasn't fair benchmarking, education, or fun but selling books: "This program has demonstrated some of the ways that OCaml improves upon C++. For a thorough introduction to OCaml, read our book "Objective CAML for Scientists".
tbp wrote: > Matt Pharr wrote: >> > [demonstration that "inline" vector math makes C++ programs faster >> > elided] > You've missed the point.
My post is more about clarity and less about performance. Your proposed alterations make the C++ implementation significantly longer and more obfuscated and your littering of the source code with "inline" actually slows the program down on my Athlon t-bird.
Moreover, your choices of initial optimisation are decidedly suboptimal. I'd have gone for:
1. Terminating the intersection of shadow rays when the first intersection is found (instead of finding the closest intersection).
2. Use an implicit scene, to avoid storing it explicitly.
3. Use single-precision storage (or no storage at all).
Both of these optimisations will give much bigger performance improvements when averaged over different platforms and architectures.
As it happens, I had already implemented all of your optimisations and all of these optimisations to both implementations before I made my original post. None of them add anything new. Sometimes C++ is faster, sometimes OCaml is faster. OCaml is always much more succinct. You can keep doing this until you are blue in the face but I don't think you'll learn much of interest.
> a) he's telling the OCaml compiler to inline everything it can while > not giving its c++ counterpart opportunities to do so; fairness?
Both ocamlopt and g++ were allowed to inline code.
As OCaml is for symbolic use, it performs no inlining by default (unlike g++) so it is common to ask OCaml to do some inlining on numerical code. Thus, specifying "-inline 100" actually makes for a fairer comparison.
> b) more importantly he's passing large structures by _value_ and i've > turned them into _references_.
Both implementations use pass by value as this is clearer, shorter and (as a consequence) more common in real code.
> In fact if you do the same for other hotpath functions as in: > ... > then the c++ version becomes 50% faster than originally
If you want a fair comparison then you should also make equivalent changes to the OCaml implementation.
> (and in fact, unconditionally faster than the OCaml version),
You are trying to compare optimised C++ against unoptimised OCaml, which would be unfair. More worryingly, you seem to have skipped the part of the experiment where you actually measure something.
> without adding a single line.
When restricted to 80 columns, your optimisations add several lines. Indeed, they push the C++ program over 100 LOC limit imposed by the creators of the shootout.
In fact, I'd already implemented your optimisations (and many more effective optimisations) and had to throw them out because of this limit. Note that the OCaml has 30 more lines with which to optimise. My optimised <100 LOC OCaml program is much faster than my optimised <100 LOC C++ on all platforms.
> You'll notice i haven't even fixed the gratitious use of virtual > functions,
What would you recommend instead?
I chose an inheritance hierarchy because this is the closest C++ equivalent to a variant type which I see in typical C++ programs.
Another optimisation that I made was to replace inheritance with a single Scene struct which represented a group of child scenes or a sphere when it had no children. I threw this out as well because it is not a fair equivalent to OCaml's variant type. For example, you could not specify a colour for each sphere without also specifying colours for all bounding volumes.
> and other details
Can you elaborate on these other "details"?
> that force standard compliant c++ > compilers to pessimize a lot when facing such... hmm.. source.
Manually implementing pass constant by reference is both obfuscating and error-prone whilst being theoretically unnecessary. The fact that the OCaml compiler does this optimisation for you when the GNU C++ compiler does not might interest some people.
> I'm sure OCaml is a wonderful & expressive language, but crafting such > a fishy c++ equivalent to make it shine (because you have an agenda) is > a despicable practice.
Everyone will be free to contribute to the shootout version of the ray tracer. Perhaps you would like to contribute a less "fishy" C++ implementation?
asamard...@matf.bg.ac.yu wrote: > Great work. What is default floating point type used by OCaml > compiler?
Double precision.
> I got running time of about 36s for scene level 3 on > Pentium3 machine and running time of about 25s when all references to > "double" type in C++ code changed to "float" (and float is usually > "good enough" for ray tracing)...
I agree that single precision is usually good enough for ray tracing. However, when I tried the same optimisation I found that it made the program much faster on 32-bit x86 and much slower on 64-bit AMD64. I assume this is a float alignment issue.
Also, it generates (numerically) different results, which makes it trickier to validate against the OCaml.
Jon Harrop <use...@jdh30.plus.com> wrote: > My post is more about clarity and less about performance. Your proposed > alterations make the C++ implementation significantly longer and more > obfuscated and your littering of the source code with "inline" actually > slows the program down on my Athlon t-bird.
Then your compiler is stupid. Get a better one.
The keyword 'inline' is not a command for the compiler that says it must inline the function or else it will be deleted immediately. And the other way around, the lack of 'inline' keyword by no means stops the compiler from inlining the function if it can and it estimates it would be useful.
In *theory* 'inline' is a hint to the compiler. In practice it has nothing to do with inlining. Most of the modern compilers will simply ignore that keyword when they are estimating whether a function should be inlined or not. In this context 'inline' is a no-op. It's like the 'register' keyword which is nowadays obsolete and a no-op.
What 'inline' *does* have some effect on is the linker. If a function with an identical declaration is implemented in more than one place then in normal circumstances the linker will give an error (because it can't know which one of them it should use). However, if the functions were declared 'inline', then the compiler will tell the linker that all of those functions are actually one and the same, and then the linker will just use one of them.
And by the way, if you are afraid of "litter" in C++ code, then you should switch to another programming language. That "litter", as you call it, is not useless junk as you seem to imply. It has a reason and a significance.
> > b) more importantly he's passing large structures by _value_ and i've > > turned them into _references_. > Both implementations use pass by value as this is clearer, shorter and (as a > consequence) more common in real code.
I hope you are not passing megabyte-sized vectors by value...
If your opinion of "clearer" and "shorter" is that the & symbol should not be used, then you might want to either check your priorities or change your programming language (because C++ is not for you).
> If you want a fair comparison then you should also make equivalent changes > to the OCaml implementation.
You want to compare which programming language runs inefficient code faster?
> > You'll notice i haven't even fixed the gratitious use of virtual > > functions, > What would you recommend instead? > I chose an inheritance hierarchy because this is the closest C++ equivalent > to a variant type which I see in typical C++ programs.
Inheritance has no overhead in C++. Dynamic binding has. While virtual function calls are pretty fast they are still slightly slower than regular function calls. If you want maximum speed you should not make time-critical functions virtual unless you are really forced to.
> Another optimisation that I made was to replace inheritance with a single > Scene struct which represented a group of child scenes or a sphere when it > had no children.
Inheritance causes no overhead. That kind of optimization is useless.
Jon Harrop wrote: > tbp wrote: > My post is more about clarity and less about performance. Your
proposed Oh? Silly me, i thought it was a benchmark.
> alterations make the C++ implementation significantly longer and more > obfuscated and your littering of the source code with "inline" actually > slows the program down on my Athlon t-bird.
I've put explicit inlines because, ie for Vec operators, you've put them outside of the class scope, hence screwing the common heuristic saying that a member function declared & defined within the class body should be inlined (and that's a much stronger hint than using an 'inline' directive). Anyway, that's decoration compared to other issues in your source.
> Moreover, your choices of initial optimisation are decidedly suboptimal. I'd > have gone for:
Hmm? I tinker with a handful of lines, cut the runtime in half and that's suboptimal? Interesting.
> 1. Terminating the intersection of shadow rays when the first intersection > is found (instead of finding the closest intersection).
> 2. Use an implicit scene, to avoid storing it explicitly.
> 3. Use single-precision storage (or no storage at all).
> Both of these optimisations will give much bigger performance improvements > when averaged over different platforms and architectures.
Your aproach & space partition (as presented) will never give runtime performance worth the time it takes to implement efficiently. Such a sphere flake should render in a blink of the eye on modern platforms, see the SPD for reference. Or this: http://www.uni-koblenz.de/~cg/publikationen/cp_raytrace.pdf I've merely fixed the most glaring flaws in your implementation. Period.
> As it happens, I had already implemented all of your optimisations and all > of these optimisations to both implementations before I made my original > post. None of them add anything new. Sometimes C++ is faster, sometimes > OCaml is faster. OCaml is always much more succinct. You can keep doing > this until you are blue in the face but I don't think you'll learn much of > interest.
I've never contested that OCaml is more succint, expressive or elegant. I'm just saying that, given the horrible c++ implementation you've presented, you have absolutely no authority or qualification to say anything about the relative merit of c++ & OCaml performance wise.
> Both ocamlopt and g++ were allowed to inline code.
It's all about exposing oportunities to do so.
> As OCaml is for symbolic use, it performs no inlining by default (unlike > g++) so it is common to ask OCaml to do some inlining on numerical code. > Thus, specifying "-inline 100" actually makes for a fairer
comparison. No. You've used none of the c++ idiom that actually helps a c++ compiler to guess what to inline or not. Know your compiler (or in that case, your language).
> Both implementations use pass by value as this is clearer, shorter and (as a > consequence) more common in real code.
Absolute non sense. I can't beleive you really think that, but if that's the case, then you should get back to your homework.
> If you want a fair comparison then you should also make equivalent changes > to the OCaml implementation.
You're the one trying to sell books about OCaml...
> You are trying to compare optimised C++ against unoptimised OCaml, which > would be unfair. More worryingly, you seem to have skipped the part of the > experiment where you actually measure something.
I've provided patches. See for yourself.
> When restricted to 80 columns, your optimisations add several lines. Indeed, > they push the C++ program over 100 LOC limit imposed by the creators of the > shootout.
You want to fit in those artificial constraints? Shorten names or define some macros. What was the fuss about already?
> In fact, I'd already implemented your optimisations (and many more effective > optimisations) and had to throw them out because of this limit. Note that > the OCaml has 30 more lines with which to optimise. My optimised <100 LOC > OCaml program is much faster than my optimised <100 LOC C++ on all > platforms.
Sure. I mean you've shown such a level of C++ wizardry that we're supposed to take your word for it?
> > You'll notice i haven't even fixed the gratitious use of virtual > > functions,
> What would you recommend instead?
Not using virtual functions in the hotpath. Primo, that's uncalled for as your "hierarchy" doesn't mix types at a given level, each object's type is known upfront. Secundo, virtual functions are expensive because they forbid inlining and are implemented as indirect branches , like: 401c83: callq *0x8(%rax) It doesn't make sense to use an expensive feature in the hotpath if you can avoid it. And you definitely can.
> I chose an inheritance hierarchy because this is the closest C++ equivalent > to a variant type which I see in typical C++ programs.
> Another optimisation that I made was to replace inheritance with a single > Scene struct which represented a group of child scenes or a sphere when it > had no children. I threw this out as well because it is not a fair > equivalent to OCaml's variant type. For example, you could not specify a > colour for each sphere without also specifying colours for all bounding > volumes.
That's totally orthogonal. Please do yourself a favor and read a good book about C++.
> > and other details > Can you elaborate on these other "details"?
Strictly speaking of your implementation details (and not algorithm or anything), you've also failed the const correctness test for example. But i suppose you're going to say say that it's cruft or obfuscation.
> > that force standard compliant c++ > > compilers to pessimize a lot when facing such... hmm.. source.
> Manually implementing pass constant by reference is both obfuscating and > error-prone whilst being theoretically unnecessary. The fact that the OCaml > compiler does this optimisation for you when the GNU C++ compiler does not > might interest some people.
I think you've just discovered that OCaml & C++ are two different language that require the programmer to express things in a totally different way. And before you propose ways to improve C++ you should learn the language first.
> Everyone will be free to contribute to the shootout version of the ray > tracer. Perhaps you would like to contribute a less "fishy" C++ > implementation?
I don't have books to sell. If/When i write a raytracer, i choose a path that at least has some chances to perform decently. See Wald & Havran.
Warp wrote: > Jon Harrop <use...@jdh30.plus.com> wrote: >> My post is more about clarity and less about performance. Your proposed >> alterations make the C++ implementation significantly longer and more >> obfuscated and your littering of the source code with "inline" actually >> slows the program down on my Athlon t-bird.
> Then your compiler is stupid. Get a better one. > ...
I agree that the compiler is stupid but I'm comparing g++ as this is a common compiler. A better C++ compiler is likely to produce faster code, of course.
> And by the way, if you are afraid of "litter" in C++ code, then you > should switch to another programming language.
Absolutely. :-)
> That "litter", as you call it, is not useless junk as you seem to imply. > It has a reason and a significance.
It also has unexpected platform dependent results, as "tbp" has discovered.
>> > b) more importantly he's passing large structures by _value_ and i've >> > turned them into _references_.
>> Both implementations use pass by value as this is clearer, shorter and >> (as a consequence) more common in real code.
> I hope you are not passing megabyte-sized vectors by value...
No, they are small, constant-sized vectors (3D), so pass by value does not affect the asymptotic algorithmic complexity.
> If your opinion of "clearer" and "shorter" is that the & symbol should > not be used, then you might want to either check your priorities or > change your programming language (because C++ is not for you).
Unfortunately, you cannot just use "&" because it will take a reference to a temporary, giving an error like:
ray.cpp: In member function `virtual void Sphere::intersect(double&, Vec&, Ray) ': ray.cpp:39: error: no match for 'operator+' in 'ray.Ray::orig + operator*(double, Vec)(((&ray) + 24))' ray.cpp:12: error: candidates are: Vec operator+(Vec&, Vec&) /usr/include/c++/3.3/bits/stl_bvector.h:261: error: std::_Bit_const_iterator std::operator+(int, const std::_Bit_const_iterator&) /usr/include/c++/3.3/bits/stl_bvector.h:202: error: std::_Bit_iterator std::operator+(int, const std::_Bit_iterator&) ray.cpp: In function `double ray_trace(double, Vec, Ray, Scene*)': ray.cpp:62: error: no match for 'operator+' in 'ray.Ray::orig + operator*(double, Vec)(((&ray) + 24))' ray.cpp:12: error: candidates are: Vec operator+(Vec&, Vec&) /usr/include/c++/3.3/bits/stl_bvector.h:261: error: std::_Bit_const_iterator std::operator+(int, const std::_Bit_const_iterator&) /usr/include/c++/3.3/bits/stl_bvector.h:202: error: std::_Bit_iterator std::operator+(int, const std::_Bit_iterator&)
You must either use "const ... &" instead, or make sure that all values are created explicitly as local variables in the caller (rather than being inline subexpressions in the function call).
It is precisely this obfuscation (and the fact that these are tiny structures, and that this is not necessarily an optimisation, and that it could be done by the compiler) that made me choose pass by value.
>> If you want a fair comparison then you should also make equivalent >> changes to the OCaml implementation.
> You want to compare which programming language runs inefficient code > faster?
If you completely optimise both implementations then you'll end up with the same assembler. There is only value in comparing simpler implementations in both languages. So yes, they will not be completely optimised but subjectively "like" real code.
>> I chose an inheritance hierarchy because this is the closest C++ >> equivalent to a variant type which I see in typical C++ programs.
> Inheritance has no overhead in C++. Dynamic binding has. > While virtual function calls are pretty fast they are still slightly > slower than regular function calls. If you want maximum speed you should > not make time-critical functions virtual unless you are really forced to.
I believe I am "really forced to" as virtual lookup is used instead of switching on the tag of the variant type/union. In C++, virtual functions are more elegant (and more common in real code) than C-style tagged unions.
>> Another optimisation that I made was to replace inheritance with a single >> Scene struct which represented a group of child scenes or a sphere when >> it had no children.
> Inheritance causes no overhead. That kind of optimization is useless.
Getting rid of inheritance gets rid of the virtual functions. However, it necessarily incurs an equivalent test somewhere else.
tbp wrote: >> When restricted to 80 columns, your optimisations add several lines. > Indeed, >> they push the C++ program over 100 LOC limit imposed by the creators > of the >> shootout. > You want to fit in those artificial constraints?
Yes, it actually says that at the top of the web page.
> Shorten names or define some macros. What was the fuss about already?
If you recall, the fuss was about clarity. So shortening names and defining macros doesn't really seem appropriate.
>> In fact, I'd already implemented your optimisations (and many more > effective >> optimisations) and had to throw them out because of this limit. Note > that >> the OCaml has 30 more lines with which to optimise. My optimised <100 > LOC >> OCaml program is much faster than my optimised <100 LOC C++ on all >> platforms. > Sure. I mean you've shown such a level of C++ wizardry that we're > supposed to take your word for it?
I'd be very happy to include a better C++ version, if you have one.
>> > You'll notice i haven't even fixed the gratitious use of virtual >> > functions,
>> What would you recommend instead? > Not using virtual functions in the hotpath. > Primo, that's uncalled for as your "hierarchy" doesn't mix types at a > given level, each object's type is known upfront.
So you're talking about specialising the scene tree so that it can only represent this scene, rather than scenes in general?
> Secundo, virtual functions are expensive because they forbid inlining > and are implemented as indirect branches , like: > 401c83: callq *0x8(%rax) > It doesn't make sense to use an expensive feature in the hotpath if you > can avoid it. > And you definitely can.
Actually, this is another optimisation that I had already done. It makes no significant difference to the performance because this isn't actually on the "hotpath", as you call it. Most of the time is spent elsewhere.
>> > and other details >> Can you elaborate on these other "details"? > Strictly speaking of your implementation details (and not algorithm or > anything), you've also failed the const correctness test for example. > But i suppose you're going to say say that it's cruft or obfuscation.
Yes, that was also done deliberately to squeeze the C++ into 100 LOC.
>> Everyone will be free to contribute to the shootout version of the > ray >> tracer. Perhaps you would like to contribute a less "fishy" C++ >> implementation? > I don't have books to sell. > If/When i write a raytracer, i choose a path that at least has some > chances to perform decently. > See Wald & Havran.
Although I'm sure that a production ray tracer will be a lot faster than my tutorial ray tracer, the paper you cite describes a high-level program which is not much faster than mine. The faster alternative is actually written in assembler (using SIMD) and also fails to reach "interactive frame rates" on such scenes.
Jon Harrop <use...@jdh30.plus.com> wrote: > I agree that the compiler is stupid but I'm comparing g++ as this is a > common compiler. A better C++ compiler is likely to produce faster code, of > course.
Are you using the optimization flags for your platform?
> Unfortunately, you cannot just use "&" because it will take a reference to a > temporary, giving an error like:
const &
You very seldom need non-const references.
> > Inheritance has no overhead in C++. Dynamic binding has. > > While virtual function calls are pretty fast they are still slightly > > slower than regular function calls. If you want maximum speed you should > > not make time-critical functions virtual unless you are really forced to. > I believe I am "really forced to" as virtual lookup is used instead of > switching on the tag of the variant type/union. In C++, virtual functions > are more elegant (and more common in real code) than C-style tagged unions.
I was not suggesting a "tagged union" as an alternative. That would actually be slower than a dynamically bound class. I was suggesting as an alternative a different design where you do not need dynamic binding.
> > Inheritance causes no overhead. That kind of optimization is useless. > Getting rid of inheritance gets rid of the virtual functions.
Well, yes, but you are using the wrong means. You don't need to drop inheritance simply to stop using dynamic binding (for the latter you simply need to drop the 'virtual' keywords; inheritance has nothing to do with that).
Jon Harrop wrote: > tbp wrote: > > Such a sphere flake should render in a blink of the eye on modern > > platforms, see the SPD for reference. > > Or this:
> Although I'm sure that a production ray tracer will be a lot faster than my > tutorial ray tracer, the paper you cite describes a high-level program > which is not much faster than mine. The faster alternative is
actually *cough*
> written in assembler (using SIMD) and also fails to reach "interactive > frame rates" on such scenes.
Not assembler, intrinsics. And the whole thing is not about SIMD but ray coherence & BVH which are notions that have nothing to do with a specific platform. It just happens to fit into the SIMD model quite well.
Jon Harrop wrote: > > You want to fit in those artificial constraints? > Yes, it actually says that at the top of the web page. > > Shorten names or define some macros. What was the fuss about already? > If you recall, the fuss was about clarity. So shortening names and defining > macros doesn't really seem appropriate.
If i sum it up, nothing's appropriate but what you did which incidentally happens to be the worst way to implement it in C++.
> I'd be very happy to include a better C++ version, if you have one.
Better in what metric? I'm cautious because it seems we live in two distinct reality.
> So you're talking about specialising the scene tree so that it can only > represent this scene, rather than scenes in general?
Scenes in general? Please correct me if i'm wrong but it seems that your program is about a sphere BVH enclosing... spheres. Is that your idea of generality? So, please focus on the topic at hand and you'll notice that using dynamic binding is superfluous.
> > Secundo, virtual functions are expensive because they forbid inlining > > and are implemented as indirect branches , like: > > 401c83: callq *0x8(%rax) > > It doesn't make sense to use an expensive feature in the hotpath if you > > can avoid it. > > And you definitely can.
> Actually, this is another optimisation that I had already done. It makes no > significant difference to the performance because this isn't actually on > the "hotpath", as you call it. Most of the time is spent elsewhere.
I see: struct Scene { // Abstract base class representing a scene virtual void intersect(double &, Vec &, Ray) = 0;
So you're saying intersections aren't part of the hotpath? Well you maybe right, after all it's only called a few million times.
> >> Can you elaborate on these other "details"? > > Strictly speaking of your implementation details (and not algorithm or > > anything), you've also failed the const correctness test for example. > > But i suppose you're going to say say that it's cruft or obfuscation.
> Yes, that was also done deliberately to squeeze the C++ into 100 LOC.
Bingo!
And i suppose that the fact that this emasculated C++ version performed slower than its OCaml counterpart was merely a side effect?
What's next? You're going to try to sell me a bridge? Oops. A book first, then a bridge.
Warp wrote: > Are you using the optimization flags for your platform?
Yes, I've played with various optimisation settings with both compilers on both platforms. I didn't see anything particularly interesting when tweaking them. The performance varies by about 20% but the ratio of the performances stays the same (between OCaml and C++).
> You very seldom need non-const references.
In point of fact, the program uses non-const references to return multiple results from functions. IMHO, this is often done in C++ (and pass by non-const pointer in C) because it is too tedious to define a new type for the return values of each function in C/C++.
As there are currently only two values being returned by the intersect routine, I could have used STL pair<> instead. I might investigate that (I think it will be simpler but slower). I have seen code which uses pair<pair<...>, ...> to return three values but I don't much like it.
>> I believe I am "really forced to" as virtual lookup is used instead of >> switching on the tag of the variant type/union. In C++, virtual functions >> are more elegant (and more common in real code) than C-style tagged >> unions.
> I was not suggesting a "tagged union" as an alternative. That would > actually be slower than a dynamically bound class. > I was suggesting as an alternative a different design where you do not > need dynamic binding.
I have actually already done this "optimisation" because I also thought it would speed things up. It turns out that the virtual lookup is taking an insignificant amount of time. I ditched the resulting code because it is less general and less clear whilst being no faster. It is significantly shorter though.
tbp wrote: > Jon Harrop wrote: >> If you recall, the fuss was about clarity. So shortening names and >> defining >> macros doesn't really seem appropriate. > If i sum it up, nothing's appropriate but what you did which > incidentally happens to be the worst way to implement it in C++.
Perhaps there is a better way but I haven't seen it.
>> I'd be very happy to include a better C++ version, if you have one. > Better in what metric? I'm cautious because it seems we live in two > distinct reality.
Better => Faster whilst remaining clear and satisfying the requirements of the shootout.
>> So you're talking about specialising the scene tree so that it can > only >> represent this scene, rather than scenes in general? > Scenes in general? > Please correct me if i'm wrong but it seems that your program is about > a sphere BVH enclosing... spheres. > Is that your idea of generality?
In both cases, the types of objects and bounding volumes are extensible.
To add a new kind of object to the C++, you create a new class derived from Scene which implements its own "intersect" member function.
To add a new kind of object to the OCaml, you supplement the variant type "scene" with another constructor and supplement the pattern match in the "intersect" function with a case to handle your new object.
Similarly for new types of bounding volume.
Perhaps it would be interesting to add a new kind of object to both implementations, to see how this is done in each language?
>> Actually, this is another optimisation that I had already done. It > makes no >> significant difference to the performance because this isn't actually > on >> the "hotpath", as you call it. Most of the time is spent elsewhere. > I see: > struct Scene { // Abstract base class representing a scene > virtual void intersect(double &, Vec &, Ray) = 0;
> So you're saying intersections aren't part of the hotpath? > Well you maybe right, after all it's only called a few million times.
Yes, exactly. It isn't really that surprising when you consider that "ray_sphere" is called ~10x more than "intersect" and it contains several branches, lots of floating point maths and function calls.
>> Yes, that was also done deliberately to squeeze the C++ into 100 LOC. > Bingo!
> And i suppose that the fact that this emasculated C++ version performed > slower than its OCaml counterpart was merely a side effect?
The C++ outperforms the OCaml on AMD64 (and IA64). I'd say that the performance is comparable in all cases.
tbp wrote: > Not assembler, intrinsics. > And the whole thing is not about SIMD but ray coherence & BVH which are > notions that have nothing to do with a specific platform. > It just happens to fit into the SIMD model quite well.
Ok. I don't know anything about intrinsics but they don't seem to be relevant to my comparison of a tutorial ray tracer in C++ and OCaml.
As I've said, if you want to optimise my ray tracer (or any other program) then I would recommend algorithmic optimisations over low-level ones like SIMD, cache coherency etc. There are plenty more algorithmic optimisations to be done (although I've already done the main one - hierarchical bounding volumes).
In this case, it looks as though you'd want to write specialised C code for certain platforms. This could be called from C++ or OCaml and it would speed both programs up by roughly the same amount. This might be a good idea though, as OCaml is clearly preferable for the higher-level, not- performance-critical bulk of the code. So you could get rid of C++ altogether. :-)
Jon Harrop <use...@jdh30.plus.com> wrote: > > You very seldom need non-const references. > In point of fact, the program uses non-const references to return multiple > results from functions. IMHO, this is often done in C++ (and pass by > non-const pointer in C) because it is too tedious to define a new type for > the return values of each function in C/C++.
Often? It is usually rare that you have to return several values from a function so that usually all of them are needed. The problem of making one function which returns many values versus making several functions, each one returning one value, is that when you only need that one value returning many is useless overhead and forces you to make clumsy code (you have to create dummy variables to hold those return variables you don't need and then immediately discard them).
When there's a logical reason to return many values at once, returning a class or struct as the return value of the function may in fact be a good idea. If you call the function like this:
ReturnValues values = theFunction(whatever);
where ReturnValues is a class or struct, C++ compilers (or at least gcc; probably also most of the other compilers) will actually make that very efficiently (they will perform the so-called return value optimization).
The return value optimization means that, in a code like the above, the function being called actually builds its return value directly into 'values' and will not create any temporaries nor will it call any copy constructors. Perhaps a bit surprisingly this works even if the function in question is not inline and its implementation is in a completely separate compilation unit (ie. object file) which is just linked later to the binary (and in fact it works even if the function in question is in a dynamically loaded library).
It might even be that the return value optimization technique produces slightly faster code than the function returning its values by writing to references given to it as parameters.
> As there are currently only two values being returned by the intersect > routine, I could have used STL pair<> instead. I might investigate that (I > think it will be simpler but slower). I have seen code which uses > pair<pair<...>, ...> to return three values but I don't much like it.
How about just creating your own struct or class?
> I have actually already done this "optimisation" because I also thought it > would speed things up. It turns out that the virtual lookup is taking an > insignificant amount of time.
Yes, virtual function calls are very fast in C++. However, what is fastest depends on what you are comparing it to. If you are comparing it to an equivalent functionality implemented with a big switch-case block, then you should definitely use the virtual functions instead.
However, if you can design the program so that the function to be called can be resolved at compilation time (instead of at runtime as happens with virtual functions or switch-case blocks) then you might get a slight speedup, specially if the function is short and called millions of times.
Warp wrote: > Jon Harrop <use...@jdh30.plus.com> wrote: >> In point of fact, the program uses non-const references to return >> multiple results from functions. IMHO, this is often done in C++ (and >> pass by non-const pointer in C) because it is too tedious to define a new >> type for the return values of each function in C/C++.
> Often? It is usually rare that you have to return several values > from a function so that usually all of them are needed.
It may seem rare when you use a programming language that makes it difficult, but lots of useful functions return multiple values.
> The problem of making one function which returns many values versus > making several functions, each one returning one value, is that when > you only need that one value returning many is useless overhead and > forces you to make clumsy code (you have to create dummy variables to > hold those return variables you don't need and then immediately discard > them).
In the case of the "intersect" function, you would not split it into two functions to compute the parameter and normal vector of the intersection, for example.
> When there's a logical reason to return many values at once, returning > a class or struct as the return value of the function may in fact be a > good idea. If you call the function like this:
> ReturnValues values = theFunction(whatever);
> where ReturnValues is a class or struct, C++ compilers (or at least gcc; > probably also most of the other compilers) will actually make that very > efficiently (they will perform the so-called return value optimization).
Creating structs for the return values of every function that returns multiple values may be a good idea, but the verbosity and clumsiness of doing this in C++ makes it quite undesirable.
> The return value optimization means that...
If it is performed, yes.
>> As there are currently only two values being returned by the intersect >> routine, I could have used STL pair<> instead. I might investigate that >> (I think it will be simpler but slower). I have seen code which uses >> pair<pair<...>, ...> to return three values but I don't much like it.
> How about just creating your own struct or class?
That would generate quite a bit more code and would probably break the 100 LOC limit.
> However, if you can design the program so that the function to be > called can be resolved at compilation time (instead of at runtime as > happens with virtual functions or switch-case blocks) then you might > get a slight speedup, specially if the function is short and called > millions of times.
I do not believe that can be done in this case (without breaking generality and the equivalence with the OCaml implementation).
Jon Harrop <use...@jdh30.plus.com> wrote: > Creating structs for the return values of every function that returns > multiple values may be a good idea, but the verbosity and clumsiness of > doing this in C++ makes it quite undesirable.
Jon Harrop wrote: > tbp wrote: > > If i sum it up, nothing's appropriate but what you did which > > incidentally happens to be the worst way to implement it in C++.
> Perhaps there is a better way but I haven't seen it.
> >> I'd be very happy to include a better C++ version, if you have one. > > Better in what metric? I'm cautious because it seems we live in two > > distinct reality.
> Better => Faster whilst remaining clear and satisfying the requirements of > the shootout.
I'll leave non objective criterions aside (how do you measure clarity?) and, as you insist on discarding my previous patches & measures, i've taken the time this morning to make this up: http://ompf.org/ray/miniray.cpp # grep -e virtual -e inline miniray.cpp # wc -l miniray.cpp 97 miniray.cpp # g++ -O2 -ffast-math miniray.cpp -o miniray # time ./miniray >pix.ppm
real 0m5.862s user 0m5.800s sys 0m0.006s # cmp pix.orig.ppm pix.ppm; echo $? 0 # composite -compose difference pix.orig.ppm pix.ppm diff.ppm # display diff.ppm [contemplate a black picture]
So, on my box, the original version took ~13s then ~10s after some patching, ~8s after some more patching and now we're at ~6s with a const correct non virtual rule compliant version.
Before you start nitpicking please consider that: a) this version complies with the shootout rules AFAIK. b) it outputs the exact same picture. c) i had to raise the epsilon a bit to avoid acnea, it has to do with the way i compute normals but i'm not going to fix that. d) i removed the virtualness in the most straightforward way and it's certainly not aesthetically pleasing e) i'm not going to spend ages on this, hence c) & d). f) it's algorithmically equivalent to your version (ie primary and shadow rays aren't special cased etc...)
> In both cases, the types of objects and bounding volumes are
extensible. That means nothing. Everything is extensible. And in this *particular case*, using dynamic binding is, again, superfluous. I've resolved that in a really crude way, but that's just one possibility.
> > So you're saying intersections aren't part of the hotpath? > > Well you maybe right, after all it's only called a few million times.
> Yes, exactly. It isn't really that surprising when you consider that > "ray_sphere" is called ~10x more than "intersect" and it contains several > branches, lots of floating point maths and function calls.
I was sarcastic; "ray_sphere" calls that interection routine which cannot be inlined because it's virtual etc... Also, the fact is that this kind of sphere intersection can easily be made branchless with something like: if ((d < 0.) | (t2 < 0.)) return infinity; else return t1 > 0. ? t1 : t2; But you want to branch away from the square root as it is what's wasting most cycles.
> >> Yes, that was also done deliberately to squeeze the C++ into 100 LOC. > > Bingo!
> > And i suppose that the fact that this emasculated C++ version performed > > slower than its OCaml counterpart was merely a side effect?
> The C++ outperforms the OCaml on AMD64 (and IA64). I'd say that the > performance is comparable in all cases.
Again, your benchmark as presented it totally unfair.
Jon Harrop wrote: > Ok. I don't know anything about intrinsics but they don't seem to be > relevant to my comparison of a tutorial ray tracer in C++ and OCaml.
True.
> As I've said, if you want to optimise my ray tracer (or any other program) > then I would recommend algorithmic optimisations over low-level ones
like Huh. Really?
> SIMD, cache coherency etc. There are plenty more algorithmic optimisations > to be done (although I've already done the main one - hierarchical bounding > volumes).
So what? The fact is that implementation quality matters just as much. A program has to run on a given machine at some point.
> In this case, it looks as though you'd want to write specialised C code for > certain platforms. This could be called from C++ or OCaml and it would > speed both programs up by roughly the same amount. This might be a good > idea though, as OCaml is clearly preferable for the higher-level, not- > performance-critical bulk of the code. So you could get rid of C++ > altogether. :-)
If you want to see it that way, that's fine. But i wouldn't say that spending most cycles in c++ called from "higher level" amounts to getting rid of it. Au contraire.
Warp wrote: > Jon Harrop <use...@jdh30.plus.com> wrote: >> Creating structs for the return values of every function that returns >> multiple values may be a good idea, but the verbosity and clumsiness of >> doing this in C++ makes it quite undesirable.
> You mean modular?
No.
> Have you ever heard of abstraction?
Yes.
For example, the following OCaml function returns the given number as well as its square and cube:
let f n = n, n*n, n*n*n;;
The C++ equivalent of this using a struct is:
struct res { int n, n2, n3; };
res f(int n) { res ans; ans.n = n; ans.n2 = n*n; ans.n3 = n*n*n; return ans;
tbp wrote: >> SIMD, cache coherency etc. There are plenty more algorithmic > optimisations >> to be done (although I've already done the main one - hierarchical > bounding >> volumes). > So what? The fact is that implementation quality matters just as much.
No, it doesn't. If you can reduce your algorithm from O(n) to O(log n) then you should do so before you rewrite it in assembler.
> A program has to run on a given machine at some point.
True. I'm not saying that it is not worthwhile to optimise a ray tracer with SIMD, just that it would be premature to do so on my ray tracer right now.
I would be very interested to see a highly optimised version of my ray tracer. I think a lot of people could learn from a set of web pages which show how to optimise the programs. If you're willing to help and I can find the time then I'd like to do this sometime.
>> In this case, it looks as though you'd want to write specialised C > code for >> certain platforms. This could be called from C++ or OCaml and it > would >> speed both programs up by roughly the same amount. This might be a > good >> idea though, as OCaml is clearly preferable for the higher-level, > not- >> performance-critical bulk of the code. So you could get rid of C++ >> altogether. :-) > If you want to see it that way, that's fine. > But i wouldn't say that spending most cycles in c++ called from "higher > level" amounts to getting rid of it. Au contraire.
I wouldn't call this kind of stuff C++:
a = _mm_load_ps(inp_sse1); b = _mm_load_ps(inp_sse2); c = _mm_add_ps(a, b); _mm_store_ps(out_sse, c);