tbp wrote: > I'll leave non objective criterions aside (how do you measure clarity?) > and, as you insist on discarding my previous patches & measures, i've > taken the time this morning to make this up:
I appreciate it, thank you.
On my 1.2GHz Athlon t-bird I get:
$ g++-3.4 -Wall -O3 -ffast-math -funroll-all-loops ray.cpp -o ray $ time ./ray >image.ppm
real 0m55.256s user 0m54.860s sys 0m0.080s $ g++-3.4 -Wall -O3 -ffast-math -funroll-all-loops miniray.cpp -o miniray miniray.cpp:18: warning: division by zero in `1.0e+0 / 0.' $ time ./miniray >image.ppm
real 0m28.587s user 0m28.400s sys 0m0.040s
So your optimised version is 1.9x faster.
My optimised version is basically the same speed as yours (although quite different):
> f) it's algorithmically equivalent to your version (ie primary and > shadow rays aren't special cased etc...)
My optimised version has a specialised intersect function for shadow rays (but still fits in <100 LOC and is just a clear, IMHO).
>> In both cases, the types of objects and bounding volumes are > extensible. > That means nothing. Everything is extensible. > And in this *particular case*, using dynamic binding is, again, > superfluous. > I've resolved that in a really crude way, but that's just one > possibility.
Can you describe how a new kind of object/bound would be implemented?
>> Yes, exactly. It isn't really that surprising when you consider that >> "ray_sphere" is called ~10x more than "intersect" and it contains > several >> branches, lots of floating point maths and function calls. > I was sarcastic; "ray_sphere" calls that interection routine which > cannot be inlined because it's virtual etc...
In my original implementation, "intersect" calls "ray_sphere" and not the other way around. Do you mean "ray_sphere" could not have been inlined?
> Also, the fact is that this kind of sphere intersection can easily be > made branchless with something like: > if ((d < 0.) | (t2 < 0.)) return infinity; > else return t1 > 0. ? t1 : t2; > But you want to branch away from the square root as it is what's > wasting most cycles.
I hadn't optimised "ray_sphere".
>> The C++ outperforms the OCaml on AMD64 (and IA64). I'd say that the >> performance is comparable in all cases. > Again, your benchmark as presented it totally unfair.
> On my 1.2GHz Athlon t-bird I get: > $ g++-3.4 -Wall -O3 -ffast-math -funroll-all-loops miniray.cpp -o miniray > miniray.cpp:18: warning: division by zero in `1.0e+0 / 0.' > $ time ./miniray >image.ppm
> real 0m28.587s > user 0m28.400s > sys 0m0.040s
> So your optimised version is 1.9x faster.
I've only quickly checked with g++-3.4, g++-4.x does a much better job (at least on my opteron on x86-64, haven't checked on my t-bird).
And i wouldn't consider my version as optimized by any stretch, but just one way to decently write a c++ implementation of that thing.
> My optimised version is basically the same speed as yours (although quite > different):
Hmm, need to check what it gives with a shadow ray shortcut, if i can cram it up in there.
> > c) i had to raise the epsilon a bit to avoid acnea, it has to do with > > the way i compute normals but i'm not going to fix that.
> Your epsilon is actually smaller than my delta (which was sqrt
epsilon). Ah! Missed that bit. Anyway that shouldn't make a difference as both epsilons are too small to cull any part of the hierarchy away.
> > d) i removed the virtualness in the most straightforward way and it's > > certainly not aesthetically pleasing
> I'll study this in more detail.
> > e) i'm not going to spend ages on this, hence c) & d).
> What better solution would you recommend for (d)?
If that's better in the more efficient sense, none. If that's better in the tigher memory foot print sense, you could shrink it a bit by removing the use of a bounding sphere for 'final' nodes (those sitting at the lowest lvl) and paying for that at runtime with a branch or something. Other than that it's just a matter of taste, and the virtual solution is the most concise.
It's a bit tighter than my solution, but both are variations on the same theme. Ideally i would use an explicit memory layout so things could be properly aligned etc... And it would certainly end up being much uglier.
> > f) it's algorithmically equivalent to your version (ie primary and > > shadow rays aren't special cased etc...)
> My optimised version has a specialised intersect function for shadow rays > (but still fits in <100 LOC and is just a clear, IMHO).
I'll try to bolt something like that on top of my version.
> Can you describe how a new kind of object/bound would be implemented?
A BVH is fine for that kind of scenes, but it's all in the construction of the hierarchy (or put another way, in its quality). Now i seriously doubt i could cram a SAH compiler within 100 LOC :)
> In my original implementation, "intersect" calls "ray_sphere" and not the > other way around. Do you mean "ray_sphere" could not have been
inlined? Sphere::intersect calls Sphere::ray_sphere, and that maybe inlined but Sphere::intersect itself then cannot be inlined because it's virtual and not statically resolved (and on top of that instead of a static call, you pay for a indirect one). Compare that to what happens in my version where the compiler has the choice to inline to its heart content; that is placing (or not) a call if that's cheaper at some point. Like i've said earlier exposing things to the compiler is the name of the game.
> > Also, the fact is that this kind of sphere intersection can easily be > > made branchless with something like: > > if ((d < 0.) | (t2 < 0.)) return infinity; > > else return t1 > 0. ? t1 : t2; > > But you want to branch away from the square root as it is what's > > wasting most cycles.
> I hadn't optimised "ray_sphere".
I was just saying that, in that function cycles are spent computing the square root, not branching; so it's cheaper to avoid with a branch than to unconditionnaly compute it (and then use some conditionnal moves).
> >> The C++ outperforms the OCaml on AMD64 (and IA64). I'd say that the > >> performance is comparable in all cases. > > Again, your benchmark as presented it totally unfair.
> In what way is it unfair?
Because you obstructed the c++ compiler in many many ways, the most prominent being passing structures by value instead of by reference (as show in previous patches). That's why i've also said the comparison was rigged as you didn't put as much effort on the c++ side as on the OCaml; it's not surprising my version is ~2x faster than the original.
Now if you present an amended version that at least fixes those issues, i'll have no objections to whatever conclusion you can come up to.
Jon Harrop <use...@jdh30.plus.com> wrote: > >> Creating structs for the return values of every function that returns > >> multiple values may be a good idea, but the verbosity and clumsiness of > >> doing this in C++ makes it quite undesirable.
> > You mean modular? > No.
Then IMO you are wrong.
> Do you see what I mean by verbose and clumsy?
No.
Granted, C++ does not offer some features other programming languages do, but you gave a misleading example. You gave the impression that if you want to return three values from a function you have to always write the entire struct code plus all the separate assignments. That is like saying that each time you want to use a string you have to write a string class explicitly.
C++ offers tools to make that code of yours simpler and shorter. The way you gave in your example is not the only nor by any means the best way.
Besides, it usually comes down to design. It is rare that a function just returns some values whose type and meaning are random. Usually those returned values have a close relation to each other and form a logical group of values. For example, if you have a function which returns an intersection point and a normal vector, does it make sense, from design point of view, to return these values using generic nameless structures with no higher logic in them? Or does it make more sense to create a logical data container called according to its usage, for example "Intersection", which contains the two (or more) things which are closely related to each other?
Which do you consider more logical, a function which just returns a pair of values (using a generic container), or a function which returns a value of type "Intersection"?
Jon Harrop wrote: > > So what? The fact is that implementation quality matters just as much.
> No, it doesn't. If you can reduce your algorithm from O(n) to O(log n) then > you should do so before you rewrite it in assembler.
As we're obviously both right, and in fine it's a matter of tradeoffs, it would pointless to argue further more.
> > A program has to run on a given machine at some point.
> True. I'm not saying that it is not worthwhile to optimise a ray tracer with > SIMD, just that it would be premature to do so on my ray tracer right
now. I beg to differ. You can't simply plug in some SIMD scheme on top of that as an afterthought, because it's tightly coupled with the way you represent things. That's one of the common pitfall of OO abstraction. Wald discuss that point a bit in his thesis and i couldn't agree more. But i know it's a not a very popular opinion atm.
> I would be very interested to see a highly optimised version of my ray > tracer. I think a lot of people could learn from a set of web pages which > show how to optimise the programs. If you're willing to help and I can find > the time then I'd like to do this sometime.
There's certainly lots of angles to attack that problem (algorithmically or not), but i doubt any of them could fit within 100 LOC. Beside C++ idiomatic issues, a better memory layout could certainly help. That's, perhaps, the lowest hanging fruit (that or 'cheating' with lower numerical precision, ie for square roots).
Anyway as i've already invested some time in a c++ version, i'm open to suggestions :)
> I wouldn't call this kind of stuff C++:
> a = _mm_load_ps(inp_sse1); > b = _mm_load_ps(inp_sse2); > c = _mm_add_ps(a, b); > _mm_store_ps(out_sse, c);
But that's legit C++. C++ is C on steroids while C itself is just a gloried assembler. No matter how much OO or syntactic sugar you pour onto it. I kinda apreciate to be able to code to-the-metal. To each, its own.
Warp wrote: > Besides, it usually comes down to design. It is rare that a function > just returns some values whose type and meaning are random.
No, that is my point. It is common, not rare. Look at the prevelance of tuple return types in ML programs and you'll see what I mean. You won't see it in C++ code because C++ makes it too tedious for people to bother coding it.
Many of the functions in the OCaml standard library return 2-tuples. None of them need to be explicitly defined.
tbp wrote: > You can't simply plug in some SIMD scheme on top of > that as an afterthought, because it's tightly coupled with the way you > represent things. > That's one of the common pitfall of OO abstraction. > Wald discuss that point a bit in his thesis and i couldn't agree more. > But i know it's a not a very popular opinion atm.
I agree, actually. I know nothing of cache coherency in the context of ray tracers though.
> There's certainly lots of angles to attack that problem > (algorithmically or not), but i doubt any of them could fit within 100 > LOC.
Yes, it wouldn't make for a very good introductory tutorial either. Although I'm sure a lot of people (me included) would be interested.
> Beside C++ idiomatic issues, a better memory layout could certainly > help. > That's, perhaps, the lowest hanging fruit (that or 'cheating' with > lower numerical precision, ie for square roots).
It would be interesting to compare equivalently optimised C++ and OCaml programs based upon float arrays. Handling alignment will be tricky in OCaml.
>> I wouldn't call this kind of stuff C++:
>> a = _mm_load_ps(inp_sse1); >> b = _mm_load_ps(inp_sse2); >> c = _mm_add_ps(a, b); >> _mm_store_ps(out_sse, c); > But that's legit C++.
Is that in the C++ standard or is it a common extension?
> C++ is C on steroids while C itself is just a gloried assembler. > No matter how much OO or syntactic sugar you pour onto it. > I kinda apreciate to be able to code to-the-metal. > To each, its own.
tbp wrote: > I've only quickly checked with g++-3.4, g++-4.x does a much better job > (at least on my opteron on x86-64, haven't checked on my t-bird).
I'll check it out when the Debian packages arrive. :-)
> And i wouldn't consider my version as optimized by any stretch, but > just one way to decently write a c++ implementation of that thing.
"more optimised" then. :-)
>> My optimised version is basically the same speed as yours (although > quite >> different): > Hmm, need to check what it gives with a shadow ray shortcut, if i can > cram it up in there.
I don't think it actually made mine much faster so it probably isn't worth it.
> It's a bit tighter than my solution, but both are variations on the > same theme. > Ideally i would use an explicit memory layout so things could be > properly aligned etc... And it would certainly end up being much > uglier.
The OCaml might be significantly easier to understand if suitable higher-order functions can be factored out which make the memory layout "look" like a tree.
>> Can you describe how a new kind of object/bound would be implemented? > A BVH is fine for that kind of scenes, but it's all in the construction > of the hierarchy (or put another way, in its quality). > Now i seriously doubt i could cram a SAH compiler within 100 LOC :)
What is an SAH compiler?
>> In my original implementation, "intersect" calls "ray_sphere" and not > the >> other way around. Do you mean "ray_sphere" could not have been > inlined? > Sphere::intersect calls Sphere::ray_sphere, and that maybe inlined but
That's the important one though.
> Sphere::intersect itself then cannot be inlined because it's virtual > and not statically resolved (and on top of that instead of a static > call, you pay for a indirect one). > Compare that to what happens in my version where the compiler has the > choice to inline to its heart content; that is placing (or not) a call > if that's cheaper at some point. > Like i've said earlier exposing things to the compiler is the name of > the game.
Yes. Although this is only useful to other people if it has a description of what information is being provided to the compiler. It would be nice to write this up.
>> In what way is it unfair? > Because you obstructed the c++ compiler in many many ways,
I think the OCaml compiler was equivalently obstructed.
> the most > prominent being passing structures by value instead of by reference (as > show in previous patches).
I think the compiler could (should) have spotted that optimisation without having to be told.
> That's why i've also said the comparison was rigged as you didn't put > as much effort on the c++ side as on the OCaml; it's not surprising my > version is ~2x faster than the original.
Well, I have a more optimised OCaml version which is just as fast on x86 and 20 lines shorter.
> Now if you present an amended version that at least fixes those issues, > i'll have no objections to whatever conclusion you can come up to.
I'll try to write another page describing both optimised versions. I also have a C version now, which is slower than both the C++ and the OCaml. :-)
Jon Harrop wrote: > tbp wrote: > > I've only quickly checked with g++-3.4, g++-4.x does a much better job > > (at least on my opteron on x86-64, haven't checked on my t-bird).
> I'll check it out when the Debian packages arrive. :-)
There's a 4.0.0 package, at least on sid 64. Otherwise building a snapshot is straightforward, yet a bit off topic :)
> > Hmm, need to check what it gives with a shadow ray shortcut, if i can > > cram it up in there.
> I don't think it actually made mine much faster so it probably isn't worth > it.
> The OCaml might be significantly easier to understand if suitable > higher-order functions can be factored out which make the memory layout > "look" like a tree.
I'm absolutely not qualified to comment on OCaml unfortunately :)
> >> Can you describe how a new kind of object/bound would be implemented? > > A BVH is fine for that kind of scenes, but it's all in the construction > > of the hierarchy (or put another way, in its quality). > > Now i seriously doubt i could cram a SAH compiler within 100 LOC :)
> What is an SAH compiler?
Surface Area Heuristic, a way to evaluate costs when building a hierarchy. Does wonders on BVH, kd-tree... you name it. But it's tricky to get it right and even more to make it not behave quadratically. Havran's thesis is a must read, Wald builds upon it (but there's a metric ton of litterature about it). http://www.cgg.cvut.cz/~havran/phdthesis.html http://www.mpi-sb.mpg.de/~wald/PhD/
> > Like i've said earlier exposing things to the compiler is the name of > > the game.
> Yes. Although this is only useful to other people if it has a description of > what information is being provided to the compiler. It would be nice to > write this up.
I'm pedagogically challenged :) On top of peremptoric statements like "know the language" and "know your compiler", it boils down to things like const correctness, tediously constraining things and so on. Or put another way, compilers are mental midgets and C++ is a language made for compilers and not humans.
Ok, that wasn't really helpful. But i'm sure someone already has put that in a nicer way somewhere, no? :)
> >> In what way is it unfair? > > Because you obstructed the c++ compiler in many many ways,
> I think the OCaml compiler was equivalently obstructed.
I can't say. But for sure, your C++ "style" was not typical.
> > the most > > prominent being passing structures by value instead of by reference (as > > show in previous patches).
> I think the compiler could (should) have spotted that optimisation without > having to be told.
That's wishful thinking. And it really really doesn't work that way. And don't blame g++ for that, as it's actually quite good at that game. But that's another topic.
> > That's why i've also said the comparison was rigged as you didn't put > > as much effort on the c++ side as on the OCaml; it's not surprising my > > version is ~2x faster than the original.
> Well, I have a more optimised OCaml version which is just as fast on x86 and > 20 lines shorter.
Bring it on ;)
> > Now if you present an amended version that at least fixes those issues, > > i'll have no objections to whatever conclusion you can come up to.
> I'll try to write another page describing both optimised versions. I also > have a C version now, which is slower than both the C++ and the
Jon Harrop wrote: > I agree, actually. I know nothing of cache coherency in the context of ray > tracers though.
First, i'd like to point out that those SIMD rt not only take care about mem/cache coherency, but also _rays_ coherency (dunno if that was clear).
Then cache coherency isn't an issue in your app, at least at the default level. Though, better alignement would certainly help wrt doubles on x86.
> > There's certainly lots of angles to attack that problem > > (algorithmically or not), but i doubt any of them could fit within 100 > > LOC.
> Yes, it wouldn't make for a very good introductory tutorial either. Although > I'm sure a lot of people (me included) would be interested.
Jacco Bikker has produced some interesting articles on flipcode about that. But his articles are now lagging way behind his own implementation. He's a lazy bastard :)
I've implemented, both a BVH & kdtree based coherent raytracer, but even the BVH one (which is simpler) would require lots of background description before addressing the meaty stuff. That's quite infortunate (and the litterature on the topic is absolutely horrible to read).
> > Beside C++ idiomatic issues, a better memory layout could certainly > > help. > > That's, perhaps, the lowest hanging fruit (that or 'cheating' with > > lower numerical precision, ie for square roots).
> It would be interesting to compare equivalently optimised C++ and OCaml > programs based upon float arrays. Handling alignment will be tricky in > OCaml.
Float arrays? I fail to grok the context. Could you elaborate a bit?
> >> I wouldn't call this kind of stuff C++:
> >> a = _mm_load_ps(inp_sse1); > >> b = _mm_load_ps(inp_sse2); > >> c = _mm_add_ps(a, b); > >> _mm_store_ps(out_sse, c); > > But that's legit C++.
> Is that in the C++ standard or is it a common extension?
Common x86 extension yes (seen in gcc, msvc, icc etc). An altivec equivalent is common on PPC platforms, i think. Admitedly, it's in phase with the "C as a glorified assembler" i alluded to earlier.
tbp wrote: > A quick hack at that (within the rules) shaved a second of runtime > here. [snip] > # time ./miniray >pix.ppm
> real 0m4.856s > user 0m4.812s > sys 0m0.004s
I'm replying to myself (don't you love that), because while tinkering with it, again - that's quite addictive -, i've noticed 2 things: a) the original deallocation routine, Group::del(), that i've shamelessly re-used is bogus and only delete the first level. b) more importantly, the BVH can be tighten a bit: at a given level, if you place a sphere of radius r centered at v(x,y,z), then instead of bounding the sub hierarchy with sphere(v, 3*r) you can bound it with sphere(v+(0,r,0), 2.4330128*r).
Then, # time ./miniray >pix.ppm real 0m3.734s user 0m3.721s sys 0m0.003s
tbp wrote: > tbp wrote: >> A quick hack at that (within the rules) shaved a second of runtime >> here. > [snip] >> # time ./miniray >pix.ppm
>> real 0m4.856s >> user 0m4.812s >> sys 0m0.004s
> I'm replying to myself (don't you love that), because while tinkering > with it, again - that's quite addictive -, i've noticed 2 things: > a) the original deallocation routine, Group::del(), that i've > shamelessly re-used is bogus and only delete the first level.
Oops, yes. How thoughtless of you to copy my mistake. :-)
> b) more importantly, the BVH can be tighten a bit: at a given level, if > you place a sphere of radius r centered at v(x,y,z), then instead of > bounding the sub hierarchy with sphere(v, 3*r) you can bound it with > sphere(v+(0,r,0), 2.4330128*r).
Yes, I hadn't optimised that either (just copied the idea from the real sphere-flake, for which the easy approach is more excusable). :-)
> Then, > # time ./miniray >pix.ppm > real 0m3.734s > user 0m3.721s > sys 0m0.003s
> Voilà.
I'm surprised it makes such a big difference. I shall check it out in more detail. I really have spent too much time playing with this now. :-)
Jon Harrop wrote: > Oops, yes. How thoughtless of you to copy my mistake. :-)
Let's say it was a feature.
> I'm surprised it makes such a big difference. I shall check it out in more > detail. I really have spent too much time playing with this now. :-)
All the magic is in the hierarchy. Now as i couldn't see how i could pack in the code for a better bvh and this tree is complete i've addressed other points this morning: . i turn the tree into an array with skip pointers - that can be done in 1 pass, the tree is complete - while i build in place. . then i can iterate over it when tracing. [uglified 100 LOC source annexed] [also here http://ompf.org/ray/bvh100.cpp]
So i was at
> > # time ./miniray >pix.ppm > > real 0m3.734s > > user 0m3.721s > > sys 0m0.003s
and now # g++ -O2 -ffast-math -o bvh bvh100.cpp # time ./bvh >pix.ppm real 0m2.985s user 0m2.966s sys 0m0.005s
but # cmp pix.orig.ppm pix.ppm pix.orig.ppm pix.ppm differ: byte 1040, line 4 even if a diff between the two results in a black picture for the eye.
It's not as fast as it could, but the source is stuffed enough as it is (cough).
> All the magic is in the hierarchy. > Now as i couldn't see how i could pack in the code for a better bvh and > this tree is complete i've addressed other points this morning:
What data structures and algorithms would you recommend? Perhaps they could be implemented in the OCaml version...
> . i turn the tree into an array with skip pointers - that can be done > in 1 pass, the tree is complete - while i build in place. > . then i can iterate over it when tracing.
> > # time ./miniray >pix.ppm > > real 0m3.734s > > user 0m3.721s > > sys 0m0.003s > and now > # g++ -O2 -ffast-math -o bvh bvh100.cpp > # time ./bvh >pix.ppm > real 0m2.985s > user 0m2.966s > sys 0m0.005s
On 1.2GHz Athlon t-bird, the unoptimised C++ on my webpage took:
$ g++-3.4 -Wall -O2 -ffast-math ray.cpp -o ray $ time ./ray >image.pgm
$ g++-3.4 -Wall -O2 -ffast-math bvh100.cpp -o bvh100 bvh100.cpp:17: warning: division by zero in `1.0e+0 / 0' $ time ./bvh100 >image.pgm
real 0m18.492s user 0m16.030s sys 0m0.070s
> but > # cmp pix.orig.ppm pix.ppm > pix.orig.ppm pix.ppm differ: byte 1040, line 4 > even if a diff between the two results in a black picture for the eye.
A couple of points here:
1. I think the -ffast-math can cause gcc to violate IEEE-complaint FP arithmetic, in which case it could theoretically alter the output and could be banned as a consequence.
2. If we're happy to produce similar-looking results then switching to single-precision takes over 40% off the time (on my T-bird, probably not on AMD64):
$ g++-3.4 -Wall -O2 -ffast-math bvh100_2.cpp -o bvh100_2 bvh100_2.cpp:17: warning: division by zero in `1.0e+0 / 0' $ time ./bvh100_2 >image2.pgm
real 0m11.282s user 0m9.380s sys 0m0.050s
For the shootout, the results will probably have to be identical. I'm more than happy to generate correct-looking results (as this is the point of a ray tracer!) for my web page though.
Jon Harrop wrote: > > All the magic is in the hierarchy.
> What data structures and algorithms would you recommend? Perhaps they could > be implemented in the OCaml version...
The present one has some nice properties if only for its 'compactness'. Forgetting about implementation complexity for a moment, there's 2 angles: a) the nature of the hierarchy; could use a kd tree (tiny nodes, cheap intersection, efficient culling) or a box bvh (maybe tighter); both can be cheaply traversed iteratively. b) regardless of the hierarchy type, use a cost driven scheme to build it; ultimately the full SAH thing.
That would give us a better hierarchy, but then it will certainly take more time to build it than to render... (ie my best SAH kdtree compiler takes ~30s to compile a 250k triangle soup).
I think my iterative version could be ammended for better memory alignment (at least on x86) just by splitting the node_t struct into 2 spheres on a side (2*32bytes) and skip pointers on the other. Would lend to some uglyfication. Haven't checked what it would give with floats.
On top of all that, you can also think of shooting more than 1 ray at once and things like that ;)
> On 1.2GHz Athlon t-bird, the unoptimised C++ on my webpage took: > $ g++-3.4 -Wall -O2 -ffast-math bvh100.cpp -o bvh100 > bvh100.cpp:17: warning: division by zero in `1.0e+0 / 0' > $ time ./bvh100 >image.pgm
> real 0m18.492s > user 0m16.030s > sys 0m0.070s
Something's wrong. That's >6x slower than on my box, an opteron 146 (2ghz), which is supposedly only, say, 2x faster than yours.
But i'm using a gcc snapshot in 64bit mode: a) there's more registers. b) more importantly, SSE math is used (not an option with doubles on your t-bird).
> A couple of points here:
> 1. I think the -ffast-math can cause gcc to violate IEEE-complaint FP > arithmetic, in which case it could theoretically alter the output and could > be banned as a consequence.
There's much more "violation" when using the x86 fpu (because of the internal 80bit representation etc) than SSE.
> 2. If we're happy to produce similar-looking results then switching to > single-precision takes over 40% off the time (on my T-bird, probably not on > AMD64):
It would help with the memory bandwidth. Other than that, only the sqrt would be a bit faster. Try changing the fpu precision on your t-bird.
> For the shootout, the results will probably have to be identical. I'm
more Yes.
> than happy to generate correct-looking results (as this is the point of a > ray tracer!) for my web page though.
Jon Harrop <use...@jdh30.plus.com> wrote: > 1. I think the -ffast-math can cause gcc to violate IEEE-complaint FP > arithmetic, in which case it could theoretically alter the output and could > be banned as a consequence.
I have my doubts that it will cause any difference in the final image (given that you round numbers correctly). Those optimizations, if they cause any computation to give a different result, probably cause the difference in the 15th decimal or so.
I don't know what kind of optimizations -ffast-math does, but one that I could imagine is this kind:
double d = 0.1234; double a = x/d, b = y/d, c = z/d;
That could be probably optimized internally to:
double d = 1/0.1234; double a = x*d, b = y*d, c = z*d;
Due to how floating point works a, b and c might get just a slightly different value (the difference would probably be in the least significant decimal or the two least, which with doubles is something like the 15th decimal or so).
Is there a reason why you're compiling "ray11" and then running "ray10" for the test?
-- Francois Labreque | The surest sign of the existence of extra- flabreque | terrestrial intelligence is that they never @ | bothered to come down here and visit us! videotron.ca | - Calvin
tbp wrote: > On top of all that, you can also think of shooting more than 1 ray at > once and things like that ;)
I might try this if I can find the time.
>> On 1.2GHz Athlon t-bird, the unoptimised C++ on my webpage took: >> $ g++-3.4 -Wall -O2 -ffast-math bvh100.cpp -o bvh100 >> bvh100.cpp:17: warning: division by zero in `1.0e+0 / 0' >> $ time ./bvh100 >image.pgm
>> real 0m18.492s >> user 0m16.030s >> sys 0m0.070s > Something's wrong. That's >6x slower than on my box, an opteron 146 > (2ghz), which is supposedly only, say, 2x faster than yours.
I think gcc is suffering from poor FP code generation. Interestingly, it is usually ocamlopt that suffers from this and not gcc. I've no idea why but this ray tracer is the only program I've ever written which produces these kind of results (faster OCaml on x86).
Someone has kindly ported it to Fortran on the shootout now. For the same functionality, the Fortran is much faster than both the C++ and the OCaml (and the SML compiled with Mlton, which is actually significantly faster than C++).
> But i'm using a gcc snapshot in 64bit mode: > a) there's more registers. > b) more importantly, SSE math is used (not an option with doubles on > your t-bird).
My AMD64 is also 6x faster (it is supposed to be 3x faster, according to other benchmarks).
>> A couple of points here:
>> 1. I think the -ffast-math can cause gcc to violate IEEE-complaint FP >> arithmetic, in which case it could theoretically alter the output and > could >> be banned as a consequence. > There's much more "violation" when using the x86 fpu (because of the > internal 80bit representation etc) than SSE.
True.
>> 2. If we're happy to produce similar-looking results then switching > to >> single-precision takes over 40% off the time (on my T-bird, probably > not on >> AMD64): > It would help with the memory bandwidth. Other than that, only the sqrt > would be a bit faster. > Try changing the fpu precision on your t-bird.
It doubles the performance. However, it almost halves the performance on AMD64.
Jon Harrop wrote: > I think gcc is suffering from poor FP code generation. Interestingly, it is > usually ocamlopt that suffers from this and not gcc. I've no idea why but > this ray tracer is the only program I've ever written which produces these > kind of results (faster OCaml on x86).
I haven't checked the 32bit side. I guess it's hitting some weakspot in gcc; that might be interesting to report.
It's interesting to note that on x86-64 ICC 8.1 & 9.0beta can't beat gcc (or more precisely a 4.1 snapshot).
> Someone has kindly ported it to Fortran on the shootout now. For the same > functionality, the Fortran is much faster than both the C++ and the OCaml > (and the SML compiled with Mlton, which is actually significantly faster > than C++).
I guess you really mean faster on x86 (as opposed to x86-64), and then i would put the blame on the 32bit fpu codegen thing too. I mean C/C++ being as low level as it is, it should be faster given a proper implementation; uglier, hackish, verbose etc but not slower :)
> > Try changing the fpu precision on your t-bird.
> It doubles the performance. However, it almost halves the performance on > AMD64.
Huh? On x86-64, SSE should be used and that op. set is impervious to FPU precision modes. If you're talking about swapping doubles for floats, i've checked that and i get a marginal speed improvement (after fixing stuff & memory layout). So, can you elaborate a bit?
Last week i've cleaned things a bit and put stub here, http://ompf.org/ray/sphereflake/, for a 100 LOC C++ version that generates a sphere flake, or something looking closer to the real thing. I've settled for a 2x2 rotated grid to get decent runtime while preserving image quality.
tbp wrote: >>Someone has kindly ported it to Fortran on the shootout now. > I mean C/C++ being as low level as it is, it should be faster given a > proper implementation; uglier, hackish, verbose etc but not slower :)
FORTRAN has various language rules (such as disallowing aliasing, allowing better optimizations) that C and C++ don't have. (FORTRAN originally was also very restricted in the forms that array indicies could take, ensuring the compiler could make use of index registers, for example.) So, no, FORTRAN is often faster than C for the same program.
-- Darren New / San Diego, CA, USA (PST) The samba was clearly inspired by the margarita.
Darren New wrote: > FORTRAN has various language rules (such as disallowing aliasing, > allowing better optimizations) that C and C++ don't have. (FORTRAN > originally was also very restricted in the forms that array indicies > could take, ensuring the compiler could make use of index registers, for > example.) So, no, FORTRAN is often faster than C for the same program.
But now C/C++ comes equiped with 'restrict' and generally a whole range of tightening aliasing rules as compiler options (with said compilers being smarter & smarter about aliasing); no silver bullet but it shows some progress on that side.
Anyway, if you said C/C++ was ugly, borked, byzantine or unproductive i would have agreed. But when it comes to runtime performance, i simply can't buy it: you can't get much lower than that without writing straight asm (with the corrolary that you may have to express your problem in compiler friendly way)... it all boils down to the implementation's quality.
PS: i really didn't meant to start yet another sterile religious debate about languages :)
Darren New wrote: > tbp wrote: >>>Someone has kindly ported it to Fortran on the shootout now.
>> I mean C/C++ being as low level as it is, it should be faster given a >> proper implementation; uglier, hackish, verbose etc but not slower :)
> FORTRAN has various language rules (such as disallowing aliasing, > allowing better optimizations) that C and C++ don't have. (FORTRAN > originally was also very restricted in the forms that array indicies > could take, ensuring the compiler could make use of index registers, for > example.) So, no, FORTRAN is often faster than C for the same program.
Not only is that untrue, it hasn't been true for many years. The main reason that non-trivial Fortran programs are typically much slower than their modern counterparts is that the verbosity and difficulty of implementing more data structures and algorithms in Fortran makes non-trivial approaches intractable.
Consider to Fortran port of my ray tracer. Not only is the resulting program 3x as long as the OCaml (and by far the longest implementation), not as general (the scene tree is prohibitively difficult to implement in Fortran without sacrificing its current genericity) but the resulting program is not significantly faster than implementations written in modern languages (slightly faster than OCaml, slightly slower than Mlton-compiled SML).
As TBP has said, this discussion doesn't belong here. If you want to know more about this then read my book on OCaml for scientists: