A couple of weeks ago I posted a question on comp.lang.c++ about some technicality about binary file IO. Over the course of the discussion, I discovered to my amazement - and, quite frankly, horror - that there seems to be a school of thought that text-based storage formats are universally preferable to binary text formats for reasons of portability and human readability.
The people who presented such ideas appeared not to appreciate two details that counter any benefits text-based numerical formats might offer:
1) Binary files are about 70-20% of the file size of the text files, depending on the number of significant digits stored in the text files and other formatting text glyphs. 2) Text-formatted numerical data take significantly longer to read and write than binary formats.
Timings are difficult to compare, since the exact numbers depend on buffering strategies, buffer sizes, disk speeds, network bandwidths and so on.
I have therefore sketched a 'distilled' test (code below) to test what overheads are involved with formatting numerical data back and forth between text and binary formats. To eliminate the impact of peripherical devices, I have used a std::stringstream to store the data. The binary bufferes are represented by vectors, and I have assumed that a memcpy from the file buffer to the destination memory location is all that is needed to import the binary format from the file buffer. (If there are significant run-time overheads associated with moving NATIVE binary formats to the destination, please let me know.)
The output on my computer is (do note the _different_ numbers of IO cycles in the two cases!):
Sun Nov 08 19:48:54 2009 : Binary IO cycles started Sun Nov 08 19:49:00 2009 : 1000 Binary IO cycles completed Sun Nov 08 19:49:00 2009 : Text-format IO cycles started Sun Nov 08 19:49:16 2009 : 100 Text-format IO cycles completed
A little bit of math produces *average*, *crude* numbers for IO cycles:
Text: 6 seconds / (1000 * 1e6) read/write cycles = 6e-9 s per r/w cycle Binary: 16 seconds / (100 * 1e6) read/write cycles = 160e-9 s per r/w cycle
which in turn means there is an overhead on the order of of 160e-9/6e-9 = 26x associated with the text formats.
Add a little bit of other overheads, e.g. caused by the significantly larger text file sizes in combination with suboptimal buffering strategies, and the relative numbers easily hit the triple digits. Not at all insignificant when one works with large amounts of data under tight deadlines.
So please: Shoot this demo down! Give it your best, and prove me and my numbers wrong.
And to the textbook authors who might be lurking: Please include a chapter on relative binary and text-based IO speeds in your upcoming editions. Binary file formats might not fit into your overall philosophies about human readability and universal portability of C++ code, but some of your readers might appreciate being made aware of such practical details.
> A couple of weeks ago I posted a question on comp.lang.c++ about some technicality > about binary file IO. Over the course of the discussion, I discovered to my > amazement - and, quite frankly, horror - that there seems to be a school of > thought that text-based storage formats are universally preferable to binary text > formats for reasons of portability and human readability.
I don't see textual formats "universally preferred". Who said that?
> The people who presented such ideas appeared not to appreciate two details that > counter any benefits text-based numerical formats might offer:
> 1) Binary files are about 70-20% of the file size of the text files, depending > on the number of significant digits stored in the text files and other > formatting text glyphs. > 2) Text-formatted numerical data take significantly longer to read and write > than binary formats.
Actual numbers may vary, but it is an established fact that text formats take more space and more processing time, and no one objected to that. So, if your application cannot afford that overhead, you don't have a choice, and you go binary. However, other applications may afford that overhead and instead enjoy the benefits that textual formats offer:
- human readability - transparency - portability (I'm not talking about preserving the exact precision, but about being free of issues such as encoding, endianness, etc.) - flexibility (Upgrading from 32-bit int to 64-bit int is a breeze.) - manipulability (You can use text-based utilities such as awk or perl, and even text editors to modify some parts.)
... especially when you consider that in many (not all) situations, storage is less of a problem nowadays than it used to be before (and maybe processing time too), and that the difference in processing times of text and binary is only a fraction of the total processing time.
// I'm afraid I'm just repeating what has been discussed over there. :(
YMMV, of course. No one tells you you /should/ use a textual format, or you shouldn't tell others they /should/ use a binary format, either. The decision is, as always, a trade-off between different values. No one knows your objectives and constraints better than you do, and while others can present the pros and cons of the options, it's your job to understand them and make the decision. (Just note that worrying about performance is justified only after an actual measurement.)
> A couple of weeks ago I posted a question on comp.lang.c++ about some > technicality > about binary file IO. Over the course of the discussion, I discovered > to my > amazement - and, quite frankly, horror - that there seems to be a > school of > thought that text-based storage formats are universally preferable to > binary text > formats for reasons of portability and human readability.
Please don't see it as a horror. You're right that binary files are faster but text files are nice for debugging and backward compatibility.
In one software we used binary files to store configurations. Then suddenly we wanted to add an item into the configuration, which made the old configuration files incompatible with the new software version. To support the old configuration files we had to do a converter, and soon we realized that we couldn't have version converters each time we wanted to add an item. That's where XML came at hand.
Rune Allnor wrote: > A couple of weeks ago I posted a question on comp.lang.c++ about some > technicality about binary file IO.
All files are binary. ;)
> Over the course of the discussion, I discovered to my amazement - and, > quite frankly, horror - that there seems to be a school of thought that > text-based storage formats are universally preferable to binary text > formats for reasons of portability and human readability.
This is the same school as the one that suggests not doing any early optimisations.
> The people who presented such ideas appeared not to appreciate two > details that counter any benefits text-based numerical formats might > offer:
> 1) Binary files are about 70-20% of the file size of the text files, > depending on the number of significant digits stored in the text files > and other formatting text glyphs.
Compression?
> 2) Text-formatted numerical data take significantly longer to read and > write than binary formats.
Do they? I don't really believe you. The point is that IO takes lots of time, so much that it dwarfs any simple parsing operation:
> Timings are difficult to compare, since the exact numbers depend on > buffering strategies, buffer sizes, disk speeds, network bandwidths > and so on.
...as you state yourself.
> I have therefore sketched a 'distilled' test (code below) to test what > overheads are involved with formatting numerical data back and forth > between text and binary formats. To eliminate the impact of peripherical > devices, I have used a std::stringstream to store the data.
Fair choice.
> The binary bufferes are represented by vectors, and I have assumed that a > memcpy from the file buffer to the destination memory location is all that > is needed to import the binary format from the file buffer. (If there are > significant run-time overheads associated with moving NATIVE binary > formats to the destination, please let me know.)
Not a fair choice. You have completely omitted to convert the on-disk representation to your in-memory representation. Things that differ are endianess, sizes, alignment and padding.
> And to the textbook authors who might be lurking: Please include a > chapter on relative binary and text-based IO speeds in your upcoming > editions. Binary file formats might not fit into your overall > philosophies about human readability and universal portability of C++ > code, but some of your readers might appreciate being made aware of > such practical details.
IMHO less for file formats but for protocols, otherwise I agree, a comparison/warning would be useful.
> std::stringstream ss; [...] > for (m = 0; m < NumElements; ++m) > ss << SourceBuffer[m];
Wrong: You are writing the numbers without any separating character, making it impossible to read them afterwards.
Wrong: Use the idiomatic "while(s >> val)". Your loop will probably overflow the buffer by reading one past the end. Actually, with the error above, I have no clue what your loop does, you should have checked correctness, too.
Further notes: 1. C++ IOStreams are a complex formatting and parsing framework using plugins for pretty much any operation. Every use of a plugin amounts to a lookup of the plugin and a virtual function call, with all the restrictions that imposes on the optimizer. I would try to optimize that part first before dumping a textual file layout. 2. Apart from the two glitches above, which are easily caught, textual formatting is pretty easy to get right. However, I dare you to write portable code to write a sequence of double values to a "packed binary" file. This is far from trivial.
> A couple of weeks ago I posted a question on comp.lang.c++ about some > technicality > about binary file IO. Over the course of the discussion, I discovered > to my > amazement - and, quite frankly, horror - that there seems to be a > school of > thought that text-based storage formats are universally preferable to > binary text > formats for reasons of portability and human readability.
> The people who presented such ideas appeared not to appreciate two > details that > counter any benefits text-based numerical formats might offer:
Well, I can't speak for those people, but I would prefer text files for exactly the reasons you suggest, provided those are of overriding mnportance for the particular application. So if the application is concerned with data transfer, I would use XML for portability, if it requires a configuration file, I would use a text format to make it easy for users to read and edit.
However, if I wanted performance, I would use a binary format FOR THE FILES WHERE PERFORMANCE IS THE PRIMARY REQUIREMENT. I don't think that anyone is suggesting that a SQL database (for example) should be implemented using text files for its indexes and tables. It would make sense though for such a database to use text files for configuration etc.
You seem to have set up a straw man, and one that has very little to do with C++, I would add.
On 8 Nov, 20:22, Rune Allnor <all...@tele.ntnu.no> wrote:
> Hi all.
> A couple of weeks ago I posted a question on comp.lang.c++ about some > technicality > about binary file IO. Over the course of the discussion, I discovered > to my > amazement - and, quite frankly, horror - that there seems to be a > school of > thought that text-based storage formats are universally preferable to > binary text > formats for reasons of portability and human readability.
That's not a school of thought - It's a fact. They are preferred and for those reasons.
That doesn't mean that sometimes you really do need the performance but it would have to be quite a large data set or quite a stringent performance requirement to make it preferrable.
> The people who presented such ideas appeared not to appreciate two > details that > counter any benefits text-based numerical formats might offer:
> 1) Binary files are about 70-20% of the file size of the text files, > depending > on the number of significant digits stored in the text files and > other > formatting text glyphs.
In 25 years programming I have never come across a problem (for files) where this has been a problem and the rate at which storage capacities increase suggests to me that it never will be for any "normal" application.
> 2) Text-formatted numerical data take significantly longer to read and > write > than binary formats.
Again - Never in my experience. In network protocols YES because you can never have too much performance in low level general purpose protocols but in application files I have never had a problem.
A slight optimisation that you might be interested in is to use hex - This is still portable and readable but can be read and written without multiplications or divisions.
> Timings are difficult to compare, since the exact numbers depend on > buffering > strategies, buffer sizes, disk speeds, network bandwidths and so on.
In other words they are of minor siginificance otherwise they would dwarf these things.
> I have therefore sketched a 'distilled' test (code below) to test what > overheads > are involved with formatting numerical data back and forth between > text and > binary formats. To eliminate the impact of peripherical devices, I > have used > a std::stringstream to store the data. The binary bufferes are
If you really worry about performance you will never use the C++ I/O library conversions - The fastest way to write an integer will almost certainly be itoa()/atoi() and (if you have it) read()/write()
> represented > by vectors, and I have assumed that a memcpy from the file buffer to > the > destination memory location is all that is needed to import the binary > format > from the file buffer. (If there are significant run-time overheads > associated with > moving NATIVE binary formats to the destination, please let me > know.)
If you are realy realy realy speed obssessed the way to go is to map a binary file into memory rather than using ANY I/O library at all (mmap on POSIX systems, Not sure about Windows).
> The output on my computer is (do note the _different_ numbers of IO > cycles in the two cases!):
> Sun Nov 08 19:48:54 2009 : Binary IO cycles started > Sun Nov 08 19:49:00 2009 : 1000 Binary IO cycles completed > Sun Nov 08 19:49:00 2009 : Text-format IO cycles started > Sun Nov 08 19:49:16 2009 : 100 Text-format IO cycles completed
> A little bit of math produces *average*, *crude* numbers for IO > cycles:
> Text: 6 seconds / (1000 * 1e6) read/write cycles = 6e-9 s per r/w > cycle > Binary: 16 seconds / (100 * 1e6) read/write cycles = 160e-9 s per r/w > cycle
> which in turn means there is an overhead on the order of of > 160e-9/6e-9 = 26x > associated with the text formats.
> Add a little bit of other overheads, e.g. caused by the significantly > larger text file sizes in combination with suboptimal buffering > strategies, > and the relative numbers easily hit the triple digits. Not at all > insignificant when one works with large amounts of data under tight > deadlines.
> So please: Shoot this demo down! Give it your best, and prove me > and my numbers wrong.
They are not wrong. They are just irrelevant to 99% of all applications.
> And to the textbook authors who might be lurking: Please include a > chapter on relative binary and text-based IO speeds in your upcoming > editions. Binary file formats might not fit into your overall > philosophies about human readability and universal portability of C++ > code, but some of your readers might appreciate being made aware of > such practical details.
> Rune
The text book authors are writing for the 99% not the 1% so they are not going to change.
I enjoy your posts Rune but IMHO you realy do get carried away with the wrong performance issues.
> A couple of weeks ago I posted a question on comp.lang.c++ about some technicality > about binary file IO. Over the course of the discussion, I discovered to my > amazement - and, quite frankly, horror - that there seems to be a school of > thought that text-based storage formats are universally preferable to binary text > formats for reasons of portability and human readability.
This is a classic example of the 'speed at any cost' school of thought. The trouble with binary formats is that they are not portable (sometimes to the extent of not being portable between releases of the same program -- I once met a young programmer who was going through the agony of having used binary data formats which could no longer be read correctly by the second release of his program much to the dismay of his customers)
The space overhead is hardly important when even simple desktop machines can have over a terabyte of disk stirage at less than it used to cost (in 1979)to buy a couple of boxes of single density 5.25" floppy disks.
Speed could be an issue in some cases, but measure first before choosing to optimise. I rarely use binary files other than for scratch files during a single execution of a program (where they are fine as long as the program does not crash and burn)
> A couple of weeks ago I posted a question on comp.lang.c++ about some technicality > about binary file IO. Over the course of the discussion, I discovered to my > amazement - and, quite frankly, horror - that there seems to be a school of > thought that text-based storage formats are universally preferable to binary text > formats for reasons of portability and human readability.
> The people who presented such ideas appeared not to appreciate two details that > counter any benefits text-based numerical formats might offer:
> 1) Binary files are about 70-20% of the file size of the text files, depending > on the number of significant digits stored in the text files and other > formatting text glyphs. > 2) Text-formatted numerical data take significantly longer to read and write > than binary formats.
Metrics only matter when they matter. Are the larger file size or increased processing time issues? If you are not constraint for space or time, and the text files have value-add (e.g., convenience, portability, etc.) then use them. Bigger and slower do not always equate to bad.
On 9 Nov, 15:56, Nick Hounsome <nick.houns...@googlemail.com> wrote:
> The text book authors are writing for the 99% not the 1% so they are > not going to change.
I am working among the 1%. I have seen companies loose business because of poorly performing software that went undetected. That is, the people whose job it was to know, did not know about the major performance issues.
In one company, which did 24/7 survey jobs and stored the data on text format, merely reading 24 hrs worth of data from text- formatted files imposed some 3-5 hrs idle time on behalf of human operators. It wouldn't have been a big deal if those 3-5 hrs were organized as one bulk (the operators in question could have had a long break if it was), but these 3-5 hrs were intersped throughout the process, tying the operators down in front of their terminals.
From a human standpoint, there are several time scales. Most (all?) readers of this newsgroup are computer programmers, so they know what it means to be in 'The Zone' where time just flys and work is being made.
Now, if you can get a job done with operator idle time less than a second, the operator can stretch his neck, yawn, have zip of coffee, and remain in 'The Zone' afterwards.
If the idle time is a couple of seconds, the waiting time start to become noticeable and thus annoying. If the waiting time becomes ten seconds or more, an operator already in 'The Zone' is yanked out of 'The Zone'. If ten seconds operator idle time is commonplace in the application, the operator never reaches 'The Zone' in the first place.
Once we start talking about minutes of operator idle time, operators go away to have a cup of coffee, read the newspaper, surf the net, flirt with the 20-year-old blonde at the swicthboard - whatever. Once that happens productivity numbers reach the point where companies go out of business.
> I enjoy your posts Rune but IMHO you realy do get carried away with > the wrong performance issues.
No. The performance issues I worry about are the ones that kick users out of business. No one cares if 15 seconds or 50 seconds is the most representative number for reading 100 MBytes of text-foematted numeric data, when the same amount of binary formatted data easily can be loaded in 0.3 seconds.
These details have a profound impact where I work. The only reason this is not recognized, is the omnipresent misconception that the slower time working with text-formatted numeric data is insignificant.
People who know their programing craft would know that one uses binary data formats for numeric data as default, and only deviates towards text-based formats where one can get away with them (file sizes less than about 5-10 MBytes).
>> I enjoy your posts Rune but IMHO you realy do get carried away with >> the wrong performance issues.
> No. The performance issues I worry about are the ones that > kick users out of business. No one cares if 15 seconds or 50 > seconds is the most representative number for reading 100 MBytes > of text-foematted numeric data, when the same amount of binary > formatted data easily can be loaded in 0.3 seconds.
You have a point, but if you asked me to solve the problem, I would probably try to keep data in the easiest way, i.e. still as text. Why? Because if it's supposed to be read by a human in the end, text is the native format. Just as much as a picture's native format is binary and not XML.
Then I would ask myself; how do we speed this up? As someone suggested you could use mmap in *nix systems (if you think of it, most configuration files in *nix are actually text files, probably loaded with mmap).
Now, let's say mmap doesn't solve the problem, what do we do next. I would look into compressors. Then you can store compressed files on disk, still in their native text format. And suddenly you have made the disk access time disappear with minimal hassle. You don't have to come up with a strange binary format to work around disk latencies.
I used to be part of developing a real-time system where the disks just couldn't perform real-time transfer rates. Then we just made an adapter class called Compressor taking two pointers to src- and dst memory, chained the class with our File class, and solved the problem with minimal effort. When the disks got faster a couple of years later we just removed the compressor.
Well, out here also programmers working with HUGE amounts of data (say: satellites, meteorological models, simulations). Text files in these fields just pure nonsense. We use binary formats, well documented and with convenient API, to allow indexing, transformation to text, xml, code, I/O filters (say compress), missing values, units, etc and A LOT of public available applications to view, plot, explore data.
If in case, have a look for example to HDF4 or HDF5 format, NetCDF format or the like.
If just a bunch of numbers (say up to some thousands) I will go for sure with a documented XML.
Rune Allnor wrote: > On 9 Nov, 15:56, Nick Hounsome <nick.houns...@googlemail.com> wrote:
>> The text book authors are writing for the 99% not the 1% so they are >> not going to change.
> I am working among the 1%. I have seen companies loose business > because of poorly performing software that went undetected. That > is, the people whose job it was to know, did not know about the > major performance issues.
So what? I, as well as other posters I guess, understand that your application may require performance that can't be satisfied by a textual format, and stated so. Did anyone say that you're not among the 1% or that you should switch from textual to binary? Do you have anything to refute from the other replies in this thread?
Frankly I don't understand your point. "But I..." or "But my..." isn't very meaningful when others didn't preclude your case.
On 10 Nov, 08:23, Graziano <graziano.giuli...@gmail.com> wrote:
> Well, out here also programmers working with HUGE amounts of data > (say: satellites, meteorological models, simulations). > Text files in these fields just pure nonsense. We use binary formats, > well documented and with convenient API, to allow indexing, > transformation to text, xml, code, I/O filters (say compress), missing > values, units, etc and A LOT of public available applications to view, > plot, explore data.
> If in case, have a look for example to HDF4 or HDF5 format, NetCDF > format or the like.
> If just a bunch of numbers (say up to some thousands) I will go for > sure with a documented XML.
And I would do exactly the same in your situation because I've read about how big those files can be but you are in the 1% (and I'd use memory mapping)
The important thing in this case is to provide the API for readers.
On 10 Nov, 09:22, Seungbeom Kim <musip...@bawi.org> wrote:
> Frankly I don't understand your point.
My point is that
1) The speed penalty imposed by using text formats for numerical data is totally absent in the USENET discussions and printed literature on C++.
2) The speed penalty imposed by using text formats for numerical data is one of the main bottlenecks in the data processing chains I have seen where I work.
3) The speed penalty imposed by using text formats for numerical data is totally unknown among the people whose job it is to set up said data processing chains.
4) The speed penalty imposed by using text formats for numerical data can easily be on the order of 100x or 200x relative to using binary data, depending on implementations of the software that accesses the file - not all applications are written in C++; not all C++ applications are efficient.
5) The speed penalty imposed by using text formats for numerical data is a *design* *choise*, on a par with using O(NlgN) quick sort algorithms instead of O(N^2) bubble sorts algorithms. What I am concerned, people are free to use text formats if
a) Portability is an *actual* issue - not always the case. b) Speed *is* irrelevant - not always the case.
Once one or both these factors is no longer relevant, text- based formats are out of the picture. As for the "human readability" question, that's irrelevant unless the contents of the file is meant to be inspected by humans.
6) The speed penalty imposed by using text formats for numerical data should be mentioned in textbooks on C++, so that unsuspecting users have a fair chance of making informed choises on the matter, instead of the present situation, where "politically correct" textbook authors not only make the choises for them, but also avoid to mention the alternatives.
> On 9 Nov, 15:56, Nick Hounsome <nick.houns...@googlemail.com> wrote:
> > The text book authors are writing for the 99% not the 1% so they are > > not going to change.
> I am working among the 1%. I have seen companies loose business > because of poorly performing software that went undetected. That > is, the people whose job it was to know, did not know about the > major performance issues.
> In one company, which did 24/7 survey jobs and stored the data > on text format, merely reading 24 hrs worth of data from text- > formatted files imposed some 3-5 hrs idle time on behalf of human > operators. It wouldn't have been a big deal if those 3-5 hrs were > organized as one bulk (the operators in question could have had > a long break if it was), but these 3-5 hrs were intersped throughout > the process, tying the operators down in front of their terminals.
But how much of that 3-5 hours was reading and how much was processing the data? I can't imagine a process in which the reading was the bulk of that time.
> From a human standpoint, there are several time scales. Most > (all?) readers of this newsgroup are computer programmers, so > they know what it means to be in 'The Zone' where time just > flys and work is being made.
> Now, if you can get a job done with operator idle time less > than a second, the operator can stretch his neck, yawn, have > zip of coffee, and remain in 'The Zone' afterwards.
> If the idle time is a couple of seconds, the waiting time > start to become noticeable and thus annoying. If the waiting > time becomes ten seconds or more, an operator already in 'The > Zone' is yanked out of 'The Zone'. If ten seconds operator > idle time is commonplace in the application, the operator never > reaches 'The Zone' in the first place.
> Once we start talking about minutes of operator idle time, > operators go away to have a cup of coffee, read the newspaper, > surf the net, flirt with the 20-year-old blonde at the > swicthboard - whatever. Once that happens productivity numbers > reach the point where companies go out of business.
I agree with everything you say here. It's just that my experience is that you can't reduce minutes to seconds unless the fundamental design is bad.
> > I enjoy your posts Rune but IMHO you realy do get carried away with > > the wrong performance issues.
> No. The performance issues I worry about are the ones that > kick users out of business. No one cares if 15 seconds or 50 > seconds is the most representative number for reading 100 MBytes > of text-foematted numeric data, when the same amount of binary > formatted data easily can be loaded in 0.3 seconds.
You're mixing reading and loading. Loading = reading and processing. Processing is the biggest user of time in almost all systems otherwise they aren't doing anything useful. If I take your figures at face value you can't be doing anything with the data.
> These details have a profound impact where I work. The only > reason this is not recognized, is the omnipresent misconception > that the slower time working with text-formatted numeric data > is insignificant.
But you are undermining your own argument. The people writing on this thread would hardly be saying that it was insignificant if it cost them their jobs therefore it hasn't cost them their jobs therefore it IS insignificant in all the projects that they've worked on. In other words its IS insignificant for MOST people MOST of the time just as I said.
> People who know their programing craft would know that one uses > binary data formats for numeric data as default, and only deviates > towards text-based formats where one can get away with them > (file sizes less than about 5-10 MBytes).
You have it backwards.
One uses text formats by default and only deviate towards binary format when you know that you have a problem and have demonstrated that binary formats will solve it.
Either that or you are right and Meyers, Stroustrop, all the C++ book writers and all the designers of the C and C++ I/O libraries are wrong.
P.S. As I think I already mentioned - If you really want the ultimate in speed then use memory mapping.
On 10 Nov, 09:23, Graziano <graziano.giuli...@gmail.com> wrote:
> Well, out here also programmers working with HUGE amounts of data > (say: satellites, meteorological models, simulations). > Text files in these fields just pure nonsense. We use binary formats, > well documented and with convenient API, to allow indexing, > transformation to text, xml, code, I/O filters (say compress), missing > values, units, etc and A LOT of public available applications to view, > plot, explore data.
> If in case, have a look for example to HDF4 or HDF5 format, NetCDF > format or the like.
I know what binary file format to use with the data in question.
My problem has been a bit more fundamental than that. When I ask decision-makers on what grounds text-based file formats were chosen, people either respond with "text files are so convenient" or a blank stare.
In other words, strategic decisions that directly affect the ability to meet deadlines are taken as a matter of course, without evaluating the operational impact on the process - or even without the awareness that an alternative existed at all.
Which is why I would like the trade-offs involved to at least be mentioned in upcoming textbooks on programming in general and C++ in particular.
On Nov 10, 7:59 am, Rune Allnor <all...@tele.ntnu.no> wrote:
> On 10 Nov, 09:22, Seungbeom Kim <musip...@bawi.org> wrote:
> > Frankly I don't understand your point.
> My point is that
I think the whole text vs. binary debate for large data files is a little facetious. Use what works. I work for a company that acquires spectral data and stores it in native binary format (for speed). We often aquire 100's of MB or very large fractions of GB's worth of data. If you can honestly tell me that a human is going to sift through a text file looking for inaccuracies in the data, then that person is deluding themselves and has way too much time on their hands.
We provide a dll interface to read our data file format. One routine will extract the data in native binary format for speed issues, the other will extract the data and return it in text format. This was done to promote language interoperability. In tests done long ago when 300Mz PCs ruled the earth, binary extraction was at least an order of magnitude faster. IIRC it was about 70 times faster.
Now that our instruments are being used in the medical community, we have additional tamper detection requirements. Storing data in text makes it very tempting (and easy) for a human to fire up a text editor and manipulate the data. It can still be done using a binary editor, but it's much harder to do.
So I would add for security reasons as well as speed text file formats do not work for us.
Rune Allnor wrote: > On 10 Nov, 09:22, Seungbeom Kim <musip...@bawi.org> wrote:
>> Frankly I don't understand your point.
> My point is that
> 1) The speed penalty imposed by using text formats for numerical > data is totally absent in the USENET discussions and printed > literature on C++.
Because it is irrelevant to those involved. And any halfway competnet programmer would understand that there is an overhead for using a test file.
> 2) The speed penalty imposed by using text formats for numerical > data is one of the main bottlenecks in the data processing > chains I have seen where I work.
Which means that a properly qualified programmer would recognise that this is one of the minority cases where a binary format would be useful. BTW in another post I mentioned that I do use binary formats for scratch files (exactly because there is no advantage in using a text format).
In addition it should be noted that a programmer who does not recognise when to use a binary format file probably does not understand the dangers in using such.
> 3) The speed penalty imposed by using text formats for numerical > data is totally unknown among the people whose job it is to > set up said data processing chains.
OK, so you are using insufficiently qualified people. Whose fault is that?
> 4) The speed penalty imposed by using text formats for numerical > data can easily be on the order of 100x or 200x relative to > using binary data, depending on implementations of the software > that accesses the file - not all applications are written in > C++; not all C++ applications are efficient.
I frankly do not believe that. Those kind of performance hits are almost invariably the consequence of using the wrong algorithms.
> 5) The speed penalty imposed by using text formats for numerical > data is a *design* *choise*, on a par with using O(NlgN) quick > sort algorithms instead of O(N^2) bubble sorts algorithms. > What I am concerned, people are free to use text formats if
> a) Portability is an *actual* issue - not always the case. > b) Speed *is* irrelevant - not always the case.
The point is that for the overwhelming majority using text formats is win-win. I would expect to see information about when to use binary formats in specialist books on areas where it matters (authors of general texts have to trim the content to meet criteria provided by publishers and something that is important to a very small minority would almost invariable be cut.
> Once one or both these factors is no longer relevant, text- > based formats are out of the picture. As for the "human > readability" question, that's irrelevant unless the contents > of the file is meant to be inspected by humans.
True but readability also extends to tools except that then we often call it portability.
> 6) The speed penalty imposed by using text formats for numerical > data should be mentioned in textbooks on C++, so that > unsuspecting users have a fair chance of making informed > choises on the matter, instead of the present situation, > where "politically correct" textbook authors not only make > the choises for them, but also avoid to mention the > alternatives.
See above. Unfortunately far too many programmers consider reading to be an arcane art and never read books. Worse they think they know all the answers and never listen to advice from others.
A programmer who cannot recognise that text formats affect performance is not fit for anything other than grunt-work.
Actually I would suggest that such things as choice of file formats are design issues and if an employer chooses not to employ a competent designer with knowledge of the area he gets all he deserves.
> On 9 Nov, 20:03, Rune Allnor <all...@tele.ntnu.no> wrote:
> > On 9 Nov, 15:56, Nick Hounsome <nick.houns...@googlemail.com> wrote:
> > > The text book authors are writing for the 99% not the 1% so they are > > > not going to change.
> > I am working among the 1%. I have seen companies loose business > > because of poorly performing software that went undetected. That > > is, the people whose job it was to know, did not know about the > > major performance issues.
> > In one company, which did 24/7 survey jobs and stored the data > > on text format, merely reading 24 hrs worth of data from text- > > formatted files imposed some 3-5 hrs idle time on behalf of human > > operators. It wouldn't have been a big deal if those 3-5 hrs were > > organized as one bulk (the operators in question could have had > > a long break if it was), but these 3-5 hrs were intersped throughout > > the process, tying the operators down in front of their terminals.
> But how much of that 3-5 hours was reading and how much was processing > the data? > I can't imagine a process in which the reading was the bulk of that > time.
I don't remember the details, this was several years ago, but the gross numbers were more or less that the process in question produced some 200 files on the order of 100 MBytes / file during 24 hours of survey. Each file needed to pass through two or three read / write cycles during processing. That's about 600 read/writes, each taking some 30 seconds of operator idle time. That 5 hours wasted, right there. With a 24-hr deadline, that hurts.
> > From a human standpoint, there are several time scales. Most > > (all?) readers of this newsgroup are computer programmers, so > > they know what it means to be in 'The Zone' where time just > > flys and work is being made.
> > Now, if you can get a job done with operator idle time less > > than a second, the operator can stretch his neck, yawn, have > > zip of coffee, and remain in 'The Zone' afterwards.
> > If the idle time is a couple of seconds, the waiting time > > start to become noticeable and thus annoying. If the waiting > > time becomes ten seconds or more, an operator already in 'The > > Zone' is yanked out of 'The Zone'. If ten seconds operator > > idle time is commonplace in the application, the operator never > > reaches 'The Zone' in the first place.
> > Once we start talking about minutes of operator idle time, > > operators go away to have a cup of coffee, read the newspaper, > > surf the net, flirt with the 20-year-old blonde at the > > swicthboard - whatever. Once that happens productivity numbers > > reach the point where companies go out of business.
> I agree with everything you say here. > It's just that my experience is that you can't reduce minutes to > seconds unless the fundamental design is bad.
The fundamental bad design is to use text-formatted files. I posted a demo I made in matlab, which is an increasingly popular language for these kinds of things, in a reply to Glassborow. Look at the numbers there - binary files are 100-200x faster than text formats.
> > > I enjoy your posts Rune but IMHO you realy do get carried away with > > > the wrong performance issues.
> > No. The performance issues I worry about are the ones that > > kick users out of business. No one cares if 15 seconds or 50 > > seconds is the most representative number for reading 100 MBytes > > of text-foematted numeric data, when the same amount of binary > > formatted data easily can be loaded in 0.3 seconds.
> You're mixing reading and loading. Loading = reading and processing. > Processing is the biggest user of time in almost all systems otherwise > they aren't doing anything useful. > If I take your figures at face value you can't be doing anything with > the data.
I only have so much time to get the job done. We can argue over semantics till the cows come home - the deadlines stand whether I am 'reading' or 'loading' the data.
> > These details have a profound impact where I work. The only > > reason this is not recognized, is the omnipresent misconception > > that the slower time working with text-formatted numeric data > > is insignificant.
> But you are undermining your own argument. > The people writing on this thread would hardly be saying that it was > insignificant if it cost them their jobs therefore it hasn't cost them > their jobs therefore it IS insignificant in all the projects that > they've worked on. In other words its IS insignificant for MOST people > MOST of the time just as I said.
Well, the *intention* of what people write might be benign, but that's not how it comes across. Some of the reactions I have recieved similar threads here and on comp.lang.c++:
[RA] > As long as you keep two factors in mind: > 1) The user's time is not yours (the programmer) to waste. > 2) The users's storage facilities (disk space, network > bandwidth etc) are not yours (the programmer) to waste.
[JK] The user pays for your time. Spending it to do something which results in a less reliable program, and that he doesn't need, is irresponsible, and borders on fraud.
>From such excerpts I can only conclude that most people are
oblivious to the problem and its implications.
> > People who know their programing craft would know that one uses > > binary data formats for numeric data as default, and only deviates > > towards text-based formats where one can get away with them > > (file sizes less than about 5-10 MBytes).
> You have it backwards.
> One uses text formats by default and only deviate towards binary > format when you know that you have a problem and have demonstrated > that binary formats will solve it.
> Either that or you are right and Meyers, Stroustrop, all the C++ book > writers and all the designers of the C and C++ I/O libraries are > wrong.
No. The authors you list aren't wrong. In order to be wrong, one must make a statement that can be proved or demonstrated to be false.
The fact is that except for Stroustrup mentioning ios_base::binary more or less in passing in his chapter on file streams, none of the C++ textbooks I have seen mention the issue at all.
<francis.glassbo...@btinternet.com> wrote: > > 3) The speed penalty imposed by using text formats for numerical > > data is totally unknown among the people whose job it is to > > set up said data processing chains.
> OK, so you are using insufficiently qualified people. Whose fault is that?
*I* am not using anyone. I happen to work among people who ought to know these things, but don't.
> > 4) The speed penalty imposed by using text formats for numerical > > data can easily be on the order of 100x or 200x relative to > > using binary data, depending on implementations of the software > > that accesses the file - not all applications are written in > > C++; not all C++ applications are efficient.
> I frankly do not believe that. Those kind of performance hits are almost > invariably the consequence of using the wrong algorithms.
Below is a test I wrote in matlab, which is an increasingly popular language for these kinds of things. The script first generates ten million random numbers, and writes them to file on both ASCII and binary double precision floating point formats. The files are then read straight back in, hopefully mitigating effects of file caches etc:
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% N = 10000000; d1=randn(N,1); t1=cputime; save test.txt d1 -ascii t2=cputime-t1; disp(['Wrote ASCII data in ',num2str(t2),' seconds'])
t3=cputime; d2=load('test.txt','-ascii'); t4=cputime-t3; disp(['Read ASCII data in ',num2str(t4),' seconds'])
t5=cputime; fid=fopen('test.raw','w'); fwrite(fid,d1,'double'); fclose(fid); t6=cputime-t5; disp(['Wrote binary data in ',num2str(t6),' seconds'])
t7=cputime; fid=fopen('test.raw','r'); d3=fread(fid,'double'); fclose(fid); t8=cputime-t7; disp(['Read binary data in ',num2str(t8),' seconds']) %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Output: ------------------------------------ Wrote ASCII data in 24.0469 seconds Read ASCII data in 42.2031 seconds Wrote binary data in 0.10938 seconds Read binary data in 0.32813 seconds ------------------------------------
Binary writes are 24.0/0.1 = 240x faster than text write. Binary reads are 42.2/0.32 = 130x faster than text read.
These numbers are representative for where I work.
> > 5) The speed penalty imposed by using text formats for numerical > > data is a *design* *choise*, on a par with using O(NlgN) quick > > sort algorithms instead of O(N^2) bubble sorts algorithms. > > What I am concerned, people are free to use text formats if
> > a) Portability is an *actual* issue - not always the case. > > b) Speed *is* irrelevant - not always the case.
> The point is that for the overwhelming majority using text formats is > win-win. I would expect to see information about when to use binary > formats in specialist books on areas where it matters (authors of > general texts have to trim the content to meet criteria provided by > publishers and something that is important to a very small minority > would almost invariable be cut.
My point is that the people who only know a little programming also would benefit from at least having seen these things mentioned.
I don't mind you and other textbook authors arguing fiercly for one approach and against the other, *provided* both approaches are at least mentioned, and preferably described in terms of pros & cons.
> > 6) The speed penalty imposed by using text formats for numerical > > data should be mentioned in textbooks on C++, so that > > unsuspecting users have a fair chance of making informed > > choises on the matter, instead of the present situation, > > where "politically correct" textbook authors not only make > > the choises for them, but also avoid to mention the > > alternatives.
> See above. Unfortunately far too many programmers consider reading to be > an arcane art and never read books. Worse they think they know all the > answers and never listen to advice from others.
> A programmer who cannot recognise that text formats affect performance > is not fit for anything other than grunt-work.
> Actually I would suggest that such things as choice of file formats are > design issues and if an employer chooses not to employ a competent > designer with knowledge of the area he gets all he deserves.
I might agree with you on both cases, if the programmers and designers had access to textbooks where these questions are discussed.
Rune Allnor wrote: > On 10 Nov, 09:22, Seungbeom Kim <musip...@bawi.org> wrote:
>> Frankly I don't understand your point.
> My point is that
> 1) The speed penalty imposed by using text formats for numerical > data [...]
> 2) The speed penalty imposed by using text formats for numerical > data [...]
> 3) The speed penalty imposed by using text formats for numerical > data [...]
> 4) The speed penalty imposed by using text formats for numerical > data [...]
> 5) The speed penalty imposed by using text formats for numerical > data [...]
> 6) The speed penalty imposed by using text formats for numerical > data [...]
As you have stated, the points you have made so far relates only to applications dealing heavily with numerical data. Then why are you accusing general C++ language book authors and Usenet participants? Take a numerical processing book, or go to a numerical processing group, and if it's said there that textual formats should be preferred, then that's where you should make accusations and claims. (And by the way, choosing data formats is only barely on-topic for a "C++ language" textbook or forum.)
I'm not saying that you should not discuss such matters elsewhere, such as here. It's just that you keep emphasizing your particular application area and its special needs, while others are stating more common cases and more general choices, and NOT refuting your points in your area in particular. You and others are going parallel to each other and not making any more progress. Your repeated claims here would make sense only if you were arguing that binary formats should be preferred *by default* in most, if not all, areas -- that's where you could refute others' claims -- but you have clearly stated above that your points are meant only for numerical applications. This is why I said I didn't understand your point: are you just explaining your situation, or trying to change others' opinions?
> The fact is that except for Stroustrup mentioning ios_base::binary > more or less in passing in his chapter on file streams, none of the > C++ textbooks I have seen mention the issue at all.
ios_base::binary does something completely different.
> Rune Allnor wrote: > > On 10 Nov, 09:22, Seungbeom Kim <musip...@bawi.org> wrote:
> >> Frankly I don't understand your point.
> > My point is that
> > 1) The speed penalty imposed by using text formats for numerical > > data [...]
> > 2) The speed penalty imposed by using text formats for numerical > > data [...]
> > 3) The speed penalty imposed by using text formats for numerical > > data [...]
> > 4) The speed penalty imposed by using text formats for numerical > > data [...]
> > 5) The speed penalty imposed by using text formats for numerical > > data [...]
> > 6) The speed penalty imposed by using text formats for numerical > > data [...]
> As you have stated, the points you have made so far relates only to > applications dealing heavily with numerical data. Then why are you > accusing general C++ language book authors and Usenet participants?
Because C++ is considered by most (although maybe not by the regulars here) as a language for the applictaions where high efficiency is paramount. Even so, how and when to handle binary files through C++ is not treated in any of the learning material I have seen.
> Take a numerical processing book, or go to a numerical processing > group, and if it's said there that textual formats should be preferred, > then that's where you should make accusations and claims. (And by the > way, choosing data formats is only barely on-topic for a "C++ language" > textbook or forum.)
Yesterday I summarized in another post a number of responses I have recieved on these questions over the past couple of weeks:
The general impression is that the C++ community as it comes across in this and other USENET groups, as well as the textbooks, is totally foreign to using binary file formats at all.
> I'm not saying that you should not discuss such matters elsewhere, > such as here. It's just that you keep emphasizing your particular > application area and its special needs, while others are stating > more common cases and more general choices, and NOT refuting your > points in your area in particular.
Correct. But there is a general trend towards dismissing the stated problem as irrelevant, or my solution as a misconception. Again, review the opinions expressed in the post I refer to above.
> You and others are going parallel > to each other and not making any more progress. Your repeated claims > here would make sense only if you were arguing that binary formats > should be preferred *by default* in most, if not all, areas -- that's > where you could refute others' claims -- but you have clearly stated > above that your points are meant only for numerical applications.
That's where it is easy to find the problem, because that's the application that is easy to find. These things don't show up unless the file sizes are 5-10 MBytes or more; only then do the delays associated with data loading become noticeable to human users.
Once e.g. XML files, that need to be parsed etc which take at least as much time as merely converting the numbers, start reaching those kinds of sizes, text-based formats will become annoying in other applications as well.
> This is why I said I didn't understand your point: are you just > explaining your situation, or trying to change others' opinions?
I am trying to make influential people here - both regular posters on c.l.c++.m and textbook authors who might be lurking - aware of the problem. Only when the teachers start addressing a problem will it be reasonable to expect students to know.
I have posted numbers to demonstrate what I am talking about on a number of occasions, e.g.
The common first reaction is that "This is not C++, so this is irrelevant!" Then all the arguments we have seen in this and recent threads appear; that
- There are faster ways than operator>> and operator<< - The algorithm must be wrong - My measurements are not 'exact'
and so on.
The fact is that very few people even suspected these numbers to differ by orders of magnitude. The numbers referred above, while not C++, are representative for the delays and bottlenecks where I work. We can quarrel about reducing the abolute numbers by a factor 3 or maybe 5 by using efficient pasrers, but that requires rewriting the file parsers in every single program already in use out there. As well as educating the programmers etc.
The net effect is far larger if one spends the same effort on educating the same programmers and designers about binary files.
But in order to do that, one needs to educate the educators.
> A couple of weeks ago I posted a question on comp.lang.c++ about some > technicality > about binary file IO. Over the course of the discussion, I discovered > to my > amazement - and, quite frankly, horror - that there seems to be a > school of > thought that text-based storage formats are universally preferable to > binary text > formats for reasons of portability and human readability.
> The people who presented such ideas appeared not to appreciate two > details that > counter any benefits text-based numerical formats might offer:
> 1) Binary files are about 70-20% of the file size of the text files, > depending > on the number of significant digits stored in the text files and > other > formatting text glyphs. > 2) Text-formatted numerical data take significantly longer to read and > write > than binary formats.
> Timings are difficult to compare, since the exact numbers depend on > buffering > strategies, buffer sizes, disk speeds, network bandwidths and so on.
> I have therefore sketched a 'distilled' test (code below) to test what > overheads > are involved with formatting numerical data back and forth between > (...)
To everything that has been said in other replies I would like to add a small test sample of mine. It's as basic as I thought I could get and just as inaccurate and insufficient as every other single test. (Code follows at the end)
-- Run 1 -- (1e6) build type: RELEASE generate data (1000000 doubles) ... start writing ... start reading ... timings: Binary write = 279 ms ASCII write = 2048 ms (Factor 7.3) Binary read = 211 ms ASCII read = 1283 ms (Factor 6.1)
-- Run 2 -- (1e7) build type: RELEASE generate data (10000000 doubles) ... start writing ... start reading ... timings: Binary write = 11329 ms ASCII write = 20014 ms (Factor 1.8) Binary read = 2252 ms ASCII read = 12922 ms (Factor 5.7)
-- Run 3 -- (1e6) build type: RELEASE generate data (1000000 doubles) ... start writing ... start reading ... timings: Binary write = 313 ms ASCII write = 1911 ms (Factor 6.1) Binary read = 212 ms ASCII read = 1277 ms (Factor 6.0)
So what gives? Binary is unsurprisingly faster. Apparently somewhere between factor 5 and 10 on my box here. (And if, something else is going on, such as in Run2, it may not even be that much faster).
Bottom line for me: 1.) Binary *is* definitely faster. 2.) The difference is small n factors. 3.) *If* you need that speed, use binary.
br, Martin
### CODE ###
int main() { using namespace std; srand( (unsigned)time( NULL ) );
Rune Allnor wrote: > On 10 Nov, 22:30, Francis Glassborow > <francis.glassbo...@btinternet.com> wrote: >>> 4) The speed penalty imposed by using text formats for numerical >>> data can easily be on the order of 100x or 200x relative to >>> using binary data, depending on implementations of the software >>> that accesses the file - not all applications are written in >>> C++; not all C++ applications are efficient. >> I frankly do not believe that. Those kind of performance hits are almost >> invariably the consequence of using the wrong algorithms.
> Below is a test I wrote in matlab, which is an increasingly > popular language for these kinds of things. [...]
You can discuss it in a MATLAB forum, then. MATLAB measurements in a C++ forum doesn't mean much, because the people don't know (and may not either be interested in) what's going on inside MATLAB. Gratuitous inefficiencies inside MATLAB, if any, cannot be used to justify any argument in C++.
> Output: > ------------------------------------ > Wrote ASCII data in 24.0469 seconds > Read ASCII data in 42.2031 seconds > Wrote binary data in 0.10938 seconds > Read binary data in 0.32813 seconds > ------------------------------------
> Binary writes are 24.0/0.1 = 240x faster than text write. > Binary reads are 42.2/0.32 = 130x faster than text read.
> These numbers are representative for where I work.
Again, if MATLAB is representative for where you work, please visit a MATLAB forum. Otherwise, there was a C++ test program by James Kanze in the comp.lang.c++ thread you mentioned, and the numbers given by that program are much more persuasive and convincing, at least here in a C++ newsgroup. Or you can suggest a better C++ program, of course.
> My point is that the people who only know a little programming > also would benefit from at least having seen these things > mentioned.
> I don't mind you and other textbook authors arguing fiercly > for one approach and against the other, *provided* both > approaches are at least mentioned, and preferably described > in terms of pros & cons.
It may not be a job for language textbooks, as I mentioned earlier, though it definitely is for numerical programming textbooks, or more general programming books dealing with choosing data formats.
It is very natural and acceptable that language textbooks focus on the language features and that for the sake of simplicity they default to a text data format that's easier to understand and debug. They don't want the readers to struggle on other issues when they don't understand the language features very well yet.
You are welcome to write a book of your own, of course.