Google Mail Calendar Documents Reader Web more »
Recently Visited Groups | Help | Sign in
Google Groups Home
Run-time overhead of text-based storage formats for numerical data
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  Messages 1 - 25 of 53 - Collapse all  -  Translate all to Translated (View all originals)   Newer >
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Follow-up To:
Add Cc | Add Follow-up to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers that you hear
 
Rune Allnor  
View profile   Translate to Translated (View Original)
 More options 8 Nov, 20:22
Newsgroups: comp.lang.c++.moderated
From: Rune Allnor <all...@tele.ntnu.no>
Date: Sun, 8 Nov 2009 14:22:11 CST
Local: Sun 8 Nov 2009 20:22
Subject: Run-time overhead of text-based storage formats for numerical data
Hi all.

A couple of weeks ago I posted a question on comp.lang.c++ about some
technicality
about binary file IO. Over the course of the discussion, I discovered
to my
amazement - and, quite frankly, horror - that there seems to be a
school of
thought that text-based storage formats are universally preferable to
binary text
formats for reasons of portability and human readability.

The people who presented such ideas appeared not to appreciate two
details that
counter any benefits text-based numerical formats might offer:

1) Binary files are about 70-20% of the file size of the text files,
depending
   on the number of significant digits stored in the text files and
other
   formatting text glyphs.
2) Text-formatted numerical data take significantly longer to read and
write
   than binary formats.

Timings are difficult to compare, since the exact numbers depend on
buffering
strategies, buffer sizes, disk speeds, network bandwidths and so on.

I have therefore sketched a 'distilled' test (code below) to test what
overheads
are involved with formatting numerical data back and forth between
text and
binary formats. To eliminate the impact of peripherical devices, I
have used
a std::stringstream to store the data. The binary bufferes are
represented
by vectors, and I have assumed that a memcpy from the file buffer to
the
destination memory location is all that is needed to import the binary
format
from the file buffer. (If there are significant run-time overheads
associated with
moving NATIVE binary formats to the destination, please let me
know.)

The output on my computer is (do note the _different_ numbers of IO
cycles in the two cases!):

Sun Nov 08 19:48:54 2009 : Binary IO cycles started
Sun Nov 08 19:49:00 2009 : 1000 Binary IO cycles completed
Sun Nov 08 19:49:00 2009 : Text-format IO cycles started
Sun Nov 08 19:49:16 2009 : 100 Text-format IO cycles completed

A little bit of math produces *average*, *crude* numbers for IO
cycles:

Text:    6 seconds / (1000 * 1e6) read/write cycles =   6e-9 s per r/w
cycle
Binary: 16 seconds / (100  * 1e6) read/write cycles = 160e-9 s per r/w
cycle

which in turn means there is an overhead on the order of of
160e-9/6e-9 = 26x
associated with the text formats.

Add a little bit of other overheads, e.g. caused by the significantly
larger text file sizes in combination with suboptimal buffering
strategies,
and the relative numbers easily hit the triple digits. Not at all
insignificant when one works with large amounts of data under tight
deadlines.

So please: Shoot this demo down! Give it your best, and prove me
and my numbers wrong.

And to the textbook authors who might be lurking: Please include a
chapter on relative binary and text-based IO speeds in your upcoming
editions. Binary file formats might not fit into your overall
philosophies about human readability and universal portability of C++
code, but some of your readers might appreciate being made aware of
such practical details.

Rune

/
*************************************************************************** /
#include <iostream>
#include <sstream>
#include <time.h>
#include <vector>

int main()
{
        const size_t  NumElements = 1000000;
        std::vector<double> SourceBuffer;
        std::vector<double> DestinationBuffer;

        for (size_t n=0;n<NumElements;++n)
        {
                SourceBuffer.push_back(n);
                DestinationBuffer.push_back(0);
        }

        time_t rawtime;
        struct tm * timeinfo;

        time( &rawtime );
        timeinfo = localtime( & rawtime );
        std::string message( asctime (timeinfo) );
        message.erase(message.size()-1);

        std::cout  << message.c_str() << " : Binary IO cycles started"
                    << std::endl;

        size_t NumBinaryIOCycles = 1000;
        for (size_t n = 0; n < NumBinaryIOCycles; ++n)
        {
                for (size_t m = 0; m<NumElements; ++m )
                {
                        DestinationBuffer[m] = SourceBuffer[m];
                }
        }

        time( &rawtime );
        timeinfo = localtime( & rawtime );
        message=std::string( asctime (timeinfo) );
        message.erase(message.size()-1);

        std::cout << message.c_str() << " : " << NumBinaryIOCycles
                << " Binary IO cycles completed " << std:: endl;

        std::stringstream ss;
        const size_t NumTextFormatIOCycles = 100;

        time( &rawtime );
        timeinfo = localtime( & rawtime );
        message=std::string( asctime (timeinfo) );
        message.erase(message.size()-1);

        std::cout  << message.c_str() << " : Text-format IO cycles started"
                   << std::endl;

        for (size_t n = 0; n < NumTextFormatIOCycles; ++n)
        {
                size_t m;
                for (m = 0; m < NumElements; ++m)
                        ss << SourceBuffer[m];

                m = 0;
                while(!ss.eof())
                {
                        ss >> DestinationBuffer[m];
                        ++m;
                }
        }

        time( &rawtime );
        timeinfo = localtime( & rawtime );
        message=std::string( asctime (timeinfo) );
        message.erase(message.size()-1);

        std::cout << message.c_str() << " : " << NumTextFormatIOCycles
                << " Text-format IO cycles completed " << std:: endl;

        return 0;

}

--
      [ See http://www.gotw.ca/resources/clcm.htm for info about ]
      [ comp.lang.c++.moderated.    First time posters: Do this! ]

    Reply    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message, you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Seungbeom Kim  
View profile   Translate to Translated (View Original)
 More options 9 Nov, 14:50
Newsgroups: comp.lang.c++.moderated
From: Seungbeom Kim <musip...@bawi.org>
Date: Mon, 9 Nov 2009 08:50:56 CST
Local: Mon 9 Nov 2009 14:50
Subject: Re: Run-time overhead of text-based storage formats for numerical data

Rune Allnor wrote:

> A couple of weeks ago I posted a question on comp.lang.c++ about some technicality
> about binary file IO. Over the course of the discussion, I discovered to my
> amazement - and, quite frankly, horror - that there seems to be a school of
> thought that text-based storage formats are universally preferable to binary text
> formats for reasons of portability and human readability.

I don't see textual formats "universally preferred". Who said that?

> The people who presented such ideas appeared not to appreciate two details that
> counter any benefits text-based numerical formats might offer:

> 1) Binary files are about 70-20% of the file size of the text files, depending
>    on the number of significant digits stored in the text files and other
>    formatting text glyphs.
> 2) Text-formatted numerical data take significantly longer to read and write
>    than binary formats.

Actual numbers may vary, but it is an established fact that text formats
take more space and more processing time, and no one objected to that.
So, if your application cannot afford that overhead, you don't have a
choice, and you go binary. However, other applications may afford that
overhead and instead enjoy the benefits that textual formats offer:

- human readability
- transparency
- portability (I'm not talking about preserving the exact precision,
   but about being free of issues such as encoding, endianness, etc.)
- flexibility (Upgrading from 32-bit int to 64-bit int is a breeze.)
- manipulability (You can use text-based utilities such as awk or perl,
   and even text editors to modify some parts.)

... especially when you consider that in many (not all) situations,
storage is less of a problem nowadays than it used to be before (and
maybe processing time too), and that the difference in processing times
of text and binary is only a fraction of the total processing time.

// I'm afraid I'm just repeating what has been discussed over there. :(

If you're interested enough, see the section "The Importance of Being
Textual" from "The Art of Unix Programming" by Eric Steven Raymond,
at <http://www.catb.org/~esr/writings/taoup/html/ch05s01.html>.

YMMV, of course. No one tells you you /should/ use a textual format,
or you shouldn't tell others they /should/ use a binary format, either.
The decision is, as always, a trade-off between different values.
No one knows your objectives and constraints better than you do, and
while others can present the pros and cons of the options, it's your
job to understand them and make the decision. (Just note that worrying
about performance is justified only after an actual measurement.)

--
Seungbeom Kim

      [ See http://www.gotw.ca/resources/clcm.htm for info about ]
      [ comp.lang.c++.moderated.    First time posters: Do this! ]


    Reply    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message, you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
DeMarcus  
View profile   Translate to Translated (View Original)
 More options 9 Nov, 14:56
Newsgroups: comp.lang.c++.moderated
From: DeMarcus <use_my_alias_h...@hotmail.com>
Date: Mon, 9 Nov 2009 08:56:27 CST
Local: Mon 9 Nov 2009 14:56
Subject: Re: Run-time overhead of text-based storage formats for numerical data

Rune Allnor wrote:
> Hi all.

> A couple of weeks ago I posted a question on comp.lang.c++ about some
> technicality
> about binary file IO. Over the course of the discussion, I discovered
> to my
> amazement - and, quite frankly, horror - that there seems to be a
> school of
> thought that text-based storage formats are universally preferable to
> binary text
> formats for reasons of portability and human readability.

Please don't see it as a horror. You're right that binary files are
faster but text files are nice for debugging and backward compatibility.

In one software we used binary files to store configurations. Then
suddenly we wanted to add an item into the configuration, which made the
old configuration files incompatible with the new software version.
To support the old configuration files we had to do a converter, and
soon we realized that we couldn't have version converters each time we
wanted to add an item. That's where XML came at hand.

Cheers,
Daniel

--
      [ See http://www.gotw.ca/resources/clcm.htm for info about ]
      [ comp.lang.c++.moderated.    First time posters: Do this! ]


    Reply    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message, you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Ulrich Eckhardt  
View profile   Translate to Translated (View Original)
 More options 9 Nov, 14:51
Newsgroups: comp.lang.c++.moderated
From: Ulrich Eckhardt <eckha...@satorlaser.com>
Date: Mon, 9 Nov 2009 08:51:35 CST
Local: Mon 9 Nov 2009 14:51
Subject: Re: Run-time overhead of text-based storage formats for numerical data

Rune Allnor wrote:
> A couple of weeks ago I posted a question on comp.lang.c++ about some
> technicality about binary file IO.

All files are binary. ;)

> Over the course of the discussion, I discovered  to my amazement - and,
> quite frankly, horror - that there seems to be a school of thought that
> text-based storage formats are universally preferable to binary text
> formats for reasons of portability and human readability.

This is the same school as the one that suggests not doing any early
optimisations.

> The people who presented such ideas appeared not to appreciate two
> details that counter any benefits text-based numerical formats might
> offer:

> 1) Binary files are about 70-20% of the file size of the text files,
> depending on the number of significant digits stored in the text files
> and other formatting text glyphs.

Compression?

> 2) Text-formatted numerical data take significantly longer to read and
> write than binary formats.

Do they? I don't really believe you. The point is that IO takes lots of
time, so much that it dwarfs any simple parsing operation:

> Timings are difficult to compare, since the exact numbers depend on
> buffering strategies, buffer sizes, disk speeds, network bandwidths
> and so on.

...as you state yourself.

> I have therefore sketched a 'distilled' test (code below) to test what
> overheads are involved with formatting numerical data back and forth
> between text and binary formats. To eliminate the impact of peripherical
> devices, I have used a std::stringstream to store the data.

Fair choice.

> The binary bufferes are represented by vectors, and I have assumed that a
> memcpy from the file buffer to the destination memory location is all that
> is needed to import the binary format from the file buffer. (If there are
> significant run-time overheads associated with moving NATIVE binary
> formats to the destination, please let me know.)

Not a fair choice. You have completely omitted to convert the on-disk
representation to your in-memory representation. Things that differ are
endianess, sizes, alignment and padding.

> And to the textbook authors who might be lurking: Please include a
> chapter on relative binary and text-based IO speeds in your upcoming
> editions. Binary file formats might not fit into your overall
> philosophies about human readability and universal portability of C++
> code, but some of your readers might appreciate being made aware of
> such practical details.

IMHO less for file formats but for protocols, otherwise I agree, a
comparison/warning would be useful.

> std::stringstream ss;
[...]
> for (m = 0; m < NumElements; ++m)
>    ss << SourceBuffer[m];

Wrong: You are writing the numbers without any separating character, making
it impossible to read them afterwards.

> while(!ss.eof())
> {
> ss >> DestinationBuffer[m];
> ++m;
> }

Wrong: Use the idiomatic "while(s >> val)". Your loop will probably overflow
the buffer by reading one past the end. Actually, with the error above, I
have no clue what your loop does, you should have checked correctness, too.

Further notes:
1. C++ IOStreams are a complex formatting and parsing framework using
plugins for pretty much any operation. Every use of a plugin amounts to a
lookup of the plugin and a virtual function call, with all the restrictions
that imposes on the optimizer. I would try to optimize that part first
before dumping a textual file layout.
2. Apart from the two glitches above, which are easily caught, textual
formatting is pretty easy to get right. However, I dare you to write
portable code to write a sequence of double values to a "packed binary"
file. This is far from trivial.

Uli

--
Sator Laser GmbH
Geschäftsführer: Thorsten Föcking, Amtsgericht Hamburg HR B62 932

      [ See http://www.gotw.ca/resources/clcm.htm for info about ]
      [ comp.lang.c++.moderated.    First time posters: Do this! ]


    Reply    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message, you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Neil Butterworth  
View profile   Translate to Translated (View Original)
 More options 9 Nov, 14:51
Newsgroups: comp.lang.c++.moderated
From: Neil Butterworth <nbutterworth1...@gmail.com>
Date: Mon, 9 Nov 2009 08:51:58 CST
Local: Mon 9 Nov 2009 14:51
Subject: Re: Run-time overhead of text-based storage formats for numerical data

Well, I can't speak for those people, but I would prefer text files for
exactly the reasons you suggest, provided those are of overriding
mnportance for the particular application. So if the application is
concerned with data transfer, I would use XML  for portability, if it
requires a configuration file, I would use a text format to make it easy
for users to read and edit.

However, if I wanted performance, I would use a binary format FOR THE
FILES WHERE PERFORMANCE IS THE PRIMARY REQUIREMENT. I don't think that
anyone is suggesting that a SQL database (for example) should be
implemented using text files for its indexes and tables. It would make
sense though for such a database to use text files for configuration etc.

You seem to have set up a straw man, and one that has very little to do
with C++, I would add.

Neil Butterworth

--
      [ See http://www.gotw.ca/resources/clcm.htm for info about ]
      [ comp.lang.c++.moderated.    First time posters: Do this! ]


    Reply    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message, you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Discussion subject changed to "Run-time overhead of text-based storage formats for numerical data" by Nick Hounsome
Nick Hounsome  
View profile   Translate to Translated (View Original)
 More options 9 Nov, 14:56
Newsgroups: comp.lang.c++.moderated
From: Nick Hounsome <nick.houns...@googlemail.com>
Date: Mon, 9 Nov 2009 08:56:57 CST
Local: Mon 9 Nov 2009 14:56
Subject: Re: Run-time overhead of text-based storage formats for numerical data
On 8 Nov, 20:22, Rune Allnor <all...@tele.ntnu.no> wrote:

> Hi all.

> A couple of weeks ago I posted a question on comp.lang.c++ about some
> technicality
> about binary file IO. Over the course of the discussion, I discovered
> to my
> amazement - and, quite frankly, horror - that there seems to be a
> school of
> thought that text-based storage formats are universally preferable to
> binary text
> formats for reasons of portability and human readability.

That's not a school of thought - It's a fact. They are preferred and
for those reasons.

That doesn't mean that sometimes you really do need the performance
but it would have to be quite a large data set or quite a stringent
performance requirement to make it preferrable.

> The people who presented such ideas appeared not to appreciate two
> details that
> counter any benefits text-based numerical formats might offer:

> 1) Binary files are about 70-20% of the file size of the text files,
> depending
>    on the number of significant digits stored in the text files and
> other
>    formatting text glyphs.

In 25 years programming I have never come across a problem (for files)
where this has been a problem and the rate at which storage capacities
increase suggests to me that it never will be for any "normal"
application.

> 2) Text-formatted numerical data take significantly longer to read and
> write
>    than binary formats.

Again - Never in my experience.
In network protocols YES because you can never have too much
performance in low level general purpose protocols but in application
files I have never had a problem.

A slight optimisation that you might be interested in is to use hex -
This is still portable and readable but can be read and written
without multiplications or divisions.

> Timings are difficult to compare, since the exact numbers depend on
> buffering
> strategies, buffer sizes, disk speeds, network bandwidths and so on.

In other words they are of minor siginificance otherwise they would
dwarf these things.

> I have therefore sketched a 'distilled' test (code below) to test what
> overheads
> are involved with formatting numerical data back and forth between
> text and
> binary formats. To eliminate the impact of peripherical devices, I
> have used
> a std::stringstream to store the data. The binary bufferes are

If you really worry about performance you will never use the C++ I/O
library conversions - The fastest way to write an integer will almost
certainly be itoa()/atoi() and (if you have it) read()/write()

> represented
> by vectors, and I have assumed that a memcpy from the file buffer to
> the
> destination memory location is all that is needed to import the binary
> format
> from the file buffer. (If there are significant run-time overheads
> associated with
> moving NATIVE binary formats to the destination, please let me
> know.)

If you are realy realy realy speed obssessed the way to go is to map a
binary file into memory
rather than using ANY I/O library at all (mmap on POSIX systems, Not
sure about Windows).

Try it - You'll be impressed.

They are not wrong. They are just irrelevant to 99% of all
applications.

> And to the textbook authors who might be lurking: Please include a
> chapter on relative binary and text-based IO speeds in your upcoming
> editions. Binary file formats might not fit into your overall
> philosophies about human readability and universal portability of C++
> code, but some of your readers might appreciate being made aware of
> such practical details.

> Rune

The text book authors are writing for the 99% not the 1% so they are
not going to change.

I enjoy your posts Rune but IMHO you realy do get carried away with
the wrong performance issues.

--
      [ See http://www.gotw.ca/resources/clcm.htm for info about ]
      [ comp.lang.c++.moderated.    First time posters: Do this! ]


    Reply    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message, you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Discussion subject changed to "Run-time overhead of text-based storage formats for numerical data" by Francis Glassborow
Francis Glassborow  
View profile   Translate to Translated (View Original)
 More options 9 Nov, 19:57
Newsgroups: comp.lang.c++.moderated
From: Francis Glassborow <francis.glassbo...@btinternet.com>
Date: Mon, 9 Nov 2009 13:57:25 CST
Local: Mon 9 Nov 2009 19:57
Subject: Re: Run-time overhead of text-based storage formats for numerical data

Rune Allnor wrote:
> Hi all.

> A couple of weeks ago I posted a question on comp.lang.c++ about some technicality
> about binary file IO. Over the course of the discussion, I discovered to my
> amazement - and, quite frankly, horror - that there seems to be a school of
> thought that text-based storage formats are universally preferable to binary text
> formats for reasons of portability and human readability.

This is a classic example of the 'speed at any cost' school of thought.
The trouble with binary formats is that they are not portable (sometimes
to the extent of not being portable between releases of the same program
-- I once met a young programmer who was going through the agony of
having used binary data formats which could no longer be read correctly
by the second release of his program much to the dismay of his customers)

The space overhead is hardly important when even simple desktop machines
can have over a terabyte of disk stirage at less than it used to cost
(in 1979)to buy a couple of boxes of single density 5.25" floppy disks.

Speed could be an issue in some cases, but measure first before choosing
to optimise. I rarely use binary files other than for scratch files
during a single execution of a program (where they are fine as long as
the program does not crash and burn)

--
      [ See http://www.gotw.ca/resources/clcm.htm for info about ]
      [ comp.lang.c++.moderated.    First time posters: Do this! ]


    Reply    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message, you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Discussion subject changed to "Run-time overhead of text-based storage formats for numerical data" by REH
REH  
View profile   Translate to Translated (View Original)
 More options 9 Nov, 19:58
Newsgroups: comp.lang.c++.moderated
From: REH <spamj...@stny.rr.com>
Date: Mon, 9 Nov 2009 13:58:06 CST
Local: Mon 9 Nov 2009 19:58
Subject: Re: Run-time overhead of text-based storage formats for numerical data
On Nov 8, 3:22 pm, Rune Allnor <all...@tele.ntnu.no> wrote:

Metrics only matter when they matter. Are the larger file size or
increased processing time issues? If you are not constraint for space
or time, and the text files have value-add (e.g., convenience,
portability, etc.) then use them. Bigger and slower do not always
equate to bad.

REH

--
      [ See http://www.gotw.ca/resources/clcm.htm for info about ]
      [ comp.lang.c++.moderated.    First time posters: Do this! ]


    Reply    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message, you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Rune Allnor  
View profile   Translate to Translated (View Original)
 More options 9 Nov, 20:03
Newsgroups: comp.lang.c++.moderated
From: Rune Allnor <all...@tele.ntnu.no>
Date: Mon, 9 Nov 2009 14:03:35 CST
Local: Mon 9 Nov 2009 20:03
Subject: Re: Run-time overhead of text-based storage formats for numerical data
On 9 Nov, 15:56, Nick Hounsome <nick.houns...@googlemail.com> wrote:

> The text book authors are writing for the 99% not the 1% so they are
> not going to change.

I am working among the 1%. I have seen companies loose business
because of poorly performing software that went undetected. That
is, the people whose job it was to know, did not know about the
major performance issues.

In one company, which did 24/7 survey jobs and stored the data
on text format, merely reading 24 hrs worth of data from text-
formatted files imposed some 3-5 hrs idle time on behalf of human
operators. It wouldn't have been a big deal if those 3-5 hrs were
organized as one bulk (the operators in question could have had
a long break if it was), but these 3-5 hrs were intersped throughout
the process, tying the operators down in front of their terminals.

 From a human standpoint, there are several time scales. Most
(all?) readers of this newsgroup are computer programmers, so
they know what it means to be in 'The Zone' where time just
flys and work is being made.

Now, if you can get a job done with operator idle time less
than a second, the operator can stretch his neck, yawn, have
zip of coffee, and remain in 'The Zone' afterwards.

If the idle time is a couple of seconds, the waiting time
start to become noticeable and thus annoying. If the waiting
time becomes ten seconds or more, an operator already in 'The
Zone' is yanked out of 'The Zone'. If ten seconds operator
idle time is commonplace in the application, the operator never
reaches 'The Zone' in the first place.

Once we start talking about minutes of operator idle time,
operators go away to have a cup of coffee, read the newspaper,
surf the net, flirt with the 20-year-old blonde at the
swicthboard - whatever. Once that happens productivity numbers
reach the point where companies go out of business.

> I enjoy your posts Rune but IMHO you realy do get carried away with
> the wrong performance issues.

No. The performance issues I worry about are the ones that
kick users out of business. No one cares if 15 seconds or 50
seconds is the most representative number for reading 100 MBytes
of text-foematted numeric data, when the same amount of binary
formatted data easily can be loaded in 0.3 seconds.

These details have a profound impact where I work. The only
reason this is not recognized, is the omnipresent misconception
that the slower time working with text-formatted numeric data
is insignificant.

People who know their programing craft would know that one uses
binary data formats for numeric data as default, and only deviates
towards text-based formats where one can get away with them
(file sizes less than about 5-10 MBytes).

Rune

--
      [ See http://www.gotw.ca/resources/clcm.htm for info about ]
      [ comp.lang.c++.moderated.    First time posters: Do this! ]


    Reply    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message, you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Discussion subject changed to "Run-time overhead of text-based storage formats for numerical data" by DeMarcus
DeMarcus  
View profile   Translate to Translated (View Original)
 More options 9 Nov, 22:06
Newsgroups: comp.lang.c++.moderated
From: DeMarcus <use_my_alias_h...@hotmail.com>
Date: Mon, 9 Nov 2009 16:06:46 CST
Local: Mon 9 Nov 2009 22:06
Subject: Re: Run-time overhead of text-based storage formats for numerical data
[...]

>> I enjoy your posts Rune but IMHO you realy do get carried away with
>> the wrong performance issues.

> No. The performance issues I worry about are the ones that
> kick users out of business. No one cares if 15 seconds or 50
> seconds is the most representative number for reading 100 MBytes
> of text-foematted numeric data, when the same amount of binary
> formatted data easily can be loaded in 0.3 seconds.

You have a point, but if you asked me to solve the problem, I would
probably try to keep data in the easiest way, i.e. still as text. Why?
Because if it's supposed to be read by a human in the end, text is the
native format. Just as much as a picture's native format is binary and
not XML.

Then I would ask myself; how do we speed this up? As someone suggested
you could use mmap in *nix systems (if you think of it, most
configuration files in *nix are actually text files, probably loaded
with mmap).

Now, let's say mmap doesn't solve the problem, what do we do next. I
would look into compressors. Then you can store compressed files on
disk, still in their native text format. And suddenly you have made the
disk access time disappear with minimal hassle. You don't have to come
up with a strange binary format to work around disk latencies.

I used to be part of developing a real-time system where the disks just
couldn't perform real-time transfer rates. Then we just made an adapter
class called Compressor taking two pointers to src- and dst memory,
chained the class with our File class, and solved the problem with
minimal effort. When the disks got faster a couple of years later we
just removed the compressor.

Cheers,
Daniel

--
      [ See http://www.gotw.ca/resources/clcm.htm for info about ]
      [ comp.lang.c++.moderated.    First time posters: Do this! ]


    Reply    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message, you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Discussion subject changed to "Run-time overhead of text-based storage formats for numerical data" by Graziano
Graziano  
View profile   Translate to Translated (View Original)
 More options 10 Nov, 08:23
Newsgroups: comp.lang.c++.moderated
From: Graziano <graziano.giuli...@gmail.com>
Date: Tue, 10 Nov 2009 02:23:36 CST
Local: Tues 10 Nov 2009 08:23
Subject: Re: Run-time overhead of text-based storage formats for numerical data
Well, out here also programmers working with HUGE amounts of data
(say: satellites, meteorological models, simulations).
Text files in these fields just pure nonsense. We use binary formats,
well documented and with convenient API, to allow indexing,
transformation to text, xml, code, I/O filters (say compress), missing
values, units, etc and A LOT of public available applications to view,
plot, explore data.

If in case, have a look for example to HDF4 or HDF5 format, NetCDF
format or the like.

If just a bunch of numbers (say up to some thousands) I will go for
sure with a documented XML.

--
      [ See http://www.gotw.ca/resources/clcm.htm for info about ]
      [ comp.lang.c++.moderated.    First time posters: Do this! ]


    Reply    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message, you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Discussion subject changed to "Run-time overhead of text-based storage formats for numerical data" by Seungbeom Kim
Seungbeom Kim  
View profile   Translate to Translated (View Original)
 More options 10 Nov, 08:22
Newsgroups: comp.lang.c++.moderated
From: Seungbeom Kim <musip...@bawi.org>
Date: Tue, 10 Nov 2009 02:22:33 CST
Local: Tues 10 Nov 2009 08:22
Subject: Re: Run-time overhead of text-based storage formats for numerical data

Rune Allnor wrote:
> On 9 Nov, 15:56, Nick Hounsome <nick.houns...@googlemail.com> wrote:

>> The text book authors are writing for the 99% not the 1% so they are
>> not going to change.

> I am working among the 1%. I have seen companies loose business
> because of poorly performing software that went undetected. That
> is, the people whose job it was to know, did not know about the
> major performance issues.

So what? I, as well as other posters I guess, understand that your
application may require performance that can't be satisfied by a textual
format, and stated so. Did anyone say that you're not among the 1% or
that you should switch from textual to binary? Do you have anything
to refute from the other replies in this thread?

Frankly I don't understand your point. "But I..." or "But my..."
isn't very meaningful when others didn't preclude your case.

--
Seungbeom Kim

      [ See http://www.gotw.ca/resources/clcm.htm for info about ]
      [ comp.lang.c++.moderated.    First time posters: Do this! ]


    Reply    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message, you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Discussion subject changed to "Run-time overhead of text-based storage formats for numerical data" by Nick Hounsome
Nick Hounsome  
View profile   Translate to Translated (View Original)
 More options 10 Nov, 12:56
Newsgroups: comp.lang.c++.moderated
From: Nick Hounsome <nick.houns...@googlemail.com>
Date: Tue, 10 Nov 2009 06:56:15 CST
Local: Tues 10 Nov 2009 12:56
Subject: Re: Run-time overhead of text-based storage formats for numerical data
On 10 Nov, 08:23, Graziano <graziano.giuli...@gmail.com> wrote:

> Well, out here also programmers working with HUGE amounts of data
> (say: satellites, meteorological models, simulations).
> Text files in these fields just pure nonsense. We use binary formats,
> well documented and with convenient API, to allow indexing,
> transformation to text, xml, code, I/O filters (say compress), missing
> values, units, etc and A LOT of public available applications to view,
> plot, explore data.

> If in case, have a look for example to HDF4 or HDF5 format, NetCDF
> format or the like.

> If just a bunch of numbers (say up to some thousands) I will go for
> sure with a documented XML.

And I would do exactly the same in your situation because I've read
about how big those files can be but you are in the 1%
(and I'd use memory mapping)

The important thing in this case is to provide the API for readers.

--
      [ See http://www.gotw.ca/resources/clcm.htm for info about ]
      [ comp.lang.c++.moderated.    First time posters: Do this! ]


    Reply    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message, you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Rune Allnor  
View profile   Translate to Translated (View Original)
 More options 10 Nov, 12:59
Newsgroups: comp.lang.c++.moderated
From: Rune Allnor <all...@tele.ntnu.no>
Date: Tue, 10 Nov 2009 06:59:20 CST
Local: Tues 10 Nov 2009 12:59
Subject: Re: Run-time overhead of text-based storage formats for numerical data
On 10 Nov, 09:22, Seungbeom Kim <musip...@bawi.org> wrote:

> Frankly I don't understand your point.

My point is that

1) The speed penalty imposed by using text formats for numerical
    data is totally absent in the USENET discussions and printed
    literature on C++.

2) The speed penalty imposed by using text formats for numerical
    data is one of the main bottlenecks in the data processing
    chains I have seen where I work.

3) The speed penalty imposed by using text formats for numerical
    data is totally unknown among the people whose job it is to
    set up said data processing chains.

4) The speed penalty imposed by using text formats for numerical
    data can easily be on the order of 100x or 200x relative to
    using binary data, depending on implementations of the software
    that accesses the file - not all applications are written in
    C++; not all C++ applications are efficient.

5) The speed penalty imposed by using text formats for numerical
    data is a *design* *choise*, on a par with using O(NlgN) quick
    sort algorithms instead of O(N^2) bubble sorts algorithms.
    What I am concerned, people are free to use text formats if

       a) Portability is an *actual* issue - not always the case.
       b) Speed *is* irrelevant - not always the case.

    Once one or both these factors is no longer relevant, text-
    based formats are out of the picture. As for the "human
    readability" question, that's irrelevant unless the contents
    of the file is meant to be inspected by humans.

6) The speed penalty imposed by using text formats for numerical
    data should be mentioned in textbooks on C++, so that
    unsuspecting users have a fair chance of making informed
    choises on the matter, instead of the present situation,
    where "politically correct" textbook authors not only make
    the choises for them, but also avoid to mention the
    alternatives.

Rune

--
      [ See http://www.gotw.ca/resources/clcm.htm for info about ]
      [ comp.lang.c++.moderated.    First time posters: Do this! ]


    Reply    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message, you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Nick Hounsome  
View profile   Translate to Translated (View Original)
 More options 10 Nov, 12:56
Newsgroups: comp.lang.c++.moderated
From: Nick Hounsome <nick.houns...@googlemail.com>
Date: Tue, 10 Nov 2009 06:56:02 CST
Local: Tues 10 Nov 2009 12:56
Subject: Re: Run-time overhead of text-based storage formats for numerical data
On 9 Nov, 20:03, Rune Allnor <all...@tele.ntnu.no> wrote:

But how much of that 3-5 hours was reading and how much was processing
the data?
I can't imagine a process in which the reading was the bulk of that
time.

I agree with everything you say here.
It's just that my experience is that you can't reduce minutes to
seconds unless the fundamental design is bad.

> > I enjoy your posts Rune but IMHO you realy do get carried away with
> > the wrong performance issues.

> No. The performance issues I worry about are the ones that
> kick users out of business. No one cares if 15 seconds or 50
> seconds is the most representative number for reading 100 MBytes
> of text-foematted numeric data, when the same amount of binary
> formatted data easily can be loaded in 0.3 seconds.

You're mixing reading and loading. Loading = reading and processing.
Processing is the biggest user of time in almost all systems otherwise
they aren't doing anything useful.
If I take your figures at face value you can't be doing anything with
the data.

> These details have a profound impact where I work. The only
> reason this is not recognized, is the omnipresent misconception
> that the slower time working with text-formatted numeric data
> is insignificant.

But you are undermining your own argument.
The people writing on this thread would hardly be saying that it was
insignificant if it cost them their jobs therefore it hasn't cost them
their jobs therefore it IS insignificant in all the projects that
they've worked on. In other words its IS insignificant for MOST people
MOST of the time just as I said.

> People who know their programing craft would know that one uses
> binary data formats for numeric data as default, and only deviates
> towards text-based formats where one can get away with them
> (file sizes less than about 5-10 MBytes).

You have it backwards.

One uses text formats by default and only deviate towards binary
format when you know that you have a problem and have demonstrated
that binary formats will solve it.

Either that or you are right and Meyers, Stroustrop, all the C++ book
writers and all the designers of the C and C++ I/O libraries are
wrong.

P.S. As I think I already mentioned - If you really want the ultimate
in speed then use memory mapping.

--
      [ See http://www.gotw.ca/resources/clcm.htm for info about ]
      [ comp.lang.c++.moderated.    First time posters: Do this! ]


    Reply    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message, you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Rune Allnor  
View profile   Translate to Translated (View Original)
 More options 10 Nov, 13:03
Newsgroups: comp.lang.c++.moderated
From: Rune Allnor <all...@tele.ntnu.no>
Date: Tue, 10 Nov 2009 07:03:28 CST
Local: Tues 10 Nov 2009 13:03
Subject: Re: Run-time overhead of text-based storage formats for numerical data
On 10 Nov, 09:23, Graziano <graziano.giuli...@gmail.com> wrote:

> Well, out here also programmers working with HUGE amounts of data
> (say: satellites, meteorological models, simulations).
> Text files in these fields just pure nonsense. We use binary formats,
> well documented and with convenient API, to allow indexing,
> transformation to text, xml, code, I/O filters (say compress), missing
> values, units, etc and A LOT of public available applications to view,
> plot, explore data.

> If in case, have a look for example to HDF4 or HDF5 format, NetCDF
> format or the like.

I know what binary file format to use with the data in question.

My problem has been a bit more fundamental than that. When I ask
decision-makers on what grounds text-based file formats were
chosen, people either respond with "text files are so convenient"
or a blank stare.

In other words, strategic decisions that directly affect the
ability to meet deadlines are taken as a matter of course,
without evaluating the operational impact on the process - or
even without the awareness that an alternative existed at all.

Which is why I would like the trade-offs involved to at least
be mentioned in upcoming textbooks on programming in general
and C++ in particular.

Rune

--
      [ See http://www.gotw.ca/resources/clcm.htm for info about ]
      [ comp.lang.c++.moderated.    First time posters: Do this! ]


    Reply    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message, you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
mzdude  
View profile   Translate to Translated (View Original)
 More options 10 Nov, 21:30
Newsgroups: comp.lang.c++.moderated
From: mzdude <jsa...@cox.net>
Date: Tue, 10 Nov 2009 15:30:49 CST
Local: Tues 10 Nov 2009 21:30
Subject: Re: Run-time overhead of text-based storage formats for numerical data
On Nov 10, 7:59 am, Rune Allnor <all...@tele.ntnu.no> wrote:

> On 10 Nov, 09:22, Seungbeom Kim <musip...@bawi.org> wrote:

> > Frankly I don't understand your point.

> My point is that

I think the whole text vs. binary debate for large data files is a
little
facetious. Use what works. I work for a company that acquires spectral
data and stores it in native binary format (for speed). We often
aquire
100's of MB or very large fractions of GB's worth of data. If you can
honestly tell me that a human is going to sift through a text file
looking for inaccuracies in the data, then that person is deluding
themselves and has way too much time on their hands.

We provide a dll interface to read our data file format. One routine
will
extract the data in native binary format for speed issues, the other
will
extract the data and return it in text format. This was done to
promote
language interoperability. In tests done long ago when 300Mz PCs ruled
the earth, binary extraction was at least an order of magnitude
faster. IIRC
it was about 70 times faster.

Now that our instruments are being used in the medical community, we
have
additional tamper detection requirements. Storing data in text makes
it
very tempting (and easy) for a human to fire up a text editor and
manipulate
the data. It can still be done using a binary editor, but it's much
harder
to do.

So I would add for security reasons as well as speed text file formats
do
not work for us.

--
      [ See http://www.gotw.ca/resources/clcm.htm for info about ]
      [ comp.lang.c++.moderated.    First time posters: Do this! ]


    Reply    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message, you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Discussion subject changed to "Run-time overhead of text-based storage formats for numerical data" by Francis Glassborow
Francis Glassborow  
View profile   Translate to Translated (View Original)
 More options 10 Nov, 21:30
Newsgroups: comp.lang.c++.moderated
From: Francis Glassborow <francis.glassbo...@btinternet.com>
Date: Tue, 10 Nov 2009 15:30:39 CST
Local: Tues 10 Nov 2009 21:30
Subject: Re: Run-time overhead of text-based storage formats for numerical data

Rune Allnor wrote:
> On 10 Nov, 09:22, Seungbeom Kim <musip...@bawi.org> wrote:

>> Frankly I don't understand your point.

> My point is that

> 1) The speed penalty imposed by using text formats for numerical
>     data is totally absent in the USENET discussions and printed
>     literature on C++.

Because it is irrelevant to those involved. And any halfway competnet
programmer would understand that there is an overhead for using a test file.

> 2) The speed penalty imposed by using text formats for numerical
>     data is one of the main bottlenecks in the data processing
>     chains I have seen where I work.

Which means that a properly qualified programmer would recognise that
this is one of the minority cases where a binary format would be useful.
BTW in another post I mentioned that I do use binary formats for scratch
files (exactly because there is no advantage in using a text format).

In addition it should be noted that a programmer who does not recognise
when to use a binary format file probably does not understand the
dangers in using such.

> 3) The speed penalty imposed by using text formats for numerical
>     data is totally unknown among the people whose job it is to
>     set up said data processing chains.

OK, so you are using insufficiently qualified people. Whose fault is that?

> 4) The speed penalty imposed by using text formats for numerical
>     data can easily be on the order of 100x or 200x relative to
>     using binary data, depending on implementations of the software
>     that accesses the file - not all applications are written in
>     C++; not all C++ applications are efficient.

I frankly do not believe that. Those kind of performance hits are almost
invariably the consequence of using the wrong algorithms.

> 5) The speed penalty imposed by using text formats for numerical
>     data is a *design* *choise*, on a par with using O(NlgN) quick
>     sort algorithms instead of O(N^2) bubble sorts algorithms.
>     What I am concerned, people are free to use text formats if

>        a) Portability is an *actual* issue - not always the case.
>        b) Speed *is* irrelevant - not always the case.

The point is that for the overwhelming majority using text formats is
win-win. I would expect to see information about when to use binary
formats in specialist books on areas where it matters (authors of
general texts have to trim the content to meet criteria provided by
publishers and something that is important to a very small minority
would almost invariable be cut.

>     Once one or both these factors is no longer relevant, text-
>     based formats are out of the picture. As for the "human
>     readability" question, that's irrelevant unless the contents
>     of the file is meant to be inspected by humans.

True but readability also extends to tools except that then we often
call it portability.

> 6) The speed penalty imposed by using text formats for numerical
>     data should be mentioned in textbooks on C++, so that
>     unsuspecting users have a fair chance of making informed
>     choises on the matter, instead of the present situation,
>     where "politically correct" textbook authors not only make
>     the choises for them, but also avoid to mention the
>     alternatives.

See above. Unfortunately far too many programmers consider reading to be
an arcane art and never read books. Worse they think they know all the
answers and never listen to advice from others.

A programmer who cannot recognise that text formats affect performance
is not fit for anything other than grunt-work.

Actually I would suggest that such things as choice of file formats are
design issues and if an employer chooses not to employ a competent
designer with knowledge of the area he gets all he deserves.

--
      [ See http://www.gotw.ca/resources/clcm.htm for info about ]
      [ comp.lang.c++.moderated.    First time posters: Do this! ]


    Reply    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message, you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Discussion subject changed to "Run-time overhead of text-based storage formats for numerical data" by Rune Allnor
Rune Allnor  
View profile   Translate to Translated (View Original)
 More options 11 Nov, 05:16
Newsgroups: comp.lang.c++.moderated
From: Rune Allnor <all...@tele.ntnu.no>
Date: Tue, 10 Nov 2009 23:16:26 CST
Local: Wed 11 Nov 2009 05:16
Subject: Re: Run-time overhead of text-based storage formats for numerical data
On 10 Nov, 13:56, Nick Hounsome <nick.houns...@googlemail.com> wrote:

I don't remember the details, this was several years ago, but
the gross numbers were more or less that the process in question
produced some 200 files on the order of 100 MBytes / file during
24 hours of survey. Each file needed to pass through two or three
read / write cycles during processing. That's about 600 read/writes,
each taking some 30 seconds of operator idle time. That 5 hours
wasted, right there. With a 24-hr deadline, that hurts.

The fundamental bad design is to use text-formatted files.
I posted a demo I made in matlab, which is an increasingly
popular language for these kinds of things, in a reply to
Glassborow. Look at the numbers there - binary files are
100-200x faster than text formats.

> > > I enjoy your posts Rune but IMHO you realy do get carried away with
> > > the wrong performance issues.

> > No. The performance issues I worry about are the ones that
> > kick users out of business. No one cares if 15 seconds or 50
> > seconds is the most representative number for reading 100 MBytes
> > of text-foematted numeric data, when the same amount of binary
> > formatted data easily can be loaded in 0.3 seconds.

> You're mixing reading and loading. Loading = reading and processing.
> Processing is the biggest user of time in almost all systems otherwise
> they aren't doing anything useful.
> If I take your figures at face value you can't be doing anything with
> the data.

I only have so much time to get the job done. We can argue
over semantics till the cows come home - the deadlines stand
whether I am 'reading' or 'loading' the data.

> > These details have a profound impact where I work. The only
> > reason this is not recognized, is the omnipresent misconception
> > that the slower time working with text-formatted numeric data
> > is insignificant.

> But you are undermining your own argument.
> The people writing on this thread would hardly be saying that it was
> insignificant if it cost them their jobs therefore it hasn't cost them
> their jobs therefore it IS insignificant in all the projects that
> they've worked on. In other words its IS insignificant for MOST people
> MOST of the time just as I said.

Well, the *intention* of what people write might be benign,
but that's not how it comes across. Some of the reactions I
have recieved similar threads here and on comp.lang.c++:

http://groups.google.no/group/comp.lang.c++/msg/0abdc440e78f98d6

[RA] > As long as you keep two factors in mind:
      > 1) The user's time is not yours (the programmer) to waste.
      > 2) The users's storage facilities (disk space, network
      >    bandwidth etc) are not yours (the programmer) to waste.

[JK] The user pays for your time.  Spending it to do something which
      results in a less reliable program, and that he doesn't need, is
      irresponsible, and borders on fraud.

Paying attention to speed "borders on fraud."

http://groups.google.no/group/comp.lang.c++.moderated/msg/555d4053471...

[RA] > 2) Text-formatted numerical data take significantly longer to
read and
      > write
      >    than binary formats.

[NH] Again - Never in my experience.

http://groups.google.no/group/comp.lang.c++.moderated/msg/eed5649d9ba...

[FG]  This is a classic example of the 'speed at any cost' school of
thought.

http://groups.google.no/group/comp.lang.c++.moderated/msg/8aec2b00e7a...

[NH]  You seem to be obssessed with speed

>From such excerpts I can only conclude that most people are

oblivious to the problem and its implications.

> > People who know their programing craft would know that one uses
> > binary data formats for numeric data as default, and only deviates
> > towards text-based formats where one can get away with them
> > (file sizes less than about 5-10 MBytes).

> You have it backwards.

> One uses text formats by default and only deviate towards binary
> format when you know that you have a problem and have demonstrated
> that binary formats will solve it.

> Either that or you are right and Meyers, Stroustrop, all the C++ book
> writers and all the designers of the C and C++ I/O libraries are
> wrong.

No. The authors you list aren't wrong. In order to be wrong,
one must make a statement that can be proved or demonstrated
to be false.

The fact is that except for Stroustrup mentioning ios_base::binary
more or less in passing in his chapter on file streams, none of the
C++ textbooks I have seen mention the issue at all.

Rune

--
      [ See http://www.gotw.ca/resources/clcm.htm for info about ]
      [ comp.lang.c++.moderated.    First time posters: Do this! ]


    Reply    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message, you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Rune Allnor  
View profile   Translate to Translated (View Original)
 More options 11 Nov, 05:15
Newsgroups: comp.lang.c++.moderated
From: Rune Allnor <all...@tele.ntnu.no>
Date: Tue, 10 Nov 2009 23:15:23 CST
Local: Wed 11 Nov 2009 05:15
Subject: Re: Run-time overhead of text-based storage formats for numerical data
On 10 Nov, 22:30, Francis Glassborow

<francis.glassbo...@btinternet.com> wrote:
> > 3) The speed penalty imposed by using text formats for numerical
> >     data is totally unknown among the people whose job it is to
> >     set up said data processing chains.

> OK, so you are using insufficiently qualified people. Whose fault is that?

*I* am not using anyone. I happen to work among people who
ought to know these things, but don't.

> > 4) The speed penalty imposed by using text formats for numerical
> >     data can easily be on the order of 100x or 200x relative to
> >     using binary data, depending on implementations of the software
> >     that accesses the file - not all applications are written in
> >     C++; not all C++ applications are efficient.

> I frankly do not believe that. Those kind of performance hits are almost
> invariably the consequence of using the wrong algorithms.

Below is a test I wrote in matlab, which is an increasingly
popular language for these kinds of things. The script first
generates ten million random numbers, and writes them to file
on both ASCII and binary double precision floating point formats.
The files are then read straight back in, hopefully mitigating
effects of file caches etc:

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
N = 10000000;
d1=randn(N,1);
t1=cputime;
save test.txt d1 -ascii
t2=cputime-t1;
disp(['Wrote ASCII data in ',num2str(t2),' seconds'])

t3=cputime;
d2=load('test.txt','-ascii');
t4=cputime-t3;
disp(['Read ASCII data in ',num2str(t4),' seconds'])

t5=cputime;
fid=fopen('test.raw','w');
fwrite(fid,d1,'double');
fclose(fid);
t6=cputime-t5;
disp(['Wrote binary data in ',num2str(t6),' seconds'])

t7=cputime;
fid=fopen('test.raw','r');
d3=fread(fid,'double');
fclose(fid);
t8=cputime-t7;
disp(['Read binary data in ',num2str(t8),' seconds'])
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

Output:
------------------------------------
Wrote ASCII data in 24.0469 seconds
Read ASCII data in 42.2031 seconds
Wrote binary data in 0.10938 seconds
Read binary data in 0.32813 seconds
------------------------------------

Binary writes are 24.0/0.1 = 240x faster than text write.
Binary reads are 42.2/0.32 = 130x faster than text read.

These numbers are representative for where I work.

> > 5) The speed penalty imposed by using text formats for numerical
> >     data is a *design* *choise*, on a par with using O(NlgN) quick
> >     sort algorithms instead of O(N^2) bubble sorts algorithms.
> >     What I am concerned, people are free to use text formats if

> >        a) Portability is an *actual* issue - not always the case.
> >        b) Speed *is* irrelevant - not always the case.

> The point is that for the overwhelming majority using text formats is
> win-win. I would expect to see information about when to use binary
> formats in specialist books on areas where it matters (authors of
> general texts have to trim the content to meet criteria provided by
> publishers and something that is important to a very small minority
> would almost invariable be cut.

My point is that the people who only know a little programming
also would benefit from at least having seen these things
mentioned.

I don't mind you and other textbook authors arguing fiercly
for one approach and against the other, *provided* both
approaches are at least mentioned, and preferably described
in terms of pros & cons.

I might agree with you on both cases, if the programmers and
designers had access to textbooks where these questions are
discussed.

As of right now, they don't.

Rune

--
      [ See http://www.gotw.ca/resources/clcm.htm for info about ]
      [ comp.lang.c++.moderated.    First time posters: Do this! ]


    Reply    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message, you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Discussion subject changed to "Run-time overhead of text-based storage formats for numerical data" by Seungbeom Kim
Seungbeom Kim  
View profile   Translate to Translated (View Original)
 More options 11 Nov, 05:18
Newsgroups: comp.lang.c++.moderated
From: Seungbeom Kim <musip...@bawi.org>
Date: Tue, 10 Nov 2009 23:18:01 CST
Local: Wed 11 Nov 2009 05:18
Subject: Re: Run-time overhead of text-based storage formats for numerical data

As you have stated, the points you have made so far relates only to
applications dealing heavily with numerical data. Then why are you
accusing general C++ language book authors and Usenet participants?
Take a numerical processing book, or go to a numerical processing
group, and if it's said there that textual formats should be preferred,
then that's where you should make accusations and claims. (And by the
way, choosing data formats is only barely on-topic for a "C++ language"
textbook or forum.)

I'm not saying that you should not discuss such matters elsewhere,
such as here. It's just that you keep emphasizing your particular
application area and its special needs, while others are stating
more common cases and more general choices, and NOT refuting your
points in your area in particular. You and others are going parallel
to each other and not making any more progress. Your repeated claims
here would make sense only if you were arguing that binary formats
should be preferred *by default* in most, if not all, areas -- that's
where you could refute others' claims -- but you have clearly stated
above that your points are meant only for numerical applications.
This is why I said I didn't understand your point: are you just
explaining your situation, or trying to change others' opinions?

--
Seungbeom Kim

      [ See http://www.gotw.ca/resources/clcm.htm for info about ]
      [ comp.lang.c++.moderated.    First time posters: Do this! ]


    Reply    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message, you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Discussion subject changed to "Run-time overhead of text-based storage formats for numerical data" by Ulrich Eckhardt
Ulrich Eckhardt  
View profile   Translate to Translated (View Original)
 More options 11 Nov, 19:23
Newsgroups: comp.lang.c++.moderated
From: Ulrich Eckhardt <eckha...@satorlaser.com>
Date: Wed, 11 Nov 2009 13:23:55 CST
Local: Wed 11 Nov 2009 19:23
Subject: Re: Run-time overhead of text-based storage formats for numerical data
Rune Allnor wrote:

[ about "binary" files ]

> The fact is that except for Stroustrup mentioning ios_base::binary
> more or less in passing in his chapter on file streams, none of the
> C++ textbooks I have seen mention the issue at all.

ios_base::binary does something completely different.

Uli

--
Sator Laser GmbH
Geschäftsführer: Thorsten Föcking, Amtsgericht Hamburg HR B62 932

      [ See http://www.gotw.ca/resources/clcm.htm for info about ]
      [ comp.lang.c++.moderated.    First time posters: Do this! ]


    Reply    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message, you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Rune Allnor  
View profile   Translate to Translated (View Original)
 More options 11 Nov, 19:26
Newsgroups: comp.lang.c++.moderated
From: Rune Allnor <all...@tele.ntnu.no>
Date: Wed, 11 Nov 2009 13:26:51 CST
Local: Wed 11 Nov 2009 19:26
Subject: Re: Run-time overhead of text-based storage formats for numerical data
On 11 Nov, 06:18, Seungbeom Kim <musip...@bawi.org> wrote:

Because C++ is considered by most (although maybe not by the
regulars here) as a language for the applictaions where high
efficiency is paramount. Even so, how and when to handle binary
files through C++ is not treated in any of the learning material
I have seen.

> Take a numerical processing book, or go to a numerical processing
> group, and if it's said there that textual formats should be preferred,
> then that's where you should make accusations and claims. (And by the
> way, choosing data formats is only barely on-topic for a "C++ language"
> textbook or forum.)

Yesterday I summarized in another post a number of responses
I have recieved on these questions over the past couple of
weeks:

http://groups.google.no/group/comp.lang.c++.moderated/msg/9421be7a5c6...

The general impression is that the C++ community as it comes
across in this and other USENET groups, as well as the textbooks,
is totally foreign to using binary file formats at all.

> I'm not saying that you should not discuss such matters elsewhere,
> such as here. It's just that you keep emphasizing your particular
> application area and its special needs, while others are stating
> more common cases and more general choices, and NOT refuting your
> points in your area in particular.

Correct. But there is a general trend towards dismissing
the stated problem as irrelevant, or my solution as a
misconception. Again, review the opinions expressed in the
post I refer to above.

> You and others are going parallel
> to each other and not making any more progress. Your repeated claims
> here would make sense only if you were arguing that binary formats
> should be preferred *by default* in most, if not all, areas -- that's
> where you could refute others' claims -- but you have clearly stated
> above that your points are meant only for numerical applications.

That's where it is easy to find the problem, because that's
the application that is easy to find. These things don't
show up unless the file sizes are 5-10 MBytes or more; only
then do the delays associated with data loading become
noticeable to human users.

Once e.g. XML files, that need to be parsed etc which take
at least as much time as merely converting the numbers, start
reaching those kinds of sizes, text-based formats will become
annoying in other applications as well.

> This is why I said I didn't understand your point: are you just
> explaining your situation, or trying to change others' opinions?

I am trying to make influential people here - both regular posters
on c.l.c++.m and textbook authors who might be lurking - aware of
the problem. Only when the teachers start addressing a problem
will it be reasonable to expect students to know.

I have posted numbers to demonstrate what I am talking about
on a number of occasions, e.g.

http://groups.google.no/group/comp.lang.c++.moderated/msg/2863e5d312a...

The common first reaction is that "This is not C++, so this
is irrelevant!" Then all the arguments we have seen in this
and recent threads appear; that

- There are faster ways than operator>> and operator<<
- The algorithm must be wrong
- My measurements are not 'exact'

and so on.

The fact is that very few people even suspected these
numbers to differ by orders of magnitude. The numbers referred
above, while not C++, are representative for the delays and
bottlenecks where I work. We can quarrel about reducing the
abolute numbers by a factor 3 or maybe 5 by using efficient
pasrers, but that requires rewriting the file parsers in
every single program already in use out there. As well as
educating the programmers etc.

The net effect is far larger if one spends the same effort
on educating the same programmers and designers about binary
files.

But in order to do that, one needs to educate the educators.

Rune

--
      [ See http://www.gotw.ca/resources/clcm.htm for info about ]
      [ comp.lang.c++.moderated.    First time posters: Do this! ]


    Reply    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message, you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Discussion subject changed to "Run-time overhead of text-based storage formats for numerical data" by Martin B.
Martin B.  
View profile   Translate to Translated (View Original)
 More options 12 Nov, 00:59
Newsgroups: comp.lang.c++.moderated
From: "Martin B." <0xCDCDC...@gmx.at>
Date: Wed, 11 Nov 2009 18:59:13 CST
Local: Thurs 12 Nov 2009 00:59
Subject: Re: Run-time overhead of text-based storage formats for numerical data

To everything that has been said in other replies I would like to add a
small test sample of mine.
It's as basic as I thought I could get and just as inaccurate and
insufficient as every other single test. (Code follows at the end)

-- Run 1 -- (1e6)
build type: RELEASE
generate data (1000000 doubles) ...
start writing ...
start reading ...
timings:
Binary write = 279 ms
ASCII write  = 2048 ms (Factor 7.3)
Binary read  = 211 ms
ASCII read   = 1283 ms (Factor 6.1)

-- Run 2 -- (1e7)
build type: RELEASE
generate data (10000000 doubles) ...
start writing ...
start reading ...
timings:
Binary write = 11329 ms
ASCII write  = 20014 ms (Factor 1.8)
Binary read  = 2252 ms
ASCII read   = 12922 ms (Factor 5.7)

-- Run 3 -- (1e6)
build type: RELEASE
generate data (1000000 doubles) ...
start writing ...
start reading ...
timings:
Binary write = 313 ms
ASCII write  = 1911 ms (Factor 6.1)
Binary read  = 212 ms
ASCII read   = 1277 ms (Factor 6.0)

So what gives?
Binary is unsurprisingly faster.
Apparently somewhere between factor 5 and 10 on my box here. (And if,
something else is going on, such as in Run2, it may not even be that
much faster).

Bottom line for me:
1.) Binary *is* definitely faster.
2.) The difference is small n factors.
3.) *If* you need that speed, use binary.

br,
Martin

### CODE ###

int main()
{
        using namespace std;
        srand( (unsigned)time( NULL ) );

        cout << "build type: " <<
#ifndef NDEBUG
        "DEBUG"
#else
        "RELEASE"
#endif
        << endl;

        const int n = 1e6;

        cout << "generate data (" << n << " doubles) ...\n";
        dvec_t outdata(n, 3.14);
        for(int i=0; i<n; ++i) {
                outdata[i] *= double(rand());
        }

        cout << "start writing ...\n";
        const tdiff_t wbin = write_binary(outdata);
        const tdiff_t wtxt = write_ascii(outdata);

        dvec_t b_in, t_in;

        cout << "start reading ...\n";
        const tdiff_t rbin = read_binary(b_in);
        const tdiff_t rtxt = read_ascii(t_in);

        cout << "timings:\n";
        cout << "Binary write = " << wbin << " ms\n";
        cout << "ASCII write  = " << wtxt << " ms\n";
        cout << "Binary read  = " << rbin << " ms\n";
        cout << "ASCII read   = " << rtxt << " ms\n";

        // cout << "check results ...\n";
        // cout << "Binary w/r == equality: " << (outdata==b_in?"yes":"no") <<
endl;
        // cout << "ASCII  w/r == equality: " << (outdata==t_in?"yes":"no") <<
endl;
        return 0;

}

tdiff_t now()
{
        return timeGetTime(); // from winmm.lib + mmsystem.h (Windows)

};

tdiff_t write_binary(dvec_t const& data)
{
        const tdiff_t start = now();
        FILE* f = fopen("mydata.bin", "wb");        
        for(dvec_t::const_iterator i=data.begin(), e=data.end(); i!=e; ++i) {
                fwrite(&(*i), sizeof(double), 1, f);
        }
        fclose(f);
        const tdiff_t stop = now();
        return stop-start;

}

tdiff_t read_binary(dvec_t & data)
{
        data.clear();
        const tdiff_t start = now();
        FILE* f = fopen("mydata.bin", "rb");
        if(!f)
                throw std::runtime_error("No file!");
        double dRead;
        while(fread(&dRead, sizeof(double), 1, f) == 1) {
                data.push_back(dRead);
        }
        fclose(f);
        const tdiff_t stop = now();
        return stop-start;

}

tdiff_t write_ascii(dvec_t const& data)
{
        const tdiff_t start = now();
        FILE* f = fopen("mydata.text", "wb");
        for(dvec_t::const_iterator i=data.begin(), e=data.end(); i!=e; ++i) {
                fprintf(f, "%le\n", *i);
        }
        fclose(f);
        const tdiff_t stop = now();
        return stop-start;

}

tdiff_t read_ascii(dvec_t & data)
{
        data.clear();
        const tdiff_t start = now();
        FILE* f = fopen("mydata.text", "rb");
        if(!f)
                throw std::runtime_error("No file!");
        double dRead;

        while(fscanf(f, "%le", &dRead) == 1) {
                data.push_back(dRead);
        }
        fclose(f);
        const tdiff_t stop = now();
        return stop-start;

}

--
      [ See http://www.gotw.ca/resources/clcm.htm for info about ]
      [ comp.lang.c++.moderated.    First time posters: Do this! ]

    Reply    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message, you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Seungbeom Kim  
View profile   Translate to Translated (View Original)
 More options 12 Nov, 00:59
Newsgroups: comp.lang.c++.moderated
From: Seungbeom Kim <musip...@bawi.org>
Date: Wed, 11 Nov 2009 18:59:11 CST
Local: Thurs 12 Nov 2009 00:59
Subject: Re: Run-time overhead of text-based storage formats for numerical data

Rune Allnor wrote:
> On 10 Nov, 22:30, Francis Glassborow
> <francis.glassbo...@btinternet.com> wrote:
>>> 4) The speed penalty imposed by using text formats for numerical
>>>     data can easily be on the order of 100x or 200x relative to
>>>     using binary data, depending on implementations of the software
>>>     that accesses the file - not all applications are written in
>>>     C++; not all C++ applications are efficient.
>> I frankly do not believe that. Those kind of performance hits are almost
>> invariably the consequence of using the wrong algorithms.

> Below is a test I wrote in matlab, which is an increasingly
> popular language for these kinds of things. [...]

You can discuss it in a MATLAB forum, then. MATLAB measurements
in a C++ forum doesn't mean much, because the people don't know
(and may not either be interested in) what's going on inside MATLAB.
Gratuitous inefficiencies inside MATLAB, if any, cannot be used to
justify any argument in C++.

> Output:
> ------------------------------------
> Wrote ASCII data in 24.0469 seconds
> Read ASCII data in 42.2031 seconds
> Wrote binary data in 0.10938 seconds
> Read binary data in 0.32813 seconds
> ------------------------------------

> Binary writes are 24.0/0.1 = 240x faster than text write.
> Binary reads are 42.2/0.32 = 130x faster than text read.

> These numbers are representative for where I work.

Again, if MATLAB is representative for where you work, please visit
a MATLAB forum. Otherwise, there was a C++ test program by James Kanze
in the comp.lang.c++ thread you mentioned, and the numbers given by
that program are much more persuasive and convincing, at least here
in a C++ newsgroup. Or you can suggest a better C++ program, of course.

> My point is that the people who only know a little programming
> also would benefit from at least having seen these things
> mentioned.

> I don't mind you and other textbook authors arguing fiercly
> for one approach and against the other, *provided* both
> approaches are at least mentioned, and preferably described
> in terms of pros & cons.

It may not be a job for language textbooks, as I mentioned earlier,
though it definitely is for numerical programming textbooks, or
more general programming books dealing with choosing data formats.

It is very natural and acceptable that language textbooks focus on
the language features and that for the sake of simplicity they default
to a text data format that's easier to understand and debug. They
don't want the readers to struggle on other issues when they don't
understand the language features very well yet.

You are welcome to write a book of your own, of course.

--
Seungbeom Kim

      [ See http://www.gotw.ca/resources/clcm.htm for info about ]
      [ comp.lang.c++.moderated.    First time posters: Do this! ]


    Reply    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message, you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Messages 1 - 25 of 53   Newer >
« Back to Discussions « Newer topic     Older topic »

Create a group - Google Groups - Google Home - Terms of Service - Privacy Policy
©2009 Google