Most programs don’t do that much disk I/O. But for those that do, the disk I/O is often the bottleneck. I discovered this firsthand working on my Huffman tree encoding program. On small data files, it doesn’t matter much how you read or write your data, but when the filesizes get into hundreds of megabytes, implementation makes a big difference.
The 3 basic C functions for reading (and writing) unformatted text files are
fgetc/fputc, fgets/fputs, and fread/fwrite.
fgetc reads in one character at a time,
fgets one line at a time and
fread some number of ‘records’ at a time. As a record is just a sequence of bytes of a defined length, what
fread really let you do is read in data in chunks whose size are recordsize * number of records. The respective put/write functions do the reverse.
To get a handle on the differences in terms of performance, I wrote a small program to read in and write back out a 150MB file using all 3 methods. On a relatively old Linux system (1.6GHZ, 160GB ATA/66 drive), here’s what I found. The chunk size is just the number of bytes in data read in or written out in a single operation.
|Method||Chunk size of data||Time|
The bottom line is that if you can accommodate reading large chunks of data at a time, you’ll get substantial savings. The difference between operating on 16 kilobyte chunks with
fread/fwrite and 1 byte chunks with
fgetc/fputc is ~9x.
fread/fwrite in particular lets you read in chunks of your choosing, and the sweet spot seems to be about around 16kb. Interestingly,
fread/fwrite is actually slower than
fgetc/fputc for very small chunk sizes.
On the whole,
fgets/fputs with a medium-sized buffer does pretty well. Still, if speed is the primary consideration,
fread/fwrite is still a win, unless you can guarantee that your files have very long lines in them (
fgets stops reading at the newline character).
Incidentally, the speedups in the table are about in line with what I observed migrating my Huffman program from single-character I/O to chunk-based I/O.