Disk I/O in C – avoid fgetc/fputc

Most programs don’t do that much disk I/O. But for those that do, the disk I/O is often the bottleneck. I discovered this firsthand working on my Huffman tree encoding program. On small data files, it doesn’t matter much how you read or write your data, but when the filesizes get into hundreds of megabytes, implementation makes a big difference.

The 3 basic C functions for reading (and writing) unformatted text files are fgetc/fputc, fgets/fputs, and fread/fwrite. fgetc reads in one character at a time, fgets one line at a time and fread some number of ‘records’ at a time. As a record is just a sequence of bytes of a defined length, what fread really let you do is read in data in chunks whose size are recordsize * number of records. The respective put/write functions do the reverse.

To get a handle on the differences in terms of performance, I wrote a small program to read in and write back out a 150MB file using all 3 methods. On a relatively old Linux system (1.6GHZ, 160GB ATA/66 drive), here’s what I found. The chunk size is just the number of bytes in data read in or written out in a single operation.

Method Chunk size of data Time
fgetc/fputc 1 byte 5.90
fgets/fputs 64 bytes 1.71
fread/fwrite 1 byte 18.37
fread/fwrite 4 byte 5.22
fread/fwrite 16 byte 1.88
fread/fwrite 64 byte 1.06
fread/fwrite 256 byte 0.79
fread/fwrite 1024 byte 0.75
fread/fwrite 4096 byte 0.71
fread/fwrite 16384 byte 0.64
fread/fwrite 65536 byte 0.63
fread/fwrite 262144 byte 0.66

The bottom line is that if you can accommodate reading large chunks of data at a time, you’ll get substantial savings. The difference between operating on 16 kilobyte chunks with fread/fwrite and 1 byte chunks with fgetc/fputc is ~9x. fread/fwrite in particular lets you read in chunks of your choosing, and the sweet spot seems to be about around 16kb. Interestingly, fread/fwrite is actually slower than fgetc/fputc for very small chunk sizes.

On the whole, fgets/fputs with a medium-sized buffer does pretty well. Still, if speed is the primary consideration, fread/fwrite is still a win, unless you can guarantee that your files have very long lines in them (fgets stops reading at the newline character).

Incidentally, the speedups in the table are about in line with what I observed migrating my Huffman program from single-character I/O to chunk-based I/O.

Comments are closed.