Disk I/O in C – avoid fgetc/fputc

Most programs don’t do that much disk I/O. But for those that do, the disk I/O is often the bottleneck. I discovered this firsthand working on my Huffman tree encoding program. On small data files, it doesn’t matter much how you read or write your data, but when the filesizes get into hundreds of megabytes, implementation makes a big difference.

The 3 basic C functions for reading (and writing) unformatted text files are fgetc/fputc, fgets/fputs, and fread/fwrite. fgetc reads in one character at a time, fgets one line at a time and fread some number of ‘records’ at a time. As a record is just a sequence of bytes of a defined length, what fread really let you do is read in data in chunks whose size are recordsize * number of records. The respective put/write functions do the reverse.

To get a handle on the differences in terms of performance, I wrote a small program to read in and write back out a 150MB file using all 3 methods. On a relatively old Linux system (1.6GHZ, 160GB ATA/66 drive), here’s what I found. The chunk size is just the number of bytes in data read in or written out in a single operation.

Method	Chunk size of data	Time
fgetc/fputc	1 byte	5.90
fgets/fputs	64 bytes	1.71
fread/fwrite	1 byte	18.37
fread/fwrite	4 byte	5.22
fread/fwrite	16 byte	1.88
fread/fwrite	64 byte	1.06
fread/fwrite	256 byte	0.79
fread/fwrite	1024 byte	0.75
fread/fwrite	4096 byte	0.71
fread/fwrite	16384 byte	0.64
fread/fwrite	65536 byte	0.63
fread/fwrite	262144 byte	0.66

The bottom line is that if you can accommodate reading large chunks of data at a time, you’ll get substantial savings. The difference between operating on 16 kilobyte chunks with fread/fwrite and 1 byte chunks with fgetc/fputc is ~9x. fread/fwrite in particular lets you read in chunks of your choosing, and the sweet spot seems to be about around 16kb. Interestingly, fread/fwrite is actually slower than fgetc/fputc for very small chunk sizes.

On the whole, fgets/fputs with a medium-sized buffer does pretty well. Still, if speed is the primary consideration, fread/fwrite is still a win, unless you can guarantee that your files have very long lines in them (fgets stops reading at the newline character).

Incidentally, the speedups in the table are about in line with what I observed migrating my Huffman program from single-character I/O to chunk-based I/O.

Left 404

Adrift in the 21st Century

Disk I/O in C – avoid fgetc/fputc