Skip to content

LOAD DATA INFILE (and mysqldump)

A colleague of mine has been benchmarking mysqldump data load vs. various versions of LOAD DATA INFILE. He created a sample data as a text file with either 100k or 20M rows of five integers each, the first column of which is the pk.
CODE:
perl -MList::Util=shuffle -e '@k=shuffle(1..20e6);
  for (@k) {
    print $_, "    ", join("    ", map int(rand(1e9)), 0..3), "\n";
}' > loadme_nonpkorder.txt

perl -e 'print ++$i, "    ", join("    ", map int(rand(1e9)), 0..3), "\n" 
  for 1..20e6' > loadme_pkorder.txt

All insertion has been done on empty and new tables. The text files we read at least once before to warm up the OS disk cache. The tables have two non-unique single-column indexes. All happens on a idle-ish DB master with some substantial memory and a NetApp hosting the datadir (via XFS and LVM).

He benchmarked four cases:
  1. Insertion in PK order.
  2. Insertion in PK order, dropping indexes before insertion and re-adding them later.
  3. Insertion in random order.
  4. Insertion in random order, dropping indexes before insertion and re-adding them later.

Summary: The result is not surprising: Both using PK order and dropping/re-adding indexes improves performance considerably. The PK order insertion becomes more and more crucial with a larger dataset (which is not at all surprising if you think about what happens when adding a record to the innodb PK tree).

First the results for the small (100k rows) dataset:

Benchmark: timing 20 iterations of nopk_with_drop, nopk_without_drop, pk_with_drop, pk_without_drop...
  1. nopk_with_drop: 26.4343 wallclock secs
  2. nopk_without_drop: 31.8272 wallclock secs
  3. pk_with_drop: 25.4144 wallclock secs
  4. pk_without_drop: 29.0066 wallclock secs

Note that these are the timings for 20 repetitions. That's a 17% speedup due to dropping/readding the indexes (non-pk order), a 4% speedup due to the PK order insertion (drop/readd case) or a 20% speed-up for both effects.

The PK order will be much more important on the large dataset:

Benchmark: timing 1 iterations of nopk_with_drop, nopk_without_drop, pk_with_drop, pk_without_drop...
  1. nopk_with_drop: 257.328 wallclock secs
  2. nopk_without_drop: 389.941 wallclock secs
  3. pk_with_drop: 141.483 wallclock secs
  4. pk_without_drop: 278.169 wallclock secs

Here, the effect of PK order is 45% speedup. The effect of dropping indexes is on the same order of magnitude. The two effects are also not entirely independent and using both PK order insertion and the drop/re-setup for indexes makes one hell of a difference, cutting over half of the run time.

For comparison: After dumping the large table using mysqldimp, then dropping it table, the reloading of it takes 405 wallclock seconds.

All benchmarks have been done against the InnoDB plugin on MySQL 5.1.49.

Trackbacks

No Trackbacks

Comments

Display comments as Linear | Threaded

Justin Swanhart on :

Do the times without the indexes also reflect the time to add the index back after loading?

Steffen Mueller on :

Yup, the times are all-inclusive. Including dropping the indexes (which is negligible since the table is empty), LOAD, and the re-adding of indexes.

Ralf Engler on :

csv storage engine is much faster, if
you can rely on your import file structure.

Add Comment

E-Mail addresses will not be displayed and will only be used for E-Mail notifications.

To prevent automated Bots from commentspamming, please enter the string you see in the image below in the appropriate input box. Your comment will only be submitted if the strings match. Please ensure that your browser supports and accepts cookies, or your comment cannot be verified correctly.
CAPTCHA

BBCode format allowed