Thursday, November 13, 2008

Why does MySQL Cluster takes so long for --initial?

This week we had a very interesting problem at a customer: They complained, that their MySQL Cluster takes about 1 hour for a --initial start-up. After some tuning on the hardware and the config.ini they brought it down to around 40 minutes. But this is still unacceptable long...

This sounds a little strange to me. But let us have a look at their config.ini first. It looked more or less like the following (which is already tuned!):

#
# config.ini
#

[NDB_MGMD DEFAULT]
DataDir = /localdisk/cluster

[NDBD DEFAULT]
DataDir = /localdisk/cluster
NoOfReplicas = 2
LockPagesInMainMemory = 1
DataMemory = 12000M
IndexMemory = 2000M

FragmentLogFileSize = 256M
NoOfFragmentLogFiles = 42

DiskCheckpointSpeedInRestart = 100M
ODirect = 1

BackupMaxWriteSize = 1M
BackupDataBufferSize = 16M
BackupLogBufferSize = 4M
BackupMemory = 20M

[NDB_MGMD]
id = 1
hostname = 192.168.0.1

[NDBD]
id = 10
hostname = 192.168.0.10

[NDBD]
id = 11
hostname = 192.168.0.11

[MYSQLD]
[MYSQLD]
[MYSQLD]
[MYSQLD]
[MYSQLD]

So the config.ini looks fine. But why does it take so long? Let us have a look who is the evil guy:

# grep phase ndb_1_cluster.log | grep "Node 10"

2008-10-24 12:24:16 [MgmSrvr] INFO -- Node 10: Start phase 1 completed
2008-10-24 12:24:16 [MgmSrvr] INFO -- Node 10: Start phase 2 completed (initial start)
2008-10-24 12:24:16 [MgmSrvr] INFO -- Node 10: Start phase 3 completed (initial start)
2008-10-24 13:03:03 [MgmSrvr] INFO -- Node 10: Start phase 4 completed (initial start)
2008-10-24 13:03:08 [MgmSrvr] INFO -- Node 10: Start phase 5 completed (initial start)
2008-10-24 13:03:08 [MgmSrvr] INFO -- Node 10: Start phase 6 completed (initial start)
2008-10-24 13:03:08 [MgmSrvr] INFO -- Node 10: Start phase 7 completed (initial start)
2008-10-24 13:03:08 [MgmSrvr] INFO -- Node 10: Start phase 8 completed (initial start)
2008-10-24 13:03:08 [MgmSrvr] INFO -- Node 10: Start phase 9 completed (initial start)
2008-10-24 13:03:08 [MgmSrvr] INFO -- Node 10: Start phase 100 completed (initial start)
2008-10-24 13:03:08 [MgmSrvr] INFO -- Node 10: Start phase 101 completed (initial start)

Start phase 4 is the one! But what is start phase 4? According to the MySQL Cluster documentation start phase 4 means:

Phase 4. For an initial start or initial node restart, the redo log files are created. The number of these files is equal to NoOfFragmentLogFiles.

Creating the redo log files is the problem! I have never created a MySQL Cluster with 42 Gbyte of redo log file (42 x 4 x 256Mbyte) myself so I have no practical experience with it how long it should take.

What I did first was a rough calculation of the value I would expect:

43'008 Mbyte : 50 Mbyte/s = 860 s = 15 minutes

After 15 minutes I would expect the cluster to be up and running (plus some seconds for the other phases). What I expect seems not to be the same as what happens... :(

I turned on iostat (by the way: IMHO iostat (sysstat packages) belongs on every server which does some kind of I/O!):

# iostat -x 1

avg-cpu: %user %nice %system %iowait %steal %idle
0.00 0.00 0.12 18.35 0.00 81.52

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 193.00 0.00 55.00 0.00 2016.00 0.00 1008.00 36.65 2.13 38.55 18.18 100.00

avg-cpu: %user %nice %system %iowait %steal %idle
0.00 0.00 0.25 18.88 0.00 80.88

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 192.00 0.00 48.00 0.00 2088.00 0.00 1044.00 43.50 1.72 35.50 20.83 100.00

avg-cpu: %user %nice %system %iowait %steal %idle
0.00 0.00 0.00 19.12 0.00 80.88

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 200.00 0.00 47.00 0.00 2088.00 0.00 1044.00 44.43 1.69 35.91 21.28 100.00

OK. As we can see, the I/O device is fully utilized. Good to know. Looks like we have an I/O problem! We do only around 50 w/s which is really bad (normal disks can do around 150 - 250 w/s) and we write with a throughput of 1 MByte/s (is about 20 kbyte/write) instead of 50 Mbyte/s.

What came to my mind is, that earlier this year I have written an article about Sparse files. I found then, that MySQL Cluster writes the redo log files as sparse files to disk. Could this have an impact? But we have never had any complains about this from other customers...

So what I did next: I tried on a completely different hardware from an other vendor and it took only 13 minutes! The problem seems to be NOT MySQL Cluster (alone) related! On our machines we have also an NFS mount. So let us try there: Even worse: 10 minutes! Then we did a synthetical test on the hardware:

# dd if=/dev/zero of=redo_log.42g count=0 obs=1 seek=42G
# dd if=/dev/zero of=redo_log.42g bs=32K count=1376256 of=direct

This was running in 12 minutes! I was confused. Maybe my simulation has to be more complex:

#!/bin/bash

let i=1
# 1376256
while [[ $i -le ]] ; do
dd if=/dev/zero of=big_file.42g obs=512 count=1 seek=32K
let i=$i+1
done

No results which come close to the thing I have seen from MySQL Cluster...

Let us summarize:

Our servers with Mysql Cluster: 37 minutes
Test servers with Mysql Cluster: 13 minutes
Our servers with dd: 12 minutes
Our servers with NFS: 10 minutes

This does not make sense to me. I need the PERFECT simulation!

For knowing more, how the redo log files were created I run a trace on the ndbd process during the creation:

# strace -o ndbd.trace -ff -s 128 -p

As a result I got this:

open("/localdisk/cluster/ndb_11_fs/D10/DBLQH/S3.FragLog", O_RDWR|O_CREAT|O_TRUNC|O_DIRECT, 0666) = 17
lseek(17, 268402688, SEEK_SET) = 268402688 # 256M - 32k
write(17, "..."..., 32768) = 32768
fsync(17) = 0
lseek(17, 0, SEEK_SET) = 0
write(17, "..."..., 32768) = 32768
fsync(17) = 0
lseek(17, 1048576, SEEK_SET) = 1048576
write(17, ""..., 32768) = 32768
fsync(17) = 0
lseek(17, 2097152, SEEK_SET) = 2097152
write(17, ""..., 32768)= 32768
fsync(17) = 0

...
lseek(17, 267386880, SEEK_SET) = 267386880
write(17, ""..., 32768) = 32768
fsync(17) = 0
close(17) = 0

OK. I see, my simulation up to now was not yet good enough... I need a Perl script to simulate this more exactly. The script you can find here: cluster_initial_test.pl.

Before I run the script I was pretty nervous: Would it work? But after a few minutes it was already clear: It has the wanted effect. After around 1/3 of the redo log files were created I estimated the end time to 59 minutes and when it ended it took around 50 minutes. Perfect!

After I came back from a short jig I was thinking about why it took longer than 40 minutes... What I can see from the system statistics is, that the I/O system is more relaxed with my script than with the original MySQL Cluster --initial restart.

# iostat -x 1

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 161.39 0.00 49.50 0.00 1798.02 0.00 899.01 36.32 1.05 20.40 14.32 70.89
sda 0.00 176.00 0.00 51.00 0.00 1920.00 0.00 960.00 37.65 1.07 21.18 14.04 71.60
sda 0.00 170.71 0.00 46.46 0.00 1882.83 0.00 941.41 40.52 0.99 21.22 15.74 73.13

So it is about 25-33% less utilized. MySQL Cluster creates the redo log files with 4 threads in parallel. For me this is a sufficient exact explanation for the moment why it takes longer.

We are able to decouple now the hardware problem from the software product. I cannot really help the hardware guys to solve their problem, but at least we were able to give them a simulation of the problem. So they can run the tests without us.

On the other hand MySQL Cluster is doing exactly what I always tell my customers not to do: "COMMIT after every INSERT". This is the worst thing what you can do with an I/O system. MySQL Cluster does a sync to disk after every 32k block (1 block per Mbyte).
So I was talking to Jonas, one of our MySQL Cluster developers, and told him the whole story. He was just asking me: "Is 14:00 early enough for the patch?"

Before 14:00 I had already the patch in my mailbox. We applied it and tested it and...

5 minutes 12 seconds!

So I think now everybody should be happy?

Addendum:

  • The perl script could be improved by:
sysopen(HANDLE, $path, O_RDWR|O_CREAT|O_EXCL|O_DIRECT) or die "sysopen $path: $!";
  • Maybe the problem could be also simulated much easier with dd like this:
# dd if=/dev/zero of=redo_log.42g bs=1M count=43008 of=direct conv=fsync

4 comments:

Anonymous said...

Will this performance improvement patch made it into normal releases or will it be available for Enterprise customers only?

Anonymous said...

Thanks for sharing this. This is a very good read!

Hope the patch made it in the production code

Shinguz said...

Hey guys,

Have a look at the addendum. I have updated it.

Regards,
Oli

Jonas Oreland said...

FYI: I also added a new config parameter in 6.3.19, that makes the redo-files to be fully initialized:

InitFragmentLogFiles=full