Tuesday, December 23, 2008

MySQL licenses for dummies

The following summary shows my personal understanding of MySQL 5.1 licenses, packages and products. It does not necessarily reflect 100% the way MySQL understands it. But after all the discussions I hope it is as close as possible to the reality:

MySQL Embedded Database Server (Download: enterprise.mysql.com -> OEM Software)
Classic (OEM license, -MEM -InnoDB)
Pro (= Classic +InnoDB)
Advanced (= Pro +Partitioning)

MySQL Community Sever (Download: www.mysql.com -> Downloads)
Community (GPL, -NDB)

MySQL Enterprise Server (Download: enterprise.mysql.com -> Enterprise Software)
Pro (GPL or commercial, -NDB +InnoDB +MEM, Basic + Silver customer, MRU + QSP)
Advanced (= Pro +Partitioning, Gold + Platinum customer)

MySQL Cluster (Download: http://dev.mysql.com/downloads/cluster/)
Community Edition (GPL, all features)
Com (ex CGE?) (OEM or commercial, -InnoDB +NDB)
Com-Pro (Com, all features)
Standard Edition (= Com, -NDB-API -Cluster-Repl, -LDAP)

Upgrade

EP customer should follow the QSP trail unless it is critical for them to install an MRU to get a quick bugfix to hold them over until the next QSP is released.

Month version / release
0 5.1.30
1 5.1.30-MRU1
2 5.1.30-MRU2
3 5.1.31
4 5.1.31-MRU1 and 5.1.30-QSP
5 5.1.31-MRU2
6 5.1.32
7 5.1.32-MRU1 and 5.1.31-QSP

Legend

CE  - Community Edition
EP - Enterprise Edition (why not EE?)
MRU - Monthly Rapid Update (EP only)
QSP - Quarterly Service Pack (EP only)
OEM - Original Equipment Manufacturer
MEM - MySQL Enterprise Monitior
CGE - Carrier Grade Edition
Please correct me, if I am wrong. And when you have more questions let me know and I try to clear this.

Thursday, November 13, 2008

Why does MySQL Cluster takes so long for --initial?

This week we had a very interesting problem at a customer: They complained, that their MySQL Cluster takes about 1 hour for a --initial start-up. After some tuning on the hardware and the config.ini they brought it down to around 40 minutes. But this is still unacceptable long...

This sounds a little strange to me. But let us have a look at their config.ini first. It looked more or less like the following (which is already tuned!):

#
# config.ini
#

[NDB_MGMD DEFAULT]
DataDir = /localdisk/cluster

[NDBD DEFAULT]
DataDir = /localdisk/cluster
NoOfReplicas = 2
LockPagesInMainMemory = 1
DataMemory = 12000M
IndexMemory = 2000M

FragmentLogFileSize = 256M
NoOfFragmentLogFiles = 42

DiskCheckpointSpeedInRestart = 100M
ODirect = 1

BackupMaxWriteSize = 1M
BackupDataBufferSize = 16M
BackupLogBufferSize = 4M
BackupMemory = 20M

[NDB_MGMD]
id = 1
hostname = 192.168.0.1

[NDBD]
id = 10
hostname = 192.168.0.10

[NDBD]
id = 11
hostname = 192.168.0.11

[MYSQLD]
[MYSQLD]
[MYSQLD]
[MYSQLD]
[MYSQLD]

So the config.ini looks fine. But why does it take so long? Let us have a look who is the evil guy:

# grep phase ndb_1_cluster.log | grep "Node 10"

2008-10-24 12:24:16 [MgmSrvr] INFO -- Node 10: Start phase 1 completed
2008-10-24 12:24:16 [MgmSrvr] INFO -- Node 10: Start phase 2 completed (initial start)
2008-10-24 12:24:16 [MgmSrvr] INFO -- Node 10: Start phase 3 completed (initial start)
2008-10-24 13:03:03 [MgmSrvr] INFO -- Node 10: Start phase 4 completed (initial start)
2008-10-24 13:03:08 [MgmSrvr] INFO -- Node 10: Start phase 5 completed (initial start)
2008-10-24 13:03:08 [MgmSrvr] INFO -- Node 10: Start phase 6 completed (initial start)
2008-10-24 13:03:08 [MgmSrvr] INFO -- Node 10: Start phase 7 completed (initial start)
2008-10-24 13:03:08 [MgmSrvr] INFO -- Node 10: Start phase 8 completed (initial start)
2008-10-24 13:03:08 [MgmSrvr] INFO -- Node 10: Start phase 9 completed (initial start)
2008-10-24 13:03:08 [MgmSrvr] INFO -- Node 10: Start phase 100 completed (initial start)
2008-10-24 13:03:08 [MgmSrvr] INFO -- Node 10: Start phase 101 completed (initial start)

Start phase 4 is the one! But what is start phase 4? According to the MySQL Cluster documentation start phase 4 means:

Phase 4. For an initial start or initial node restart, the redo log files are created. The number of these files is equal to NoOfFragmentLogFiles.

Creating the redo log files is the problem! I have never created a MySQL Cluster with 42 Gbyte of redo log file (42 x 4 x 256Mbyte) myself so I have no practical experience with it how long it should take.

What I did first was a rough calculation of the value I would expect:

43'008 Mbyte : 50 Mbyte/s = 860 s = 15 minutes

After 15 minutes I would expect the cluster to be up and running (plus some seconds for the other phases). What I expect seems not to be the same as what happens... :(

I turned on iostat (by the way: IMHO iostat (sysstat packages) belongs on every server which does some kind of I/O!):

# iostat -x 1

avg-cpu: %user %nice %system %iowait %steal %idle
0.00 0.00 0.12 18.35 0.00 81.52

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 193.00 0.00 55.00 0.00 2016.00 0.00 1008.00 36.65 2.13 38.55 18.18 100.00

avg-cpu: %user %nice %system %iowait %steal %idle
0.00 0.00 0.25 18.88 0.00 80.88

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 192.00 0.00 48.00 0.00 2088.00 0.00 1044.00 43.50 1.72 35.50 20.83 100.00

avg-cpu: %user %nice %system %iowait %steal %idle
0.00 0.00 0.00 19.12 0.00 80.88

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 200.00 0.00 47.00 0.00 2088.00 0.00 1044.00 44.43 1.69 35.91 21.28 100.00

OK. As we can see, the I/O device is fully utilized. Good to know. Looks like we have an I/O problem! We do only around 50 w/s which is really bad (normal disks can do around 150 - 250 w/s) and we write with a throughput of 1 MByte/s (is about 20 kbyte/write) instead of 50 Mbyte/s.

What came to my mind is, that earlier this year I have written an article about Sparse files. I found then, that MySQL Cluster writes the redo log files as sparse files to disk. Could this have an impact? But we have never had any complains about this from other customers...

So what I did next: I tried on a completely different hardware from an other vendor and it took only 13 minutes! The problem seems to be NOT MySQL Cluster (alone) related! On our machines we have also an NFS mount. So let us try there: Even worse: 10 minutes! Then we did a synthetical test on the hardware:

# dd if=/dev/zero of=redo_log.42g count=0 obs=1 seek=42G
# dd if=/dev/zero of=redo_log.42g bs=32K count=1376256 of=direct

This was running in 12 minutes! I was confused. Maybe my simulation has to be more complex:

#!/bin/bash

let i=1
# 1376256
while [[ $i -le ]] ; do
dd if=/dev/zero of=big_file.42g obs=512 count=1 seek=32K
let i=$i+1
done

No results which come close to the thing I have seen from MySQL Cluster...

Let us summarize:

Our servers with Mysql Cluster: 37 minutes
Test servers with Mysql Cluster: 13 minutes
Our servers with dd: 12 minutes
Our servers with NFS: 10 minutes

This does not make sense to me. I need the PERFECT simulation!

For knowing more, how the redo log files were created I run a trace on the ndbd process during the creation:

# strace -o ndbd.trace -ff -s 128 -p

As a result I got this:

open("/localdisk/cluster/ndb_11_fs/D10/DBLQH/S3.FragLog", O_RDWR|O_CREAT|O_TRUNC|O_DIRECT, 0666) = 17
lseek(17, 268402688, SEEK_SET) = 268402688 # 256M - 32k
write(17, "..."..., 32768) = 32768
fsync(17) = 0
lseek(17, 0, SEEK_SET) = 0
write(17, "..."..., 32768) = 32768
fsync(17) = 0
lseek(17, 1048576, SEEK_SET) = 1048576
write(17, ""..., 32768) = 32768
fsync(17) = 0
lseek(17, 2097152, SEEK_SET) = 2097152
write(17, ""..., 32768)= 32768
fsync(17) = 0

...
lseek(17, 267386880, SEEK_SET) = 267386880
write(17, ""..., 32768) = 32768
fsync(17) = 0
close(17) = 0

OK. I see, my simulation up to now was not yet good enough... I need a Perl script to simulate this more exactly. The script you can find here: cluster_initial_test.pl.

Before I run the script I was pretty nervous: Would it work? But after a few minutes it was already clear: It has the wanted effect. After around 1/3 of the redo log files were created I estimated the end time to 59 minutes and when it ended it took around 50 minutes. Perfect!

After I came back from a short jig I was thinking about why it took longer than 40 minutes... What I can see from the system statistics is, that the I/O system is more relaxed with my script than with the original MySQL Cluster --initial restart.

# iostat -x 1

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 161.39 0.00 49.50 0.00 1798.02 0.00 899.01 36.32 1.05 20.40 14.32 70.89
sda 0.00 176.00 0.00 51.00 0.00 1920.00 0.00 960.00 37.65 1.07 21.18 14.04 71.60
sda 0.00 170.71 0.00 46.46 0.00 1882.83 0.00 941.41 40.52 0.99 21.22 15.74 73.13

So it is about 25-33% less utilized. MySQL Cluster creates the redo log files with 4 threads in parallel. For me this is a sufficient exact explanation for the moment why it takes longer.

We are able to decouple now the hardware problem from the software product. I cannot really help the hardware guys to solve their problem, but at least we were able to give them a simulation of the problem. So they can run the tests without us.

On the other hand MySQL Cluster is doing exactly what I always tell my customers not to do: "COMMIT after every INSERT". This is the worst thing what you can do with an I/O system. MySQL Cluster does a sync to disk after every 32k block (1 block per Mbyte).
So I was talking to Jonas, one of our MySQL Cluster developers, and told him the whole story. He was just asking me: "Is 14:00 early enough for the patch?"

Before 14:00 I had already the patch in my mailbox. We applied it and tested it and...

5 minutes 12 seconds!

So I think now everybody should be happy?

Addendum:

  • The perl script could be improved by:
sysopen(HANDLE, $path, O_RDWR|O_CREAT|O_EXCL|O_DIRECT) or die "sysopen $path: $!";
  • Maybe the problem could be also simulated much easier with dd like this:
# dd if=/dev/zero of=redo_log.42g bs=1M count=43008 of=direct conv=fsync

Monday, October 13, 2008

MySQL Multi-Master – Single-Slave – Replication (Episode 2)

Introduction

One of the features that make MySQL so great is its easy replication set-up. If you are experienced and know-how to do it, it takes you about 15 minutes to set-up a slave. What you have in the end is a replication from one master to one or several slaves. So you can build a top-down data stream pyramid and spread your data on many slaves.


From time to time some customers are asking for the other way: Many masters replicating to one slave (which is also called multi-source replication). For this requirement MySQL replication cannot help you directly.

Possibilities

You can circumvent this situation in the following ways:

  1. Implement your own data transfer mechanism.

  2. Use this ugly hack I have written down 2 years ago: Multi-Master-Single-Slave-Replication

  3. Test the approach described in the following article.

  4. Wait until MySQL has implemented it...


Possibility number 4 for would be the nicest one. Because then the solution would be properly supported by your database vendor and you do not have to take care much about problems, as long as you have a support contract.

When we look at the MySQL work log and search for replication we find a work log entry with the title: WL#1697: Multi-source replication. But for whatever reason it has the following status: Affects: Server-7.0 — Status: On-Hold — Priority: Low. :-(

What can we do now:
a) Resign.
b) Become an important customer (by paying much money), complain about the priority and if nothing changes escalate it to the top management of MySQL.
c) Find many many other fellow sufferers, unite and make your database vendors management aware of your desire.
d) Help yourself (it is eventually an Open Source product...).

Possibilities number 1 to 3 from above are some kind of Help yourself.

Help yourself

Because I am not a programmer and I have no clue about programming, possibility number 1 is out of question for me. But I am sure there are many other MySQL users out there in the world which would appreciate your effort.
Possibility number 2 is a quick and ugly hack but may work in some situations.
And number 3 I was pointed to by a user called Kwame who wrote me an email (thanks Kwame for the cool hint!).

A possible solution

One and a half year ago I wrote down a little article about MySQL Active - Active Clustering" because we had ever and ever customer asking for a replacement for a well known but very expensive product from an other big database vendor.
Peter Zaitsev was not very happy with the content of it: MySQL MyISAM Active Active Clustering - looking for trouble?. But comments about the critics already gave a sign for the solution: When you combine the MySQL Active-Active Clustering with MySQL replication you can finally get a many-master single slave replication!

So what I did was the following:

I have 3 servers which act as a master and on the 4th server I install 3 MySQL instances (mysqld) running on the same datadir. Please make sure, that you replicate ONLY ONE SCHEMA per master slave pair!


As configuration file for my slaves I have used something like this:

#
# my.cnf
#

[mysqld]

port = 3308
socket = /home/mysql/tmp/mysql-3308.sock
pid-file = /home/mysql/data/mysqld5127/mysql-3308.pid
datadir = /home/mysql/data/mysqld5127

skip-innodb
external-locking
log-error = /home/mysql/data/mysqld5127/error_4.log

server_id = 4
master-info-file = /home/mysql/data/mysqld5127/master.info
relay-log = /home/mysql/data/mysqld5127/mysql-3308-relay-bin
relay-log-index = /home/mysql/data/mysqld5127/mysql-3308-relay-bin
relay-log-info-file = /home/mysql/data/mysqld5127/relay-log.info

Requirements

Make sure that:

  • You only write to one schema per master (schema_a in master 1, schema_b in master 2 and so on...)

  • You comply with all the requirements described in the document MySQL Active - Active Clustering

  • You flush the tables on ALL slave before reading.


Very simple first tests showed, that it is possible to "aggregate" the data from many masters into a single slave. But please verify this proof of concept very carefully before you deploy it to your productive system with your real data. This approach is neither recommended nor supported by MySQL.

Drawback

A further replication from this slave system to other slaves is not possible (or at least no feasible solution comes to my mind) except you use MySQL cluster as "aggregator". Then it could be possible...

Thursday, October 9, 2008

Test application for MySQL high availability (HA) set-up

When I set-up a MySQL HA environment for customers I usually do some final fail over tests after configuring the whole beast.

To check if the application behaves like expected I always run my little test application (test.sh) from the server(s) where the customers application runs. It displays "graphically" how the application behaves and you can show to the customer immediately what is going on...

Make sure, that you point it to the VIP (virtual IP) or the LB (load balancer).

It was really useful for me and I recommend you to do your HA fail over tests also at least with this little tool to avoid evil surprises in the future.

Thursday, September 25, 2008

Citation of the week

"Das dreieckige Rad hat gegenüber dem viereckigen einen gewaltigen Vorteil: Ein Rumms weniger pro Umdrehung!"

Translation:
"The triangular wheel has one enormous advantage over the quadrangular: One knock less per revolution!"

Maybe not new, but I have not heard it yet and I love it. It was about reinventing functionality in a well known product...

Tuesday, September 23, 2008

MySQL Cluster: No more room in index file

Recently we were migrating an InnoDB/MyISAM schema to NDB. I was too lazy to calculate all the needed MySQL Cluster parameters (for example with ndb_size.pl) and just took my default config.ini template.
Because I am really lazy I have a little script doing this for me (alter_engine.sh).

But suddenly my euphoria was stopped abruptly by the following error:

MySQL error code 136: No more room in index file

The usual command that helps me in such a situation is a follows:

# perror 136
MySQL error code 136: No more room in index file

But in this case it is not really helpful. Also

# perror --ndb 136

does not bring us further. Strange: Index file... We are converting from MyISAM/InnoDB to NDB. Why the hell is he using an index file for this operation? It seems to be clearly a mysqld error message and not a MySQL Cluster error message. And we are also not using MySQL Cluster disk data tables.

After bothering a bit MySQL support I had the idea to do the following:

# ndb_show_tables | grep -ic orderedindex
127

The MySQL online documentation clearly states:

MaxNoOfOrderedIndexes
...
The default value of this parameter is 128.

So this could be the reason! When I have changed this parameter followed by the common rolling restart of the MySQL Cluster I could continue to migrate my schema into cluster...

Conclusion
MySQL errors can be related to cluster errors and do not necessarily point to the source of the problem. The error:

MySQL error code 136: No more room in index file


means just MaxNoOfOrderedIndexes is too small!


I hope that I can safe you some time with this little article.

Possible memory leak in NDB-API applications?

A customer has recently experienced a possible memory leak in its NDB-API application. What he did was something like
# ps aux | grep <pid>

over time and then he saw the RSS increasing. When he would have had a look a little longer he would have seen that the RSS consumption would increase up to a certain level and then becomes stable. Which is the expected behaviour.


But how to explain to the customer that his application, which was in fact not doing anything, consumes more RSS?
With a diff over time on /proc/<pid>/smaps we found that this area was the reason:
b67b7000-b6fca000 rw-p b67b7000 00:00 0 (8 Mbyte)
Size: 8268 kB
Rss: 148 kB
Shared_Clean: 0 kB
Shared_Dirty: 0 kB
Private_Clean: 0 kB
Private_Dirty: 148 kB
Referenced: 148 kB

But what is this meaning? To find the answer we did a strace on the program and got the following system calls:
...
read(5, "127.0.0.1 localhost\n\n# The follo"..., 4096) = 450
close(5) = 0
munmap(0xb7acb000, 4096) = 0
mmap2(NULL, 2117632, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb69bf000 - 0xB6BC4000 (2068 Mbyte)
mmap2(NULL, 2101248, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb67be000 - 0xb69bf000 (2052 Mbyte)
mmap2(NULL, 32768, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7ac4000
mprotect(0xb7ac4000, 4096, PROT_NONE) = 0
clone(child_stack=0xb7acb4b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_C
...

OK. Somebody is allocating 2 times 2 junks of about 2 Mbyte of memory. But what the hell could this be??? During night I found the solution. It is the SendBufferMemory and ReceiveBufferMemory which I have configured in the config.ini to that size...

When you experience similar behaviour on your processes, maybe this little script can help you to find the problem: mem_tracker.sh

By the way, with an other customer we wound some other nice behaviour. But this time it was a mysqld:

Friday, September 5, 2008

Active/active fail over cluster with MySQL Replication

Electing a slave as new master and aligning the other slaves to the new master

In a simple MySQL Replication set-up you have high-availability (HA) on the read side (r). But for the master which covers all the writes (w) and the time critical read (rt) there is no HA implemented. For some situations this can be OK. For example if you have rarely writes or if you can wait until a new Master is set up.

But in other cases you need a fast fail-over to a new master.

In the following article it is shown how to implement the election of a new master and how to align the slaves to the new master.

We can have two possible scenarios:
  1. This scenario assumes, that every slave can become the new master.
  2. This scenario assumes, that only one dedicated slave will become master.

The advantages and disadvantages of both scenarios:


Scenario 1
+ You can choose the slave which is the most actual one.
- Higher possibility of errors if not automatized.
- You do not need an extra spare slave.
- More bin log writing on all Slaves.

Scenario 2
+ You do not have to choose which is the new master, you already have defined before.
- You have the possibility to not choose the Slave with the most recent data applied.

Important: All the slaves which can become master have to run with log-bin on and log-slave-updates.

Electing a Slave to become the new master
Szenario 1: Compare output of SHOW SLAVE STATUS and decide which one will become the new master.
Szenario 2: Not necessary because it is already done before.

Aligning the other slaves to the new master
The officially recommended way to set-up again a replication when the master fails is as follows:
  1. Set-up the new master (is skipped in our case because a slave becomes master).
  2. Do a consistent backup of the master (which takes time and, depending on the used storage engines, blocks writing).
  3. Set-up all slaves one by one and point them to the new master (takes also time).
During these steps your production environment provides partially limited resources.
To avoid or at least reduce this problem we are looking for an abbreviation of the whole process:

Step 1: is obsolete in our scenario.
Step 2: Can be circumvented when we use a storage engine which allows us to make consistent backups (for example InnoDB) or when we use a very fast backup method (for example LVM snapshots).
Step 3: We can re-use all the slaves which have the same or older information than the new elected master. Slaves which have newer informations or in some other exceptional cases (see below) have to be set-up anyway as recommended.

How to do this?
IMHO the best is to show that in a little demo. For this I have set-up a environment like in scenario 1) and/or 2). There I have created my favourite table test as follows:
CREATE TABLE test (
`id` int(11) NOT NULL AUTO_INCREMENT,
`data` varchar(32) DEFAULT NULL,
`ts` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
PRIMARY KEY (`id`));
As next step we simulate the application as follows:
INSERT INTO TEST VALUES (NULL, 'Insert on Master', NULL);
INSERT INTO TEST VALUES (NULL, 'Insert on Master', NULL);
INSERT INTO TEST VALUES (NULL, 'Insert on Master', NULL);
To simulate a lag on one or more of the slaves we stop the replication on these:
STOP SLAVE;
Then we do some more application action on the master:
INSERT INTO TEST VALUES (NULL, 'Insert on Master', NULL);
INSERT INTO TEST VALUES (NULL, 'Insert on Master', NULL);
And then we crash the master!
From this point on we have the situation which could happen in the real world: Master is crashed and different slaves (can) have different positions relatively to the master:

Find the most actual slave
To find the slave which is the most actual one you have to gather some information on the slaves. This can be done as follows:
mysql> PAGER grep Master_Log;
mysql> SHOW SLAVE STATUS\G
mysql> PAGER ;
You also have to do this step in scenario b) because we want to know which slave can be taken and which one has to be set-up from scratch.
Then we get some output for example like this:
Slave 1:
Master_Log_File: bin-3311.000006
Read_Master_Log_Pos: 929
Relay_Master_Log_File: bin-3311.000006
Exec_Master_Log_Pos: 929
Slave 2:
Master_Log_File: bin-3311.000006
Read_Master_Log_Pos: 635
Relay_Master_Log_File: bin-3311.000006
Exec_Master_Log_Pos: 635
What we have to assure first is, that all the slave have caught up with writing the data from the relay log to slave. This is assured by comparing Master_Log_file/Read_Master_Log_Pos with Relay_Master_Log_file/Exec_Master_Log_Pos. If these values are the same then the slave has caught-up. Otherwise wait until they become the same.
When this is done we have to find, which slave is the most recent one. This is simple: Higher value is equal to newer information (also consider the log file not only the position!).
In scenario a) the one (or one of these) is elected as new master.
In our scenario this is Slave 1!

In scenario 2 all slaves which are newer than the pre-elected new master must be rebuild from the new master.
Slave 1 is newer than slave 2. If slave 2 was pre-elected as new master slave one must be rebuild from the new master.
From all the slaves which have a different position than the new master calculate the delta to the new master:

Calculate delta: 6.929 - 6.635 = 294

When the log file is different we cannot use these informations and we have to rebuild this slave from the new master.
Now we have defined, which one will become the new master and which slaves are in line, which are ahead and which are behind the new master.

Set-up the new environment
To avoid any troubles we to a STOP SLAVE on all slaves first.
Then we do a RESET SLAVE on the new master.
Now for every slave which is not rebuild from the master we have to calculate the position where to start the replication from. To do this we have to gather the actual position of the new master:
SHOW MASTER STATUS;
+-----------------+----------+--------------+------------------+
| File | Position | Binlog_Do_DB | Binlog_Ignore_DB |
+-----------------+----------+--------------+------------------+
| bin-3312.000002 | 2857 | | |
+-----------------+----------+--------------+------------------+
And for every slave we can calculate the delta:

==> 2857 - 294 = 2563

When the value becomes negative this means that we have to start in an older log-file than the actual one. I did not find any rule to calculate the exact position in this case. So unfortunately we also have to set-up these slaves from the backup.

As soon as we have these values calculated we can start the application running against the new master and we can also start now with the new consistent backup for all the slaves we have to set-up again from the backup.
INSERT INTO TEST VALUES (NULL, 'Insert on Master', NULL);
INSERT INTO TEST VALUES (NULL, 'Insert on Master', NULL);
On the slaves which are OK for aligning with the new master we have to change the master and the new positions now:
CHANGE MASTER TO master_host='laptop', MASTER_USER='replication', MASTER_PORT=3312, MASTER_PASSWORD='replication';
CHANGE MASTER TO MASTER_LOG_FILE='bin-3312.000002', master_log_pos=2563;
START SLAVE;
That's it!
If you would like to here more about such stuff please let me know. We are glad to help you with some consulting...
I have also most of this stuff in some scripts so this could be easily automated...

Monday, August 25, 2008

Typical automated MySQL maintenance jobs

The following maintenance jobs are typically run against a MySQL database:

  • “Backup”

  • Clean-up binary logs.

  • Optimize tables

  • Purge query cache

  • Rotate binary logs

Backup

A backup is not a typical maintenance job. But it behaves more or less like one. The backup should be done regularly depending on the restore/PITR (Point in Time Recovery) requirements.

Make sure, that in the backup all the necessary files (data files, transaction log files, configuration files and binary log files) are included. To prove that the backup process is working properly a regular restore should be performed. This can ideally be combined with the set-up of new database instances for developers or testing.

Clean-up the binary logs

The binary logs can be cleaned-up in two ways:

a) Passive by MySQL itself:


# my.cnf

expire_logs_days = 7


b) Active by the customers environment:

mysql> PURGE MASTER LOGS TO 'binarylog.000999';

mysql> PURGE MASTER LOGS BEFORE '2008-07-29 22:46:26';


Make sure NO binary logs are purged which are still needed by a slave. In this situation the slave is lost and has to be set-up from scratch.

Make also sure binary logs are not removed by a file system operation (rm bin-log.*). Otherwise the database gets confused.

Optimize table

After large UPDATE or INSERT/DELETE operations or long time tables are blown up and contain a lot of unused space. This unused space can be partially reclaimed by optimizing the table again.



mysql> OPTIMIZE TABLE ;


This operation internally copies the whole table and therefore can take a long time!

Purge query cache

When there are SELECT queries with different sizes of result sets the query cache gets de-fragmented. This is shown by a lot of free space in the query cache but also a lot of not cached queries. Here it makes sense to purge the query cache from time to time.


mysql> FLUSH QUERY CACHE;


Binlog rotate

Binary logs can only be rotated by size. Sometimes you want to have them rotated by time. You can do this as follows (for example with a cron job):



mysql> FLUSH LOGS;


What other MySQL maintenance jobs are you performing (not application related) I am very interested in...

Saturday, June 28, 2008

My thoughts about MySQL (Cluster) replication

According to Johans wishes I write down my concerns about MySQL (Cluster) replication. These items are things I run again and again into it with customers:

  • SQL-nodes are still loosing too easy connection to cluster after data node or management node restart (which leads into gaps, see next point). Automatic fail over or reconnection is just a dream (maybe it works in about 90% of the cases at least)..

  • Gaps: Whenever I do a cluster reconfiguration (should not be necessary too often, see loosing connection above) or a mysqld restart I get a gap (these are the planned ones, see also automatic channel fail over). Then we have the not planned ones...
    I cannot understand, why we are not able to keep at least a certain amount of traffic in a binlog-injector-buffer to avoid the majority of these gaps. This could be something like a Round-Robin buffer which stores all the cluster information which should be sent to the binlog-injector thread. When we configure this buffer to 1 Gbyte we can keep close to 2 minutes of traffic (10 Mbyte/s traffic) which should be fine to cover just hiccups and quick restarts of mysql. In other environments where we have much less traffic we can store there even hours of traffic which really should make look us unbreakable.
    Together with an automatic channel fail over this should be much more robust.

  • Automatic channel fail over: It should not be too difficult to automatically switch from one channel to an other (or a third) one. If the issues above are solved, it is also less likely, that a channel fail over is necessary.
    A gap event should trigger a channel fail over and a STOP SQL THREAD command and not just stopping the replication with an error. It should also be possible to make an automatic switch back to the original slave again if this slave is still talking to its master after the gap to see if it possibly could continue working...

  • Atomic operations: They do not have to be necessarily durable but atomic. This problem we have on the master side (as described in the issue 27201) and also on the slave side. When we guarantee atomicity we should not loose to much in performance (don't we?) but can avoid a lot of troubles in replication: The relay-log is not atomic with the slave at all. If I crash a slave under heavy load it will stick in troubles (in about 25% of the cases). This is reproducible and a bug is filed for this.

  • Binary-log mechanism is a nightmare for every DB operator: Making errors in MySQL replication and restore/recovery is very easy (including Cluster PITR). We should make the mysql-restore-recovery mechanism aware of the last applied transaction (global trx id?) and by default it should deny to apply binary-logs with the wrong file or position (at some other databases you cannot even force it to do it wrong!).
    The Cluster Recovery/PITR is further not documented and also should follow this rules/mechanism. It would also be nice if, after ndb_restore -m -r -e is done it would ask for the binary log to apply next (at the right position!).

  • Rolling restart: From time to time it happens, that a rolling restart is necessary. This has to be done manually for each node (some customers were writing scripts for this) but even worse you also have to switch the tool to do this.
    My suggestion is: a) Implement a ROLLING RESTART command and b) also allow to restart SQL-Nodes from the mgmd (mgmd and mysqld are talking to each other anyway (see binlog injector thread), so it should not be too difficult to send a SIGHUP from the mgmd to the mysqld).

  • Monitoring: MySQL Cluster Monitoring without the MGMD-API is a nightmare too. And also with the MGMD-API it is so poor! What I would like to have is a standardized interface to my system metrics (namely SQL!). And then all the important information (performance metrics, buffer fill degree, etc.) in some cluster tables. Basically all the the stuff which is written to the cluster log with ALL REPORT MemoryUsage, ALL REPORT BackupStatus and ALL CLUSTERLOG STATISTICS=15 (or maybe ALL events?)). So I can access them easily.

  • Just a little (maybe?) bug I run into it with recent MySQL Cluster releases: A Gap requires now more than one (at least 2) set global sql_slave_skip_counter=1 statements to make slave running again... This makes the replication less predictable for automatising processes.

All these points endanger my operative production life and can cause errors in enterprise and carrier grade environments.

Thursday, May 29, 2008

Some more details about DiskSyncSize

The parameter DiskSyncSize is a MySQL Cluster parameter and was added in MySQL 5.1.23.

After the amount of stored bytes of data per file, the data node will fsync (flush) the LCP file to disk, even if a fsync is not needed for consistency.
This is done because the OS will otherwise buffer all the writes, and when a fsync is really needed, it can take a lot of time...

Originally this parameter was hard coded. Now it defaults to 4 Mbyte.

The parameter DiskSyncSize is related to the parameters NoOfDiskPagesToDiskAfterRestartTUP and NoOfDiskPagesToDiskAfterRestartACC which are deprecated right now. It does NOT replace the parameter TimeBetweenLocalCheckpoint.

This parameter should not be changed on any OS (with reasonable settings). With ODIRECT it is not used at all.

Thanks to Jonas for the help.

Sunday, May 18, 2008

With MySQL-Enterprise Montior through firewalls

Sometimes it is nice to show customers the functionality of MySQL-Enterprise Monitor (aka Merlin). I install the agents on the servers and the dashboard runs on my laptop. But very often only ssh is open to these servers.
So how to dig a whole through the firewall for MySQL-Enterprise Monitor?
laptop> ssh -R 18080:localhost:18080 oli@where_the_agent_sits
Maybe trivial for you but for me its hard to remember...

Wednesday, April 9, 2008

Move my blog

To look a little bit more nice and because I do not want to spend to much time into my own blog/feed software I have decided to move all my writings here.

But the most interesting for me to is the readers feed back...