Shinguz' Blog: September 2008

Thursday, September 25, 2008

Citation of the week

"Das dreieckige Rad hat gegenüber dem viereckigen einen gewaltigen Vorteil: Ein Rumms weniger pro Umdrehung!"

Translation:
"The triangular wheel has one enormous advantage over the quadrangular: One knock less per revolution!"

Maybe not new, but I have not heard it yet and I love it. It was about reinventing functionality in a well known product...

Tuesday, September 23, 2008

MySQL Cluster: No more room in index file

Recently we were migrating an InnoDB/MyISAM schema to NDB. I was too lazy to calculate all the needed MySQL Cluster parameters (for example with ndb_size.pl) and just took my default config.ini template.
Because I am really lazy I have a little script doing this for me (alter_engine.sh).

But suddenly my euphoria was stopped abruptly by the following error:

MySQL error code 136: No more room in index file

The usual command that helps me in such a situation is a follows:

# perror 136
MySQL error code 136: No more room in index file

But in this case it is not really helpful. Also

# perror --ndb 136

does not bring us further. Strange: Index file... We are converting from MyISAM/InnoDB to NDB. Why the hell is he using an index file for this operation? It seems to be clearly a mysqld error message and not a MySQL Cluster error message. And we are also not using MySQL Cluster disk data tables.

After bothering a bit MySQL support I had the idea to do the following:

# ndb_show_tables | grep -ic orderedindex
127

The MySQL online documentation clearly states:

MaxNoOfOrderedIndexes
...
The default value of this parameter is 128.

So this could be the reason! When I have changed this parameter followed by the common rolling restart of the MySQL Cluster I could continue to migrate my schema into cluster...

Conclusion
MySQL errors can be related to cluster errors and do not necessarily point to the source of the problem. The error:

MySQL error code 136: No more room in index file

means just MaxNoOfOrderedIndexes is too small!

I hope that I can safe you some time with this little article.

Possible memory leak in NDB-API applications?

A customer has recently experienced a possible memory leak in its NDB-API application. What he did was something like

# ps aux | grep <pid>

over time and then he saw the RSS increasing. When he would have had a look a little longer he would have seen that the RSS consumption would increase up to a certain level and then becomes stable. Which is the expected behaviour.

But how to explain to the customer that his application, which was in fact not doing anything, consumes more RSS?
With a diff over time on /proc/<pid>/smaps we found that this area was the reason:

b67b7000-b6fca000 rw-p b67b7000 00:00 0 (8 Mbyte)
Size:               8268 kB
Rss:                 148 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:         0 kB
Private_Dirty:       148 kB
Referenced:          148 kB

But what is this meaning? To find the answer we did a strace on the program and got the following system calls:

...
read(5, "127.0.0.1 localhost\n\n# The follo"..., 4096) = 450
close(5) = 0
munmap(0xb7acb000, 4096) = 0
mmap2(NULL, 2117632, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb69bf000 - 0xB6BC4000 (2068 Mbyte)
mmap2(NULL, 2101248, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb67be000 - 0xb69bf000 (2052 Mbyte)
mmap2(NULL, 32768, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7ac4000
mprotect(0xb7ac4000, 4096, PROT_NONE) = 0
clone(child_stack=0xb7acb4b4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_C
...

OK. Somebody is allocating 2 times 2 junks of about 2 Mbyte of memory. But what the hell could this be??? During night I found the solution. It is the SendBufferMemory and ReceiveBufferMemory which I have configured in the config.ini to that size...

When you experience similar behaviour on your processes, maybe this little script can help you to find the problem: mem_tracker.sh

By the way, with an other customer we wound some other nice behaviour. But this time it was a mysqld:

Friday, September 5, 2008

Active/active fail over cluster with MySQL Replication

Electing a slave as new master and aligning the other slaves to the new master

In a simple MySQL Replication set-up you have high-availability (HA) on the read side (r). But for the master which covers all the writes (w) and the time critical read (rt) there is no HA implemented. For some situations this can be OK. For example if you have rarely writes or if you can wait until a new Master is set up.

But in other cases you need a fast fail-over to a new master.

In the following article it is shown how to implement the election of a new master and how to align the slaves to the new master.

We can have two possible scenarios:

This scenario assumes, that every slave can become the new master.
This scenario assumes, that only one dedicated slave will become master.

The advantages and disadvantages of both scenarios:

Scenario 1
+ You can choose the slave which is the most actual one.
- Higher possibility of errors if not automatized.
- You do not need an extra spare slave.
- More bin log writing on all Slaves.

Scenario 2
+ You do not have to choose which is the new master, you already have defined before.
- You have the possibility to not choose the Slave with the most recent data applied.

Important: All the slaves which can become master have to run with log-bin on and log-slave-updates.

Electing a Slave to become the new master
Szenario 1: Compare output of SHOW SLAVE STATUS and decide which one will become the new master.
Szenario 2: Not necessary because it is already done before.

Aligning the other slaves to the new master
The officially recommended way to set-up again a replication when the master fails is as follows:

Set-up the new master (is skipped in our case because a slave becomes master).
Do a consistent backup of the master (which takes time and, depending on the used storage engines, blocks writing).
Set-up all slaves one by one and point them to the new master (takes also time).

During these steps your production environment provides partially limited resources.
To avoid or at least reduce this problem we are looking for an abbreviation of the whole process:

Step 1: is obsolete in our scenario.
Step 2: Can be circumvented when we use a storage engine which allows us to make consistent backups (for example InnoDB) or when we use a very fast backup method (for example LVM snapshots).
Step 3: We can re-use all the slaves which have the same or older information than the new elected master. Slaves which have newer informations or in some other exceptional cases (see below) have to be set-up anyway as recommended.

How to do this?
IMHO the best is to show that in a little demo. For this I have set-up a environment like in scenario 1) and/or 2). There I have created my favourite table test as follows:

CREATE TABLE test (
`id` int(11) NOT NULL AUTO_INCREMENT,
`data` varchar(32) DEFAULT NULL,
`ts` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
PRIMARY KEY (`id`));

As next step we simulate the application as follows:

INSERT INTO TEST VALUES (NULL, 'Insert on Master', NULL);
INSERT INTO TEST VALUES (NULL, 'Insert on Master', NULL);
INSERT INTO TEST VALUES (NULL, 'Insert on Master', NULL);

To simulate a lag on one or more of the slaves we stop the replication on these:

STOP SLAVE;

Then we do some more application action on the master:

INSERT INTO TEST VALUES (NULL, 'Insert on Master', NULL);
INSERT INTO TEST VALUES (NULL, 'Insert on Master', NULL);

And then we crash the master!
From this point on we have the situation which could happen in the real world: Master is crashed and different slaves (can) have different positions relatively to the master:

Find the most actual slave
To find the slave which is the most actual one you have to gather some information on the slaves. This can be done as follows:

mysql> PAGER grep Master_Log;
mysql> SHOW SLAVE STATUS\G
mysql> PAGER ;

You also have to do this step in scenario b) because we want to know which slave can be taken and which one has to be set-up from scratch.
Then we get some output for example like this:

Slave 1:
Master_Log_File: bin-3311.000006
Read_Master_Log_Pos: 929
Relay_Master_Log_File: bin-3311.000006
Exec_Master_Log_Pos: 929

Slave 2:
Master_Log_File: bin-3311.000006
Read_Master_Log_Pos: 635
Relay_Master_Log_File: bin-3311.000006
Exec_Master_Log_Pos: 635

What we have to assure first is, that all the slave have caught up with writing the data from the relay log to slave. This is assured by comparing Master_Log_file/Read_Master_Log_Pos with Relay_Master_Log_file/Exec_Master_Log_Pos. If these values are the same then the slave has caught-up. Otherwise wait until they become the same.
When this is done we have to find, which slave is the most recent one. This is simple: Higher value is equal to newer information (also consider the log file not only the position!).
In scenario a) the one (or one of these) is elected as new master.
In our scenario this is Slave 1!

In scenario 2 all slaves which are newer than the pre-elected new master must be rebuild from the new master.
Slave 1 is newer than slave 2. If slave 2 was pre-elected as new master slave one must be rebuild from the new master.
From all the slaves which have a different position than the new master calculate the delta to the new master:

Calculate delta: 6.929 - 6.635 = 294

When the log file is different we cannot use these informations and we have to rebuild this slave from the new master.
Now we have defined, which one will become the new master and which slaves are in line, which are ahead and which are behind the new master.

Set-up the new environment
To avoid any troubles we to a STOP SLAVE on all slaves first.
Then we do a RESET SLAVE on the new master.
Now for every slave which is not rebuild from the master we have to calculate the position where to start the replication from. To do this we have to gather the actual position of the new master:

SHOW MASTER STATUS;
+-----------------+----------+--------------+------------------+
| File | Position | Binlog_Do_DB | Binlog_Ignore_DB |
+-----------------+----------+--------------+------------------+
| bin-3312.000002 | 2857 | | |
+-----------------+----------+--------------+------------------+

And for every slave we can calculate the delta:

==> 2857 - 294 = 2563

When the value becomes negative this means that we have to start in an older log-file than the actual one. I did not find any rule to calculate the exact position in this case. So unfortunately we also have to set-up these slaves from the backup.

As soon as we have these values calculated we can start the application running against the new master and we can also start now with the new consistent backup for all the slaves we have to set-up again from the backup.

INSERT INTO TEST VALUES (NULL, 'Insert on Master', NULL);
INSERT INTO TEST VALUES (NULL, 'Insert on Master', NULL);

On the slaves which are OK for aligning with the new master we have to change the master and the new positions now:

CHANGE MASTER TO master_host='laptop', MASTER_USER='replication', MASTER_PORT=3312, MASTER_PASSWORD='replication';
CHANGE MASTER TO MASTER_LOG_FILE='bin-3312.000002', master_log_pos=2563;
START SLAVE;

That's it!
If you would like to here more about such stuff please let me know. We are glad to help you with some consulting...
I have also most of this stuff in some scripts so this could be easily automated...