Monday, October 19, 2009

MySQL useful add-on collection using UDF

I really like this new toy (for me) called UDF. So I try to provide some more, hopefully useful, functionality.

The newest extension I like is the possibility to write to the MySQL error log through the application. Oracle can do that since long. Now we can do this as well...

A list of what I have done up to now you can find here:

If you have some more suggestions, please let me know.

The complete details you can find here.

Thursday, October 15, 2009

Using MySQL User-Defined Functions (UDF) to get MySQL internal informations

In one of my previous posts I was writing about how to read other processes memory [1]. As an example I tried to get the value of the hard coded MySQL internal InnoDB variable spin_wait_delay (srv_spin_wait_delay).

In this example we were using gdb or the operating system ptrace function to retrieve this value. This method has the disadvantage that it is pretty invasive.

When I was working on a customer support case I had the idea to solve this by the much less invasive method of User-Defined Functions (UDF).

UDF were introduced in MySQL 5.0 [2]. They provide the feasibility to enlarge the MySQL functionality by adding external code.

The clue is now that you also can use this external code to do some MySQL internal stuff.

My idea was now, instead of using gdb/ptrace to get the value of spin_wait_delay, to write and UDF to get and set this value.

More details about the UDF itself, how to compile and load it you can find on my website [3].

Then the UDF has to be loaded and activated in the database:

mysql> CREATE FUNCTION spin_wait_delay RETURNS INTEGER SONAME "udf_spin_wait_delay.so";

To remove the UDF again you can use the following command:

mysql> DROP FUNCTION spin_wait_delay;

To check if an UDF is installed or to see which ones are installed the following command gives you the right answer:

mysql> SELECT * FROM mysql.func;
+-----------------+-----+------------------------+----------+
| name            | ret | dl                     | type     |
+-----------------+-----+------------------------+----------+
| spin_wait_delay |   2 | udf_spin_wait_delay.so | function |
+-----------------+-----+------------------------+----------+

When the UDF is compiled and properly loaded into the database you can get the value of spin_wait_delay as follows:

mysql> SELECT spin_wait_delay();
+--------------------+
| spin_wait_delay(5) |
+--------------------+
|                  5 |
+--------------------+

And now the real nice thing is that you can even set this value as follows:

mysql> SELECT sping_wait_delay(8);
+--------------------+
| spin_wait_delay(8) |
+--------------------+
|                  8 |
+--------------------+

With this function we can make a static hard coded InnoDB value dynamically changeable. To make it permanent also after a database restart possibly the functionality of init_file could help you further [4].

With this concept we can think about implementing many missing things without touching the MySQL code itself or recompiling MySQL. Please let me know what is missing in your opinion and I can try to implement it. Because I am not a programer the help of those guys would be very appreciated.

If anybody sees a problem with this method please let me know. I do not know about such things like thread safe and mutexes etc. But I think at least reading should not harm.

Caution: When you have a crash in your UDF the whole MySQL server will crash. So be careful and test it intensively!

Binary

udf_spin_wait_delay.so (md5 807c6bc09b5dc88a8005788519f2483a)

Friday, October 2, 2009

Determine in MySQL if we are in summer time or winter time (daylight saving time, DST)

Recently a colleague at Sun was asking me if MySQL can tell him to determine if we are currently in summer time or winter time. He was doing some data analysis of his house where he has installed solar panels.

I am not aware of what he wants to do exactly, but possibly he wants all the data in solar time. So UTC could help him because UTC does not change much over time.

Next thing which came to my mind is, that possibly the good place to do such math calculations is the application code and not the database.

But never the less I was interested in how to solve this IN the database.

By default your MySQL server relies on your servers time zone. [1]

So if your server is set-up correctly you should be capable to determine if you are in summer time or winter time by your current time, UTC time and the offset you have to UTC.
mysql> SELECT IF(ROUND(TIME_TO_SEC(SUBTIME(TIME(SYSDATE()), UTC_TIME())) / 3600, 0) = 2, 'summer time', 'winter time') AS time;
Have fun calculating how much power is produced by your solar panels according to winter or sumer time...

If you have smarter solutions please let me know.

[1] Time zone support
[2] Date and time functions

Friday, August 14, 2009

Reading other processes memory

As you probably have experienced yet MySQL does not always provide all internal information as you might want to have them and as you are used to have from other RDBMS.

MySQL plans to improve this implementing the/a performance schema and its probably already partly done in MySQL 5.4. But who knows when this will be finished and what it contains at all...

What is not provided to me I want to gather myself... But how? Other RDBMS provide interfaces to attach applications directly to their memory to retreive information. But MySQL does not. So I was looking for a way to read an other process memory.

I have no clue about programming and thus changing MySQL code was out of focus. Further I am looking for a solution you can use immediately on a running systems at consulting gigs. Some tries to read /proc/<pid>/mem with a little php script failed.

An article by Domas M. helped me. I do not have to write something myself I can use a tool already exsting to do the work. But gdb is not installed on every machine and usually not at all on production machines. Further gdb is probably an overkill to just read memory of other processes.

But an other application to do this job I did not find. I just found some comments that ptrace is the way to do it. Ptrace (man ptrace) is not a program (as for example strace) but an operating system function call.

When you are interested how I found out how to do it please continue reading here.

Tuesday, December 23, 2008

MySQL licenses for dummies

The following summary shows my personal understanding of MySQL 5.1 licenses, packages and products. It does not necessarily reflect 100% the way MySQL understands it. But after all the discussions I hope it is as close as possible to the reality:

MySQL Embedded Database Server (Download: enterprise.mysql.com -> OEM Software)
Classic (OEM license, -MEM -InnoDB)
Pro (= Classic +InnoDB)
Advanced (= Pro +Partitioning)

MySQL Community Sever (Download: www.mysql.com -> Downloads)
Community (GPL, -NDB)

MySQL Enterprise Server (Download: enterprise.mysql.com -> Enterprise Software)
Pro (GPL or commercial, -NDB +InnoDB +MEM, Basic + Silver customer, MRU + QSP)
Advanced (= Pro +Partitioning, Gold + Platinum customer)

MySQL Cluster (Download: http://dev.mysql.com/downloads/cluster/)
Community Edition (GPL, all features)
Com (ex CGE?) (OEM or commercial, -InnoDB +NDB)
Com-Pro (Com, all features)
Standard Edition (= Com, -NDB-API -Cluster-Repl, -LDAP)

Upgrade

EP customer should follow the QSP trail unless it is critical for them to install an MRU to get a quick bugfix to hold them over until the next QSP is released.

Month version / release
0 5.1.30
1 5.1.30-MRU1
2 5.1.30-MRU2
3 5.1.31
4 5.1.31-MRU1 and 5.1.30-QSP
5 5.1.31-MRU2
6 5.1.32
7 5.1.32-MRU1 and 5.1.31-QSP

Legend

CE  - Community Edition
EP - Enterprise Edition (why not EE?)
MRU - Monthly Rapid Update (EP only)
QSP - Quarterly Service Pack (EP only)
OEM - Original Equipment Manufacturer
MEM - MySQL Enterprise Monitior
CGE - Carrier Grade Edition
Please correct me, if I am wrong. And when you have more questions let me know and I try to clear this.

Thursday, November 13, 2008

Why does MySQL Cluster takes so long for --initial?

This week we had a very interesting problem at a customer: They complained, that their MySQL Cluster takes about 1 hour for a --initial start-up. After some tuning on the hardware and the config.ini they brought it down to around 40 minutes. But this is still unacceptable long...

This sounds a little strange to me. But let us have a look at their config.ini first. It looked more or less like the following (which is already tuned!):

#
# config.ini
#

[NDB_MGMD DEFAULT]
DataDir = /localdisk/cluster

[NDBD DEFAULT]
DataDir = /localdisk/cluster
NoOfReplicas = 2
LockPagesInMainMemory = 1
DataMemory = 12000M
IndexMemory = 2000M

FragmentLogFileSize = 256M
NoOfFragmentLogFiles = 42

DiskCheckpointSpeedInRestart = 100M
ODirect = 1

BackupMaxWriteSize = 1M
BackupDataBufferSize = 16M
BackupLogBufferSize = 4M
BackupMemory = 20M

[NDB_MGMD]
id = 1
hostname = 192.168.0.1

[NDBD]
id = 10
hostname = 192.168.0.10

[NDBD]
id = 11
hostname = 192.168.0.11

[MYSQLD]
[MYSQLD]
[MYSQLD]
[MYSQLD]
[MYSQLD]

So the config.ini looks fine. But why does it take so long? Let us have a look who is the evil guy:

# grep phase ndb_1_cluster.log | grep "Node 10"

2008-10-24 12:24:16 [MgmSrvr] INFO -- Node 10: Start phase 1 completed
2008-10-24 12:24:16 [MgmSrvr] INFO -- Node 10: Start phase 2 completed (initial start)
2008-10-24 12:24:16 [MgmSrvr] INFO -- Node 10: Start phase 3 completed (initial start)
2008-10-24 13:03:03 [MgmSrvr] INFO -- Node 10: Start phase 4 completed (initial start)
2008-10-24 13:03:08 [MgmSrvr] INFO -- Node 10: Start phase 5 completed (initial start)
2008-10-24 13:03:08 [MgmSrvr] INFO -- Node 10: Start phase 6 completed (initial start)
2008-10-24 13:03:08 [MgmSrvr] INFO -- Node 10: Start phase 7 completed (initial start)
2008-10-24 13:03:08 [MgmSrvr] INFO -- Node 10: Start phase 8 completed (initial start)
2008-10-24 13:03:08 [MgmSrvr] INFO -- Node 10: Start phase 9 completed (initial start)
2008-10-24 13:03:08 [MgmSrvr] INFO -- Node 10: Start phase 100 completed (initial start)
2008-10-24 13:03:08 [MgmSrvr] INFO -- Node 10: Start phase 101 completed (initial start)

Start phase 4 is the one! But what is start phase 4? According to the MySQL Cluster documentation start phase 4 means:

Phase 4. For an initial start or initial node restart, the redo log files are created. The number of these files is equal to NoOfFragmentLogFiles.

Creating the redo log files is the problem! I have never created a MySQL Cluster with 42 Gbyte of redo log file (42 x 4 x 256Mbyte) myself so I have no practical experience with it how long it should take.

What I did first was a rough calculation of the value I would expect:

43'008 Mbyte : 50 Mbyte/s = 860 s = 15 minutes

After 15 minutes I would expect the cluster to be up and running (plus some seconds for the other phases). What I expect seems not to be the same as what happens... :(

I turned on iostat (by the way: IMHO iostat (sysstat packages) belongs on every server which does some kind of I/O!):

# iostat -x 1

avg-cpu: %user %nice %system %iowait %steal %idle
0.00 0.00 0.12 18.35 0.00 81.52

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 193.00 0.00 55.00 0.00 2016.00 0.00 1008.00 36.65 2.13 38.55 18.18 100.00

avg-cpu: %user %nice %system %iowait %steal %idle
0.00 0.00 0.25 18.88 0.00 80.88

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 192.00 0.00 48.00 0.00 2088.00 0.00 1044.00 43.50 1.72 35.50 20.83 100.00

avg-cpu: %user %nice %system %iowait %steal %idle
0.00 0.00 0.00 19.12 0.00 80.88

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 200.00 0.00 47.00 0.00 2088.00 0.00 1044.00 44.43 1.69 35.91 21.28 100.00

OK. As we can see, the I/O device is fully utilized. Good to know. Looks like we have an I/O problem! We do only around 50 w/s which is really bad (normal disks can do around 150 - 250 w/s) and we write with a throughput of 1 MByte/s (is about 20 kbyte/write) instead of 50 Mbyte/s.

What came to my mind is, that earlier this year I have written an article about Sparse files. I found then, that MySQL Cluster writes the redo log files as sparse files to disk. Could this have an impact? But we have never had any complains about this from other customers...

So what I did next: I tried on a completely different hardware from an other vendor and it took only 13 minutes! The problem seems to be NOT MySQL Cluster (alone) related! On our machines we have also an NFS mount. So let us try there: Even worse: 10 minutes! Then we did a synthetical test on the hardware:

# dd if=/dev/zero of=redo_log.42g count=0 obs=1 seek=42G
# dd if=/dev/zero of=redo_log.42g bs=32K count=1376256 of=direct

This was running in 12 minutes! I was confused. Maybe my simulation has to be more complex:

#!/bin/bash

let i=1
# 1376256
while [[ $i -le ]] ; do
dd if=/dev/zero of=big_file.42g obs=512 count=1 seek=32K
let i=$i+1
done

No results which come close to the thing I have seen from MySQL Cluster...

Let us summarize:

Our servers with Mysql Cluster: 37 minutes
Test servers with Mysql Cluster: 13 minutes
Our servers with dd: 12 minutes
Our servers with NFS: 10 minutes

This does not make sense to me. I need the PERFECT simulation!

For knowing more, how the redo log files were created I run a trace on the ndbd process during the creation:

# strace -o ndbd.trace -ff -s 128 -p

As a result I got this:

open("/localdisk/cluster/ndb_11_fs/D10/DBLQH/S3.FragLog", O_RDWR|O_CREAT|O_TRUNC|O_DIRECT, 0666) = 17
lseek(17, 268402688, SEEK_SET) = 268402688 # 256M - 32k
write(17, "..."..., 32768) = 32768
fsync(17) = 0
lseek(17, 0, SEEK_SET) = 0
write(17, "..."..., 32768) = 32768
fsync(17) = 0
lseek(17, 1048576, SEEK_SET) = 1048576
write(17, ""..., 32768) = 32768
fsync(17) = 0
lseek(17, 2097152, SEEK_SET) = 2097152
write(17, ""..., 32768)= 32768
fsync(17) = 0

...
lseek(17, 267386880, SEEK_SET) = 267386880
write(17, ""..., 32768) = 32768
fsync(17) = 0
close(17) = 0

OK. I see, my simulation up to now was not yet good enough... I need a Perl script to simulate this more exactly. The script you can find here: cluster_initial_test.pl.

Before I run the script I was pretty nervous: Would it work? But after a few minutes it was already clear: It has the wanted effect. After around 1/3 of the redo log files were created I estimated the end time to 59 minutes and when it ended it took around 50 minutes. Perfect!

After I came back from a short jig I was thinking about why it took longer than 40 minutes... What I can see from the system statistics is, that the I/O system is more relaxed with my script than with the original MySQL Cluster --initial restart.

# iostat -x 1

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util
sda 0.00 161.39 0.00 49.50 0.00 1798.02 0.00 899.01 36.32 1.05 20.40 14.32 70.89
sda 0.00 176.00 0.00 51.00 0.00 1920.00 0.00 960.00 37.65 1.07 21.18 14.04 71.60
sda 0.00 170.71 0.00 46.46 0.00 1882.83 0.00 941.41 40.52 0.99 21.22 15.74 73.13

So it is about 25-33% less utilized. MySQL Cluster creates the redo log files with 4 threads in parallel. For me this is a sufficient exact explanation for the moment why it takes longer.

We are able to decouple now the hardware problem from the software product. I cannot really help the hardware guys to solve their problem, but at least we were able to give them a simulation of the problem. So they can run the tests without us.

On the other hand MySQL Cluster is doing exactly what I always tell my customers not to do: "COMMIT after every INSERT". This is the worst thing what you can do with an I/O system. MySQL Cluster does a sync to disk after every 32k block (1 block per Mbyte).
So I was talking to Jonas, one of our MySQL Cluster developers, and told him the whole story. He was just asking me: "Is 14:00 early enough for the patch?"

Before 14:00 I had already the patch in my mailbox. We applied it and tested it and...

5 minutes 12 seconds!

So I think now everybody should be happy?

Addendum:

  • The perl script could be improved by:
sysopen(HANDLE, $path, O_RDWR|O_CREAT|O_EXCL|O_DIRECT) or die "sysopen $path: $!";
  • Maybe the problem could be also simulated much easier with dd like this:
# dd if=/dev/zero of=redo_log.42g bs=1M count=43008 of=direct conv=fsync

Monday, October 13, 2008

MySQL Multi-Master – Single-Slave – Replication (Episode 2)

Introduction

One of the features that make MySQL so great is its easy replication set-up. If you are experienced and know-how to do it, it takes you about 15 minutes to set-up a slave. What you have in the end is a replication from one master to one or several slaves. So you can build a top-down data stream pyramid and spread your data on many slaves.


From time to time some customers are asking for the other way: Many masters replicating to one slave (which is also called multi-source replication). For this requirement MySQL replication cannot help you directly.

Possibilities

You can circumvent this situation in the following ways:

  1. Implement your own data transfer mechanism.

  2. Use this ugly hack I have written down 2 years ago: Multi-Master-Single-Slave-Replication

  3. Test the approach described in the following article.

  4. Wait until MySQL has implemented it...


Possibility number 4 for would be the nicest one. Because then the solution would be properly supported by your database vendor and you do not have to take care much about problems, as long as you have a support contract.

When we look at the MySQL work log and search for replication we find a work log entry with the title: WL#1697: Multi-source replication. But for whatever reason it has the following status: Affects: Server-7.0 — Status: On-Hold — Priority: Low. :-(

What can we do now:
a) Resign.
b) Become an important customer (by paying much money), complain about the priority and if nothing changes escalate it to the top management of MySQL.
c) Find many many other fellow sufferers, unite and make your database vendors management aware of your desire.
d) Help yourself (it is eventually an Open Source product...).

Possibilities number 1 to 3 from above are some kind of Help yourself.

Help yourself

Because I am not a programmer and I have no clue about programming, possibility number 1 is out of question for me. But I am sure there are many other MySQL users out there in the world which would appreciate your effort.
Possibility number 2 is a quick and ugly hack but may work in some situations.
And number 3 I was pointed to by a user called Kwame who wrote me an email (thanks Kwame for the cool hint!).

A possible solution

One and a half year ago I wrote down a little article about MySQL Active - Active Clustering" because we had ever and ever customer asking for a replacement for a well known but very expensive product from an other big database vendor.
Peter Zaitsev was not very happy with the content of it: MySQL MyISAM Active Active Clustering - looking for trouble?. But comments about the critics already gave a sign for the solution: When you combine the MySQL Active-Active Clustering with MySQL replication you can finally get a many-master single slave replication!

So what I did was the following:

I have 3 servers which act as a master and on the 4th server I install 3 MySQL instances (mysqld) running on the same datadir. Please make sure, that you replicate ONLY ONE SCHEMA per master slave pair!


As configuration file for my slaves I have used something like this:

#
# my.cnf
#

[mysqld]

port = 3308
socket = /home/mysql/tmp/mysql-3308.sock
pid-file = /home/mysql/data/mysqld5127/mysql-3308.pid
datadir = /home/mysql/data/mysqld5127

skip-innodb
external-locking
log-error = /home/mysql/data/mysqld5127/error_4.log

server_id = 4
master-info-file = /home/mysql/data/mysqld5127/master.info
relay-log = /home/mysql/data/mysqld5127/mysql-3308-relay-bin
relay-log-index = /home/mysql/data/mysqld5127/mysql-3308-relay-bin
relay-log-info-file = /home/mysql/data/mysqld5127/relay-log.info

Requirements

Make sure that:

  • You only write to one schema per master (schema_a in master 1, schema_b in master 2 and so on...)

  • You comply with all the requirements described in the document MySQL Active - Active Clustering

  • You flush the tables on ALL slave before reading.


Very simple first tests showed, that it is possible to "aggregate" the data from many masters into a single slave. But please verify this proof of concept very carefully before you deploy it to your productive system with your real data. This approach is neither recommended nor supported by MySQL.

Drawback

A further replication from this slave system to other slaves is not possible (or at least no feasible solution comes to my mind) except you use MySQL cluster as "aggregator". Then it could be possible...