Friday, January 29, 2010

What is CHECK TABLE doing with InnoDB tables?

Recently we had a case where a customer got some corrupted blocks in his InnoDB tables. His largest tables where quite big, about 30 to 100 Gbyte. Why he got this corrupted blocks we did not find out yet (disk broken?).

When you have corrupted blocks in InnoDB, it is mandatory to get rid of them again. Otherwise your database can crash suddenly.
If you are lucky only "normal" tables are concerned. So you can dump, drop, recreate and load them again as described in the InnoDB recovery procedure in the MySQL documentation [1].
If you are not so lucky you have to recreate your complete database or go back to an old backup and do a restore with a Point-in-Time-Recovery (PITR).

To find out if some tables are corrupted MySQL provides 2 tools: The innochecksum utility [2] and the mysqlcheck utility [3] or you can use the CHECK TABLE command manually (which is used by mysqlcheck).

I wanted to know how CHECK TABLE works in detail. So I looked first in the MySQL documentation [4]. But unfortunately the MySQL documentation does not go into details that much very often on such specific questions.

So I dug into the code. The interesting lines you can find in the files handler/ha_innodb.cc and row/row0mysql.c. In the following snippets I have cut out a lot of detail stuff.

The function ha_innobase::check is the interface between the CHECK TABLE command and the InnoDB storage engine and does the call of the InnoDB table check:

// handler/ha_innodb.cc

int ha_innobase::check( THD* thd )
{

build_template(prebuilt, NULL, table, ROW_MYSQL_WHOLE_ROW);

ret = row_check_table_for_mysql(prebuilt);

if (ret == DB_SUCCESS) {
return(HA_ADMIN_OK);
}

return(HA_ADMIN_CORRUPT);
}

The function row_check_table_for_mysql does the different checks on an InnoDB table:

  • First it checks if the ibd file is missing.

  • Then the first index (dict_table_get_first_index) is checked on its consistency (btr_validate_index) by walking through all page tree levels. In InnoDB the first (primary) index is always equal to the table (= data).

  • If the index is consistent several other checks are performed (row_scan_and_check_index):

    • If entries are in ascendant order.

    • If unique constraint is not broken.

    • And the number of index entries is calculated.

  • Then the next and all other (secondary) indexes of the table are done in the same way.

  • At the end a WHOLE Adaptive Hash Index check for ALL InnoDB tables (btr_search_validate) is done for every CHECK TABLE!

// row/row0mysql.c

ulint row_check_table_for_mysql( row_prebuilt_t* prebuilt )
{

if ( prebuilt->table->ibd_file_missing ) {
fprintf(stderr, "InnoDB: Error: ...", prebuilt->table->name);
return(DB_ERROR);
}

index = dict_table_get_first_index(table);

while ( index != NULL ) {

if ( ! btr_validate_index(index, prebuilt->trx) ) {
ret = DB_ERROR;
}
else {

if ( ! row_scan_and_check_index(prebuilt, index, &n_rows) ) {
ret = DB_ERROR;
}

if ( index == dict_table_get_first_index(table) ) {
n_rows_in_table = n_rows;
}
else if ( n_rows != n_rows_in_table ) {

ret = DB_ERROR;

fputs("Error: ", stderr);
dict_index_name_print(stderr, prebuilt->trx, index);
fprintf(stderr, " contains %lu entries, should be %lu\n", n_rows, n_rows_in_table);
}
}

index = dict_table_get_next_index(index);
}

if ( ! btr_search_validate() ) {
ret = DB_ERROR;
}

return(ret);
}

A little detail which is NOT discussed in the code above is that the fatal lock wait timeout is set from 600 seconds (10 min) to 7800 seconds (2 h 10 min).

/* Enlarge the fatal lock wait timeout during CHECK TABLE. */
mutex_enter(&kernel_mutex);
srv_fatal_semaphore_wait_threshold += 7200; /* 2 hours */
mutex_exit(&kernel_mutex);

As far as I understand this has 2 impacts:
  1. CHECK TABLE for VERY large tables (> 200 - 400 Gbyte) will most probably fail because it will exceed the fatal lock timeout. This becomes more probable when you have bigger tables, slower disks, less memory or do not make use of your memory appropriately.

  2. Because srv_fatal_semaphore_wait_threshold is a global variable, during every CHECK TABLE the fatal lock wait timeout is set high for the whole system. Long enduring InnoDB locks will be detected late or not at all during a long running CHECK TABLE command.


If this is something which should be fixed to get a higher reliability of the system I cannot judge and is up to the InnoDB developers. But when you hit such symptoms during long running CHECK TABLE commands consider this.
For the first finding I have filed a feature request [5]. This "problem" was introduced long time ago with bug #2694 [6] in MySQL 4.0, Sep 2004. Thanks to Axel and Shane for their comments.
If you want to circumvent this situation you have either to recompile MySQL with higher values or you can use the concept of a pluggable User Defined Function (UDF) which I have described earlier [7], [8], [9].

An other detail is that at the end of each CHECK TABLE command a check of all Adaptive Hash Indexes of all tables is done. I do not know how expensive it is to check all Adaptive Hash Indexes, especially when they are large. But having a more optimized code there could help to speed up the CHECK TABLE command for a small percentage?

These information are valid up to MySQL/InnoDB 5.1.41 and the InnoDB plug-in 1.0.5.

Literature

[1] Forcing InnoDB Recovery
[2] innochecksum — Offline InnoDB File Checksum Utility
[3] mysqlcheck
[4] CHECK TABLE
[5] Bug #50723: InnoDB CHECK TABLE fatal semaphore wait timeout possibly too short for big table
[6] Bug #2694: CHECK TABLE for Innodb table can crash server
[7] MySQL useful add-on collection using UDF
[8] Using MySQL User-Defined Functions (UDF) to get MySQL internal informations
[9] MySQL useful add-on collection using UDF

I'm going to FOSDEM, the Free and Open Source Software Developers' European Meeting

3 comments:

kabel said...

Wow. This post could not have been more timely. I have a (legacy) server with a large (600G) InnoDB tablespace that is slightly corrupted. A new machine has been rotated into production while I clean up this monster, but you just saved me a lot of heartache with the fatal lock timeout catch. It's probably going to end up being a incremental table dump and restore.

Thanks!

Shinguz said...

Hello kabel,

Thanks for the flowers!

Anyhow, logical backups with such an amount of data is nearly impossible to restore (in a timely manner)!

But if you rely on physical backup methods you do not get rid of the corruption. So one is trapped in this situation. :-(

Do you have an idea why you got the corruptions? Playing around with DRBD or anything similar on file system level? Or just upgraded to 5.1?
How does the corruption manifests?

Regards,
Oli

qu1j0t3 said...

ZFS can prevent corruption from ever reaching MySQL (or any application).

For mission critical data I wouldn't use anything else.