欢迎您光临本站,如有问题请及时联系我们。

MySQL Waiting for table metadata lock故障分析

  源起

  线上执行ALTER时,通过show processlist查看到出现Waiting for table metadata lock ,导致后面的查询都无法执行。

  5217122 | create_table_04 | 172.100.207.148:31291 | finance | Query | 1829 | Waiting for table metadata lock | ALTER TABLE `Pay` MODIFY COLUMN `pay` smallint(6) NULL DEFAULT 0 COMMENT '付??'

  5217155 | bx_live_dml | 172.100.210.4:34730 | finance | Prepare | 1714 | Waiting for table metadata lock | select * from Pay where (pay =4 or pay =8) and projectId ='CSZY0'

  由于当时直接把ALTER给kill了,所以基本没有现场,所以下面模拟什么情况MySQL产生这种事故。

  MySQL元数据锁

  MySQL DBA对于Waiting for table metadata lock肯定不会陌生,一般都是进行alter操作时被堵住了,导致了我们在show processlist 时,看到线程的状态是在等metadata lock。

  为了在并发环境下维护表元数据的数据一致性,在表上有活动事务(显式或隐式)的时候,不可以对元数据进行写入操作。因此从MySQL5.5版本开始引入了MDL锁(metadata lock),来保护表的元数据信息,用于解决或者保证DDL操作与DML操作之间的一致性。对于引入MDL,其主要解决了2个问题,一个是事务隔离问题,比如在可重复隔离级别下,会话A在2次查询期间,会话B对表结构做了修改,两次查询结果就会不一致,无法满足可重复读的要求;另外一个是数据复制的问题,比如会话A执行了多条更新语句期间,另外一个会话B做了表结构变更并且先提交,就会导致slave在重做时,先重做alter,再重做update时就会出现复制错误的现象。

  所以在对表进行上述操作时,如果表上有活动事务(未提交或回滚),请求写入的会话会等待在Metadata lock wait 。例如下面的这种情形:

  若没有MDL锁的保护,则事务2可以直接执行DDL操作,并且导致事务1出错,5.1版本即是如此。5.5版本加入MDL锁就在于保护这种情况的发生,由于事务1开启了查询,那么获得了MDL锁,锁的模式为SHARED_READ,事务2要执行DDL,则需获得EXCLUSIVE锁,两者互斥,所以事务2需要等待。

  注:支持事务的InnoDB引擎表和不支持事务的MyISAM引擎表,都会出现Metadata Lock Wait等待现象。一旦出现Metadata Lock Wait等待现象,后续所有对该表的访问都会阻塞在该等待上,导致连接堆积,业务受影响。

  引起Metadata lock wait出现的场景:Waiting for table metadata lock

  场景一:当前有执行DML操作时执行ALTRE操作。

  # SESSION A

  mysql> insert into sbtest2 select * from sbtest1;

  # SESSION B

  mysql> alter table sbtest2 add test1 int; //等待SESSION A执行完;

  # SESSION C

  mysql> show processlist;

  +-----+------+-----------+--------+---------+------+---------------------------------+-------------------------------------------+

  | Id | User | Host | db | Command | Time | State | Info |

  +-----+------+-----------+--------+---------+------+---------------------------------+-------------------------------------------+

  | 267 | root | localhost | sbtest | Query | 7 | Sending data | insert into sbtest2 select * from sbtest1 |

  | 271 | root | localhost | sbtest | Query | 3 | Waiting for table metadata lock | alter table sbtest2 add test1 int |

  | 272 | root | localhost | NULL | Query | 0 | starting | show processlist |

  +-----+------+-----------+--------+---------+------+---------------------------------+-------------------------------------------+

  3 rows in set (0.00 sec)

  # SESSION D

  mysql> select * from sbtest2 limit 10; //等待元数据锁;

  # SESSION E

  mysql> show processlist;

  +-----+------+-----------+--------+---------+------+---------------------------------+-------------------------------------------+

  | Id | User | Host | db | Command | Time | State | Info |

  +-----+------+-----------+--------+---------+------+---------------------------------+-------------------------------------------+

  | 267 | root | localhost | sbtest | Query | 20 | Sending data | insert into sbtest2 select * from sbtest1 |

  | 271 | root | localhost | sbtest | Query | 13 | Waiting for table metadata lock | alter table sbtest2 add test1 int |

  | 272 | root | localhost | NULL | Query | 0 | starting | show processlist |

  | 308 | root | localhost | sbtest | Query | 3 | Waiting for table metadata lock | select * from sbtest2 limit 10 |

  +-----+------+-----------+--------+---------+------+---------------------------------+-------------------------------------------+

  4 rows in set (0.00 sec)

  从上述例子可以看出,我们在执行DDL语句的时候得事先看一下,进程中是否已经存在某些DML语句占用了表的元数据锁,这样会导致DDL语句处于锁等待状态。一旦出现Waiting for table metadata lock等待现象,后续所有对该表的访问都会阻塞在该等待上,包括读操作,导致连接堆积,业务受影响。

  场景二:当前有对表的长时间查询或使用mysqldump/mysqlpump时,使用alter会被堵住。

  # SESSION A

  mysql> select *,sleep(10) from sbtest2;

  # SESSION B

  mysql> alter table sbtest2 add test2 int; //等待SESSION A执行完;

  # SESSION C

  mysql> show processlist;

  +-----+------+-----------+--------+---------+------+---------------------------------+---------------------------------------+

  | Id | User | Host | db | Command | Time | State | Info |

  +-----+------+-----------+--------+---------+------+---------------------------------+---------------------------------------+

  | 267 | root | localhost | sbtest | Query | 12 | User sleep | select *,sleep(10) from sbtest2 |

  | 271 | root | localhost | sbtest | Query | 8 | Waiting for table metadata lock | alter table sbtest2 add test3 int |

  | 272 | root | localhost | NULL | Query | 0 | starting | show processlist |

  | 311 | root | localhost | NULL | Query | 3 | Waiting for table metadata lock | select * from sbtest.sbtest2 limit 10 |

  +-----+------+-----------+--------+---------+------+---------------------------------+---------------------------------------+

  4 rows in set (0.00 sec)

  场景三:显示或者隐式开启事务后未提交或回滚,比如查询完成后未提交或者回滚,使用alter会被堵住。

  # SESSION A

  mysql> begin;

  mysql> select * from sbtest2;

  # SESSION B

  mysql> alter table sbtest2 add test2 int; //等待SESSION A执行完;

  # SESSION C

  mysql> show processlist;

  +-----+------+-----------+--------+---------+------+---------------------------------+-----------------------------------+

  | Id | User | Host | db | Command | Time | State | Info |

  +-----+------+-----------+--------+---------+------+---------------------------------+-----------------------------------+

  | 267 | root | localhost | sbtest | Sleep | 36 | | NULL |

  | 271 | root | localhost | sbtest | Query | 30 | Waiting for table metadata lock | alter table sbtest2 add test2 int |

  | 272 | root | localhost | NULL | Query | 0 | starting | show processlist |

  +-----+------+-----------+--------+---------+------+---------------------------------+-----------------------------------+

  3 rows in set (0.00 sec)

  场景四:表上有失败的查询事务,比如查询不存在的列,语句失败返回,但是事务没有提交,此时alter仍然会被堵住。

  # SESSION A

  mysql> begin;

  mysql> select error from sbtest2;

  ERROR 1054 (42S22): Unknown column 'error' in 'field list'

  # SESSION B

  mysql> alter table sbtest2 add test3 int; //等待SESSION A提交或回滚;

  # SESSION C

  mysql> show processlist;

  +-----+------+-----------+--------+---------+------+---------------------------------+-----------------------------------+

  | Id | User | Host | db | Command | Time | State | Info |

  +-----+------+-----------+--------+---------+------+---------------------------------+-----------------------------------+

  | 267 | root | localhost | sbtest | Sleep | 7 | | NULL |

  | 271 | root | localhost | sbtest | Query | 3 | Waiting for table metadata lock | alter table sbtest2 add test3 int |

  | 272 | root | localhost | NULL | Query | 0 | starting | show processlist |

  | 311 | root | localhost | NULL | Sleep | 413 | | NULL |

  +-----+------+-----------+--------+---------+------+---------------------------------+-----------------------------------+

  4 rows in set (0.00 sec)

  # SESSION D

  mysql> select * from information_schema.innodb_trx;

  Empty set (0.00 sec)

  其实SESSION A中的事务并未开启,但是由于select获取表元数据的语句,语法上是有效的,虽然执行失败了,但是任然不会释放元数据锁,故而导致SESSION B的alter动作被阻塞。

  通过SESSION D查看当前打开事务时,你会发现没有,从而找不到原因。所以当出现这种场景时,如何判断是哪个进程导致的呢,我们可以尝试查看表performance_schema. events_statements_current,分析进程状态来进行判断。

  mysql> select * from performance_schema. events_statements_current\G

  *************************** 1. row ***************************

  THREAD_ID: 293

  EVENT_ID: 32

  END_EVENT_ID: 32

  EVENT_NAME: statement/sql/select

  SOURCE: socket_connection.cc:101

  TIMER_START: 212721717099954000

  TIMER_END: 212721717213807000

  TIMER_WAIT: 113853000

  LOCK_TIME: 0

  SQL_TEXT: select error from sbtest2

  DIGEST: 0bbb2d5d1be45e77debea68111264885

  DIGEST_TEXT: SELECT ERROR FROM `sbtest2`

  CURRENT_SCHEMA: sbtest

  OBJECT_TYPE: NULL

  OBJECT_SCHEMA: NULL

  OBJECT_NAME: NULL

  OBJECT_INSTANCE_BEGIN: NULL

  MYSQL_ERRNO: 1054

  RETURNED_SQLSTATE: 42S22

  MESSAGE_TEXT: Unknown column 'error' in 'field list'

  ERRORS: 1

  然后找到其sid, kill掉该session,也可以kill掉DDL所在的session解决可以解决此问题。

  另外,测试时SESSION A要显式开启一个事务,否则查询会隐式回滚结束,无法重现上面的场景。SESSION B执行alter后,没有立即阻塞住,而是立马开始copy to tmp table,这个过程结束后,才进行了MDL锁等待。这怎么解释呢,应该是执行alter操作主要分为创建临时新表->插入老表的数据->临时新表rename to老表三个步骤,在这种情况下,到最后一步才需要MDL锁,所以copy过程中不会阻塞。由于没有查询在进行,而且查询也没有进入innodb层 (失败返回),所以show processlist和information_schema.innodb_trx没有可以参考的信息。

  出现以上几种情况时,这个时候如果进行如下操作就会引起MDL:

  创建、删除索引。

  修改表结构。

  表维护操作(optimize table、repair table等)。

  删除表。

  获取表上表级写锁 (lock table tab_name write)。

  使用Profile分析场景三:显示或者隐式开启事务后未提交或回滚,比如查询完成后未提交或者回滚,使用alter会被堵住

  # SESSION A

  mysql> set profiling=on;

  mysql> begin;

  mysql> select * from sbtest.sbtest2 limit 1;

  # SESSION B

  mysql> set profiling=on;

  mysql> alter table sbtest.sbtest5 add test2 int; //等待SESSION A执行完;

  # SESSION C

  mysql> set profiling=on;

  mysql> select * from sbtest.sbtest2 limit 1;

  # SESSION D

  mysql> set profiling=on;

  mysql> show processlist;

  +-----+------+-----------+------+---------+------+---------------------------------+------------------------------------------+

  | Id | User | Host | db | Command | Time | State | Info |

  +-----+------+-----------+------+---------+------+---------------------------------+------------------------------------------+

  | 325 | root | localhost | NULL | Query | 25 | Waiting for table metadata lock | alter table sbtest.sbtest2 add test5 int |

  | 326 | root | localhost | NULL | Query | 3 | Waiting for table metadata lock | select * from sbtest.sbtest2 limit 1 |

  | 327 | root | localhost | NULL | Query | 0 | starting | show processlist |

  | 328 | root | localhost | NULL | Sleep | 50 | | NULL |

  +-----+------+-----------+------+---------+------+---------------------------------+------------------------------------------+

  4 rows in set (0.00 sec)

  然后回滚SESSION A,等待SESSION B和SESSION C执行完,查看profile。

  查看SESSION A

  # SESSION A

  mysql> show profiles;

  +----------+-------------+------------------------------------------+

  | Query_ID | Duration | Query |

  +----------+-------------+------------------------------------------+

  | 1 | 42.81646375 | alter table sbtest.sbtest2 add test5 int |

  +----------+-------------+------------------------------------------+

  1 row in set, 1 warning (0.00 sec)

  mysql> show profile for query 1;

  +--------------------------------+----------+

  | Status | Duration |

  +--------------------------------+----------+

  | starting | 0.000069 |

  | checking permissions | 0.000004 |

  | checking permissions | 0.000004 |

  | init | 0.000005 |

  | Opening tables | 0.000128 |

  | setup | 0.000032 |

  | creating table | 0.000654 |

  | After create | 0.000053 |

  | Waiting for table metadata loc | 1.000069 |

  .....................

  | After create | 0.000013 |

  | Waiting for table metadata loc | 0.871435 |

  | After create | 0.000042 |

  | System lock | 0.000013 |

  | preparing for alter table | 0.002475 |

  | altering table | 9.752928 |

  | committing alter table to stor | 0.185624 |

  | end | 0.000021 |

  | query end | 0.000010 |

  | closing tables | 0.000007 |

  | freeing items | 0.000020 |

  | cleaning up | 0.000011 |

  +--------------------------------+----------+

  83 rows in set, 1 warning (0.00 sec)

  查看SESSION C

  mysql> show profiles;

  +----------+-------------+--------------------------------------+

  | Query_ID | Duration | Query |

  +----------+-------------+--------------------------------------+

  | 1 | 10.75216050 | select * from sbtest.sbtest2 limit 1 |

  +----------+-------------+--------------------------------------+

  1 row in set, 1 warning (0.00 sec)

  mysql> show profile for query 1;

  +--------------------------------+-----------+

  | Status | Duration |

  +--------------------------------+-----------+

  | starting | 0.000080 |

  | checking permissions | 0.000007 |

  | Opening tables | 0.000012 |

  | Waiting for table metadata loc | 10.751829 |

  | Opening tables | 0.000094 |

  | init | 0.000019 |

  | System lock | 0.000010 |

  | optimizing | 0.000004 |

  | statistics | 0.000011 |

  | preparing | 0.000009 |

  | executing | 0.000003 |

  | Sending data | 0.000040 |

  | end | 0.000006 |

  | query end | 0.000008 |

  | closing tables | 0.000008 |

  | freeing items | 0.000014 |

  | cleaning up | 0.000009 |

  +--------------------------------+-----------+

  17 rows in set, 1 warning (0.00 sec)

  从上述测试可以看出,SESSION C需要打开表时碰到了元数据锁。MySQL不论SESSION A执行的是select还是delete,此时alter table语句无法获取到metadata独占锁,会进行等待;所以会影响SESSION C的读取。

  这是最基本的一种情形,这个和MySQL 5.6中的online ddl并不冲突。一般alter table的操作过程中,在after create步骤会获取metadata独占锁,当进行到altering table的过程时(通常是最花时间的步骤),对该表的读写都可以正常进行,这就是online ddl的表现,并不会像之前在整个alter table过程中阻塞写入,当然并不是所有ALTER语句都支持online ddl。

  总之,alter table的语句是很危险的(其实他的危险其实是未提交事物或者长事务导致的),在操作之前最好确认对要操作的表没有任何进行中的操作、没有未提交事务、也没有显式事务中的报错语句。如果有alter table的维护任务,在无人监管的时候运行,最好通过lock_wait_timeout设置好超时时间,避免长时间的metedata锁等待。


来源:本文由E8运维原创撰写,欢迎分享本文,转载请保留出处和链接!