MySQL desu: December 2015

Monday, December 28, 2015

Finding international phone numbers mixed in with other numbers

Today I had a developer ask me to help write a "fun" query. He is new to MySQL wasn't sure how to do the same things he is familiar doing in PHP. The developer said he wanted to get a count for all the rows in a table where we store phone numbers to see which entries would meet the criteria for an international number. He wanted to first strip out all the extra stuff people might enter in a phone field such as commas, dashes, spaces, parenthesis, then filter out any places where the client had entered in words instead of numbers, blank numbers, null values, exclude numbers that start with +1 (because that is USA number), then trim off any numbers that start with 1 but don't already have a + in the beginning and don't count numbers that are 10 digits after stripping out the extra characters (because those should be considered USA numbers).

Here is an example table and sample data:

CREATE TABLE `contacts` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`phone` varchar(40) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `idx_phone` (`phone`)
) ENGINE=InnoDB;

INSERT INTO `contacts` (`id`, `phone`)
VALUES
(1, NULL),
(2, '(123)'),
(3, '+222'),
(4, '+44 20 7895 1232'),
(5, '+448717923642'),
(6, ',123'),
(7, '1 (818) 555-1211'),
(8, '119999999999'),
(9, '123numbers'),
(10, '1555555555555'),
(11, '55555555'),
(12, '5555555555'),
(13, 'numbers'),
(14, 'numbers123');

Here is the query that gets my count:

SELECT database() as `database`, COUNT(*) as count
FROM
( SELECT database() as `database`,
phone,
TRIM(
TRIM(LEADING '1' FROM
REPLACE(
REPLACE(
REPLACE(
REPLACE(
REPLACE(
phone,
'(',''),
')',''),
'-',''),
' ',''),
',','')
)
) as phone_changed,
CHAR_LENGTH(
TRIM(
TRIM(LEADING '1' FROM
REPLACE(
REPLACE(
REPLACE(
REPLACE(
REPLACE(
phone,
'(',''),
')',''),
'-',''),
' ',''),
',','')
)
)
) as phone_changed_length
FROM contacts
WHERE phone NOT LIKE '+1%'
AND phone <> ''
AND phone IS NOT NULL
AND phone REGEXP '[0-9]+'
AND phone REGEXP '^[+]?[-(), 0-9]+$'
HAVING phone_changed_length <> 10
) as t

Trimming 1 off of the numbers may not be the perfect solution for getting rid of the 1 when someone puts it at the beginning of a US number but it worked fine for our situation. There are not any US area codes that start with 1 and it is mainly the US that uses the 1 for dialing a US phone number so we were not concerned that this count would be wrong. Example 2 and 6 in the above sample data will still get trimmed even though they are obviously not a valid US number and also not a international number and will still get counted.

The inner query is longer than it actually needs to be just to get this count, however, running the inner query by itself will allow you to see how the TRIM and REPLACE is stripping out some of the data and compare it to the original.

And to give credit where credit is due (needed to reference a few examples):

http://stackoverflow.com/questions/12652234/sql-query-to-remove-special-characters
http://stackoverflow.com/questions/5064977/detect-if-value-is-number-in-mysql
https://techjourney.net/remove-or-trim-first-or-last-few-characters-in-mysql-database-with-sql/
http://stackoverflow.com/questions/8844880/mysql-regexp-removing-white-space-and-non-numeric-characters
http://dev.mysql.com/doc/refman/5.7/en/regexp.html
http://stackoverflow.com/questions/15005916/get-number-of-rows-returned-in-mysql-query
http://www.w3resource.com/mysql/string-functions/mysql-trim-function.php
http://www.w3resource.com/mysql/string-functions/mysql-char_length-function.php

Tuesday, December 8, 2015

mysql dump triggers and stored procedures

I've been working on a little project to fix situations where we get incorrect definers on certain databases. The incorrect definer could be on an event, view, stored procedure, trigger, or a function. When fixing these problems it requires some effort because you cannot just edit a definer for triggers (even though you can for events and procedures). The trigger must be fully dropped and re-created with the new definer. To automate this task I've been writing a bash script to do the work.

Initially I had written a little bash function to DROP each function/procedure/trigger before restoring it. After writing the function I discovered that MySQL was automatically including a DROP IF EXISTS in the dump file for each stored procedure and each function. My bash function was redundant and not necessary. However, I wasn't seeing an equivalent DROP TRIGGER IF EXISTS command when I took a mysqldump of the triggers. After reading up on this, a lot of people were requesting this many years ago before I became a DBA: https://bugs.mysql.com/bug.php?id=34325

Lo and behold this feature was already in the version of MySQL I was running. I just needed to tell MySQL I wanted those statements included.

Here is the mysqldump command I've created for this task:

mysqldump -u user_name -p -h test_server.awesome.com --add-drop-trigger --databases my_test_db --no-create-info --no-data -R > backup_triggers_routines.sql

Explanation of each option:
--add-drop-trigger (this option tells it to add the DROP TRIGGER IF EXSTS syntax)
-- databases (this option allows me to specify more than one database, I'm only do one at a time)
--no-create-info (tell MySQL not to include CREATE TABLE info)
--no-data (tells MySQL not to include data because I only want the triggers/procs/functions)
-R (this is the same as --routines, tell it to also dump procedures/functions)

Note: Triggers are included by default but stored procedures and functions are not. If you wanted to you could include --triggers but it is not necessary.

If you wanted to dump only the routines (stored procedures/functions) you can use the --skip-triggers option to exclude triggers.

Seems silly that MySQL doesn't automatically include routines and then silly that it automatically includes the DROP syntax for routines but doesn't automatically include the DROP syntax for triggers.

Monday, December 7, 2015

pt-online-schema-change

Percona has a tool called the Percona Online Schema change which is part of the Percona Toolkit. I've heard it referred to as OSC. It is super useful for adding indexes, changing the storage engine of a table, adding a column, etc, when you want to be able to do the operation "online" without causing the table to lockup and essentially be unavailable for the duration of the operation. On large tables this can take a long time, I've done it on tables with 500 million rows and it has taken 6 hours to perform the operation.

The tool works by creating a new table, performing the changes on the empty table and then moving the data from the old table to the new table in chunks. Once the data is all moved over, the tool syncs up the two tables one last time so that the new data written to the old table while it was moving the other data gets moved over and then it renames the new table to the same name as the old table and then drops the old table. One of the added bonuses of doing all this work is the table also get optimized at the same time. If you do something on a large table that sees a lot of DELETES and UPDATES then is is likely fragmented. I've seen situations where a 100 GB table shrunk down to 50 GB after adding an index because it was so highly fragmented. However, within a few weeks the fragmentation was back.

The syncing of data works by means of triggers so the tool cannot be used on any tables that already have triggers. There are a number of options you can use when running the tool such as "Dry Run", and telling the tool if the server gets too busy to pause. You can also change the chunk size to be as small or large as you see fit. I typically use 10000 as the chunk size. I've tried a chunk size of 1000 and it is much slower for large tables. You could get away with using 50000 but I'd be careful of something too large.

I set a thresholds if 400 Threads_running which is really high in my examples. Threads_running of 50 might be high enough on your server that you would want the tool to pause and wait until the server is less busy to continue syncing data. The tool won't actually perform the operation unless you specify "--execute" (see below).

The tool also has a number of safety features where it checks replication lag on all the slaves and it won't run until replication lag catches up. Example output from using Online Schema Change is below.

Here are two examples I've done frequently:

Example adding an index to a table:

# non prod
## dry run
./pt-online-schema-change --host=non_prod.awesome.com D=my_non_prod_db,t=my_table --ask-pass -u'!me' --critical-load Threads_running=400 --max-load Threads_running=375 --chunk-size=10000 --dry --alter 'ADD INDEX idx_colum1_column2 (column_1, column_1)' > np_add_index_dry_results.log

## actual
./pt-online-schema-change --host=non_prod.awesome.com D=my_non_prod_db,t=my_table --ask-pass -u'!me' --critical-load Threads_running=400 --max-load Threads_running=375 --chunk-size=10000 --execute --alter 'ADD INDEX idx_column_1_column_2 (column_1, column_2)' > np_add_index_actual_results.log

# prod
## dry run
./pt-online-schema-change --host=prod.awesome.com D=my_prod_db,t=my_table --ask-pass -u'!me' --critical-load Threads_running=400 --max-load Threads_running=375 --chunk-size=10000 --dry --alter 'ADD INDEX idx_column_1_column_2 (column_1, column_2)' > add_index_dry_results.log

## actual
./pt-online-schema-change --host=prod.awesome.com D=my_prod_db,t=my_table --ask-pass -u'!me' --critical-load Threads_running=400 --max-load Threads_running=375 --chunk-size=10000 --execute --alter 'ADD INDEX idx_column_1_column_2 (column_1, column_2)' > add_index_actual_results.log

Example changing storage engine from MyISAM to InnoDB:

# non prod
## dry run
./pt-online-schema-change --host=non_prod.awesome.com D=my_database,t=my_table -u'!me' --ask-pass --critical-load Threads_running=400 --max-load Threads_running=375 --chunk-size=10000 --dry --alter "ENGINE = InnoDB" > non_prod_dry_my_table_change_table_engine.log
# actual
./pt-online-schema-change --host=non_prod.awesome.com D=my_database,t=my_table -u'!me' --ask-pass --critical-load Threads_running=400 --max-load Threads_running=375 --chunk-size=10000 --execute --alter "ENGINE = InnoDB" > non_prod_actual_my_table_change_table_engine.log

# prod
## dry run
./pt-online-schema-change --host=prod.awesome.com D=my_database,t=my_table -u'!me' --ask-pass --critical-load Threads_running=400 --max-load Threads_running=375 --chunk-size=10000 --dry --alter "ENGINE = InnoDB" > prod_dry_my_table_change_table_engine.log
## actual
./pt-online-schema-change --host=prod.awesome.com D=my_database,t=my_table -u'!me' --ask-pass --critical-load Threads_running=400 --max-load Threads_running=375 --chunk-size=10000 --execute --alter "ENGINE = InnoDB" > prod_dry_my_table_change_table_engine.log

Found 5 slaves:
db_server_2
db_server_3
db_server_4
db_server_5
db_server_6
Will check slave lag on:
db_server_2
db_server_3
db_server_4
db_server_5
db_server_6

Example output when changing table from MyISAM to InnoDB:

Operation, tries, wait:
copy_rows, 10, 0.25
create_triggers, 10, 1
drop_triggers, 10, 1
swap_tables, 10, 1
update_foreign_keys, 10, 1
Altering `my_database`.`my_table`...
Creating new table...
Created new table my_database._my_table_new OK.
Altering new table...
Altered `my_database`.`_my_table_new` OK.
2015-12-07T17:53:43 Creating triggers...
2015-12-07T17:53:43 Created triggers OK.
2015-12-07T17:53:43 Copying approximately 188366 rows...
2015-12-07T17:54:07 Copied rows OK.
2015-12-07T17:54:07 Swapping tables...
2015-12-07T17:54:07 Swapped original and new tables OK.
2015-12-07T17:54:07 Dropping old table...
2015-12-07T17:54:08 Dropped old table `my_database`.`_my_table_old` OK.
2015-12-07T17:54:08 Dropping triggers...
2015-12-07T17:54:08 Dropped triggers OK.
Successfully altered `my_database`.`my_table`.

Tuesday, December 1, 2015

Why do developers sometimes forget to add a primary key to tables?

I work a lot with developers and I've discovered that their experience with data modeling or just database work in general ranges from knowing nearly nothing to being experts beyond my own experience. Some have taken courses in college, some are self trained and some have just picked things up on the job.

One of the first things I did two years ago when I started working at my company was to create a list of standards that all new database development in MySQL would adhere to. For example, all tables would use InnoDB as the default storage engine, utf8 as the default collation, no more enums allowed, all tables must have primary keys and lots of other standards.

The item about developers not putting primary keys on tables has been a mystery to me. It just seems very basic to me. We've got a legacy app that largely uses MyISAM tables and the developers that created these tables long before I ever got hired never put primary keys on some of them. There was no database code review process at that time and so they just got away with it.

I've been preaching ever since I got here that all the tables need to be converted to InnoDB and those tables need primary keys. But 24 months later, it still hasn't changed. If it ain't broke, don't fix it right?

I don't have any benchmarks but this is Percona's take scalability problems if your InnoDB tables don't have primary keys:

https://www.percona.com/blog/2013/10/18/innodb-scalability-issues-tables-without-primary-keys/
https://dzone.com/articles/innodb-scalability-issues-due

Here are some others:
http://www.psce.com/blog/2012/04/04/how-important-a-primary-key-can-be-for-mysql-performance/
http://blog.jcole.us/2013/05/02/how-does-innodb-behave-without-a-primary-key/

Finding tables is pretty easy, I copied some code from the data charmer blog to get this:

SELECT table_schema, table_name
FROM information_schema.tables
WHERE (table_catalog, table_schema, table_name) NOT IN
(SELECT table_catalog, table_schema, table_name
FROM information_schema.table_constraints
WHERE constraint_type in ('PRIMARY KEY'))
AND table_schema NOT IN ('information_schema', 'mysql');

Here I want to find all tables that do not have a PRIMARY key. Even if it has a UNIQUE key, I still want to get the table name.

One of the big use case on why the tables need to be converted is that several development managers want to try using Percona Cluster but having all these tables without Primary Keys isn't going to fly.