Wednesday, July 27, 2016

Finding pesky utf8 data in your latin1 tables

For whatever reason, sometimes you get utf8 data written to a table/column that doesn't support it. It might get double encoded, turn into mojibake (garbled text), etc. These might break your application or cause weirdness. Finding these values can be a pain. Here is a way to do.

CREATE TABLE `test` (
  `id` int(10) NOT NULL AUTO_INCREMENT,
  `name` varchar(100) DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

insert into `test` (name) values ('x'), (unhex('c2a3'));


SELECT
name,
length(`name`) as ln,
char_length(`name`) as cl,
length(CONVERT(`name` USING utf8)) as u_ln,
char_length(CONVERT(`name` USING utf8)) as u_cl,
hex(name) as hx,
CONVERT(`name` USING utf8) as u_nm,
hex(CONVERT(`name` USING utf8)) as h_nm
FROM test;

After running this query you will notice that the length is the same for latin characters when it counted using length and still the same after converting to utf8. However, for utf8 data you will notice that the length changes after being converted to utf8. This makes it easy to identify where you have utf8 data. 



With this understanding, you can add a WHERE clause to only show the rows where there are differences like this:

SELECT
name,
length(`name`) as ln,
char_length(`name`) as cl,
length(CONVERT(`name` USING utf8)) as u_ln,
char_length(CONVERT(`name` USING utf8)) as u_cl,
hex(name) as hx,
CONVERT(`name` USING utf8) as u_nm,
hex(CONVERT(`name` USING utf8)) as h_nm
FROM test
WHERE length(`name`) <> length(CONVERT(`name` USING utf8));


Wednesday, July 20, 2016

TokuDB vs InnoDB

I had heard a great deal about TokuDB the last couple years and I've been dying to get a good use case for it. Ever since Percona bought TokuDB and included it in their distribution, I've been excited to test it.

I have a table that stores millions of rows and hundreds of thousands to a million more get added every day. I thought it would be a good candidate for TokuDB. As a test, I installed Percona Server 5.6.25 and then installed the TokuDB plugins. Everything was working fine up to that point. My test VM has 4 GB of RAM and 4 CPU. I left all the default TokuDB settings and let it automatically take 2 GB RAM (50% of RAM) for the TokuDB buffer pool. I changed all the tables from InnoDB to TokuDB and then lowered the InnoDB buffer pool. I'm no expert in TokuDB tuning so I wasn't really sure what to change to make it perform better.

My process for loading data into the large table involves collecting stats from thousands of data sources and then importing all those data sources into thousands of tiny staging tables and then merging all that data into a single table. Every night when my job to import that data into the staging tables would kick off, the mysql test instance with TokuDB would use up all the RAM and then crash. When I would re-start the instance, the TokuDB recovery would take several minutes and then it would happen again. I didn't have a lot of time to figure out what was going so I switched all my tables back to InnoDB and re-ran the tests and didn't have any problems. I thought TokuDB would perform better "out of the box" and with the default settings but not so.

I've read several of Baron Schwartz's Blog entries and he praises InnoDB for how reliable it is and how well it works. At this point, I don't know what I need to do to make TokuDB work better but it appears that creating thousands of tiny tables (each with only a few hundred or a few thousand rows) that are all TokuDB storage engine isn't a good use case. It seems the frequent dropping and creating of thousands of tiny TokuDB tables might cause RAM problems. I wish I had more time to dig deeper.

Friday, July 15, 2016

Killing lots of linux processes from the command line

I run a lot of bash scripts from the crontab to automate database tasks. Sometimes my bash scrips get carried away and don't finish. When this happens I need to kill them. I use pkill.

How to kill processes from the command line that match a patters:

pkill -f my_pattern
Thank you to this post:
http://stackoverflow.com/questions/8987037/how-to-kill-all-processes-with-a-given-partial-name

Wednesday, July 6, 2016

Two ways to convert to utf8

I'm in the middle of a project to converts lots of legacy tables from latin1 to utf8. These are two ways of converting specific tables to utf8. These commands are the same for utf8mb4, just change the CHARACTER SET and COLLATE for utf8mb4.

Option 1:

       

-- Change the schema default to utf8

ALTER DATABASE databaseA CHARACTER SET utf8 COLLATE utf8_unicode_ci;



-- Convert the defaults for the table to utf8

ALTER TABLE tableA CHARACTER SET = utf8, COLLATE = utf8_unicode_ci;



-- Convert specific column(s) to utf8

ALTER TABLE tableA MODIFY value VARCHAR(40) CHARACTER SET utf8 COLLATE utf8_unicode_ci;



       
 

What is good about this option is it will preserve other character sets on the same table. Maybe you need to keep latin1 on a specific column or you want to have a case sensitive column that uses some special collation, doing the change like this will not touch those other columns.

Option 2:

       

-- Change the schema default to utf8

ALTER DATABASE databaseA CHARACTER SET utf8 COLLATE utf8_unicode_ci;



-- Convert the table and all columns in it to utf8

ALTER TABLE tableA CONVERT TO CHARACTER SET utf8 COLLATE utf8_unicode_ci;


       
 

This option converts all columns to utf8. This is nice because it will catch all columns so you don't have to do each one separately. However, if you wanted to keep a specific column a specific collation then you would want to use Option 1.


For an excellent primer on utf8 and MySQL this is a great source:

http://mysql.rjweb.org/doc.php/charcoll