I have a table at work in a legacy database. It is part of a sharded database setup so there are thousands upon thousands of these tables with different data across all our customers. We allow customers via the app to put whatever they want in this table. So it ends up with crap data some times. What is worse, is that the table doesn't support utf8 characters but clients somehow still end up trying to write utf8 characters into it. Below is an example of the table and the type of table that might be in the table. As you can see there is a unique constraint on the "custom_id" column and the "value" column. If you look through the list through you will see what looks like a space at the end of some of the words which is actually some type of blank character which doesn't break the unique constraint.
SET NAMES 'latin1';
CREATE TABLE `my_table` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`custom_id` int(10) unsigned NOT NULL DEFAULT '0',
`value` varchar(100) NOT NULL DEFAULT '',
PRIMARY KEY (`id`),
UNIQUE KEY `idx_uq_cutom_id_value` (`custom_id`,`value`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
INSERT INTO `my_table` (`custom_id`, `value`)
VALUES
(34, 'Outside of the U.S.'),
(142, 'AutomotiveÊ'),
(142, 'ConstructionÊ'),
(142, 'EnergyÊ'),
(142, 'InsuranceÊ'),
(142, 'ManufacturingÊ'),
(142, 'Non-ProfitÊ'),
(142, 'Real EstateÊ'),
(142, 'RetailÊ'),
(142, 'TelecommÊ'),
(142, 'Manufacturing'),
(142, 'Telecomm'),
(142, 'Automotive'),
(34, 'Outside of the U.S.Ê'),
(142, 'Insurance'),
(142, 'Real Estate'),
(142, 'Construction'),
(142, 'Non-Profit'),
(142, 'Energy'),
(142, 'Retail');
This is what the data looks like after sorted by custom_id and value:
The app has code which write a MySQL statement like this:
SET NAMES 'utf8';
SELECT value
FROM my_table
WHERE custom_id = 142 AND value = 'Manufacturing�'
Because of that funky utf8 character in the end of the SELECT statement we frequently get the following ERROR:
Illegal mix of collations (latin1_swedish_ci,IMPLICIT) and (utf8_general_ci,COERCIBLE) for operation '='
From a DBA perspective, I'm thinking what am I suppose to do here to fix this problem? The table and column the query is working against doesn't support utf8 so I keep getting that error.
If I convert the table and column to utf8 then I stop the error from occurring like this:
ALTER TABLE my_table CONVERT TO CHARACTER SET utf8 COLLATE utf8_unicode_ci;
Then my data would look like this:
But then I will still have all that left over crap data (which might be fine). I came up with a couple ideas but not sure how amazing they are. Either delete the rows with the bad data or update them to clean up the bad data and then add something else in to make them unique.
Here I clean up the data by converting if from latin to binary and then utf8. This gets rid of the junk data but then creates duplicates so I won't be able to add the unique key back. In my case there are child tables that reference the id column so I cannot delete the rows. Here I just add a period at the end of the row.
ALTER TABLE `my_table` DROP INDEX `idx_uq_cutom_id_value`;
ALTER TABLE my_table CONVERT TO CHARACTER SET utf8 COLLATE utf8_unicode_ci;
UPDATE my_table SET
value = CONVERT(CAST(CONVERT(value USING latin1) AS BINARY) USING utf8);
UPDATE my_table SET
value = CONVERT(CAST(CONVERT(value USING latin1) AS BINARY) USING utf8);
UPDATE my_table SET
value = CONCAT(value,'.')
WHERE id IN (SELECT * FROM (SELECT c2.max
FROM my_table c1
INNER JOIN (SELECT id, custom_id, MAX(id) as max, count(*) as count FROM my_table GROUP BY custom_id, value HAVING count > 1) c2 ON c1.id = c2.id
GROUP BY c1.custom_id, value) t);
ALTER TABLE `my_table` ADD UNIQUE INDEX `idx_uq_cutom_id_value` (`custom_id`, `value`);
If it was permissible to DELETE the data duplicates I could do so like this:
DELETE FROM my_table
WHERE id IN (SELECT * FROM (SELECT c2.max
FROM my_table c1
INNER JOIN (SELECT id, custom_id, MAX(id) as max, count(*) as count FROM my_table GROUP BY custom_id, value HAVING count > 1) c2 ON c1.id = c2.id
GROUP BY c1.custom_id, value) t);
No comments:
Post a Comment