MySQL desu: May 2016

Wednesday, May 25, 2016

enums in MySQL - why you should not use them

ENUMS are evil. When I started working as a MySQL DBA at my current company, our tables were rampant with enums. The dev managers wanted me to review all the database changes that went from dev to alpha and eventually to production for all the developers. As part of my standards, I prohibited developers to add enums to any new tables.

I've copied from this blog the following 8 reason not to use enums:
http://komlenic.com/244/8-reasons-why-mysqls-enum-data-type-is-evil/

1. Data isn't being treated like data.

Male/Female, Mr/Mrs/Ms, Africa/Asia/etc: these bits of text that people use ENUM columns for are data. When you use an ENUM column, you're technically moving data from where it belongs (in actual database fields), to somewhere it doesn't (into the database metadata, specifically a column definition). This is different than putting constraints on the data, which is what we are doing when we say that a numeric column can only hold an integer, or that a date column can't be null - that's fine and quite necessary. With an ENUM we're actually storing pieces of data in a place that was only intended to hold crucial information about the model. In short, an ENUM column violates the rules of normalization. This may seem academic or pedantic, but it is actually why a lot of the other reasons on this list can be problems!

2. Changing the member list of ENUM columns is very expensive.

Invariably, what happens is this: you create an ENUM column and say "no way NEVER will this list change or need added to". But humans are really poor at estimating the entire scope of something, and even worse at predicting the future. R&D dreams up a whole new product type. Your company adds another shipping method. North America crashes into Asia.

The problem is that changing the member list for an ENUM column restructures the entire table with ALTER TABLE, which can be very expensive on resources and time. If you have ENUM('red', 'blue', 'black') but need to change it toENUM('red', 'blue', 'white'), MySQL needs to rebuild your table and look through every record to check for the now-invalid value 'black'. MySQL is literally dumb and will even do this when all you did was add a new value to the end of the member list! (It is rumored that appending an ENUM member list will be handled better in the future, but I doubt that this is a high priority feature.)

A full-table rebuild may not cause much pain on a small table, but on a large one it is possible to peg your resources for a long time. If you use a reference table instead of ENUM, changing the list is as simple as INSERT, UPDATE, or DELETE, laughably-cheap operations by comparison. It's also important to note that when altering an ENUM member list, MySQL converts any existing record values that are not included in the new ENUM definition to ' ' (an empty string). With a reference table, you have greater flexibility when renaming or eliminating list choices (more on this below).

3. It's impossible to add additional attributes or related info.

Adding related info to a reference table

There just isn't any sane way to add related information to an ENUM column, which is a common scenario that often comes up. In our country/continent example, what happens when we need to store something like land area for each continent? We didn't envision needing that attribute, but now we do. With a reference table, we can simply extend the continent table to include a 'land_area' column and query this new data any way we would like. With an ENUM? Forget it.

Other awesome flexibilities exist due to the ability to easily extend a reference table. One common scenario is adding a column to set a flag to denote whether a choice in the reference table is discontinued. So, when your company stops selling black widgets, you can add an 'is_discontinued' column to the reference table and flag the old 'black' row. You can still query a list of currently offered colors, and maintain info about all your old orders of black widgets! Try that with an ENUM column.

4. Getting a list of distinct ENUM members is a pain.

A very common need is to populate a select-box or drop down list with possible values from the database. Like this:

Select color:

If these values are stored in a reference table named 'colors', all you need is:SELECT * FROM colors ...which can then be parsed out to dynamically generate the drop down list. You can add or change the colors in the reference table, and your sexy order forms will automatically be updated. Awesome.

Now consider the evil ENUM: how do you extract the member list? You could query the ENUM column in your table for DISTINCT values but that will only return values that are actually used and present in the table, not necessarily all possible values. You can query INFORMATION_SCHEMA and parse them out of the query result with a scripting language, but that's unnecessarily complicated. In fact, I don't know of any elegant, purely SQL way to extract the member list of an ENUM column.

5. ENUM columns may only offer limited or negligible effects on optimization.

The usual justifications for using ENUM, are centered around optimization, in the conventional sense of performance gains, and sometimes in the sense of simplifying a complicated model to be more comprehensible.

Let's look at performance. You can do a surprising number of un-optimized things with databases, but most won't affect performance until a certain scale is reached, and often our applications are never asked to scale up that far. This is important to remember because DB devs should aspire to design fully-normalized and only de-normalize when a performance problem becomes real. If you're concerned that a reference table is going to slow things down, benchmark it out both ways in your unique application on an actual dataset (or a realistic high-estimate fake dataset) and see. Just don't automatically assume a join or a reference table is going to be a bottleneck, because it probably isn't. (There is also evidence to support that ENUM isn't always appreciably faster than alternatives.)

The second optimization argument for ENUM is that it reduces the number of tables and foreign keys in your database. This is a valid argument, in the sense that it's one more little box joined to another box with some lines, and in large systems the effect of normalization can already tax the limits of human comprehension and complicate queries. This is however, why we make models, and why those models employ abstraction so we can understand them. Go ahead and draw up a new representation of your model or ER diagram that leaves out some of the little details and reference tables. Sometimes it may just seem easier to use an ENUM, but the fact that you think another reference table makes things too complicated isn't a good reason by itself.

6. You can't reuse the member-list of an ENUM column in other tables.

When you create a list of possible members in an ENUM column, there's no easy and consistent way to re-use that list in other tables. With a reference table, the same set of data can be related to as many other tables as required. Changing the list in the lone reference table, will change the available options in every other table that it is linked or joined to.

A reference table can easily be linked to multiple tables

With separate ENUM columns, you would have identical duplicate member lists on two different columns in at least two different tables (that would all require consistent updating).

7. ENUM columns have noteable gotchas.

Suppose you have ENUM('blue', 'black', 'red') and you attempt to insert 'purple': MySQL actually truncates the illegal value to ' ' (an empty string). This is correct, but if we had used a reference table with a foreign key, we would have more robust data integrity enforcement.

Also, MySQL stores enum values internally as integer keys to reference ENUM members. It's easy to end up referencing the index instead of the value and vice-versa. Consider:

CREATE TABLE test (foobar ENUM('0', '1', '2'));

mysql> INSERT INTO test VALUES ('1'), (1);

Query OK, 2 rows affected (0.00 sec)

Records: 2  Duplicates: 0  Warnings: 0

mysql> SELECT * FROM test;

+--------+

| foobar |

+--------+

| 1      |

| 0      |

+--------+

2 rows in set (0.00 sec)

We inserted '1' (a string), and accidentally also inserted 1 (as a number, without quotes). MySQL correctly (but confusingly) uses our number input as an internal reference to the first item in the member list (which is actually the value '0').

8. ENUM has limited portability to other DBMS.

The ENUM data type isn't standard SQL, and beyond MySQL not many other DBMS's have native support for it. PostgreSQL, MariaDB, and Drizzle (the latter two are forks of MySQL anyway), are the only three that I know of. Should you or someone else want to move your database to another system, someone is going to have to add more steps to the migration procedure to deal with all of your clever ENUMs. If it's you, you'll undoubtedly feel less clever than you once did - and if it's someone else, they may not like you. Generally, migrating to a different database system is something that just doesn't happen that often and everybody assumes will bring out demons anyway, which is why this just squeaks in at number 8 on the list.

Criteria for when it might be okay to use enum:

1. When you're storing distinct, unchanging value sets...

A fairly good example that meets this criteria is our list of the continents. These are well-defined. Other commonly-given examples are salutations: Mr/Mrs/Ms, or playing card suits Spades/Hearts/Diamonds/Clubs. However, consider that even these examples have scenarios where you may need to extend the member list (such as when someone demands that you now need a 'Dr.' salutation, or when your card game app needs to accommodate a non-suited card like the Joker).

AND 2. You will never need to store additional related info...

Consider again Spades/Hearts/Diamonds/Clubs. There are popular card games that rely on the fact that clubs/spades are black and hearts/diamonds are red (Euchre, for example.) What happens when we need to store additional info related to suit, such as its color? If we had used a reference table, it would be a trivial matter to add this color data to the reference table in an additional column. If we use an ENUM to represent suit, it becomes much more difficult to represent the color/suit model accurately, and we're going to have to enforce it on the application level.

AND 3. The member list will contain more than 2 and less than 20 items.

If you're using an ENUM for only two values, you can always replace the ENUM with a very efficient TINYINT(1) or the even-better BIT(1) available since MySQL 5.0.3. For example: gender ENUM('male', 'female') can be changed to: is_male BIT(1). When you only have two choices, they can always be expressed as a Boolean true/false by prepending "is" to one of the member strings and renaming the column. As for less than 20: Yes, ENUM can store up to 65,535 values. No, you shouldn't try it. More than 20 becomes unwieldy and certainly more than 50 is just insane to manage and work with.

If you really still want to use ENUM:

1. Never use numbers as enum member values.

There's a reason ENUM is a string data type. Not only should you be using a numericdata type to store numbers, but ENUM has some well-documented gotchas related to the fact that MySQL references ENUM members internally using a numerical index. (See #7 above.) Just don't ever store numbers in an ENUM data type, ok?

2. Consider using strict mode.

Strict mode will at least throw an error when you try to insert an invalid value into an ENUM column. Otherwise only a warning is thrown and the value is simply set to an empty string ' ' (referenced internally as 0). Note: Errors can still be suppressed in strict mode if you use IGNORE.

Conclusion

Do what makes sense from a development/maintenance perspective, and optimize only once a performance problem becomes real - in most cases that is a strong argument for using reference tables over MySQL's ENUM datatype.

Thursday, May 19, 2016

Why I don't like doing in place database upgrades

I've copied this directly from here since I think it is such a good answer:

http://serverfault.com/questions/224771/reasons-why-in-place-upgrades-are-bad

I would be surprised if any system administrator likes in-place upgrades. The simple truth is that they are not and can never be reliable. First you have to make the huge assumption that every software package initially installed on a system was packaged 100% correctly. This is never true. There are always packages with logic errors in them or that write to files they forget to mention in their package specfile. Writing package scripts is an incredibly unglamorous job, and is always assigned to the most junior developer on any software team.

Then, you have to assume that the host was correctly managed over time. Nobody ever went in and manually modified configuration or manually installed binaries that conflict with the packaged binaries. That never happens, right?

Remember that installing all these new packages means shutting down the old versions of the packages first. Package shutdown scripts are among the most poorly tested of all software. Nobody wants to deal with the incredibly boring and unthinkable case of their precious software package being deactivated and discarded. Plus there's a huge gotcha in this - shutting down packages which contain shared libraries means doing some very clever work to shut down packages which have those libraries loaded in to memory, or swapping running libraries in memory. That is extremely hard to get right.

Finally you have to assume that the people who produced the new release for the system thoroughly tested all possible combination of upgrades and not just the simplest case of upgrading the immediately previous version to this one. You have to believe that somehow they were able to devote the essentially infinite resources necessary to deal with every possible upgrade path.

As you can see, that is one giant (steaming) heap of assumptions. In practice none of those conditions are ever true. People install their own binaries on machines all the time. Package install and removal scripts are always buggy (I could write a whole dissertation on that one issue alone). The net result is that in-place upgrades are a fool's errand.

I should also note that my team performed a major OS and software upgrade of 7500 servers at my work last year. We tried so hard to make an in-place upgrade work but ultimately it just didn't, for many of the reasons listed above. We ended up wiping all the machines and installing the OS and all software from scratch. Problem solved.

All of this leads to my joke that the software release manager's mantra is Nuke it from orbit -- it's the only way to be sure. Completely wiping and reinstalling systems is the only sure path to success. You can get by with incremental upgrades most of the time, but eventually it's time to start fresh.

This is more SQL Server focused but still a good answer:

http://dba.stackexchange.com/questions/29328/are-sql-server-in-place-upgrades-as-ill-advised-as-they-used-to-be

Why I Prefer A Migration to an In-Place Upgrade

Easier Rollback - If something goes wrong you can rollback by simple saying "we aborted the upgrade.. Please change connection strings to old server while we resolve this". With an in-place you are fixing it or you are down.
Refresh Hardware - Hardware changes rapidly. You can easily get stuck on hardware that was right for your company 4 years ago but not right for today and the next four years with an in-place upgrade. You will likely have to do a migration at some point anyway for the new hardware.
Feel Better - Sure... This one is subjective, but it feels good knowing you are starting with a new OS installation, a new SQL installation with no cobwebs from the person on the job before you (or you before you knew what you knew today) that could possibly cause you headaches in the future.
New OS - A migration gives you a chance to start with a new OS version if you aren't on the latest and greatest today.
You Can Test It - Ever want to get a set of baselines on a new machine before you install SQL and cloud it up with databases and usage? You can do that now.
Sometimes it is easier to sneak in best practices - Maybe things have been sort of hacked together to make it work before. You can fix that all and start fresh.
Free test environment and extra sleep - It is a great benefit to have an environment you can work in ahead of the actual cutover day when you make this new environment live. Doing a migration to a new environment means you can build it during business hours, well ahead of your actual cutover day and test it in many ways ahead of time. You can run full regression testing on all applications and systems for days and have a great peace of mind before you actually do the final set of restores/attaches and cutover all applications and access to the new environment.
You don't have to do it all at once - A very common situation I bump into is an environment that is trying to consolidate to just a few instances. Perhaps one per version, perhaps one per "tier" and version. A lot of these projects have different timelines for various applications and databases based on testing, project plans and vendor certification timeliness. Doing the migration means you can move those databases that are ready, when they are ready and still handle requests for those databases that can't move for one reason or another.

Friday, May 13, 2016

Finding the time difference on two date time or timestamp columns

Many times I need to get the difference between two times on a MySQL table or even multiple tables. Here are couple ways to do that...

Here is one I wrote this week:

SELECT *, TIME_TO_SEC(TIMEDIFF(end, started)) as diff_sec
FROM my_table
WHERE `started` >= '2015-12-17 05:00:00' AND `started` <= '2015-12-17 06:00:00'
HAVING diff_sec > 5

Thanks to stack overflow for the examples:
http://stackoverflow.com/questions/3528219/mysql-how-to-get-the-difference-between-two-timestamps-in-seconds

I have a bash script that I run everyday against all our database tables to collect sizes and row counts. Part of the script is determining how long since the script last ran to collect all that information. I store that time difference with the data so I know that the growth happened in the last 24 hours. This is how I do it:

SET @time_diff=0;
SET @time_diff=(SELECT TIMESTAMPDIFF(MINUTE,MAX(ts1.time_created),stgts.time_created) as minutes_since_last_timestamp
FROM today_stats_table_a stgts
INNER JOIN yesterday_stats_table_a ts1 ON (stgts.table_name = ts1.table_name AND stgts.schema_name = ts1.schema_name AND stgts.server_name = ts1.server_name));

UPDATE today_stats_table_a stgts
SET stgts.minutes_since_last_timestamp = @time_diff;

Here is an amazing tutorial with examples on this page for using TIMESTAMPDIFF:
http://www.w3resource.com/mysql/date-and-time-functions/mysql-timestampdiff-function.php

And there lots of other functions that deal with time to play with also:
http://dev.mysql.com/doc/refman/5.7/en/date-and-time-functions.html

Tuesday, May 10, 2016

Binary logs eating up your space? Turning off logging and purging logs

I work a lot with developers who have non prod testing environments for their application. Frequently we set up replication for testing something and then the developer forgets about it. Weeks later the test system is full of binary logs because they are not expiring them or they performed an operation that wrote a lot of binary log files.

You may see a large amount of bin files in the MySQL data directory named "server-bin.n" or mysql-bin.00000n, where n is a number that increments. These are typically in a directory at /var/lib/mysql.

The binary log stores all the DDL and DML operations (INSERT, UPDATE, DELETE, CREATE, DROP...).

The binary log can be used for Data Recovery if you haven't taken a backup in the last few hours and you need to revert back to a specific point in the time before something happened within the last time you took a backup. The binary logs also allows replication to work by recording the DDL/DML statements so they can replayed on the slave servers.

If you are not using replication and don't care about performing a data recovery then you should not have bin logging turned on and should turn it off by editing the my.cnf file. Look for an entry called "log_bin" and comment it out. There may also be entires related to binary logging that start "bin" or "expire_log_days" or "max_binlog_size". These can all be commented out or removed (be sure to backup the file before making changes).

After saving the file restart MySQL.

If you want to simply leave binary logging on but purge the files there are two ways to do this.

1. Purge up to a specific file

Example:
PURGE BINARY LOGS TO 'mysql-bin.09';

2. Purge up to date

Example:
PURGE BINARY LOGS BEFORE '2015-12-17 11:11:11';