2011-12-09

The dangers of max_connections

We have a database that's drastically scaling up its query rate from many different clients. This typically means more client jobs and more parallelization across connections, so the first limit we hit is max_user_connections (in the mysql.user table), which we use to provide isolation between runaway users/jobs. We increase that, and we eventually start hitting server-level max_connection limits. We increased that in an emergency push last night, and several hours later performance fell off a cliff.

SHOW PROCESSLIST showed tons of processes in "Opening tables" and "closing tables" state. Our InnoDB IO stats had dropped dramatically; we clearly weren't making significant progress actually answering queries. A quick search turned up a Percona post about this, which pointed to table_cache. A quick look at SHOW VARIABLES LIKE 'table_cache' showed 91. Our config has this set to 2048; something had clearly gone wrong.

One hint was the error log message:
111209 11:08:21 [Warning] Changed limits: max_open_files: 8192  max_connections: 8000  table_cache: 91
It turns out that MySQL tries to be smart about table_cache, based on the open file rlimit. The formula is (max_open_files - max_connections - 10) / 2, which in this case is (8192 - 8000 - 10) = 182 / 2 = 91. 91 active tables is smaller than the hot set in this database, and when query rate crossed a line, the global open/close table lock and underlying syscalls couldn't keep up, and performance death-spiraled from there. We run-time set the limit higher with:
SET GLOBAL table_cache = 2048;
This overrides the automatic setting but opens the possibility of hitting the rlimit when opening tables. The real fix is to increase the rlimit to accommodate both pools of file descriptors.

2011-11-29

Converting subselects to joins, part 2

I previously discussed this in depth. However, today I saw a case that I didn't cover:

You have a table of Users and a table of Logins, with a row for each user login event. You're looking for users that have logged in within the last 31 days. The initial version of this I saw used a derived table:
SELECT
    UserId,
    LastLogin
  FROM Users
    JOIN (
      SELECT
          UserId,
          DATEDIFF(NOW(), MAX(TimeStamp)) AS LastLogin
        FROM Logins
        GROUP BY UserId
    ) AS Temp USING (UserId)
  WHERE LastLogin <= 31;
We can convert this to a simple JOIN with the magic of HAVING. HAVING is like WHERE, but applies after aggregation:
SELECT
    UserId,
    DATEDIFF(NOW(), MAX(TimeStamp)) AS LastLogin
  FROM Users
    JOIN Logins USING (UserId)
  GROUP BY UserId
  HAVING LastLogin <= 31;

Safe(r) data changes

It's a great goal to avoid making manual changes to your database. It never works 100%, though; there are always software bugs, unexpected interactions and other events that muck up your data, and you have to do one-off corrections. These are inherently hazardous: hard to test for unexpected data corruption, hard to apply consistently and hard to model the application behavior that results from them. Here are some strategies for the first issue, avoiding unexpected data corruption.
  1. Don't run one-off executables against your database. Instead, have the executable print out SQL that it would have run to update the database. If something goes wrong, you don't have to model the behavior of the program; you can just look at the SQL.
  2. Check the SQL files into source control somewhere. Manual changes tend to breed more manual changes to fix the fixes, so you never know when you'll want a record of what you twiddled in the past.
  3. Include all fields from the primary key in the WHERE clause. This ensures that each statement only modifies one row. Even if this results in a huge list of changes, at least you know exactly what changed.
  4. Include as many additional gating clauses as possible, linked with AND. For example, if you have a table of products and you want to set the price to 0.99 for everything that is currently set to 1.00, do:
    UPDATE Products SET Price=0.99 WHERE ProductId=2762 AND Price=1.00;
    This ensures that if something else changes Price just before you run your change, you don't destroy that update.
  5. Record the number of rows affected by each statement, in case something unexpected happens.
  6. Use transactions sensibly. Overly huge grouping of statements can block replication, but consider whether your changes will be toxic if partially applied.
  7. Stop running changes on errors or warnings and let a human examine the output. Warnings like string truncation can be a sign of a broken change.
#4 can be a challenge if your verification statements are complex. Consider, for example, if you want to update rows in table A for which there is exactly one row for a particular CustomerId. It's relatively easy to do a SELECT statement to verify this by hand:
SELECT
    CustomerId,
    COUNT(CustomerId)
  FROM A
  WHERE
    CustomerId IN (15, 16)
  GROUP BY CustomerId;
To verify this at UPDATE time, however, you either need a subselect or an intermediate table. We'll use the latter:
CREATE TABLE ScratchTable AS
  SELECT
      CustomerId,
      COUNT(CustomerId) AS Customers
    FROM A
    WHERE
      CustomerId IN (15, 16)
    GROUP BY CustomerId;

UPDATE A
  JOIN ScratchTable USING (CustomerId)
  SET Updated=1
  WHERE
    A.Id=3
    AND Customers=1;
The same trick works if your data change inserts new rows:
INSERT INTO A (CustomerId)
  SELECT CustomerId
    FROM ScratchTable
    WHERE
      CustomerId=15
      AND Customers=1;

2011-08-09

InnoDB as the default table type

We recently switched from MyISAM to InnoDB as the default table type. This affects CREATE TABLE without an explicit ENGINE=, as well as implicitly-created tables for sorting and such. Mark had previously discussed issues with this here, but we thought it was worth giving another try.

We've found a far more basic problem: KILL takes a long time. Consider the following:
CREATE TABLE scratch.MyShortLivedTable
  SELECT * FROM A JOIN B JOIN C....;

There's no ENGINE= there, so it uses the default table type. It seems to run at about the same speed as it did with MyISAM, but then you kill it. If it's MyISAM, the KILL takes effect immediately. If it's InnoDB, the CREATE is non-transactional and stays, but the implicit INSERTs have to be rolled back. If that query took 10 minutes, it may take 10 minutes to roll back, during which time the query is unkillable and MySQL won't shut down cleanly.

Our failovers rely on being able to kill all connections to a slave that they're promoting to master. Our backups rely on being able to shut down MySQL. We don't have a good answer for this, short of hardwiring ENGINE=MyISAM onto writes we know are going to be large.

2011-08-08

Database best practices for future scalability

There’s a perpetual debate about how much effort to put into scalability when first designing and building a modern web application. The opposing arguments are roughly: “Scalability doesn’t matter if your app isn’t successful; features should come first” vs. “When you know that you need to scale, you won’t have time to do it.” Below are some suggestions for scalable MySQL schema design that should get you on the right path, without being onerous enough to slow you down. Let’s start with things that are almost free:

  1. Use InnoDB. You’re eventually likely to want transactions, foreign keys, row-level locking or relative crash safety. You can Google and find lots of InnoDB vs. MyISAM comparisons, but if you don’t want to dig too deeply, “just use InnoDB” is a safe place to start.

  2. Don’t store data that you don’t intend to use relationally. Using MySQL as a key/value store or to store encoded data (XML, protocol buffers, etc.) in BLOB/TEXT may work, but dedicated key/value stores are likely to be more efficient.

  3. Try to design your schema with as few hierarchies as possible. For example, take the following table layout, with arrows representing many-to-one relationships:Perhaps A=”Users”, B=”DVDs Owned”, C=”Logins”, D=”Times Watched”, while E=”Administrators” and F=”Changes”. These are two hierarchies, since A and E have no links to each other. Minimizing the number of hierarchies (keeping it to just one is awesome!) makes it easier to shard later. Schemas with cross-links (say F also links to A, or a table records transfers between two different users, linking to A twice) are very difficult to shard.

  4. Use BIGINT UNSIGNED NOT NULL (64-bit required numeric) primary keys on every table. AUTO_INCREMENT is fine, at least to start. You can skip this for many-to-many link tables; just put both link columns in the primary key. Having single-column, numeric primary keys makes it easier to do things like drift checking and traversing between tables.

  5. Use BIGINT instead of INT for all keys. The cost in space (4 vs. 8 bytes) and compute time is so small that you’ll never notice it, but the cost of a schema change later to increase the field size, or an accidental wraparound, can be enormous. As you expand shards, your key space becomes sparse and grows rapidly, so wide keys are critical.

  6. Split off your data access code into an ORM layer, even if it’s very simple to begin with. This will be where your read/write split, shard lookups and change tracking will live later.

  7. Don't use triggers or stored routines. Keep this logic in your ORM layer instead, to give yourself a single point of truth and versioning system.

  8. Avoid subselects; use joins instead.

  9. Don’t use views, unless you’re using a third-party ORM (rails, django) that mandates a schema structure that isn’t ideal.

  10. Avoid network round-trips whenever possible. Use the multi-row insert syntax where possible. Enable the CLIENT_MULTI_STATEMENTS flag at connect time, then send groups of statements separated by ";".

Then, things that cost development time, in increasing order of complexity:
  1. Use foreign keys. Don’t make references nullable; use a flag field to mark whole (sub-)hierarchies deleted instead. Combined with the hierarchy rule above, this means that you guarantee yourself that you’ll never end up with orphaned rows.

  2. Write to masters; read from slaves. This can be quite complex, since you have to worry about replication delay. For example, you can’t have one web page hit cause a write, then the next hit render the results by reading from the database, because the result might not have replicated. However, this enables significant scaling, because hooking up more slaves is much easier than sharding.

  3. Don’t store event-based data as one row per event. If you record page views or clicks in the database, aggregate that data into one row per hour, or per day. You can keep logs of events outside of the database in case you need to change aggregation and re-generate historical data, but don’t keep every event in a hot table.

  4. Stop using AUTO_INCREMENT. Instead, keep a table IdSequences, and do something like:
    BEGIN;
    UPDATE IdSequences SET LastId=LAST_INSERT_ID(LastId+Increment)
      WHERE TableName='A' AND ColumnName='b';
    INSERT INTO A (b, c) VALUES (LAST_INSERT_ID(), ‘foo’);
    COMMIT;
    This lets you change IdSequences later to modify your sharding scheme.

  5. Create an empty shard (new database, same schema, no data) and add test rows. Teach your application to choose which shard to talk to. This will require some method to look up a shard for the root of each hierarchy; keep all data linked to a particular root on the same shard, so you can JOIN it. At its simplest, the lookup can be (ID mod NumShards). If you have uneven shard growth, you may need an indirection table to map from virtual shard (determined by modular division with a large divisor) to physical database.

2011-07-12

Converting subselects to joins

mysql> SELECT Title
-> FROM Articles
-> WHERE ArticleId IN (
-> SELECT ArticleId
-> FROM Views
-> );
+-------------------------+
| Title |
+-------------------------+
| Interesting things |
| More interesting things |
+-------------------------+
2 rows in set (0.00 sec)
What's wrong with this statement? It looks like it's trying to get a list of article names that have been viewed, and it seems to be doing its job. It's easy to read and to tell what's going on, even for someone with limited SQL experience. So what's there to fix?
Notice that there are two SELECT statements above. The latter is called a subselect or subquery. Just like parentheses in mathematical expressions ("5 * (2 + 8)"), you're walling off part of your statement and asking for it to be completely executed first. If that inner statement produces a huge data set (imagine this is views of, say, reddit articles), it's bad if you have to store that entire result before moving on to finding the associated articles. In reality, database engines can optimize this and be smarter than storing the whole result set, but there are no guarantees.
Fortunately, most subselects can be converted directly to joins. Let's look at a few simple examples. Given the tables:
CREATE TABLE Articles (
ArticleId bigint(20) unsigned NOT NULL AUTO_INCREMENT,
Title varchar(255) NOT NULL,
PRIMARY KEY (ArticleId)
) ENGINE=InnoDB;

CREATE TABLE Views (
ArticleId bigint(20) unsigned NOT NULL,
ViewedAt timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
KEY ArticleId (ArticleId),
CONSTRAINT Views_ibfk_1 FOREIGN KEY (ArticleId) REFERENCES Articles (ArticleId)
) ENGINE=InnoDB;
We'll start with a simple query:
SELECT Title
FROM Articles
WHERE ArticleId IN (
SELECT ArticleId
FROM Views
);
This is easy because it's a positive relationship: "IN" as opposed to "NOT IN". As a join, it looks like:
SELECT DISTINCT Title
FROM Articles
JOIN Views USING (ArticleId);
"DISTINCT" is required because Views -> Articles is many -> one, and we only want each article title once. We can use "USING" instead of "ON" because the column name is the same in both tables.
So, what if we have a negative query? Say we're looking for unviewed articles:
SELECT Title
FROM Articles
WHERE ArticleId NOT IN (
SELECT ArticleId
FROM Views
);
We can turn this into a join by using something called an outer join. Outer joins give us back all the rows in one table, then matching rows from another, or NULLs if they don't exist. An outer join between these two tables would look like:
mysql> SELECT Title, ViewedAt
-> FROM Articles
-> LEFT JOIN Views USING (ArticleId);
+-------------------------+---------------------+
| Title | ViewedAt |
+-------------------------+---------------------+
| Interesting things | 2011-07-12 14:09:28 |
| More interesting things | 2011-07-12 14:09:29 |
| More interesting things | 2011-07-12 14:09:31 |
| Rather boring things | NULL |
+-------------------------+---------------------+
4 rows in set (0.00 sec)
We can then filter back down to just unread articles. We'll also avoid referencing any columns but the ones we're already joining on:
mysql> SELECT Title
-> FROM Articles
-> LEFT JOIN Views USING (ArticleId)
-> WHERE Views.ArticleId IS NULL;
+----------------------+
| Title |
+----------------------+
| Rather boring things |
+----------------------+
1 row in set (0.00 sec)
Take an example query that is looking for all articles that have not been read since a certain timestamp:
mysql> SELECT Title
-> FROM Articles
-> WHERE ArticleId NOT IN (
-> SELECT ArticleId
-> FROM Views
-> WHERE ViewedAt > '2011-07-12 14:09:30'
-> );
+----------------------+
| Title |
+----------------------+
| Interesting things |
| Rather boring things |
+----------------------+
2 rows in set (0.00 sec)
This is slightly more complex to convert, because the naive conversion returns the wrong answer:
mysql> SELECT Title
-> FROM Articles
-> LEFT JOIN Views USING (ArticleId)
-> WHERE ViewedAt > '2011-07-12 14:09:30'
-> AND Views.ArticleId IS NULL;
Empty set (0.00 sec)
To solve this, we have to remember that we want the ViewedAt condition to be before the join, while the Views.ArticleId condition should be after. We can rewrite this to:
mysql> SELECT Title
-> FROM Articles
-> LEFT JOIN Views ON (Articles.ArticleId = Views.ArticleId
-> AND ViewedAt > '2011-07-12 14:09:30')
-> WHERE Views.ArticleId IS NULL;
+----------------------+
| Title |
+----------------------+
| Interesting things |
| Rather boring things |
+----------------------+
2 rows in set (0.00 sec)

2011-05-16

Paris

I'm on a two-week trip to Dublin for work and decided to break it up with a trip to a city that I'd never been closer to than the airport: Paris. Luckily, I have a French-speaking friend who's great company, so off we went.
As the airfare to Dublin was covered by Google, I decided to splurge a bit on Paris. We ended up staying at the Hotel Raphael, which was excellent throughout. It's beautiful, if a bit overwhelming for modern tastes. Having a great concierge and a room with everything at hand, however, makes a world of difference. Also, a balcony with a view of the Eiffel Tower: awesome.
Morgan arrived later than I did on Friday, so I ended up venturing out on my own to a restaurant with concierge directions and the ability to speak about 3 words of French. I ended up at Jamin for my first taste of real French food. Lesson #1: no one orders wine by the glass. Lesson #2: holy shit, this is good! French has never been my favorite genre for restaurants in the US, but this was amazing.
Morgan arrived around midnight, and after figuring out the horribly baroque lighting system in the hotel room, we turned in and got up early. We went first to the Jardin des Plantes, which is beautiful if a bit ill-cared-for. The menagerie is actually better kept than the botanical gardens, probably because it attracts a bigger crowd. If you go, do visit the greenhouses though.
We ate lunch looking out at Notre Dame Cathedral, then walked around it. The line was too long and we didn't head in. Instead, we headed up to Sacré-Cœur, with its amazing panoramic view of Paris. We stepped inside, but weren't terribly enthused; I don't think either of us is really a church person, even architecturally.
Of course, you can't go to Paris for the first time and not see the Eiffel Tower. Again we didn't go in/up because of the throng, but it's still impressive to see.
Then we were back to the hotel to clean up and change, and off to dinner with a quick stop at the rooftop bar.
We had dinner reservations at Restaurant Victor, described by the concierge as a "bistro"; Morgan scoffed. We arrived, the food arrived, Morgan no longer scoffed. I had an appetizer of morel mushrooms in cream sauce that I will remember for quite some time.
The next morning, we checked out and headed off to a tiny, unassuming building that marks the entrance to the Catacombs of Paris. Down a long staircase, though long, small tunnels, and then past the bones of 6 million former Parisians artfully arranged. The experience is surreal; while there are plenty of pictures online, the sheer scale of the thing in person is stunning. It seems to go on and back forever. What's also stunning is something that would never happen in the US: the only things separating you from the walls of bones are signs are signs reminding you not to touch.
After emerging into the sunlight slightly stunned, we took a quick trip to see the Louvre from the outside, then the hotel to pick up bags and off in our respective directions. I think I'll be headed back to Paris for a longer stay in the future. I seem to enjoy it very much.