thasha.in.wonderland: September 2010

Monday, September 20, 2010

MySQL part 3: optimization hints.

Database performance is very important for achieving maximal ergonomics of the website. Everyone knows that one of the most important usability condition is the response time. Everyone knows also that the database may be the bottleneck. In the worst case it may even cause inaccessibility of the website.

Let’s have a closer look at this problem.

I used again my sample The Lost database but with greater number of records (the table pictures has now 1 000 000 records and the table people_pictures has 3 000 000 records).

The size of the database is now 118,3 Mb.

Here you can find a simple PHP script to generate such number of records.

Imagine that now we are about to create a simple website for this collection. The typical queries we could use are as follow:

Query 1: Show all types and the number of pictures of each type. (We want to use this for creating a side menu where the type’s names with the links and the number of pictures will be given in brackets for every type.)

select types.name, count(pictures.id) from types, pictures where pictures.type_id=types.id group by types.name;

Query 2: Show all characters with the number of pictures in which they are tagged. (This query can be also used for another type of side menu.)

select people.name, count(people_pictures.id) from people join people_pictures on (people.id=people_pictures.people_id) group by people.name;

Query 3: Show 20 newest pictures (the creation date counts) with all characters tagged in every picture (it can be the main element on the website).

select filename, group_concat(name) from people_pictures join pictures on (people_pictures.picture_id=pictures.id) join people on (people_pictures.people_id=people.id) group by filename order by date limit 1,20;

These queries are very simple but they perform very bad. (On my computer: query1 – 22s, query2 – 84s and query3 reaches the time-out – 240s).

Design makes difference

At the beginning let’s have a look at the types we use. I purposely created the first version of database with the “default” types and sizes. Most of them we can substitute with something less heavy. For example: for the tables types, places and people the id column will rather not have more than 100 rows, so we can use TINYINT instead of INT. We can also limit all the columns name to 25 characters. All changes in the pictures below:

This way we already earn some time for first two queries. Query1 takes now 2s and Query2 26s. Query 3 still reaches the time limit.

More information about MySQL data types is here.

Indexes

Indexes should be created carefully. In most cases, added correctly, may dramatically speed up the queries. Generally speaking we should create indexes on these columns that we use for JOINs and clauses WHERE, ORDER BY and GROUP BY.

Query 3 joins tables people_pictures with people and pictures (relation many-to-many). It seems reasonable to create indexes on columns people_id and picture_id in people_pictures table. The result set is grouped by pictures.filename and ordered by pictures.date. We can create more two indexes on these columns. Such modification decrease the execution time to 196 s.

In some cases, indexes may provoke increase of execution time. Let’s have a look at Query 1. It uses column pictures.type_id to join pictures with their types. However if we create index on pictures.type_id we notice that the execution time for this query increases to 13s .

Using indexes for joining dictionary-like tables produces overhead. The index may take more space that the actual table data. In such cases it is faster to read sequentially from table, because this minimizes disk seeks (especially when the dictionary table is short).

For Query 1 we can just delete the index from pictures.type_id. But what happened when we use the same index in few queries and it increase the execution time only for some of them? Fortunately there is an easy solution. We can modify the query by adding ignore index hint.

The final results of this optimization phase are Query1: 2s, Query 2: 20s and Query3: 196s - still too much.

This is what we can achieve optimizing on spec. The better way is to look closer at every particular query.

Some useful links:

Explain and other techniques.

One of the most powerful tool for optimizing is EXPLAIN statement. It helps to analyze particular queries by showing the their execution plan.

Query 1

Let’s have a look at the EXPLAIN output for this query.

id	select_type	table	type	possible_keys	key	key_len	ref	rows	Extra
1	SIMPLE	pictures	ALL	NULL	NULL	NULL	NULL	1000000	Using temporary; Using filesort
1	SIMPLE	types	ALL	PRIMARY	NULL	NULL	NULL	4	Using where; Using join buffer

From the query execution plan we can see that unnecessary file sort is used. This is caused by GROUP BY clause. If you use GROUP BY, output rows are sorted according to the GROUP BY columns as if you had an ORDER BY for the same columns. To avoid the overhead of sorting that GROUP BY produces, add ORDER BY NULL.

Most of the execution time is taken to process the pictures table, while the one that is more interesting for this query is the types table.

To overcome this we can provide an extra column to the types tables – types._count. Every time a picture is added, deleted or updated this column will be updated as well for a proper picture type. This way it will store the current number of pictures for every type.

To keep the data coherent we can use procedures and triggers that works inside the database and are not depended on any external interface.

We need two procedures – one to increase the counter of 1 another to decrease it of 1.

DELIMITER $$
CREATE DEFINER=`root`@`localhost` PROCEDURE `inc_pictures_type_counter`(IN type_id INT)
BEGIN
UPDATE types SET _count=_count+1 WHERE id=type_id;
END

DELIMITER $$
CREATE DEFINER=`root`@`localhost` PROCEDURE `dec_pictures_type_counter`(IN type_id INT)
BEGIN
UPDATE types SET _count=_count-1 WHERE id=type_id;
END

Now triggers. When we add a new picture of certain type the database has to call the procedure increasing the counter for that type - inc_pictures_type_counter.

DROP TRIGGER IF EXISTS lost.on_insert_picture_type
CREATE TRIGGER lost.on_insert_picture_type AFTER INSERT ON lost.pictures
 FOR EACH ROW BEGIN
  CALL inc_pictures_type_counter(NEW.type_id);
END

When we delete a picture the database has to call the procedure decreasing the counter - dec_pictures_type_counter.

DROP TRIGGER IF EXISTS lost.on_delete_picture_type
CREATE TRIGGER lost.on_delete_picture_type AFTER DELETE ON lost.pictures
 FOR EACH ROW BEGIN
  CALL dec_pictures_type_counter(OLD.type_id);
END

When the type of a picture is updated the database has to call the dec_pictures_type_counter for the old type and then inc_pictures_type_counter for the new type.

DROP TRIGGER IF EXISTS lost.on_update_picture_type
CREATE TRIGGER lost.on_update_picture_type AFTER UPDATE ON lost.pictures
 FOR EACH ROW BEGIN
  CALL dec_pictures_type_counter(OLD.type_id);  
  CALL inc_pictures_type_counter(NEW.type_id);
END

The new version of Query 1 uses only one table:

select name, _count from types;

and take less than 0,0001 s.

Query 2

The execution plan for this query:

id	select_type	table	type	possible_keys	key	key_len	ref	rows	Extra
1	SIMPLE	people	ALL	PRIMARY	NULL	NULL	NULL	12	Using temporary; Using filesort
1	SIMPLE	people_pictures	ref	people_id	people_id	1	lost.people.id	250000

Here we can see that the database first reads all people.id and then the table people_pictures is searched for rows where people_pictures.people_id is equal to current people.id. It means that the huge people_pictures table is searches 12 times and every time approximately 250000 rows are processes.

Of course we sense that it would be faster to process the huge table once and join it with people table when necessary. Let’s try to create new version of Query 2 that works this way:

select name, count(people_id) from people_pictures left join people on (people_pictures.people_id=people.id) group by people_id;

This query takes 1,5s and the query execution plan is now as follow:

id	select_type	table	type	possible_keys	key	key_len	ref	rows	Extra
1	SIMPLE	people_pictures	index	NULL	people_id	1	NULL	3000000	Using index
1	SIMPLE	people	eq_ref	PRIMARY	PRIMARY	1	lost.people_pictures.people_id	1

The people_pictures table is processed one time. Anyway the execution time is a bit too long. For this query we can use SQL_CACHE. The result set of this query will be cached and recomputed only in case, when any of involved tables changed. (If the caching mode is set to query_cache_type = 2 then you can use the SQL_CACHE hint to tell MySQL which queries to cache.) With the SQL_CACHE hint the execution time is minimized to less than 0,0001 s.

Query 3

Query 3 is the most complex and takes much more time then the others. Let’s see its execution plan:

id	select_type	table	type	possible_keys	key	key_len	ref	rows	Extra
1	SIMPLE	people	ALL	PRIMARY	NULL	NULL	NULL	12	Using temporary; Using filesort
1	SIMPLE	people_pictures	ref	picture_id,people_id	people_id	1	lost.people.id	250000
1	SIMPLE	pictures	eq_ref	PRIMARY	PRIMARY	3	lost.people_pictures.picture_id	1

The first used table is people. The optimizer reads people_pictures table for every people.id – 12 times. Only then the people_pictures is joined with pictures.

Since we want to limit the result set to pictures that fulfill certain condition it seems more reasonable to use the table pictures as first. Then we can join only these rows that left with the other two tables. The new version of Query 3 may be like this:

select filename, group_concat(people_id) from pictures left join people_pictures on (people_pictures.picture_id=pictures.id) left join people on (people_pictures.people_id=people.id) group by filename order by date limit 1,20;

The time of this query is 42,5 s.

Not bad, but also not good enough. The execution plan:

id	select_type	table	type	possible_keys	key	key_len	ref	rows	Extra
1	SIMPLE	pictures	index	NULL	date	8	NULL	7	Using temporary
1	SIMPLE	people_pictures	ref	picture_id	picture_id	3	lost.pictures.id	3
1	SIMPLE	people	eq_ref	PRIMARY	PRIMARY	1	lost.people_pictures.people_id	1	Using index

If we think more about it, we may notice, that it maybe better to choose only newest 20 pictures from people_pictures and join the people table with it. This way we could replace one conjunction with subquery. It would be something like this:

select picture_id, group_concat(name) from people_pictures join people on (people_pictures.people_id=people.id) where picture_id in (select id from pictures order by date desc limit 1, 20) group by picture_id;

“Would be” - because MySQL doesn’t allow LIMIT statement in subqueries.

The thing we can do is to split these to two queries and join them in e.g. PHP like this:

$db=new PDO('mysql:host=localhost;dbname=lost-basic', 'root', '', array(PDO::ATTR_PERSISTENT => false, PDO::ATTR_ERRMODE => PDO::ERRMODE_EXCEPTION));

$query = 'select id from pictures order by date desc limit 1, 20;';
$res=$db->query($query);

$pictures_array=array();

foreach($res as $row){
	foreach($row as $value){
		$pictures_array[]=$value;
	}	
}

$pictures = implode(', ', $pictures_array);

$query = "select picture_id, group_concat(name) from people_pictures join people on (people_pictures.people_id=people.id) where picture_id in ($pictures) group by picture_id";
$res=$db->query($query);
$res->setFetchMode(PDO::FETCH_NUM);
$res->fetchAll();

$db=null;

This works really fine and takes only 0,003 s.

Summary

As we can see there are many factors that influence the performance of the database but also many solutions for improving the execution time. We manage to decrease time for all queries to less than 0,01 s.

These above ideas are just few from many others. It is very important is to observe the slow queries log. Such queries maybe then analyzed using EXPLAIN statement and then best solution for particular query can be used.

Some other useful information here:

Wednesday, September 15, 2010

MySQL part 2: a bit more advanced JOINs.

Last time I explained the very basics of JOINs. Now I will go a bit more further and show how to build more advanced queries. (I will continue on the same database.)

First let’s remind the clauses order in the SELECT query:

SELECT

FROM table_references

WHERE where_condition]

[GROUP BY {col_name | expr | position}

[ASC | DESC], ... [WITH ROLLUP]]

[HAVING where_condition]

[ORDER BY {col_name | expr | position}

[ASC | DESC], ...]

[LIMIT {[offset,] row_count | row_count OFFSET offset}]

WHERE indicates the condition or conditions that rows must satisfy to be selected. GROUP BY must be used with any aggregating functions to group the resultset by one or more columns. HAVING works only with GROUP BY and should be used only with aggregation functions (while the WHERE clause cannot). ORDER BY is used to order the result set by a desired column or columns.

Many-To-Many

Let’s choose all pictures with a list of people tagged on each of them. This is a many-to-many relation that requires two joins.

select filename, name from people_pictures join pictures on (people_pictures.picture_id = pictures.id) join people on (people_pictures.people_id=people.id);

picture1.jpg	Jack
picture1.jpg	Ben
picture1.jpg	Claire
picture2.jpg	John
picture2.jpg	Jack
picture2.jpg	Ben
picture3.jpg	Jin
picture3.jpg	Jack
picture5.jpg	John
picture5.jpg	Sun

The result set has many rows with the same filenames. Sometimes this can be a disadvantage. There is an easy way to show every picture with coma-separated list of tagged characters. For that we can use group_concat function with GROUP BY clause:

select filename, group_concat(name) from people_pictures join pictures on (people_pictures.picture_id = pictures.id) join people on (people_pictures.people_id=people.id) group by filename;

picture1.jpg	Jack,Claire,Ben
picture2.jpg	Ben,Jack,John
picture3.jpg	Jack,Jin
picture5.jpg	Sun,John

If we need to exclude some results on some conditions we can use HAVING clause. Let’s choose only the pictures with at least three people tagged on them.

select filename from people_pictures join pictures on (people_pictures.picture_id = pictures.id) join people on (people_pictures.people_id=people.id) group by filename having count(name)>2;

picture1.jpg

picture2.jpg

We can also find all pictures with Jack.

select filename from people_pictures join people on (people_pictures.people_id = people.id) join pictures on (people_pictures.picture_id=pictures.id) where name = 'Jack';

picture1.jpg

picture2.jpg

picture3.jpg

At the end let’s find the most tagged character.

select people.name from people_pictures join people on (people_pictures.people_id = people.id) group by people_id order by count(people_id) desc limit 1;

Jack

All above examples work just fine for small database. However the loading time grows with the number of records. Next time I will show how to optimize such queries to keep the loading time reasonable .

Tuesday, September 14, 2010

MySQL part 1: JOINs in a nutshell.

Recently I decided to go more deep in the topic of MySQL optimization. However, to start working on this issue I needed to organize my expertise in this field. This way this short “memo” is created. Maybe it will be useful for somebody…

So, here we go.

First of all, we need some sample database. I love The Lost series, so I thought up a database for storing pictures of the main characters.

The information we want to save in this database are as follows:

Every picture has a file name and a date when it was taken.
Every picture may be associated a type and a place where it was taken.
Additionally every character can be tagged on several pictures, and every picture can be tagged with several characters (many-to-many)

Here you can find the SQL script for this DB.

To warm up let’s start with something simple…

We have to choose pictures and the types related to them (only if the picture has a type). We can do it in two ways:

using WHERE:

select filename, name from pictures, types where pictures.type_id=types.id;

or using JOIN:

select filename, name from pictures join types on (pictures.type_id = types.id);

In both cases the result set will be:

picture1.jpg	funny
picture5.jpg	funny
picture4.jpg	landscape
picture2.jpg	misc

In this case WHERE and JOIN work well, however they work like intersection.

LEFT JOIN

Imagine situation when you need all pictures. Doesn’t matter if they have a type. In such case you can use LEFT JOIN (what means, that the query returns all rows from the left table, even if there are no matches in the right table).

select filename, name from pictures left join types on (pictures.type_id = types.id);

picture1.jpg	funny
picture2.jpg	misc
picture3.jpg
picture4.jpg	landscape
picture5.jpg	funny

RIGHT JOIN

We can also do something opposite: chose all types and join the pictures that belongs to them:

select filename, name from pictures right join types on (pictures.type_id = types.id);

This time we used RIGHT JOIN (the query returns all rows from the right table, even if there are no matches in the left table).

picture1.jpg	funny
picture5.jpg	funny
picture4.jpg	landscape
picture2.jpg	misc
	portrait

In most cases we can use LEFT or RIGHT JOIN to get the same results (anyway it is recommended using LEFT JOIN).

select filename, name from types left join pictures on (pictures.type_id = types.id);

picture1.jpg	funny
picture5.jpg	funny
picture4.jpg	landscape
picture2.jpg	misc
	portrait

NATURAL JOIN

Another type of join is NATURAL JOIN. We can use it in case when we want to join two (or more) tables using columns with the same names. NATURAL JOIN works without any join condition. (You must be careful: if you have a habit to name primary key column “id”, MySQL may use this column to create the conjunction.)

select filename, name from pictures natural join places;

picture1.jpg	Los Angeles
picture2.jpg	Sydney
picture4.jpg	Sydney
picture3.jpg	The Island

In this case both tables have a column named “place_id” and this column is used to create relationship.

At the end it is worth mentioning INNER and CROSS JOIN. In MySQL they are equivalent to each other and produce exactly the same result as JOIN.

You can find more about JOINs here:

thasha.in.wonderland