1. Using mysql_* functions directly
He writes:
One common problem is older PHP code using the mysql_ functions to access the database directly. Listing 1 shows how to access the database directly.
In my experience that depends a lot on the definition of the problem: There is nothing wrong with the mysql_* functions in PHP per se, nor with using them. They are old, well tested and very fast (They also do not allow you to use some of the newer features of MySQL 5, so you may be forced to use mysqli_* instead).
If the emphasis of the problem is on speed, it is very worthwhile using mysql_* directly, and in some controlled instances switching to persistent connections. For example, the MySQL Benchmark Team was able to speed up the DELL DVD Shop Benchmark used in the
database comparison (
english version) by the german c't magazine by 46% by enabling mysql_pconnect() - the mysqli extension currently does not offer pconnect.
Database abstractions have been even slower in the context of this particular benchmark: Solutions to the problem have been submitted talking to MySQL using ADODB, ezSQL, PDO and PEAR::DB. One particular submitter, a computer science class of Ulm technical university, had a set of submissions that were able to use any of these abstraction layers. While their mysql_* direct solution did 444 operations per minute (opm), the fastest abstraction layer, ADODB, came in at 219 opm (For reference, the MySQL Benchmark Team solution came in at 3664 opm at 20% idle with a standard benchmark setup, and with more than 6000 opm after the benchmark load generator array has been upgraded to be able to saturate the test platform. Blatant plug: We are so cool that you can even buy us at
the MySQL Consulting page).
Similar lessons can be learned from Rasmus Lerdorf's talks
Getting Rich with PHP: PHP likes to be lightweight. Rasmus talk also shows neatly how you can be lightweight without losing internal structure and maintainability.
Database abstractions make some things easier - talking to databases of different vendors, for example. But, they come at a price, and they do not solve the problem of efficient and optimized data access at all. They are mostly database
access abstractions, not database
model abstractions or database
query generation abstrations.
If you have to have an application that needs to play nicely with different databases, a database access abstraction is not going to help you a lot - you still need different schema definitions for different vendors databases, if you want them to be optimized as well. Also, you most likely need different SQL for these different databases. So you're maybe better off if you just encapsulate your data access from an application point of view yourself and use the native functions, and native, optimized SQL code in the first place.
2. Not using auto_increment functionality
Jack writes:
Like most modern databases, MySQL has the ability to create auto-incrementing unique identifiers on a per-record basis. Despite that, we still see code that first runs a SELECT statement to find the maximum id, then adds one to that id, as well as a new record.
Here I concur with Jack: The
select max(id) stunt shown by him is an incredibly bad idea that is sometimes still seen in code. Hug auto_increment and use it, it will do your code good.
Sometimes you cannot use auto_increment, though, because for example the application logic calls for an insert id before the actual insert, or because you need a range of ids apiece instead of getting them one by one. At other times you may need counter objects that are independent of a single table.
In these cases it may be useful to have a set of stored procedures and functions in an admin database that can help out.
First, we create the admin database, if we do not already have one:
CODE:
create database if not exists admin charset latin1;
Then we create a table admin._seq, which we are going to use for our counters. Because we want to be able to increment these counters as independently of each other as possible, we are using the InnoDB table type for this. This will enable us to use row locking and we will see less contention for our counters.
CODE:
drop table if exists admin._seq;
CREATE TABLE admin._seq (
name varchar(80) NOT NULL,
value bigint unsigned NOT NULL default '0',
PRIMARY KEY (name)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
We need a stored procedure to create and increment counters in this table, because we want to use transactions with this and transactions are not available in stored functions.
Our stored procedure requires a name for the counter, an increment and delivers the id into an OUT parameter:
CODE:
drop procedure admin.nextval2;
set sql_mode = strict_all_tables;
delimiter //
create procedure admin.nextval2(in cntname varchar(80),
in cntstep bigint unsigned,
out cntvalue bigint unsigned)
sql security invoker
begin
declare _exception varchar(2);
declare continue handler for not found
set cntvalue = cntstep-1;
if cntstep <= 0
then
select "parameter 2 (cntstep) must be > 0" as error;
set _exception = 'xxxx'; -- ugly hack to raise an error
end if;
select value + cntstep into cntvalue
from admin._seq
where name = cntname for update;
insert into admin._seq ( name, value )
values ( cntname, cntvalue )
on duplicate key update value = cntvalue;
commit;
end;
//
delimiter ;
There are a few noteworthy things visible here. One is the use of SQL SECURITY INVOKER. By default, anything is SQL SECURITY DEFINER, which is the equivalent of making everything executeable in a system Set User ID (mostly SUID root, because stored procedures are often written by users with admin permissions).
Then there is the use of the _exception variable. This is a dummy variable for two characters which we are using to assign four characters to. This will raise an exception in MySQL - because there is no SIGNAL_ERROR() as of now, we need to use this ugly hack to create an exception.
To make this work, we need to define the proper sql_mode for the procedure. It is good practice to set sql_mode in front of every procedure and function definition anyway, because the current sql_mode at definition time becomes part of the prodecure definition and is being stored with the actual defintion.
Finally, we need to deal with the initial counter value, so we need to define a NOT FOUND handler to hide the SQL warning and we are using a SELECT FOR UPDATE and INSERT ON DUPLICATE KEY UPDATE pair to deal with counter creation.
The most common case is an increment of 1, so we define for convenience the following short procedure:
CODE:
drop procedure admin.nextval;
set sql_mode = strict_all_tables;
delimiter //
create procedure admin.nextval(in cntname varchar(80),
out cntvalue bigint unsigned)
sql security invoker
begin
call admin.nextval2(cntname, 1, cntvalue);
end;
//
delimiter ;
For completeness sake, we will define a function called curval():
CODE:
drop function admin.curval;
set sql_mode = strict_all_tables;
delimiter //
create function admin.curval(cntname varchar(80)) returns bigint unsigned
sql security invoker
reads sql data
begin
declare cntvalue bigint unsigned default 0;
declare continue handler for not found set cntvalue = 0;
select value into cntvalue from admin._seq where name = cntname;
return cntvalue;
end;
//
delimiter ;
We are now able to do kust what we wanted in the first place:
CODE:
root@localhost [(none)]> select admin.curval("cookie");
+------------------------+
| admin.curval("cookie") |
+------------------------+
| 0 |
+------------------------+
1 row in set (0.00 sec)
root@localhost [(none)]> call admin.nextval2("cookie", 10, @count);
Query OK, 0 rows affected (0.05 sec)
root@localhost [(none)]> select @count;
+--------+
| @count |
+--------+
| 9 |
+--------+
1 row in set (0.00 sec)
root@localhost [(none)]> call admin.nextval("cookie", @count);
Query OK, 0 rows affected (0.00 sec)
root@localhost [(none)]> select @count;
+--------+
| @count |
+--------+
| 10 |
+--------+
1 row in set (0.00 sec)
root@localhost [(none)]> select admin.curval("cookie");
+------------------------+
| admin.curval("cookie") |
+------------------------+
| 10 |
+------------------------+
1 row in set (0.00 sec)
3. Using multiple databases
Here Jack comes up with a case I have yet to see:
Once in a while, we see an application in which each table is in a separate database.
But to MySQL this makes no difference: You can have tables in different schema definitions on the same database server, or in a single schema with multiple tables, and the tables may be stored using different storage engines. To MySQL it is all the same - as long as you have the appropriate access permission, all is well.
What I have seen in the field, though, are cases where customers are creating a lot of tables, "a lot" being something like 40.000 or even 400.000 tables, either in one or multiple schema definitions. To MySQL, this is 40.000 (or 400.000) .frm-Files in one or multiple directories, one for each schema name. How well this performs is completely dependent on your file system. Old-style filesystems such as ext2 and ext3 did not handle large directories particularly well for a long time, and even now can have problems with such a setup.
Independently of any file system, MySQL will need to cache the .frm-File definitions for all these tables, and will need to keep open file handles for them. Both things should be rather unrelated, but are not: They are done together in MySQL with the table cache (this is going to change in 5.1). So for such a scenario, you would need to come up with a monstrous table_cache setting of 40.000 (or 400.000) in your my.cnf. And you'll most likely have to ask your operating system to enable you to open more file handles concurrently, raising for example the open_files_limit to some equally large value in the [safe_mysqld] section of your my.cnf.
But if you need that many tables, you're most likely doing something else wrong. Which leads us nicely to point 4.
4. Not using relations
What does Jack mean when we says "Not using relations"? He explains:
Relational databases aren't like programming languages. They don't have array types. Instead, they use relations among tables to create a one-to-many structure between objects, which has the same effect as an array. One problem I've seen with applications is when engineers attempt to use a database as though it were a programming language, creating arrays by using text strings with comma-separated identifiers.
What he is talking about without naming it is something called Normalisation, and here specifically the first normal form.
I think it is important to introduce the names everybody else uses for things in beginners texts, because they enable beginners to search for these words and find additional material on the subject. Like small children beginners tend to ask the same question over and over in different contexts - this is not because they are stupid or do not remember the first time they ask, but to get assurance and reinforcement as well as additional views and context dependent interpreation on the same subject. So stick to the standard vocabulary when explaining things to enable reinforcement through googling.
5. The n+1 pattern
Jack explains the n+1 pattern, which I usually call "Select in a Loop":
I can't tell you how many times we've seen large applications in which the code first retrieves a list of entities -- say, customers -- then comes back and retrieves them one by one to get the details for each entity. We call it the n+1 pattern because that's how many queries will be performed -- one query to retrieve the list of all the entities, then one query for each of the n entities. This isn't a problem when n=10, but what about when n=100 or n=1000? Then the inefficiency really kicks in.
Unfortunately, Jack then manages to explain the
JOIN without using the word join or pointing to the manual. See my take on reinforcement above.
My Book Tip
Want to learn more about MySQL? Well, we do have
a site, a
Manual and
these forum thingies.
Want to read a book on SQL in general? I firmly believe in reading first hand sources. For SQL that means giving all your money to Joe Celko. He wrote the standard and he
wrote a book on it. Luckily, Joe happens to be a good writer, so this is even fun to read. Also,
Joe has style, and so should you.
Tracked: Aug 11, 14:50