Database optimalisation for developers
Archive
| Increasing websi... | Mar 13, 2011 |
| Increasing websi... | Sep 14, 2010 |
| Increasing websi... | Aug 28, 2010 |
| Benchmarking fil... | Jul 16, 2010 |
| HTML5: Multilaye... | Jun 29, 2010 |
| JavaScript garba... | Jun 27, 2010 |
| Websocket server... | Jun 11, 2010 |
| Database optimal... | Apr 23, 2010 |
| Transfering data... | Apr 4, 2010 |
| Finding the high... | Jan 8, 2010 |
| Selecting on pre... | Nov 23, 2009 |
| Forcing foreach ... | Oct 11, 2009 |
| Creating an anym... | Oct 7, 2009 |
| Bitmask access l... | Aug 5, 2009 |
| Recursive templa... | Jul 27, 2009 |
Comments
| I've taken a look .. | Mar 30, 2012 |
| Another interestin.. | Feb 17, 2012 |
| I am also using Ap.. | Jul 21, 2011 |
| Yeah extension_dir.. | Jul 21, 2011 |
| You're sure the li.. | Jul 20, 2011 |
| I've downloaded se.. | Jul 20, 2011 |
| As someone recentl.. | Mar 29, 2011 |
| Thats actually qui.. | Jul 19, 2010 |
| Peculiar, it works.. | Jul 13, 2010 |
| Doesn't work with .. | Jul 12, 2010 |
| Don't tell anyone,.. | Jul 5, 2010 |
| Yeah. "Paint" does.. | Jul 5, 2010 |
| Woops, fixed that,.. | May 19, 2010 |
| It's catEgory, not.. | May 18, 2010 |
| Thanks, found it, .. | May 14, 2010 |
While it is of course great to have so much functionality available, an unfortunate side effect is that it becomes less clear how to use all these abilities in the most optimal way. For example, a query can be written in such a way that it will either take a fraction of a second to execute, or several minutes - yet yield the exact same result in both cases.
I will describe some easy tips and tricks to lower the stress on your database system - most of these are written with MySQL in mind, but some general tips apply to virtually any RDBMS currently available. As written from a developers point of view, these do not so much elaborate on administrating a database (which is what the allmighty DBA's are for) as using it efficiently.
1. Thou Shalt Not Use Too Many Queries
Say you have a table containing grouped data and you wish to know some general property about this data. For example, you have a table of news reports with a limited set of categories and you wish to the number of reports per category. One of way doing this is to obtain a list of all categories and then use a for loop to obtain the count for each category - something like this:
Code (php) (nieuw venster):
1 SELECT COUNT(*) FROM `news` WHERE `category` = [index]
With a dozen categories, this means a dozen queries in rapid succession - a sure way to stress your database. Using GROUP BY however you can retrieve the same data with a single query!
In most languages, firing off an SQL query is relatively cheap. You generally need not worry about adding an extra query, especially not if it is a simple one. But SQL queries within a for(each) loop are almost never necesary and a dozen excess queries will bring down performance. Avoid if you can, but remember 2:.
2. Thou Shalt Not Make Thy Queries Overly Complex
The other end of the spectrum of using too many queries is using too few - or to be more exact, make your queries do too much. This does not often apply and in fact means you are already using a fairly large subset of what your DBMS is capable of, but it can happen.
Most notorious are subqueries, or in-query calculations. Using structures like "WHERE POW(`a`, `b`) > `c`" for example means your DBMS has to calculate POW(a,b) for every row in the table to ensure it does not exceed c. This is a fairly cheap calculation, but imagine performing it a million times for each query! Sure, some caching mechanisms apply, but your results will come back a lot faster if you add a column with the data already calculated - then your DBMS can instead use an index to find the rows you need, which is much, much faster. This leads us to:
3. Thou Shalt Remember Thyne Indici
Need to search for text? Use a fulltext index. Indexes allow your database to quickly jump to the data it needs instead of having to examine every row in the table. Especially when your tables grow larger the benefit of indici is undeniable. Of course, there is a tradeoff - in memory or disk space - but that is almost always a very small price to pay.
Note that even an index can not save you when you attempt to do something like "WHERE `description` LIKE '%query%'". When searching through text, always try to use MATCH() AGAINST - it exists for a very good reason.
4. Thou Shalt Ask Only What Thy Need
Yes, we all know the temptation: use an asterisk to select every column in your dataset, instead of going through the tedious task of specifying each and every column you need. Heck, what if you need another column later - then you have to go back and add it. So why not do a "SELECT *"?
First of all, because it can lead to ambiguity problems. Most of your tables columns will have unique names, but some might not - for example, what if your primary key column is always called "id"? Some people stick to conventions like "tablename_id", but that does not prevent duplicate names between tables entirely either. If you do not specify your column names and later on add a JOIN to your query, the ID you have been selecting on could suddenly change to that of a completely different table - and suddenly your results will be a heap of garbage and you spend the next day finding out how it could have happened (or less if you used your source control well).
By specifying exactly what you need, you need not worry about duplicate names when adding a JOIN to your query: your database will do it for you! MySQL will throw a nice ambiguity error as soon as a name you specified is not unique - which tells you directly what to change!
The other good reason to always specify only what you need is a simple matter of resources: if you do a select on a table with a dozen fields, and maybe join it on a few other, similarly sized tables, your resultset can easily range over fifty, sixty columns - and chances are good you need far less than that. Retrieving five times as much data won't take five times as long, but it will increase memory consumption considerably - not just in your application, but in your database servers query cache as well. For these reasons, you should never do a global select unless you are really sure you need every column in the table and you will never expand your query.
5. Measure And Thou Shalt See
You did your homework, your queries are ticking along nicely, your indexes are in place (and have a high hitrate), so why is your application still sluggish?
The best way to find out, is to simply ask your database server! MySQL allows you to log slow queries (where you can determine yourself what a 'slow' query is, in terms of execution time), so with a bit of logging and analyzing you can see exactly what queries are causing the trouble. Alternatively, with the 'SHOW FULL PROCESSLIST' command you can see a slow executing query as it runs - though most queries will take less than a second to run and are therefore hard to catch using this method. Still, for anything taking more than a few seconds it is a quick way to determine the bottleneck.
Having found your slow query, the next step is of course determining why it is so slow:
6. Ask For An Explanation And Thou Shalt Receive
The EXPLAIN command yields detailed information on your query - in particular, when "key" is empty or "Extra" notes it is using filesort you should be wary - these are signs of a bad query.
If no key is used, you should check the "possible_keys" field - is the index you inteded to be used listed? If it is, why is it not being used? If not, how come your index is not applicable? These questions are complex, as are their answers - too much so in fact for me to answer here. Note though that MySQL offers syntax to help using keys, use them wisely.
When the "Extra" field notes that your query is using filesort, something different is happening: you have most likely made your query too complex, or are attempting to order / limit your resultset in a way that is not inherent in the table. Adding a column with cached results of calculations can sometimes help - for example, if you have a table with coordinates and want to know the distance from a central location to these coordinates, it is faster to calculate the distances once and store them in an extra column than to calculate them again and again for each query - of course at the expanse of diskspace. Sometimes however, when for example searching through text without match/against this will not help you. In these cases, search the net for common solutions - chances are you are not the first to encounter a particular problem.
7. Connectionpools Are Thyne Friends
A common misconception seems to arise every now and again that you should use mysql_close() at the end of each PHP script. While it is (very rarely) not a bad idea to use mysql_free_result to garbace-collect large amounts of data when you are executing a particularly long script, the same cannot be said for mysql_close. Opening a database connection takes time and effort. So does closing it. So why would you want to do that for no other purpose than to free a few kilobytes of memory in-between pageloads? In fact, if performance is an issue you will most likely have concurrent pageloads, meaning several concurrent database connections at all times anyway. I can really think of no reason why you would need to close the connection manually halfway through a script.
MySQL uses 150 as default max_connections limit these days - in other words, it expects to handle 150 connections simultaneously. But this is by no means an upper limit - any decent database-server is easily capable of handling several thousand connections with no ill effects. The few dozen opened by your script are by no means a problem. For this reason Apache will already keep a connection pool ready if configured to do this: whenever a script needs a connection it can use an existing one, speeding things up nicely. You can enforce this behavior with mysql_pconnect, but in my experience it is often better to trust your webserver and database-administrator: they will already have arranged for your connections to keep alive between pageviews.
In conclusion, every performance issue requires its own approach: measure what the bottleneck is, determine which part of the query causes this and adjust either your query or your database to accommodate. This is sometimes easy, but more often quite difficult - much has been written about it already, so do not forget to search for existing answers. Do not rule out external factors either: if your database has to swap on disk because your server runs out of memory you can optimize for eternity and still not achieve significantly better performance. This is especially true for most shared hosting accounts. Good fast hardware (in particular, hard disks with low access times, SSD's for preference), plenty of memory and a well configured database can sometimes do far more than a change in SQL can - not seldom at far lesser costs.
FragFrog out!