April 26Monitoring MySQL
These are the slides for my talk "Monitoring MySQL" at the MySQL Users Conference and Expo 2007 from yesterday.
I tried to reproduce my talk as a written text, but what I attach below is somewhat more and at the same time less than what I was actually saying yesterday. Goals and deliverables of monitoring Different kinds of monitoring do exist and should not be treated equally, because they have different purposes and goals. Because of this, they are of vastly different complexity, collect different measurements and have different HA requirements. System administration and operating usually want to have functional checks for incident detection. They need to answer the old question "Are we still online?". The measurements taken are usually dead simple and can be taken directly out of the SLA: What is being measured is some high level availability such as "Does the database answer a trivial query within the specified response time taken from the SLA?" A late answer is the same as the database being offline: We are in violation of the SLA and therefore we are technically offline - that is an incident and needs to be handled. The deliverable of this kind of monitoring is a ticket to the help desk, operating or incident management directly. The HA requirements are high, in the range of minutes of maximum allowed downtime. Monitoring for compliance is a similar, but different thing. It is the data warehouse portion of functional checks: The measurements taken are the same, but they are recorded to document SLA fulfillment after the fact. The deliverable is some report to IT management or a customer that is being generated weekly or monthly showing if we have been in compliance with the SLA or not. Consequently the HA requirements are pretty low: Nobody does really care if the SLA DWH is down for three days in a crisis, there are more pressing problems in such a situation. Unfortunately most monitoring software right now treats monitoring for functionality and compliance the same, because the data looks superficially identical. But functionality monitoring is OLTP: it is machinery to take test results and turn them into tickets about incidents. Once the ticket is closed, the data should be pushed out of this part of the monitoring system into the general direction of the DWH to keep the primary system small. In a crisis operating is blind without the primary monitoring system - in an ideal world that primary monitoring would fit onto a flash memory stick and can be reinstalled almost anywhere from that stick. Related to compliance monitoring is monitoring for capacity planning: This too is a data warehouse application, but it keeps moch more detailed measurements of all possible aspects and subsystem. Like compliance monitoring the deliverable is a weekly or monthly report to IT management or the customer, but unlike compliance checking the report for capacity planning is focusing on the future trying to predict future requirements in advance. The HA requirements match those of the compliance DWH. Developers have monitoring needs as well, but they are completely different: Usually they are vertical, concentrating on the aspects of a single query or a group of queries. The measurements taken have to be very detailed, outlining the query plan, showing wait times or time spent per function or code section. The deliverable is a report to a single developer, and the HA requirements should ideally be nonexistent, because this type of monitoring is hopefully running on some development machine outside of production. Finally, auditors also have monitoring requirements as well. For them it is important that the monitoring is inescapeable and that the recorded data cannot be changed after the fact. Auditors want to create accountability, so they are interested which natural person is responsible for the action taken. Ideally, it records high level events from an applications or business transactions point of view and associates them in some way with data changes. Often, proper auditing cannot be done in the database itself or alone, because of accountability requirements: While the application often has knowledge of the natural person operating it, the database usually has not - there is a single database user that is being used by the application, while the application has a high level user concept: Consequently proper auditing is often done above the database level. The deliverable again is a report to management, so the HA requirements are pretty low. Additionally, operating usually requires auditing must not affect regular operations. Metering what? OS level data sources At the operating system level we can check functionality of the database from the "inside" (while being logged in to the database server) and from the "outside" (connecting to the server from somewhere else and running queries). The most primitive internal checks is checking for the presence of a PID file. But a "test -f linux.pid" is not good enough: The PID file might be dead and a leftover from a crashed server. A better test is "kill -0 $(cat linux.pid)" as the database user, where sending a signal 0 to a process is a nonintrusive way to check for the presence of a process. The results of this test may be:
External tests are often a ping check to make sure the machine is on the network and a trivial query with the timeout set as specified in the SLA. The trivial query often is a "SHOW FULL PROCESSLIST" because that particular query does not place undue load onto the server and returns other useful information to the monitoring program such as number of processes, facilities to check for long running queries and the possibility to detect an excessive number of queries in "Locked" state. At the operating system level there are additional data sources available which can be checked. One of these things is the process size of mysqld in memory. Tools such as "ps" and "top" will report two numbers for this: VIRT, the virtual process size, and RSS, the resident set size. Simplifying a bit the VIRT size is the approximate target process size for that process, the memory the process has asked for so far, while the RSS size is the acutal amount of memory comitted to the process by the operating system so far. After a server start the internal caches of the server are cold so there is a huge difference between these two numbers. Over the course of several house the server will heat up and the numbers will move closer together. The general target VIRT size for the server process depends on the type of storage engine being used in the server. A pure MyISAM server will use the key_buffer to cache MYI files, but leaves the caching of MYD files to the file system buffer cache. So on a pure MyISAM server it is not adviseable that MySQL or any other program uses more than approximately 50% of the available memory - the rest of the memory is to be left "unused", that is, as file system buffer cache. The command "free -m" will report memory usage inside the operating system, for example CODE: kris@h3118:~> free -m In this machine a total amount of 1011M of memory is available. Of this, 59M are shown as "free", but actually there are 574M of memory that can potentially be given up by the operating system and given to programs: 59M truly unused memory + 71M of buffers + 443M of caches. Only 436M are currently actually used by programs.total used free shared buffers cached Mem: 1011 951 59 0 71 443 -/+ buffers/cache: 436 574 Swap: 2550 31 2518 While the kernel can give up buffer cache memory in a low memory situation it does not necessarily do so. In low memory situations the kernel has to make a choice: Shrink the buffer cache or swap out some process instead. The choice the kernel actually makes is controlled by the variable /proc/sys/vm/swappiness. The number is a percentage: Set to 100, the kernel will never shrink the cache when memory is tight, but will always swap a program instead. When set to 0, we are at the other extreme of the continuum: The kernel will then always try to shrink the buffer cache and avoid swapping programs at all cost. The default is 60. Note then even with a swappiness of 0 the kernel may swap. This can happen it the file system buffer cache is already minimally sized but there still is memory pressure. In database servers, we never want to swap: Database servers tend to organize structures in memory in hashes for low latency, no access time data retrieval. These structures are very good for in-memory random access, but behave close to pessimal when used with disks, so when the database server swaps, we have in memory hashes that end up on disks, creating severe threshing and extremely bad performance. Those of you that still remember the Commodore Amiga computer will know what I am talking about, as the directory structures of the Amiga Filesystem are an extremely bad choice for disks: They are such hashes. Recall the distinctive sound of an Amiga floppy drive processing a "dir" command! That's your database server swapping. In order to minimize swapping, we need to set swappiness to 0: CODE: kris@linux:~> grep swappi /etc/sysctl.conf vm.swappiness = 0 kris@linux:~> sysctl -p We also want to monitor general disk health and usage. With databases this usually means monitoring the number of seeks, as most databases are usually not limited by disk bandwidth (measured in MB/sec), but by disk seeks per second. A single server disk or a RAID-1 pair can give you up to 200 seeks/sec. A cheap laptop disk will perform at approximately half that rate. Commands such as "iostat -x 1 3" will report disk utilization: The "-x" option reports extended disk statistics, the "1" means we will get reports in 1-second intervals and the "3" means we want 3 reports before the command terminates. The first report is always averages since system start and can be disregarded if we want a spot meaurement of current activity. We are only interested into the last two measurements. In the output, we are looking the r/s and w/s rate, the actual number of read and write operatings going to the drive (as opposed to the rreq/s and wreq/s numbers, which are calls into the driver) and we are interested into the %util value. The final ingredient of a good database server is good network connectivity. That is especially true for cluster, which practially lives off the network and needs low latency and high reliability. Tools such as smokeping can be used to plot network latency and reliability. Server level data sources At the server level we have a number of data sources available that deliver raw performance measurements. These often need some postprocessing to actually become meaningful values. The internal data sources are
Derived performance metrics The primary performance metric for a mysql server is its QPS rating, the number of questions per second processed. The status counter "Questions" is being incremented for each statement seen by the server, even syntax errors or SET statements. So "Questions"/"Uptime" is not a particulary accurate metric, but by looking at the power of 10 one can see if the server is generally busy or not: Are we seeing 10s of questions per second, 100s of questions, or even 1000s or for cluster maybe even higher numbers? Each command will also increment a "com_"-Counter named after that command, so an INSERT will bump up "Com_insert" and so on. "Com_select" is special, because "select" statements caught by the query cache will not bump Com_select at all. Instead Qcache_hits moves up. Thus, the sum of Com_select, Qcache_hits, Com_insert, Com_update, Com_delete and Com_replace will give you a rough estimate of "useful" queries processed by the server, divide by "Uptime" to get a rate. Often you want to know the relation between read and write queries per second for a server. A server that is read-heavy is easy to scale: Add memory or add replication slaves. Writes are much harder to scale: There are a few tricks to delay, sort and batch writes, but in the end you need to end up writing to stable storage, so the only thing that going to help is more disks that can handle more writes per second. The r/w ratio is of course (Com_select + Qcache_hits)/(Com_insert+Com_update+Com_delete+Com_replace). By looking at the number of Com_commit and Com_rollback we can learn even more. If Com_commit is close to 0, the application does not use transactions (but is maybe running on autocommit, which often is a very bad thing for transactional engines in a write heavy environment). The ratio of Com_rollback to Com_commit is also interesting, because especially InnoDB is not very good at processing Rollbacks. The ratio between write statements and Com_commit is also interesting because it can help to estimate the average transaction size. An 1:1 ratio between Com_insert and Com_commit is a dead giveaway for a broken application, for example - good insert batch sizes are ideally in the range of a few thousand rows per commit. Finally, a high and fast Com_set counter usually indicates the use of the JDBC driver somewhere because of the way JDBC works. Other general performance metrics deal with the table cache and the thread cache: We can take multiple measurements of opened_tables and detect the rate of opened_tables/sec. If that is high, increase your table_cache and maybe you need to set open_files_limit to match it. The table cache will eventually become completely utilized: The degree of utilization would be table_cache_size (from variables) - open_tables (from status). The thread cache is almost symmetric to this: Look at the number of threads_created/sec to detect excessive thread creation (Are you using PHP?) and if necessary make the thread_cache larger. Linux is equipped to handle thread creation nicely, but still it is going to cost ressources and should be avoided. The thread cache utilization would be thread_cache_size (from variables) - threads_cached (from status), but unlike the table cache utilization this number is much more volatile and of limited use. Looking at the connections we want to see if we ever touched max_connections ("max_connections - max_used_connections") and how far we have max_connections utilized at the moment ("max_connections - threads_connected"). Fooling around with the query cache The query cache can catch queries before they even bump com_select. So the hit ratio is "qcache_hits * 100 / (qcache_hites + com_select)". If it is below 30%, you may be looking at a query cache that is actually costing you performance. How is your qcache_hits vs. qcache_inserts? More inserts than hits? Bad. Maybe you can delay write queries and batch them in minute intervals or longer? Maybe you can split of mutable columns from read-only parts of tables and so get better hit ratios? Look at the query cache size: Are you seeing many qcache_lowmem_prunes per second? If so, you cache may be too small. But beware: A very large query cache may take a lot of CPU to maintain. How long does "FLUSH QUERY CACHE" run if you clean up the cache manually? That's an estimate for the cost of a single lowmem prune. If you increase the cache size, of course you not only expect a lowe rate of lowmem prunes, but also a higher hit rate. If that does not happen, your query cache is not effective and pushing memory at it is a waste. Again: Can you delay and batch writes or split off mutable parts? Bad query cache performance often is not a configuration issue, but requires application changes. MyISAM The central cache used in MyISAM is the MyISAM key cache. The hit ratio is officially 100-(key_reads*100/key_read_requests) and should be 99.3% at least and be better 99.9% or so. If you are using no tool to calculate, just divide key_read_request/key_reads - the result should be a number between 300 and 1000. You key cache can be to large: It caches MYI files, so if it is larger than 'select sum(index_length) as il from information_schema.tables where schema_name = ... and table_type = "myisam"' you are wasting memory. Also, there is currently a completely useless 4G limit on the size of a single query cache, so the cache cannot be larger than that. Unfortunately, many older versions of MySQL do not enforce that limit properly, so you seem to be able to set the key cache for example to 6G, but you end up in fact with a wraparound and an effevtive cache of 2G and MyISAM index corruptions. The workaround is to have multiple key caches (see the manual) and CACHE INDEX statements - not very pretty. In high concurrency situations, you will get excessive table locks and lock contention on MyISAM tables. The ratio to watch is ( table_locks_waited * 100 / tables_locks_immediate). <1% is ok, at 1% you will to feel pain and at 3+% you are usually dying in locks or be already dead. This is distinctively nonlinear behaviour. The remedy is to shorten lock time by making queries faster, or to switch to a table type that does not have table locks, but for example row locks. Slow queries and table locks in MyISAM can lead to read-write interlocks which make some operations very slow. I had one customer once who had a number of slow search queries on his database - one SELECT running sometimes for about 4 seconds, read-locking T and T_PROP to perform the query. To import new data, one write has to be made to T and about 20 writes have to go to T_PROP to establish the 1:n relationship between T and T_PROP. Writes have a higher priority than reads in MyISAM by default, but locks already given cannot be broken by queries waiting. So the inserts into T and T_PROP are waiting in "Locked" state on the slow query to finish, while new selects (slow and fast ones) will wait on the termination of the insert and also end up in "Locked" state - the slow select will indirectly slow down all other selects, even fast ones, because of the one write being wedged inbetween. The writing application is of course synchronous, and will not generate the remaining 19 insert statements before the first insert finishes. So eventually the slow select will finish, unlocking the Insert statement. The insert is quick and will finish almost immediately after unlocking thereby releasing the waiting select queries. After all the delay there is a very high probability that among the multitude of select queries there is another slow one, immediately recreating the same situation again. Possible solutions are to speed up the searches, to wrap the 20 batch inserts to T and T_PROP into a LOCK TABLES to make them appear as just one big insert or to switch to InnoDB and MVCC: In InnoDB, reads and writes will never block each other (and SELECT ... FOR UPDATE is treated as a write). InnoDB Unlike MyISAM, InnoDB uses the InnoDB buffer pool to cache indexes and data. So while with MyISAM we have a target VIRT size of 50% for a standlone pure MyISAM server, with InnoDB we do have a target VIRT size of 80%. In fact, we are playing blackjack with memory: For pure InnoDB we want the server to be as large as possible, but once we swap we are out. For this to work at all, we absolutely must set vm.swappiness to 0, especially on 64 bit machines with large memories. Otherwise we will be swapped out on a 16G machine when touching the 10G limit with out process size even if there still is plenty of free memory. The InnoDB page cache usage can be seen by looking at Innodb_buffer_pool_pages_free*100/Innodb_buffer_pool_pages_total. The hit ratio is 100 - (Innodb_buffer_pool_reads*100/Innodb_buffer_pool_read_requests), and we can expect a slightly lower hit ratio than with MyISAM, in the range of 96% to 99% because of the slightly different nature of the cache. We should monitor the buffer pool for appropriate size by looking at innodb_buffer_pool_wait_free - that variable must not count up, and we should do the same with the redo log looking at Innodb_log_waits. That variable also should not move. Temp tables Whenever in EXPLAIN we see "using temporary", MySQL resolves a query by materializing the result set or an intermediate result as a temporary table. That table is being created as a memory table unless it is too large or cannot be created as a memory table for other reasons. A table is too large for memory if it is larger than either tmp_table_size or max_heap_table_size. You have to increase both limits as the smaller one will be the effective limit for your temporary memory tables. Regardless of size any temp table will go to disk when it contains any data type with TEXT or BLOB in the name. Often it is useful to remove the BLOB from the SELECT ... FROM part of the query and go for the ID of the blob only. In a second query or a subquery, we can then use SELECT BLOB FROM ... WHERE id IN (...) to fetch the blobs. The temp table ratio to watch is created_disk_tmp_tables*100/created_tmp_tables, and of course we want to have an eye on the general rate of created_tmp_tables. If you cannot avoid disk temp tables at all, al least point the variable tmpdir to some ramdisk. Slaves Monitor your slaves functionality by checking "SHOW SLAVE STATUS" output: Slave_io_running = yes and Slave_SQL_Running = Yes. Look at the lag, Seconds_behind_master. You can measure the movement of Read_master_log_pos per second and exec_master_log_pos per second to get a feel for the amount of data flowing into your slave and the execution rate. Check the binlog cache size: binlog_cache_disk_user*100/binlog_cache_use. Maybe you need to grow the binlog_cache_size a bit (cached are statements, and the cache size should reflect statement size). Slow queries Do you need to turn on the slow query log? Look at the Slow_queries counter (after setting long_query_time down to at least 2) and the number of slow queries per second. Count your evil full joins: select_full_join/sec and select_full_join/com_select are useful ratios. The toolbox For Nagios, have a look at NagiosExchange, and have a look at the plugins provided there. My attempt at a Nagios plugin is at my german language blog, and while the text is in german, the C is international. :-) Many Nagios plugins are written in Shell or Perl and are therefore large and slow. They are also often not fully compliant with the Nagios Plugin Development Guidelines, including my own attempt, but they in general work as advertised. Some people want to collect data by using SNMP. SNMP as a protocol is broken in many interesting ways, but that is the topic for another rant another times. It suffices to say that you cannot do monitoring or even management at the level of Peeks and Pokes, and that is what SNMP tries to do. Unfortunately, alternatives do not really exist or are in a way even worse (look at WBEM). So I have written an AgentX bridge into MySQL which you can find at my english language blog. Cacti and Munin are two RRD Tool based monitoring solutions. Both are very colorful and shiny. Cacti template creating is really, really painful, though, but fortunately ready-made template sets exist. I have yet to find a way to plot key_read_requests/key_reads or anything else dependent on two variables in Cacti, so if you know how to do it, please contact me. Munin is extremely strong on monitoring systems at the OS level, but the builtin MySQL support is a bit behind, not covering the more modern data available from more recent versions of MySQL. It is a very nice tool though, and the plugins should be easily improved - much easier than Cacti, I believe. MySQL Activity Report is a tool that plots only a very limited number of variables and does not monitor multiple hosts, but it has a set of rules giving general advice, not unlike the tool I am no longer allowed to call Merlin. On the text mode console mytop and innotop are indispensable, and they should be merged into a single MySQL text mode console. Innotop is under active development, while mytop development seems to have slowed down a bit. Finally, the single all-in-one one-stop shop it all solution for monitoring is of course the tool I am no longer allowed to call Merlin. It deploys a neat and slim little C-written agent on the same machine as your server which connects to your server and executes SHOW commands in regular intervals. It will connect back to a central data collection system using https and will drop measurements into that using some XML format. The central server stores all that information in a local MySQL database and uses that information to create nice suit-compatible graphics and run rules. The ruleset in Merlin really produces a very simple junior performance tuning consultant in a box, which ties nicely into the MySQL Knowledge base and into support. Trackbacks
Trackback specific URI for this entry
No Trackbacks
|
QuicksearchCommentssantosh about An InnoDB tutorial Mon, 21.07.2008 11:13 I have a problem in deleting t he child table row from my app lication using unique id. when I am using the command [...] Azundris about Nermalisation Sat, 05.07.2008 16:08 The point is that you have the option to not do, but then yo u'll be in a world of pain. Th e point is also to show [...] fdask about Nermalisation Fri, 04.07.2008 15:40 Nice little article, and I lik e how you've used cats in your examples :) I am a little unclear what the point h [...] Salmiakki about Salmiakki - the official MySQL Drink Sun, 08.06.2008 11:34 150g of liquorice are enough t o get the original Salmiakki t aste.... you can add some more , anyway delbertino about phpvikinger.org: Things that have no name Sun, 11.05.2008 06:34 In reply to "stuff with no nam es":very informative and succi nct. I am retired and need to learn to build a website [...] ashok about What is the difference between MySQL and Postgres? Wed, 30.04.2008 14:08 what is the difference between MySql and PostgreSql? fortunecookienotes about Fortune Cookie Wed, 09.04.2008 21:46 What a random fortune, who kno ws what it means. My favorite random fortune cookie note rea ds "you will make a good [...] Categories
Syndicate This BlogBlog AdministrationDisclaimerAll examples, code-snippets, advice, trivia, etc. are supplied in the hope that they are useful, but no warranty of any kind is given. Use of trademarks is not intended as a challenge to them.
ImpressumCopyright 2005-2007 bei den jeweiligen Autorinnen der Beiträge. Die Infrastruktur von MySQL-dump wird betrieben von:
Azundris, Immanuelkirchstraße 18, 710405 Berlin, Germany
|
