tag:blogger.com,1999:blog-314219542010-09-06T22:57:05.123-07:00mySQL DBA<a href="http://hotjobs.yahoo.com/resumes/dathanvp/public"><img src="http://www.linkedin.com/img/webpromo/btn_viewmy_160x25.gif" width="160" height="25" border="0" alt="View Dathan's profile on LinkedIn"></a>Dathan Vance Pattishallhttp://www.blogger.com/profile/00356367514107959723noreply@blogger.comBlogger125125tag:blogger.com,1999:blog-31421954.post-40654536690063753432010-09-03T15:42:00.000-07:002010-09-04T19:11:53.934-07:00Cassandra and Ganglia<a href="http://www.flickr.com/photos/dathan/4955551654/" title="cassandra_tpstats_row_read_stage_completed by dathan, on Flickr"><img src="http://farm5.static.flickr.com/4106/4955551654_58e8b73a44.jpg" width="500" height="320" alt="cassandra_tpstats_row_read_stage_completed" /></a><br /><br />I finally got some time to do some house cleaning. One of my nagging low-hanging fruit jobs was to stop using jconsole as my monitor. I created a ganglia script to graph what is above. The image illustrated above I am showing all the Cassandra servers and their total row read stages completed in the last hour as a gauge. In essence I am graphing the delta of the change between ganglia script runs.<br /><br />How I have it set up is:<br /><br />All data exposed by JMX to produce tpstats and cfstats is graphed via ganglia. The pattern for each graph is as follows<br /><br />cass_{stat_class}_{key}<br /><br />stat_class - tpc, tpp, tpa means complete, pending, active respectively<br />key - would be message deserialization for instance.<br /><br />For column family stats I graph the keyspace stats as well as the specific column family stats exposed by cfstats. For instance below:<br /><br /><a href="http://www.flickr.com/photos/dathan/4955539434/" title="Cassandra cfstats with ganglia by dathan, on Flickr"><img src="http://farm5.static.flickr.com/4105/4955539434_4fd19b1acd.jpg" width="382" height="500" alt="Cassandra cfstats with ganglia" /></a><br /><br />If you’re interested in the scripts I'll send it to you or put it up on code.google.com, its written in perl OOP perl and takes the same approach of packaging that maatkit tool kit for mySQL by Xarb and crew does (puts all the "classes" in the file as the application).<br /><br />GmetricDelegate is the parent package<br />GmetricCassandra extends GmetricDelegate and overloads getData as well as defines what is an absolute stats vrs a gauge.<br /><br />As you can see the pattern I also have<br />GmetricInnoDB<br />GmetricMySQL<br /><br />and so on.<br /><br />then on each server I run<br /><br />/usr/bin/perl -w /home/scripts/ganglia_gmetric.pl --module=GmetricCassandra<br /><br />this then talks to Ganglia through gmetric to report the stats.<div class="blogger-post-footer"><img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/31421954-4065453669006375343?l=mysqldba.blogspot.com' alt='' /></div>Dathan Vance Pattishallhttp://www.blogger.com/profile/00356367514107959723noreply@blogger.com1tag:blogger.com,1999:blog-31421954.post-87603141695459996882010-08-18T15:28:00.000-07:002010-08-18T16:19:52.140-07:00Review of MySQL High Availability by Charles Bell, Mats Kindahl and Lars Thalmann<a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://oreilly.com/catalog/9780596807290"><img style="float:right; margin:0 0 10px 10px;cursor:pointer; cursor:hand;width: 180px; height: 236px;" src="http://covers.oreilly.com/images/9780596807290/cat.gif" border="0" alt="" /></a><br />The kind folks at O'Reilly sent me MySQL High Availability by Charles Bell, Mats Kindahl and Lars Thalmann . In summary, the book is awesome. Personally I didn't think that Replication was enough of a feature in mySQL to fill up 500+ pages about it, BUT, they did and not with fillers or info that is NOT actually important to you as the end designer. In <a href="http://mysqlha.blogspot.com/">Mark Callaghan</a>'s forward he states that this book "adequately" explains MySQL replication, this is a huge low-ball, I dare say if you did not look at the replication code (prior to row-based replication there was not a lot of it) your questions can be answered with this book. There are three parts to this book: Replication (roughly 50% of the book), Monitoring and Disaster Recovery, then finally High Availability Environments. Each part is well written and accurate. I have been using mySQL before Replication existed. When it was first released I have been using it and been to many talk about it, from the original author talks to Brian Aker’s talks to asking <a href="http://jeremy.zawodny.com/">Jeremy Zawodny</a> about it to giving my own talks about Scale-Out, replication, etc. <br />My experience with mySQL Replication is rather robust. I broke holes in it, found bugs with it, and even wrote a quick C client (when it was still statement based replication only on the 4.1 branch) that would act as a proxy to gather all the replication feeds of all boxes onto a single box (it was a proof of concept). The book goes into clear detail about the common approaches and use cases of replication scaling out the apps reads. It even talks about data sharding that I must say I am an expert at (done it for Flickr/RockYou, for friends etc). The examples are clear although I would not recommend deploying their examples verbatim. The reason you never want to mod based on the number of shards in your system, because if you add more shards you are going to have to move your data all around. Central dictionaries are perfect to control balance. They go into this, but not how to lock a user to migrate to a lightly loaded shard. The example that they have creates holes in the flow, race conditions as well as downtime to move data, this is a nitpick, the book is good.<br /><br />I do not want to give away the book, it is a good read and the quality is what you expect from O’Reilly. If you want to know about various replication techniques, replication shortcomings, building fail-over systems, and exploring other technologies that are comparable to replication, this is a great book for you. Oh and of note, they even go into exotic features that I rarely use in production and how that effects replication. Awesome.<div class="blogger-post-footer"><img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/31421954-8760314169545999688?l=mysqldba.blogspot.com' alt='' /></div>Dathan Vance Pattishallhttp://www.blogger.com/profile/00356367514107959723noreply@blogger.com2tag:blogger.com,1999:blog-31421954.post-88805556441887052132010-07-08T12:58:00.000-07:002010-07-08T15:36:25.926-07:00Upgrading Cassandra 0.5.1 to 0.6.3Every month or so a node randomly dies<br /><pre><br />EQX root@cass01:/opt/cassandra/bin# ./nodeprobe -host localhost -port 8181 ring<br />Address Status Load Range Ring<br /> facebook_1301003235_1301003235 <br />10.129.28.22 Down 15.77 GB 9ZehBzpHHwnxiPJU |<--|<br />10.129.28.23 Up 7.59 GB facebook_100000471858343_1514390063 | |<br />10.129.28.14 Up 4.59 GB facebook_100000846936312 | |<br />10.129.28.20 Up 12.94 GB facebook_1301003235_1301003235 |-->|<br /></pre><br /><br />Trying to get info from the host, the reads timeout.<br />java.net.SocketTimeoutException: Read timed out<br /><br /><br />Doing an lsof -p on the java proc I see that it is holding open a bunch of sockets. So the node itself is hanging on something internal is my assumption.<br /><br />Looking at /var/log/cassandra/system.log I see that the last rotation happened Jun 8th over a month ago and no new log is being written to. THe issue is the node just died today. So this seems like a bug to me.<br /><br /><br />Now since Cassandra does not tell me what the problem is, I assume that there is a bug in this version and searching Cassandra <a href="https://issues.apache.org/jira/sr/jira.issueviews:searchrequest-printable/temp/SearchRequest.html?query=node+hung&summary=true&description=true&pid=12310865&fixfor=12314040&tempMax=1000">Jira bug database</a> I see that a lot of stuff is fixed as well as added. So might as well as upgrade.<br /><br />Before I upgrade I wanted to do research to see if anyone else has. To my surprise there doesn't seem to be any blog talking about upgrading from <a href="http://www.google.com/search?hl=en&q=cassandra+upgrade+from+0.5+to+0.6.3&btnG=Search&aq=f&aqi=&aql=&oq=&gs_rfai=">0.5 to 0.6.3</a><br /><br />I know its rather easy but there is some new stuff in 0.6.3 that is turned on by default: So let's see what changes in the conf<br /><br />diff /opt/cassandra/conf /opt/apache-cassandra-0.6.3/conf<br /><br />I see that in storage.xml there is some new XML attributes for the ColumnFamily tag such as RowsCached, new tags called <a href="http://wiki.apache.org/cassandra/HintedHandof">HintedHandoffEnabled</a>, Authenticator, DiskAccessMode, <a href="http://wiki.apache.org/cassandra/CassandraLimitations">RowWarningThresholdInMB</a>.<br />Additional to this I noticed that a lot of XML tags are missing. A rolling upgrade is just not possible and is mentioned in NEWS.txt<br /><br />Thus in my application I set this $GLOBALS['cfg']['disable_nosql_feature'] = 1; I have about 40 toggles to play with, a very helpful process to enable dynamically code with out breaking your site.<br /><br /><br />now time for an upgrade without the service running:<br /><br />Steps:<br /><OL><br /><LI> Shut down Cassandra: dsh -g cassandra "pkill java" # same thing as stop-server</LI><br /><LI> rpm -e cassandra-0.5.1 </LI><br /><LI> rpm -ivh cassandra-0.6.3.rpm </LI><br /><LI> /opt/cassandra/bin/cassandra </LI><br /></OL><br /><br />Done. Note what the hell is cassandra-0.6.3.rpm, it's an rpm I created that has my storage-conf.xml<br />log4j.properties<br />cassandra.in.sh<br /><br />After Upgrading:<br /><br /><pre><br />***************************************************************<br />WARNING: ./nodeprobe is obsolete, use ./nodetool instead<br />***************************************************************<br />Address Status Load Range Ring<br /> facebook_1301003235_1301003235 <br />10.129.28.22 Up 11.75 GB 9ZehBzpHHwnxiPJU |<--|<br />10.129.28.23 Up 3.04 GB facebook_100000471858343_1514390063 | |<br />10.129.28.14 Up 2.33 GB facebook_100000846936312 | |<br />10.129.28.20 Up 4.4 GB facebook_1301003235_1301003235 |-->|<br /></pre><br /><br /><br />Now what is left to do it change my ganglia scripts / nagios scripts to use nodetool instead of nodeprobe.<div class="blogger-post-footer"><img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/31421954-8880555644188705213?l=mysqldba.blogspot.com' alt='' /></div>Dathan Vance Pattishallhttp://www.blogger.com/profile/00356367514107959723noreply@blogger.com0tag:blogger.com,1999:blog-31421954.post-16632331794749276112010-06-25T01:02:00.000-07:002010-06-25T02:19:02.477-07:00dsh and TABLE CHECKSUMSo running through some various tasks, I'm finally on the section of work where I can resurrect a script that finds inconsistent data between master-master pairs. <br /><br />Let's get a quick summary to find our problems to target the script at.<br /><br /><pre><br />dsh -ef 20 -w dbfacebook9b,dbfacebook9a -s ~/bin/execute_query.sh "FacebookShard \"CHECKSUM TABLE RollingUserLog\""<br />dbfacebook9b: *************************** 1. row ***************************<br />dbfacebook9b: Table: FacebookShard.RollingUserLog<br />dbfacebook9b: Checksum: 538386033<br />dbfacebook9a: *************************** 1. row ***************************<br />dbfacebook9a: Table: FacebookShard.RollingUserLog<br />dbfacebook9a: Checksum: 538386033<br /></pre><br /><br />not bad.<br /><br />CHECKSUM TABLE uses a ACCUM algorithm to determine the table checksum. This gives a good fast snapshot in a shared lock mode so writes do not stop while doing a checksum (this is in the context of INNODB). RollingUserLog gets nearly 700 writes per sec per box (No alerts).<br /><br />But what if there is a problem<br /><pre><br /># dsh -ef 20 -w dbfacebook38b,dbfacebook38a -s ~/bin/execute_query.sh "FacebookShard \"CHECKSUM TABLE SimulatedProblem\""<br />dbfacebook38b: *************************** 1. row ***************************<br />dbfacebook38b: Table: FacebookShard.SimulatedProblem<br />dbfacebook38b: Checksum: 660032421<br />dbfacebook38a: *************************** 1. row ***************************<br />dbfacebook38a: Table: FacebookShard.SimulatedProblem<br />dbfacebook38a: Checksum: 2533654621<br /></pre><br /><br />Well let's see if there is a row count mismatch<br /><br /><pre><br /># dsh -ef 20 -w dbfacebook38b,dbfacebook38a -s ~/bin/execute_query.sh "FacebookShard \"SELECT COUNT(*) FROM SimulatedProblem\""<br />dbfacebook38b: *************************** 1. row ***************************<br />dbfacebook38b: COUNT(*): 64358<br />dbfacebook38a: *************************** 1. row ***************************<br />dbfacebook38a: COUNT(*): 64358<br /></pre><br /><br /><br />Ah so the data is inconsistent, we are not missing rows but 1 or more rows have different values. Now time to find it.<br /><br />To resurrect my script now that I know what table is messed up:<br /><br />Algorithm is this:<br /><br />open connection to both servers - fork<br />compare the data by scanning the table via a index walk<br />crc32/md5/hash the rows<br />compare columns <br />if (dbfacebook38b.hash != dbfacebook38a.hash)<br />mark row and record primary key to track position<br /><br />print report<br /><br />but wait why polish up my script when I can use a formal one that does the job great!<br /><br />@see<br /><br />mk-table-checksum<br />mk-table-sync<br /><br /># fixes the issue<br /><pre><br />mk-table-sync h=dbfacebook38a,u=$USER,p=$PASS,D=FacebookShard,t=SimulatedProblem h=dbfacebook38b --execute --no-check-slave<br /></pre><br /><br />maatkit is awesome. Nearly every problem that I run into and built a script for has been formalized. I'm going to muster up some time and contribute to this toolkit like add my binary log rotate which makes sure that the slaves are caught up to the binary file that is being purged.<div class="blogger-post-footer"><img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/31421954-1663233179474927611?l=mysqldba.blogspot.com' alt='' /></div>Dathan Vance Pattishallhttp://www.blogger.com/profile/00356367514107959723noreply@blogger.com1tag:blogger.com,1999:blog-31421954.post-51494795643009308052010-06-07T12:07:00.000-07:002010-06-07T12:47:39.401-07:00Review of High Availability MySQL Cookbook by Packt Publishing<a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://www.packtpub.com/sites/default/files/imagecache/productview/9942_High%20Availability%20MySQL%20Cookbook.jpg"><img style="float:right; margin:0 0 10px 10px;cursor:pointer; cursor:hand;width: 125px; height: 152px;" src="http://www.packtpub.com/sites/default/files/imagecache/productview/9942_High%20Availability%20MySQL%20Cookbook.jpg" border="0" alt="" /></a><br />A few months ago, I reviewed MySQL Admin Cookbook. Today I am reviewing High <a href="https://www.packtpub.com/high-availability-mysql-cookbook/book">Availability MySQL Cookbook from Packt Publishing by Alex Davies</a>. Overall, I found the book to contain some good hidden Gems.<br /><br />The book is a mixture of MySQL Cluster (NDB), Replication schemes, some performance tuning, some minor kernel tweaking, and some more exotic approaches to common High Availability problems. Overall, I found this book very informative and a good read.<br />Now the specifics, the book starts out on NDB and stays focused on this fact for about 60% of the book. The next 20% is on mySQL replication then about 10% of the book is on tweaking kernel, mysql, network settings to get the most out of the system. The last 10% is a mixture of uses of exotic systems such as GFS, Conga, ISCSI and how to use these shared storage techs with mySQL.<br /><br />High Availability MySQL Cookbook, is a good read, and jogged my memory on NDB (since I do not use it on a day-to-day basis). If you are looking for a good reference on how to get an HA system up then this is a good book for you. If you are looking for why to use NDB over mySQL multi-master replication, this is not a book for you. The “why-to-use-this-over-that” is not the scope of the book. The drawbacks and concerns of what technology to use are not a focus of this book, and should not be. It is assumed that the reader knows which direction to go into for the most part or gives the reader enough information to set up and environment to see which way to go.<br /><br />In conclusion, I like this book. I think it is rather concise, and right to the point, which either gets you started into building HA systems or gives you a good reference for an existing HA environment.<div class="blogger-post-footer"><img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/31421954-5149479564300930805?l=mysqldba.blogspot.com' alt='' /></div>Dathan Vance Pattishallhttp://www.blogger.com/profile/00356367514107959723noreply@blogger.com0tag:blogger.com,1999:blog-31421954.post-63225785097129352652010-04-29T13:43:00.001-07:002010-04-29T13:43:18.048-07:00Mastering the art of indexingCheck out this SlideShare Presentation: I'd have to say that this is the coolest presentation on Indexing and how it relates to INNODB. I have written on this subject many times in the pass but this presentation covers all the bases and does a great job at explaining WHY you should do certain statements over others.<div style="width:425px" id="__ss_3739976"><strong style="display:block;margin:12px 0 4px"><a href="http://www.slideshare.net/matsunobu/more-mastering-the-art-of-indexing" title="More mastering the art of indexing">More mastering the art of indexing</a></strong><object id="__sse3739976" width="425" height="355"><param name="movie" value="http://static.slidesharecdn.com/swf/ssplayer2.swf?doc=moremasteringtheartofindexing-100415172610-phpapp02&stripped_title=more-mastering-the-art-of-indexing" /><param name="allowFullScreen" value="true"/><param name="allowScriptAccess" value="always"/><embed name="__sse3739976" src="http://static.slidesharecdn.com/swf/ssplayer2.swf?doc=moremasteringtheartofindexing-100415172610-phpapp02&stripped_title=more-mastering-the-art-of-indexing" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="425" height="355"></embed></object><div style="padding:5px 0 12px">View more <a href="http://www.slideshare.net/">presentations</a> from <a href="http://www.slideshare.net/matsunobu">Yoshinori Matsunobu</a>.</div></div><div class="blogger-post-footer"><img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/31421954-6322578509712935265?l=mysqldba.blogspot.com' alt='' /></div>Dathan Vance Pattishallhttp://www.blogger.com/profile/00356367514107959723noreply@blogger.com0tag:blogger.com,1999:blog-31421954.post-80722982544879134252010-04-13T14:52:00.000-07:002010-04-15T17:47:13.432-07:00Review of MySQL Admin Cookbook from PACKT Publishing<a href="https://www.packtpub.com/mysql-admin-cookbook/book">PACKT Publishing</a> sent me titled "<a href="http://www.amazon.com/MySQL-Admin-Cookbook-Daniel-Schneller/dp/1847197965/ref=sr_1_1?ie=UTF8&s=books&qid=1271203041&sr=1-1">MySQL Admin Cookbook</a>" to review and I told them that I would be brutally honest about it. They said cool and well here, we go.<br /><br />Overall, the book is cool if you are starting out in MySQL administration and want to get a box up and running. If you are looking to scale MySQL or make your application faster this is not the book for you. If you are worried about consistency and getting the most out of your hardware-this is not the book for you. If you are trying to figure out what the best index combination is-again-this is not the book for you. If you want to know how to add users, or set up replication, or dump a CSV format text file of data then this is the book for you. <br /> <br />Some things that annoy me from this book is all of the GUI cut and paste screen shots. Explaining stuff with a GUI screen shot really sucks IMHO since by the time you read the book, the GUI changed. I personally stick with command line interfaces or write my own GUI layouts to administration actions since I know what the various ADMIN commands do. Let me stress again that GUI explanations really go out of date fast and is only pertinent for when the book is made. For instance if you ever used Eclipse, a common IDE for various languages (mainly Java), between Eclipse builds the GUI changes. The overall interface for the MySQL command line client has stayed the same since the very beginning. To be fair though the book does show some mySQL command line examples, like for handling NULLs but consistency is key to getting your ideas across.<br /><br />Another pet peeve of mine is the book has a tag line <span style="font-style:italic;">99 great recipes for mastering MySQL configuration and administration</span> yet I couldn't confirm 99 recipes since the book is not actually structured this way IMHO. It is structured in the format of "How to do it", "How it works", and "there's more..." for certain actions and there is just not enough meat for Mastering MySQL configurations – like what is a Star Replication Schema and how to do it? How do you rotate in new servers when in a circular MySQL config? Where is MySQL clustering? Why are file sorts so slow? How is MySQL using the disk subsystem with this config ... etc.<br /><br /> IN conclusion, would I recommend this book to readers? If you need a starting point to ask Google for some more complicated questions-this is a good start. For experienced administrators, no it is not for you.<div class="blogger-post-footer"><img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/31421954-8072298254487913425?l=mysqldba.blogspot.com' alt='' /></div>Dathan Vance Pattishallhttp://www.blogger.com/profile/00356367514107959723noreply@blogger.com1tag:blogger.com,1999:blog-31421954.post-41737773744139427292010-03-26T10:45:00.000-07:002010-03-26T11:09:36.476-07:00Some kernel tweaks to aid Cassandra under a high concurrency environmentFor the past couple of weeks I have been trouble shooting some Cassandra issues where data would not make it to Cassandra.<br /><br /><a href="http://www.flickr.com/photos/dathan/4464575619/" title="Graph of various tracked Exceptions by dathan, on Flickr"><img src="http://farm5.static.flickr.com/4047/4464575619_c1f48240bd_b.jpg" width="1024" height="382" alt="Graph of various tracked Exceptions" /></a> <br /><br /><br />The image above graphs all the exceptions that are produced from Cassandra. The two big lines are<br /><br />Transport Exceptions (te) - meaning that Cassandra could not answer the request think of this as MAX Connection errors in mySQL.<br /><br /><br />Unavailable Exceptions (ue) - meaning that Cassandra could answer the request but the "storage engine" cannot do anything with it because its busy doing something like communicating with other nodes or maintenance like a node cleanup.<br /><br /><br />So how did I get the graph to drop to 0? After looking at the error logs, I saw that Cassandra was getting flooded with SYN Requests and the kernel thought that it was a SYN Flood and did this<br /><br /><i>possible SYN flooding on port 9160. Sending cookies.</i><br /><br /><br />To stop this the puppet profile was changed to have<br /><br />sysctl -w net.ipv4.tcp_max_syn_backlog=4096<br />sysctl -w net.ipv4.tcp_syncookies=0<br /><br /><br /><br />Next looking into the Cassandra log which I defined to exist in /var/log/cassandra/system.log<br /><br /><blockquote><br />WARN [TCP Selector Manager] 2010-03-26 02:46:31,619 TcpConnectionHandler.java (line 53) Exception was generated at : 03/26/2010 02:<br />Too many open files<br />java.io.IOException: Too many open files<br /></blockquote><br /><br />Then noticed that <br />ulimit -n == 1024<br /><br />thus I changed<br />/etc/security/limits.conf so that It's at a server setting by adding this:<br /><pre><br />* - nofile 8000<br /></pre><br /><br />Now my Transport Exceptions and Unavailable Exceptions are gone and data is being written to it consistently.<br /><br />There are many other ways of doing the same thing, I could have modified my init script or did some other stuff but I choose this way. Default Distros set kernel and limits fields too low: settings for desktop levels.<div class="blogger-post-footer"><img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/31421954-4173777374413942729?l=mysqldba.blogspot.com' alt='' /></div>Dathan Vance Pattishallhttp://www.blogger.com/profile/00356367514107959723noreply@blogger.com4tag:blogger.com,1999:blog-31421954.post-71500759603892020432010-03-23T18:36:00.000-07:002010-03-23T20:32:54.831-07:00Cassandra is my NoSQL solution but..In the past few months, I have tested many NoSQL solutions. Redis, MongoDB, HBase yet Cassandra is the Column Store DB I picked because of its speed (on writes), reliability, built in feature set that makes it multi-datacenter aware. The one other personal reward for Cassandra is it is written in Java. I like reading and writing in Java more than C++ although it really does not matter for me personally in the end.<br /><br />Let us talk about the reason why I am introducing Cassandra into my infrastructure and some of its drawbacks I have noticed so far.<br /><br />Why it is being introduced:<br />We have a feature where we record every single click for 50 million Monthly Active Users (real-time) and storing this in mySQL is just waste of semi-good hardware for data that is only looked at for the past 24 hours. Over the course of some time (couple of months) more than 3 billion rows accumulated, which translated into a 3.5 TB distributed INNODB datafile. So purging/archiving this data just sucks.<br /><br />Now introducing user clicks into Cassandra was rather easy. I researched various sites, asked my Digg buddy and then figured out the rest. Within two days I was up and running with Cassandra and had a great understanding about Column Families and SuperColumns.<br /><br />Developing with a Cassandra Data layer:<br /><br />Now that the infrastructure is up, I needed to add a data model to /opt/cassandra/conf/storage.conf. The SQL that drove this functionality consisted of two main SQL statements.<br />Add click<br />Get a Range of Clicks<br /><br />For these operations, mySQL rarely takes 0.001 seconds (1 ms).<br /><br />Cassandra for writes is rather fast, but for reads, Getting Range Clicks<br />i.e. using <br />recv_get_slice<br /><br />it takes .02 seconds (20 ms).<br /><br />What does this mean? MySQL is a hell of a lot faster! Is it because of my CF design? No.<br />for instance, take this slug (keys separated by a delimiter to make a distinct key name).<br /><br />{$clicker}_{$pageowner} => [ $object_id_clicked0 => $value,$ts, ... $object_id_clicked =>N $value,$ts ]<br /><br /><br />The slug says that clicker A, clicked on page B and what is stored are columns of distinct clicks - remember this is a column store DB.<br /><br />So <br />{$clicker}_{$pageowner} == the KEY for the COLUMN $object_id_clicked with the value being a $value and the free extra value the $timestamp.<br /><br />Here is the php code<br /><code><br />$columnPath = $this->getColumnPath($objectid);<br /> $this->getCassandraConnect()->insert(self::KEYSPACE, $this->getKey($clicker, $pageOwner), $columnPath, $this->today_ts, microtime(true), ConsistencyLevel::ZERO);<br /></code><br /><br /><br />Now I want a list of items clicked.<br /><code><br />$data = $this->getCassandraConnect()->get_slice(self::KEYSPACE, $this->getKey($clicker, $pageOwner), $this->getColumnParent(), $this->getSlicePredicate(), ConsistencyLevel::ONE);<br /></code><br /><br />This says give me the last N logged clicks that the clicker A made for page owner B. This is a hash lookup (Big-O(1)) but a sorted list of columns (O(nlogn)) and return the last N elements.<br /><br />Why is Cassandra sooo slow on reads. Is it because my memory config is not enough?<br />No. 7 GB of data is allocated for data that fits in memory ( for now ).<br /><br /><br />Is it because my disk is saturated?<br /><pre><br />avg-cpu: %user %nice %system %iowait %steal %idle<br /> 12.21 0.00 2.85 0.48 0.00 84.47<br /><br />Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util<br />sda 0.01 14.01 0.23 6.61 22.25 165.03 27.34 0.33 47.57 7.35 5.04<br />sda1 0.00 0.00 0.00 0.00 0.00 0.00 20.83 0.00 7.74 6.40 0.00<br />sda2 0.01 14.01 0.23 6.61 22.25 165.03 27.34 0.33 47.57 7.35 5.04<br />dm-0 0.00 0.00 0.24 20.63 22.25 165.03 8.97 0.72 34.27 2.41 5.04<br />dm-1 0.00 0.00 0.00 0.00 0.00 0.00 8.00 0.00 5.28 0.22 0.00<br /></pre><br /><br /><br />No its not.<br /><br />My <span style="font-weight:bold;">hunch</span> is the slowdown is a combo of the low thread read pool and in the inherent sorting and there doesn't seem to be a way to turn it off, so without digging in the code I couldn't say (I will be able to once I get an understanding of the code layout). <br /><br /><br />In conclusion, I like Cassandra, it’s very fast in writes, slow (for my taste but fast enough) in reads and what takes 2 lines of SQL takes 250 lines of PHP code interfacing with the Cassandra.thrift suite. I am going to use it still because it is good enough and I love the built in HA of it. Additionally the performance is improving very quickly.<div class="blogger-post-footer"><img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/31421954-7150075960389202043?l=mysqldba.blogspot.com' alt='' /></div>Dathan Vance Pattishallhttp://www.blogger.com/profile/00356367514107959723noreply@blogger.com12tag:blogger.com,1999:blog-31421954.post-15944787537255750912010-03-10T15:56:00.000-08:002010-03-10T15:59:58.070-08:00Talking at the University of UtahGiving a talk at the University of Utah on everything from scaling, clustering, mysql, mysql internals, noSQL (Cassandra) to how to manage all this stuff. If you are there at University I'm bringing some Swag!<br /><br />Also I will upload the slides and put them here.<div class="blogger-post-footer"><img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/31421954-1594478753725575091?l=mysqldba.blogspot.com' alt='' /></div>Dathan Vance Pattishallhttp://www.blogger.com/profile/00356367514107959723noreply@blogger.com2tag:blogger.com,1999:blog-31421954.post-33441198259735318852010-02-25T11:01:00.000-08:002010-02-25T11:05:37.077-08:00NoSQL explained to DBAs<a href="http://krow.livejournal.com/684898.html">Brian Aker</a>, a brilliant helpful duder, who I learn a lot from. Gives a great talk about what is NoSQL explained in a way for database guys. I warn you, there are some points in this video where you can't hear Brian due to the audience "participation" but you should get the content.<br /><br /><br /><object width="425" height="344"><param name="movie" value="http://www.youtube.com/v/LhnGarRsKnA&hl=en_US&fs=1&"></param><param name="allowFullScreen" value="true"></param><param name="allowscriptaccess" value="always"></param><embed src="http://www.youtube.com/v/LhnGarRsKnA&hl=en_US&fs=1&" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" width="425" height="344"></embed></object><div class="blogger-post-footer"><img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/31421954-3344119825973531885?l=mysqldba.blogspot.com' alt='' /></div>Dathan Vance Pattishallhttp://www.blogger.com/profile/00356367514107959723noreply@blogger.com0tag:blogger.com,1999:blog-31421954.post-31602154630069454642010-02-04T11:24:00.001-08:002010-02-04T12:12:50.607-08:00innodb_file_per_table, shrinking table spaces and the data dictionaryINNODB has some irritating gotchas that makes disk space management hard. In 2002ish INNODB, added innodb_file_per_table to get around allot of these issues, but it does not fix everything.<br /><br />If you are running innodb_file_per_table, you will notice in your database directory <br /><UL><br /><LI>db.opt - database characteristics file. </LI><br /><LI> tablename.frm - the table structure. </LI><br /><LI> tablename.ibd - the actual innodb table space file </LI><br /></UL><br /><br />Imagine that you have a table with 10 million rows and you delete say 5 million rows in multiple chunks around 400K chunks, because deletes are slow. Next, you notice that the table space file did not shrink. So what do you do? OPTIMIZE tablename, tada all the wasted space is reclaimed, but here is the PROBLEM the ibdata file grew!<br /><b><br />ibdata stores all of the UNDO LOGS thus GROWS due to the deletes and space is never reclaimed.<br /></b><br /><br />Ok, lets try coping the tablespace file to another directory and re-import the tablespace file after wiping the data dictionary (ibdata).<br /><br />For instance<br />/etc/init.d/mysql stop<br />cd /var/lib/mysql/DB<br />cp * /tmp/hold<br />rm /var/lib/mysql/ib*<br />cp /tmp/hold/* /var/lib/mysql/DB<br />/etc/init.d/mysql start // create the ibdata file<br /><br />ALTER TABLE tablename IMPORT TABLESPACE<br />ERROR 1146 (42S02): Table 'DB.tablename' does not exist<br /><br />Really, there is nothing that you can do to force innodb to purge the undo logs in the ibdata file without a full dump of the data to text and then reimport the data.<br /><br />So here are the steps to shrink all table spaces and the ibdata file<br /><br />mysqldump --all-databases (or use mk-parallel-dump)<br />stop mysql<br />rm -f /var/lib/mysql/ib* /var/lib/mysql/DB/*<br />start mysql<br />mysqlimport (or use mk-parallel-restore)<br /><br />Takes a bunch of time but there is no other recourse. If you know of another way please share :)<div class="blogger-post-footer"><img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/31421954-3160215463006945464?l=mysqldba.blogspot.com' alt='' /></div>Dathan Vance Pattishallhttp://www.blogger.com/profile/00356367514107959723noreply@blogger.com2tag:blogger.com,1999:blog-31421954.post-79045743053401138372010-01-08T16:21:00.000-08:002010-02-16T12:07:41.604-08:00Scaling to 200K Transactions per Second with Open Source - MySQL, Java, curl, PHPI'll be giving a web presentation Jan. 19th 2010. If you would like to check it out please do!<br /><br /><br /><a href="http://www.mysql.com/news-and-events/web-seminars/display-475.html" alt="click me">Scaling to 200K Transactions per Second with Open Source - MySQL, Java, curl, PHP</a><br /><br /><br />I thought it went well. I will post the link to the slides here :)<br /><br /><a title="View Scaling to 200K Transactions per Second with Open Source - MySQL, Java, curl, PHP on Scribd" href="http://www.scribd.com/doc/26955045/Scaling-to-200K-Transactions-per-Second-with-Open-Source-MySQL-Java-curl-PHP" style="margin: 12px auto 6px auto; font-family: Helvetica,Arial,Sans-serif; font-style: normal; font-variant: normal; font-weight: normal; font-size: 14px; line-height: normal; font-size-adjust: none; font-stretch: normal; -x-system-font: none; display: block; text-decoration: underline;">Scaling to 200K Transactions per Second with Open Source - MySQL, Java, curl, PHP</a> <object id="doc_571268112673426" name="doc_571268112673426" height="600" width="100%" type="application/x-shockwave-flash" data="http://d1.scribdassets.com/ScribdViewer.swf" style="outline:none;" > <param name="movie" value="http://d1.scribdassets.com/ScribdViewer.swf"> <param name="wmode" value="opaque"> <param name="bgcolor" value="#ffffff"> <param name="allowFullScreen" value="true"> <param name="allowScriptAccess" value="always"> <param name="FlashVars" value="document_id=26955045&access_key=key-2cb93mz5nrq8armj4zw0&page=1&viewMode=slideshow"> <embed id="doc_571268112673426" name="doc_571268112673426" src="http://d1.scribdassets.com/ScribdViewer.swf?document_id=26955045&access_key=key-2cb93mz5nrq8armj4zw0&page=1&viewMode=slideshow" type="application/x-shockwave-flash" allowscriptaccess="always" allowfullscreen="true" height="600" width="100%" wmode="opaque" bgcolor="#ffffff"></embed> </object><div class="blogger-post-footer"><img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/31421954-7904574305340113837?l=mysqldba.blogspot.com' alt='' /></div>Dathan Vance Pattishallhttp://www.blogger.com/profile/00356367514107959723noreply@blogger.com1tag:blogger.com,1999:blog-31421954.post-47096199621419716092009-11-09T16:58:00.000-08:002009-11-10T12:43:24.549-08:00Asynchronous Queries verses Synchronous QueriesIn a procedural language without the use of threads (or Inter Process Communication via forks), to execute I/O requests they are done one after another. Synchronous Queries produce at best a Big-O of N such that N is an element of I/O communication (queries) and N equals the number of queries needed to achieve the requested dataset.<br />With IPC or threads we can speed up common O(N) problems to reduce the N with parallelism, its still functionally a O(N) yet from a single instance point of view N is much less because threads (IPC) takes that Serial computing component and executes the code in parallel. To better explain what I am talking about lets look at some PHP code:<br /><br /><pre><br />foreach($friends as $friend){<br /> $data[] = getMySQLData(“SELECT * FROM AccountData WHERE userid = $friend);<br />}<br /></pre><br /><br />The Primary key for the AccountData table is userid. Assuming that you have 5000 friends, the query has to be executed 5000 times.<br />We can reduce the O(N) and change it to a O(nlogn) (Binary Tree - doesn't take into account other factors) by switching the query to<br /><br /><pre><br />$data = getMYSQLData(“SELECT * FROM AccountData WHERE userid IN (….)”);<br /></pre><br /><br />We just sped up the retrieval of the data significantly, yet we just introduced a bottle neck on the datalayer. Our architecture requires that the data is located in a single location.<br /><br />What if AccountData’s data is spread across many servers federated by userid? This means that userid belongs to a server, so the server contains a shard of the AccountData’s Data.<br /><br />Now we are back to a O(N) where each query needs to be executed on the corresponding shard. The logical next step is to group queries per shard and run across them all. For instance<br /><br /><pre><br />$multiShardIDs = $genericShard->getMultipleShardIDs($objIds);<br />foreach ($multiShardIDs as $shardID => $shardUserIDs) {<br /> if (stripos($orgQuery, " WHERE ") !== false){<br /> $query = $orgQuery." AND {$column} IN (".implode(',', $shardUserIDs).") ";<br /> } <br /> else{<br /> $query = $orgQuery." WHERE {$column} IN (".implode(',', $shardUserIDs).") "; <br /> }<br /><br /> $shard_to_sql[$shardID] = $query;<br /><br /> .... more stuff ....<br /><br /></pre><br /><br />Yet this is still a O(N) its just that N is smaller. Each query is still executed serially. <br />Let’s look at some stats of synchronous queries of SELECT 1; This query is executed across 35 shards and the timings are from PHP point of view.<br /><br /><table border=1><tr><th>Field</th><th>End Value</th><th>Start Value</th><th>Delta</th></tr><tr><td>ru_oublock</td><td>0</td><td>0</td><td> 0</td></tr><tr><td>ru_inblock</td><td>0</td><td>0</td><td> 0</td></tr><tr><td>ru_msgsnd</td><td>0</td><td>0</td><td> 0</td></tr><tr><td>ru_msgrcv</td><td>0</td><td>0</td><td> 0</td></tr><tr><td>ru_maxrss</td><td>0</td><td>0</td><td> 0</td></tr><tr><td>ru_ixrss</td><td>0</td><td>0</td><td> 0</td></tr><tr><td>ru_idrss</td><td>0</td><td>0</td><td> 0</td></tr><tr><td>ru_minflt</td><td>9872</td><td>9865</td><td> 7</td></tr><tr><td>ru_majflt</td><td>0</td><td>0</td><td> 0</td></tr><tr><td>ru_nsignals</td><td>0</td><td>0</td><td> 0</td></tr><tr><td>ru_nvcsw</td><td>11344</td><td>11114</td><td> 230</td></tr><tr><td>ru_nivcsw</td><td>977</td><td>968</td><td> 9</td></tr><tr><td>ru_nswap</td><td>0</td><td>0</td><td> 0</td></tr><tr><td>ru_utime.tv_usec</td><td>865054</td><td>849053</td><td> 16001</td></tr><tr><td>ru_utime.tv_sec</td><td>16</td><td>16</td><td> 0</td></tr><tr><td>ru_stime.tv_usec</td><td>556097</td><td>552097</td><td> 4000</td></tr><tr><td>ru_stime.tv_sec</td><td>1</td><td>1</td><td> 0</td></tr><tr><td colspan=3>Total Execution Time</td><td>0.18323707580566</td></tr></table><br /><br /><br /><br />As you can see, to execute this from PHP it took 100 ms, 100s pages reclaimed and 200s voluntary context switches to query 35 servers.<br /><br />Now let’s look at Asynchronous execution of SELECT 1; // the query generation is from PHP yet the execution is performed on a server that executes the query in parallel<br /><table border=1><tr><th>Field</th><th>End Value</th><th>Start Value</th><th>Delta</th></tr><tr><td>ru_oublock</td><td>0</td><td>0</td><td> 0</td></tr><tr><td>ru_inblock</td><td>0</td><td>0</td><td> 0</td></tr><tr><td>ru_msgsnd</td><td>0</td><td>0</td><td> 0</td></tr><tr><td>ru_msgrcv</td><td>0</td><td>0</td><td> 0</td></tr><tr><td>ru_maxrss</td><td>0</td><td>0</td><td> 0</td></tr><tr><td>ru_ixrss</td><td>0</td><td>0</td><td> 0</td></tr><tr><td>ru_idrss</td><td>0</td><td>0</td><td> 0</td></tr><tr><td>ru_minflt</td><td>9131</td><td>9121</td><td> 10</td></tr><tr><td>ru_majflt</td><td>0</td><td>0</td><td> 0</td></tr><tr><td>ru_nsignals</td><td>0</td><td>0</td><td> 0</td></tr><tr><td>ru_nvcsw</td><td>3891</td><td>3889</td><td> 2</td></tr><tr><td>ru_nivcsw</td><td>290</td><td>290</td><td> 0</td></tr><tr><td>ru_nswap</td><td>0</td><td>0</td><td> 0</td></tr><tr><td>ru_utime.tv_usec</td><td>596287</td><td>596287</td><td> 0</td></tr><tr><td>ru_utime.tv_sec</td><td>4</td><td>4</td><td> 0</td></tr><tr><td>ru_stime.tv_usec</td><td>460028</td><td>460028</td><td> 0</td></tr><tr><td>ru_stime.tv_sec</td><td>0</td><td>0</td><td> 0</td></tr><tr><td colspan=3>Total Execution Time</td><td>0.019363880157471</td></tr></table><br /><br /><br /><br />As you can see from the table above executing the query asynchronously produced results with less context switching, less pages reclaimed and almost 10 times execution improvement over the synchronous query counterpart.<br />How is the asynchronous query executed? Lets take a look at the figure below.<br /><br /><a href="http://www.flickr.com/photos/dathan/4090681457/" title="Async by dathan, on Flickr"><img src="http://farm3.static.flickr.com/2731/4090681457_150aeb59fe.jpg" alt="Async" height="361" width="500" /></a><br /><br /><br />So A user comes through the firewall / load balancer with a HTTP Request to the www pool that runs PHP. PHP now makes a CURL request to the Async Shard Servers (through a LB same LB different PORT). The HTTP Request to the Async Shard Server contains the SQL we wish to execute. The Async Shard Servers has a thread per shard and executes the request in parallel. The results are merged and sent to the calling CURL process via JSON. The returned JSON is then converted into a PHP object. This is a typical three-tier environment.<br /><br />When having to query multiple servers using an Asynchronous Tier is dramatically faster; in fact its as fast as the slowest server. This is the main sticking point of why asynchronous queries are faster then synchronous queries (in this context) since the total execution time for serial queries is the SUM of all the query execution.<br /><br />The current version of the server is used for Friend Query execution across the datalayer. Its been solid for a few months now, and I'm currently getting permission to release it as an Open Source Product. The features this server contains:<br /><UL><br /><LI> Lightweight </LI><br /><LI> CPU bounded </LI><br /><LI> Scales Linearly </LI><br /><LI> A Timer Thread to keep the database config up to date in memory and fetching the config from PHP so if PHP changes connections to the shards so does Java </LI><br /><LI> Uses Java-6 Executor Service </LI><br /><LI> Merges the result set prior to sending it to the calling process </LI><br /><LI> Communicates via JSON </LI><br /><LI> Uses MySQL Connector/J </LI><br /><LI> Supports a high concurrency </LI><br /><LI> Optimized thread usage </LI><br /></UL><div class="blogger-post-footer"><img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/31421954-4709619962141971609?l=mysqldba.blogspot.com' alt='' /></div>Dathan Vance Pattishallhttp://www.blogger.com/profile/00356367514107959723noreply@blogger.com4tag:blogger.com,1999:blog-31421954.post-32308762268337906322009-10-09T15:34:00.001-07:002009-10-12T11:23:11.861-07:00Steps I take before upgrading mysqlI am not a fan of upgrading mySQL unless I need to. I am of the mind if it is not broke don't fix it, but when I do upgrade I follow these general steps.
<br />
<br />If I have run into a mySQL bug, I look to see if that bug is fixed by searching the mySQL bug database.
<br />
<br />If I've notice a performance bottleneck, I look to see if the performance bottleneck has been fixed by searching the same database.
<br />
<br />I will NOT upgrade to the latest and greatest version of mySQL (5.4) I stay within my branch (5.0).
<br />
<br />These are my three general motivations that drive my upgrade decisions. Anytime I upgrade I also make a list of things that might affect my environment for the stuff I use.
<br />
<br /><OL>Here are my steps:
<br /><LI> Check the <a href="http://dev.mysql.com/doc/refman/5.0/en/news-5-0-x.html">change log </a>
<br /></LI><LI> Ignore all the NDB changes... I don't use it and that's the majority of fixes. This is also, why I do not use it.
<br /></LI><LI> List the changes that will affect the production environment
<br /></LI><LI> Deploy the version that I picked on a few servers running my original config
<br /></LI><LI> Do data corruption tests (make sure my checksum scripts return the same data)
<br /></LI><LI> Verify that the problem I'm trying to fix is fixed
<br /></LI><LI> Deploy to more boxes
<br /></LI><LI> Let the new server bake for a period of no less than a week
<br /></LI><LI> Deploy everyplace
<br />
<br /></OL>
<br />
<br />So now, I'm upgrading from 5.0.56 to 5.0.86. What I'm trying to fix is mysql memory overhead at high levels of ram.
<br />
<br />For instance, I have a slew of 48GB boxes. I set the bufferpool to 40GB; the OS uses 1 GB of memory (roughly) leaving an overhead of 7GB for the system cache and various spikes of sort buffers. Over a period, I see that mySQL will consume and hold onto 47GB of memory for an unknown reason even with some tight my.cnf settings. (I'm certain they are tight since I know what each buffer does). Therefore, testing some later versions of mySQL we found that these later versions do not grow past the settings defined yet performs the same.
<br />
<br />Next, since I decided that upgrading is a good solution, now it’s time to list all the changes that <i>fixes</i> things.
<br />
<br /><UL>
<br /><LI> <a href="http://dev.mysql.com/doc/refman/5.0/en/news-5-0-58.html">5.0.58</a> - <a href="http://bugs.mysql.com/32149">INNODB performance fix</a>
<br /></LI><LI> <a href="http://dev.mysql.com/doc/refman/5.0/en/news-5-0-60.html">5.0.60</a> - various problems that I should be affected by but havn't noticed so it’s fair to assume that said problems were introduced after my build.
<br /></LI><LI> <a href="http://dev.mysql.com/doc/refman/5.0/en/news-5-0-62.html">5.0.62</a> - nothing major noticed the sp releases that's why I wait.
<br /></LI><LI> <a href="http://dev.mysql.com/doc/refman/5.0/en/news-5-0-64.html">5.0.64 </a>- nothing major
<br /></LI><LI> <a href="http://dev.mysql.com/doc/refman/5.0/en/news-5-0-66.html">5.0.66</a> - security fixes and fixes to fix the bugs introduced from this build.
<br /></LI><LI> <a href="http://dev.mysql.com/doc/refman/5.0/en/news-5-0-67.html">5.0.67</a> - two INNODB performance fixes and crash bug fixes.
<br /></LI><LI> <a href="http://dev.mysql.com/doc/refman/5.0/en/news-5-0-68.html">5.0.68</a> - changes show status and fixes an innodb crash bug.
<br /></LI><LI> <a href="http://dev.mysql.com/doc/refman/5.0/en/news-5-0-70.html">5.0.70</a> - fix another INNODB crash bug and security fixes </a>
<br /></LI><LI> <a href="http://dev.mysql.com/doc/refman/5.0/en/news-5-0-72.html">5.0.72</a> - more general bug fixes
<br /></LI><LI> <a href="http://dev.mysql.com/doc/refman/5.0/en/news-5-0-74.html">5.0.74</a> - more stuff I don't care about
<br /></LI><LI> <a href="http://dev.mysql.com/doc/refman/5.0/en/news-5-0-75.html">5.0.75</a> - stuff given to Enterprise users now in community
<br /></LI><LI> <a href="http://dev.mysql.com/doc/refman/5.0/en/news-5-0-76.html">5.8.76</a> - more bug fixes that I do not need
<br /></LI><LI> <a href="http://dev.mysql.com/doc/refman/5.0/en/news-5-0-78.html">5.0.78</a> - more bug fixes I do not care about (run MS Access on windows not mySQL)
<br /></LI><LI> <a href="http://dev.mysql.com/doc/refman/5.0/en/news-5-0-80.html">5.0.80</a> - problem with error messages for concurrency limits that caused an assert failure
<br /></LI><LI> <a href="http://dev.mysql.com/doc/refman/5.0/e/news5-0-82sp1.html">5.0.82</a> - Fixes to fix fixes for this build.
<br /></LI><LI> <a href="http://dev.mysql.com/doc/refman/5.0/en/news-5-0-83.html">5.0.83</a> more minor fixes that I don't seem to have a problem from
<br /></LI><LI> <a href="http://dev.mysql.com/doc/refman/5.0/en/news-5-0-84.html">5.0.84</a>- more bug fixes for INNODB and latches
<br /></LI><LI> <a href="http://dev.mysql.com/doc/refman/5.0/en/news-5-0-85.html">5.0.85</a> - looks like windows fixes
<br /></LI><LI> <a href="http://dev.mysql.com/doc/refman/5.0/en/news-5-0-85.html">5.0.86</a> - fixes that I'm not having problems with
<br /></UL>
<br />
<br />Therefore, overall, upgrading should give me a boost in performance. My own internal testing sees some tighter memory usage, even though this is not fixed explicitly, the product has matured overall so I can account for the reduction in memory to that.
<br />
<br /><div class="blogger-post-footer"><img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/31421954-3230876226833790632?l=mysqldba.blogspot.com' alt='' /></div>Dathan Vance Pattishallhttp://www.blogger.com/profile/00356367514107959723noreply@blogger.com3tag:blogger.com,1999:blog-31421954.post-26245068981940848382009-09-09T11:30:00.001-07:002009-09-09T11:59:17.169-07:00Nagios Event Handlers - Love themWhat is Nagios? <a href="http://www.nagios.org/">Nagios</a> IMHO is the best Open Source monitoring system out there. It supports hosts checks, a level to determine on a host level if a box is considered "up". It supports service check, a level to determine if a particular service such as mySQL is up. It has features to log all events to a flat file or to a DB. It can notify you when a service is in a warning state, error state or unknown state.<br /><br />For the purpose of this article, I am going to talk about handling events such as a clearing up swap.<br /><br />First, let us look at some configuration of Nagios. We are going to define a command, then service acting on that command. Let us assume that the nagios install is in /usr/local/nagios.<br /><br />Therefore, in /usr/local/nagios/ a few configuration files are key:<br />- /usr/local/nagios/etc/objects/commands.cfg - the command file where the checks are defined<br />- /usr/local/nagios/etc/hosts/*/hosts.cfg - the services file where the checks are defined for execution based on other directives in this file.<br /><br /><br />A command:<br /><pre><br /># 'check_local_swap' command definition<br />define command{<br /> command_name check_local_swap<br /> command_line $USER1$/check_swap -w $ARG1$ -c $ARG2$<br />}<br /></pre><br /><br />This says that check_local_swap executes check_swap with a warning threshold of $ARG1 and a critical threshold or $ARG2<br /><br /><br />Next when defining a service for a host<br /><pre><br />define service{<br /> use generic-service; Name of service template to use<br /> host_name dbfacebook34b ; hostname<br /> service_description SYS:Swap ; what shows up in alerts<br /> is_volatile 0 <br /> check_period 24x7 ; threshold when to check (all the time)<br /> max_check_attempts 4 ; threshold to check before marking state<br /> event_handler handle-swap ; handle an event (another command)<br /> normal_check_interval 5 ; in seconds<br /> retry_check_interval 1 ; only try once before reporting the state<br /> contact_groups itops ; contact group to send notifications to<br /> notification_options w,u,c,r ; need to look this up for all defs<br /> notification_interval 600 ; retry sending notifs every 8 mins<br /> notification_period 24x7 ; keep sending them<br /> check_command check_nrpe!check_local_swap!80%!55% ; execute the event handler and warn like hell<br />}<br /></pre><br /><br /><br />Lots of goodies as you can see. Let us look at the event handler<br /><pre><br />define command{<br /> command_name handle-swap<br /> command_line /home/scripts/handle_swap.pl<br />}<br /></pre><br /><br />This means execute this script whenever any event for swap occurs (I decided to make this simple and not put a threshold on this).<br /><br /><br />What does handle_swap.pl do - well it’s a perl script that looks at free memory and if only a few 100K of swap is in use, swapoff -a; swapon -a;<br /><br />In this case, it is a bit safe to do this. Why do this? Why not just turn of swap. I have talked in depth about this subject-but for a minor recap. Linux needs swap else, kswapd will freak out. Swap in DB's is bad so I clean it up automatically since O_DIRECT on my SAN is not an option.<br /><br />Why not just run a cron job? Nagios keeps a log, I like to review what is happening from a central location, and nagios is freaking COOL.<div class="blogger-post-footer"><img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/31421954-2624506898194084838?l=mysqldba.blogspot.com' alt='' /></div>Dathan Vance Pattishallhttp://www.blogger.com/profile/00356367514107959723noreply@blogger.com4tag:blogger.com,1999:blog-31421954.post-81306306330727620982009-08-21T11:02:00.000-07:002009-08-21T11:31:45.684-07:00Back From Vacation and Man do I feel great.For the first time ever I took a two-week vacation. How can a person who has 100s of database servers, 100s of web servers, and a system that supports 100K tps across 20 TB of data is able to take a vacation? Easy, I have a great team that is very competent in managing the platform by following our cookbook routines and guidelines for new application interaction with the databases.<br /><br />Where did I go? I went to NYC-stayed in Tribeca and only ate Pizza, Hotdogs, White Castle and Hala Food from vendors that are nearly on every street corner. I also lost 6 pounds! After three months of half-ass dieting to prepare for my Pizza diet, who would have known that just going for it was the key to success! How did I do it? Well, first I walked to Little Italy, bought 2-4 slices, ate them, and then walked around the city. That is it. As a Database Dude, I do not do much walking except for when I am too tired to run on the treadmill for 20 mins. I was averaging around 10 miles a day for a 12-hour period walking around NYC taking pictures, enjoying the sites, eating, drinking, and then repeat. I even walked from Canal Street all the way to the Staten Island Ferry just because. Man it was great. <br />After NYC, I went to Puerto Rico for the second time. My family and I had a freaking awesome time. We went to the only US National Park that is a Rain Forest. Went to Vieques to the Bio Luminous Bay, where the single cell organisms light up when kinetic energy is applied to them. Went to Culebra, to Flamenco Beach – which is off the hook: much better than any other beach I have been to. Drove around the entire Main Puerto Rican Island stopping off in Rincon and eating every 3-4 hours.<br />I am fully rested and feel like a 20 year old again. So, expect a lot more posts. I think the order will be:<br /><br />Migrate 20 TB of data into a new Format without downtime.<br />How to make Friend Queries work with a database.<br />Throughput increase from Asynchronous Queries.<div class="blogger-post-footer"><img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/31421954-8130630633072762098?l=mysqldba.blogspot.com' alt='' /></div>Dathan Vance Pattishallhttp://www.blogger.com/profile/00356367514107959723noreply@blogger.com4tag:blogger.com,1999:blog-31421954.post-22489961217501368412009-06-29T13:29:00.000-07:002009-06-29T13:30:55.725-07:00mysql UC 2009 TalkScribe is a bit buggy with displaying this presentation:<br /><br /><a title="View Scaling a Widget Company on Scribd" href="http://www.scribd.com/doc/16934181/Scaling-a-Widget-Company" style="margin: 12px auto 6px auto; font-family: Helvetica,Arial,Sans-serif; font-style: normal; font-variant: normal; font-weight: normal; font-size: 14px; line-height: normal; font-size-adjust: none; font-stretch: normal; -x-system-font: none; display: block; text-decoration: underline;">Scaling a Widget Company</a> <object codebase="http://download.macromedia.com/pub/shockwave/cabs/flash/swflash.cab#version=9,0,0,0" id="doc_643907886416336" name="doc_643907886416336" classid="clsid:d27cdb6e-ae6d-11cf-96b8-444553540000" align="middle" height="500" width="100%" > <param name="movie" value="http://d.scribd.com/ScribdViewer.swf?document_id=16934181&access_key=key-20juy0gcpq7hvlqze92y&page=1&version=1&viewMode="> <param name="quality" value="high"> <param name="play" value="true"> <param name="loop" value="true"> <param name="scale" value="showall"> <param name="wmode" value="opaque"> <param name="devicefont" value="false"> <param name="bgcolor" value="#ffffff"> <param name="menu" value="true"> <param name="allowFullScreen" value="true"> <param name="allowScriptAccess" value="always"> <param name="salign" value=""> <embed src="http://d.scribd.com/ScribdViewer.swf?document_id=16934181&access_key=key-20juy0gcpq7hvlqze92y&page=1&version=1&viewMode=" quality="high" pluginspage="http://www.macromedia.com/go/getflashplayer" play="true" loop="true" scale="showall" wmode="opaque" devicefont="false" bgcolor="#ffffff" name="doc_643907886416336_object" menu="true" allowfullscreen="true" allowscriptaccess="always" salign="" type="application/x-shockwave-flash" align="middle" height="500" width="100%"></embed> </object><div class="blogger-post-footer"><img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/31421954-2248996121750136841?l=mysqldba.blogspot.com' alt='' /></div>Dathan Vance Pattishallhttp://www.blogger.com/profile/00356367514107959723noreply@blogger.com0tag:blogger.com,1999:blog-31421954.post-81846071629440412052009-06-11T16:06:00.000-07:002009-06-11T16:42:12.439-07:00PHP mysql connect and using APC to recoverSo, when you connect on the fly to a database your subject to a variety of issues, like when the db is not available and when the db does not have a route.<br /><br />One of the main reasons why a dev may want to connect on the fly is because they have too many front ends to hold a persistent connection on the backends. Since mySQL does not use libevent, holding open threads to mySQL is much more costly. Threads == Memory.<br /><br />But, that's here nor there. The main purpose of this post is to talk about how to recover from failed connections that block apache threads.<br /><br />Common Failures:<br />No route to Host<br />Flapping NIC<br />Locked Tables<br />Recovering from a Crash<br />more of the same.<br /><br /><br />My Environment:<br />I have a bunch of webservers (200+) that all have 300 possible threads (60000 possible connections to a single DB) behind a load balancer that uses the LB least connections protocol to distribute load across all 200+ webservers.<br /><br />Since PHP is used in a stateless mode (no guarentee that the same user will hit the same server), I have to have some way of telling all other apache procs for said box that a server is down. I use stateless on the fly connections, so each apache proc will test the connection.<br /><br />Here in lies my most common problem. If a box in the backend dies, all apache threads will block for a predefined time-out.<br /><br />In /etc/php.ini (it could be in /etc/php.d/mysql.ini) I set <br /><br />mysql.connect_timeout = 5 // the default is 60<br /><br /><br />In my common_db class when connecting to a database, the connect routine returns a database handle object<br /><pre><br />if (PEAR::isError($dbh) && $delta_to_connect >= 5){<br /> <br /> #<br /> # mark ip as dead for 15 min<br /> #<br /> <br /> apc_store($ip, array('DEAD'), 900);<br /> return false;<br />}<br /><br />return $dbh;<br /><br /></pre><br /><br /><br />Now PRIOR to calling the database connect code I check to see if the IP is up.<br /><pre><br />$status = apc_fetch($ip);<br />if ($status[0] === 'DEAD'){<br /> return false;<br />}<br /><br />... do connect ...<br /></pre><br /><br />But, there could be a variety of issues that can call false positives, like network flaps, someone blocking the db for some time etc. So, I'll allow one request .1% of the time bypass the status check to try again.<br /><br /><pre><br />if ($status[0] === 'DEAD' && (mt_rand(0, 1000) != 1)){<br /> return false;<br />}<br /></pre><br /><br /><br />But, if your app can't connect to the DB aren't you down anyway?<br /><br />Yes, although I don't have to restart all the httpds.<br />Also each DB has a redundant pair to when returning false, the app code will try the other set of servers.<br /><br />I have a variety of methods to deal with these scenarios but this is the quickest to implement.<div class="blogger-post-footer"><img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/31421954-8184607162944041205?l=mysqldba.blogspot.com' alt='' /></div>Dathan Vance Pattishallhttp://www.blogger.com/profile/00356367514107959723noreply@blogger.com2tag:blogger.com,1999:blog-31421954.post-47657005557988240022009-05-19T12:26:00.000-07:002009-05-19T12:56:34.369-07:00Multi Direction Sorts and avoiding a file sortThere are two PRIMARY directions to sort data in SQL: Ascending (ASC) and Descending DESC.<br />When these two sort definitions are put together in a single statement a filesort is produced.<br /><br />Why do we want to avoid filesorts?<br /><br />Filesorts are bad. 1st they tickle a thread based buffer called sort_buffer_size. Additionally filesorts reads the data twice, unless max_length_for_sort_data limit is reached and as a result the Filesort runs slower to reduce disk I/O. If you want filesorts to run faster at the expense of the disk increase the default max_length_for_sort_data. You can read the filesort algorithm <a href="http://dev.mysql.com/doc/refman/5.0/en/order-by-optimization.html">here.</a><br /><br />So, here is an example<br /><pre><br /> CREATE TABLE `ABCD` (<br /> `A` int(10) unsigned NOT NULL default '0',<br /> `B` int(10) unsigned NOT NULL default '0',<br /> `C` int(10) unsigned NOT NULL default '0',<br /> `D` int(10) unsigned NOT NULL default '0',<br /> PRIMARY KEY (`a`,`b`,`c`,`d`)<br />) ENGINE=InnoDB DEFAULT CHARSET=latin1<br /></pre><br /><br /><pre><br />mysql> explain SELECT * FROM ABCD WHERE a=1 AND b=1 ORDER BY c DESC, d ASC\G <br /><br />*************************** 1. row ***************************<br /> id: 1<br /> select_type: SIMPLE<br /> table: ABCD<br /> type: ref<br />possible_keys: PRIMARY<br /> key: PRIMARY<br /> key_len: 8<br /> ref: const,const<br /> rows: 2<br /> <span style="font-weight:bold;">Extra: Using where; Using index; Using filesort</span><br />1 row in set (0.00 sec)<br /></pre><br /><br /><br />Notice the filesort? So how does one get around this filesort?<br /><br />Well<br /><br />Let's define some roles for columns C and D. C is the parent while D is the child.<br /><br /><LI/>We want all the latest parents (C)<br /><LI/>We want all the oldest children (D)<br /><br />We require pagination of all the PARENTS (show 10 parents per page) so Queries like this is PRODUCED<br /><pre><br />SELECT * FROM ABCD WHERE A=? AND B=? ORDER BY C DESC<br />explain SELECT * FROM ABCD WHERE a=1 AND b=1 ORDER BY c DESC LIMIT 10\G <br />*************************** 1. row ***************************<br /> id: 1<br /> select_type: SIMPLE<br /> table: ABCD<br /> type: ref<br />possible_keys: PRIMARY<br /> key: PRIMARY<br /> key_len: 8<br /> ref: const,const<br /> rows: 2<br /> <span style="font-weight:bold;">Extra: Using where; Using index</span><br />1 row in set (0.00 sec)<br /><br /></pre><br /><br /><br />Now<br /><br />FOREACH($C_parent as $i => $c_id) {<br /><br />$C_parent[$i] = SELECT SQL_CALC_FOUND_ROWS * FROM ABCD WHERE A=? AND B=? AND C=$c_id ORDER BY D ASC LIMIT 1;<br /><br /><br />}<br /><br />So, we changed 1 query into 11 queries (10 parents per page) to make the page load happen faster, by getting rid of the filesort. <br /><br />What 11 queries is faster then 1? Yes, for this case it is. The reason is because filesorts are SLOOOOW, they chew up a lot of limited resources and they should be avoided. I've see filesorts take close to 50-60% of the query time.<div class="blogger-post-footer"><img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/31421954-4765700555798824002?l=mysqldba.blogspot.com' alt='' /></div>Dathan Vance Pattishallhttp://www.blogger.com/profile/00356367514107959723noreply@blogger.com2tag:blogger.com,1999:blog-31421954.post-59514354694357007612009-04-23T12:23:00.001-07:002009-04-23T13:12:43.728-07:00Restoring a backup of a Circular Replication configSo say you have two mySQL servers called A and B. A and B are in a circular replication ring meaning<br /><br />A replicates from B<br />B replicates from A<br /><br />In addition, log-slave-updates is on, so if one server of either A or B goes down you can recover all the data from either side.<br /><br />Therefore, here are some steps.<br />Take a snapshot of a server in the ring, with your favorite backup tool (mine is ibbackup)<br /><br />When either server A or B dies, restore from said backup. Here are the steps. Let us assume server A goes down, and backups are done from server B.<br /><br />restore the backup of server B to server A<br /><br />On server-A turn off log-slave-updates (you will see why on the next line)<br />Add <a href="http://dev.mysql.com/doc/refman/5.0/en/replication-options-slave.html#option_mysqld_replicate-same-server-id">replicate-same-server-id</a> on server A to my.cnf<br /><br />CHANGE MASTER TO to the position and file of the binary log that is reported from your tool on server-A.<br /><br />IF the binary log did not get corrupted on server-B your cool, else flip the binary log to the next binary log from server-A on server-B.<br /><br />Let server-A catch up<br />Stop server-A<br />Add log-slave-updates on server-A<br />remove replicate-same-server-id on server A<br />Start server-A<br /><br />done.<div class="blogger-post-footer"><img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/31421954-5951435469435700761?l=mysqldba.blogspot.com' alt='' /></div>Dathan Vance Pattishallhttp://www.blogger.com/profile/00356367514107959723noreply@blogger.com4tag:blogger.com,1999:blog-31421954.post-91560534475417529872009-04-08T13:11:00.001-07:002009-04-08T14:13:33.384-07:00Recover from No-Route to Host or Storage Engine Plugin CrashesThere are a finite set of cases where mySQL clients will hang on a connection-for a small period of time (seconds) or indefinitely. Most notably no-route to host causes a timeout to occur which in most clients are on the order of seconds to recovery.<br /><br />In a web environment if a database is connected to on the fly, each connection made should take milliseconds-when the net is healthy. However, when a database server crashes a timeout for each connection takes seconds and there are cases where connections can just hang (recovery of INNODB as an example). For instance have you ever had a ssh session hang and a kill (SIGHUP..) does not work? This happens for mySQL client connections too. <br /><br />I want the front ends to recover gracefully from a no-route to host, or more notably a hanged connection condition. I want to avoid that damn timeout all together on stateless connections. Why? Because if you use up all your worker httpd threads requests Fail-things crash, bosses get mad, everyone wakes up and you have a meeting about it later. So what are ways around this?<br /><br />Use a Load balancer<br /><dd>This is cool and all, but is expensive since two or three are needed for HA purposes. In addition, this increases the complexity of managing servers, and most LBs are optimized for HTTPD traffic, not raw TCP traffic.<br /></dd><br />Use a memcache layer to keep DB state<br /><br /><dd>This is cool as well, but the state server, which collects the state of your servers and updates memcache, is now a single point of failure-and a lot of time is spent to make this resilient. Not to mention all clients across Programming languages (bash scripts, Java, PHP, Python, etc) need to have access to this memcache layer and logically handle the connection the same. This is doable, but there is an easier way.<br /></dd><br />Use IRON DNS:<br /><dd>IRON DNS is a term I use to build an HA Resolver. IF a box fails a health check, a nagios event handler can tell IRON DNS to set the domain names IP address to 127.0.0.1 or a routable interface that produces a Connection Refused (111 instead of 110).<br /></dd><br /><br />For an internal network, all database entries should be able to fit in memory. All resolves should take less then 2ms, and updates to DNS entries can happen in just a few seconds across your entire farm. The only drawback is if the DNS server fails to respond your site is boned. Making DNS Failure proof is rather easy-which is another post entirely.<br /><br />For my solution, I am using DNS. This allows me to recover servers in a shard on different boxes without having to change code. This also allows my environment to recover from blocked I/O events that can spike load on the front-end that make requests slow.<br /><UL><br />Here are some conditions where connections take to long:<br /><LI>No Route to Host<br /></LI><LI>INNODB recovery<br /></LI><LI>Disk Fails (disk fills up)<br /></LI><LI>Switch Fails / Flaps<br /></LI><LI>Plug-in storage engine crashes, yet mySQL is up<br /></LI><LI>DNS Resolve fails<br /></UL><br /><br />If you have a better method, or want to advise me on flaws I should watch out for, drop a comment.<div class="blogger-post-footer"><img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/31421954-9156053447541752987?l=mysqldba.blogspot.com' alt='' /></div>Dathan Vance Pattishallhttp://www.blogger.com/profile/00356367514107959723noreply@blogger.com2tag:blogger.com,1999:blog-31421954.post-89740828376383236302009-03-31T11:05:00.000-07:002009-03-31T14:26:49.502-07:00What do you think about adding ZLIB to memcache storageMemcache is a fantastic Hash table-very fast and one of the great successes of Brad Fitzpatrick-who in my opinion has done more for the open social movement as an individual then anyone else. I use memcache quite extensively, now I am thinking about adding ZLIB native to compress the value of each key-much like how INNODB does with the Barracuda file format. The theory is with a CPU hit, we can store more data per memcache instance. I've talked to the Northscale guys and they love the idea. What do you think?<br /><br /><br />Update: Well what do you know<br /><br />http://us3.php.net/manual/en/function.memcache-setcompressthreshold.php<br /><br />for PHP for instance compresses the data on the client side.<br /><br />There still might be some value compressing the data on the server-side, but now I'm not as motivated.<br /><br />What might be a good alternative is to compress keys into 8 byte longs in memcached automatically instead of the actual string that can be huge. To give some more detail,<br />8 byte longs is a 64-bit int. A string can easily be converted into a big int by bit manipulation - and the address space is huge so key conflict is effectively removed.<div class="blogger-post-footer"><img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/31421954-8974082837638323630?l=mysqldba.blogspot.com' alt='' /></div>Dathan Vance Pattishallhttp://www.blogger.com/profile/00356367514107959723noreply@blogger.com9tag:blogger.com,1999:blog-31421954.post-11553696176555403682009-03-23T10:38:00.000-07:002009-03-23T13:22:29.357-07:00Cloud computing without spending moneyCloud computing is the big thing now days, weather you are an app developer using EC2 or the Google APP engine, or a new company trying to build your own cloud product. If you are hosting, or using a platform it costs money. I hate to spend money, especially money that is for my company. If I find an idle server, I use it to 100% utilization (prior to the saturation point).<br /><br />I needed to build a new application that periodically crawls a website to update various lists. Building a crawler is expensive, especially from scratch. First, you have to define the amount of lag that is allowed from the crawl copy and the real copy. Of course the Project Manager does not want any lag, all events must be caught and near real-time without overloading the source of the data-but I am not hating, it is a challenge. Next, what technology to use, what language to write the app in-what considerations are left to be defined? How does one crawl Gigs, Tera, and amount of data in a guarantee period? On top of that, how much additional hardware is this going to cost. In addition, to be a cloud it needs to have an API so app developers can set, and get consistent data in an expected period. This is a lot of freaking requirements.<br /><br />Therefore, to solve this issue, I know that mySQL will store the data, but getting the data is the hard part. This is what is going to cost money, lots of it. I looked around common architectures and found that nothing would do what I wanted to do in a cost effective manner. So, I designed my own using Seti @ Home as the basis for the design. <br /><br />Get to the point already Dathan:<br /><br />I have turned ever user who views my applications into a collector, using idle bandwidth without knowing who is collecting the data. My user base on spare cycles will fetch a feed of my choosing, and sends that data to my servers without any personal information. Instead of using an Amazon or Google service, I have turned my user base into a cloud to service their needs.<br /><br />It is rather awesome-I must say. I am able to service the needs of more than 60 million users at the cost of development time, and NO NEW HARDWARE. The cloud does not have to be a service provider-it can be the end user as long as the end user is not impacted by the requests. BTW the team that I manage is freaking awesome-they built my vision with trial and error and a hand waved spec.<br /><br />Currently the system scales as long as there is enough end users. If I lose all my users then well I am boned, but to support the feeds all I need is 100K nodes at the current rate. With 60 million end nodes, I am cool.<br /><br />Imagine if Google with Adsense used this install base to tell Google if the data has changed for an arbitrary web address. All it needs is a few people to hit the same url, inform Google that the web address in question has a different checksum, then at that point Google’s crawlers can go fetch it, instead of constantly crawling data that doesn't change. Google would be able to reduce overall server cost significantly, if it just knew what data has changed instead of guessing what data has changed.<br /><br />By next years Velocity conferance I hope to have a full disclosure on what technology my team used, how my team get around cross domain issues, and how to compute checksums to validate the data.<br /><br />PS - I designed this, with my team we made it much better and one person implemented it and owns the product from this point on.<div class="blogger-post-footer"><img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/31421954-1155369617655540368?l=mysqldba.blogspot.com' alt='' /></div>Dathan Vance Pattishallhttp://www.blogger.com/profile/00356367514107959723noreply@blogger.com9tag:blogger.com,1999:blog-31421954.post-9790084817221961722009-03-12T13:58:00.000-07:002009-03-16T10:03:23.168-07:00Walking an INNODB table FastWalking a table means, traversing each row, commonly used in building queues, fixing data, or dumping a table. I've recently ran into a problem-caused by an assumption, where walking a table was taking way to long using the method<br /><br /><pre><br />$pos = 0;<br />do {<br /><br /> $result = SELECT col FROM TABLE LIMIT $pos, 1000;<br /> $pos += 1000;<br />} while ($result);<br /></pre><br />The assumption was since INNODB uses a cluster index, this would traverse the table using the PRIMARY key. This is not the case, <b>its not a problem in INNODB</b> but a bad assumption, that I fell victim to. A table scan to each $pos occurs producing a Big-O of N^2. So, when the query:<br /><br />SELECT col FROM TABLE LIMIT 1000000, 1000 is executed mySQL will scan all the rows up to row position 1001000 and for each subsequent iteration.<br /><br />This is SLOOOOW. IMHO since the table is sorted by the primary key, mySQL should optimize this case - but it does not and will not. So, to walk an INNODB table fast, and keep liner time or a Big-O of N an alternative is<br /><pre><br />$last_id = 0<br />do {<br /><br /> $result = SELECT col FROM TABLE USE INDEX(PRIMARY) WHERE pkey_part > $last_id LIMIT 1000<br /> $last_id = $result[count($result) - 1]->pkey_part<br /><br />}while($result);<br /></pre><br /><br />This dumps a table very fast, almost as fast as doing a count(*) on the PRIMARY KEY.<br /><br />Another method is to <br /><br />SELECT col INTO OUTFILE "/dir/file.ids" FROM TABLE;<br /><br />but the data is local to the database - thus the need for the application to grab data. Another draw back of this method is that the dump produces more disk IO then walking a table off of a key, slowing down access to this table.<br /><br /><br />In conclusion, even if the storage engine keeps the table order consistent like INNODB does, do not assume that LIMIT 100000, 1000 is equivalent to a file seek of position 100000, without telling the Optimizer to use an index.<div class="blogger-post-footer"><img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/31421954-979008481722196172?l=mysqldba.blogspot.com' alt='' /></div>Dathan Vance Pattishallhttp://www.blogger.com/profile/00356367514107959723noreply@blogger.com12