Friday, March 10. 2006Serving Images From A DatabaseComments
Display comments as
(Linear | Threaded)
"You never want to feed from a database what can be fed as static content."
Never say never! I guess TEXT fields should be the next to go. But why stop there? Let's make every read access from a database supply that information from a file instead? I've benchmarked my application with images in BLOBs and as separate files. The difference was single-digit percentages. I'm willing to pay that for the ease of backup and other niceties of using a single storage mechanism. That's not to say that single-digit performance increases may be worth pursuing, or that there aren't other reasons for storing images in files instead of in BLOBs. I'm not even arguing against most images being in files. But a dogmatic banning of images in the database does not impress me at all! QUOTE: I've benchmarked my application with images in BLOBs and as separate files. The difference was single-digit percentages. I'm willing to pay that for the ease of backup and other niceties of using a single storage mechanism. One interesting question in benchmarking is performance difference, and you say you have seen only a little performance difference between both variants. The other interesting question in benchmarking is "What is the benchmark maxing out on?". A benchmark is usually "ramp up the load on a system and measure the number of results as well as the latency seen in generating these". Usually you get a typical saturation curve such as those shown in my Linuxtag talk. When the system is saturated you have not only established a performance plateau, but you also can tell why there is a plateau here - what is the ressource that is limiting performance in this particular case. Knowing the reason is important, because in some cases the ressource that is maxed out is not the one you are interested in for performance metering. For example, if you in your case see only a few percent performance difference in your two cases, it may be because both times the system maxes out on something unrelated such as the performance limit of the load generator, the network interface or external internet connection or something else that is not really part of the "network buffer - webserver - database server - disk" pipeline, which is what you really want to meter here.
"The difference was single-digit percentages."
I'm willing to bet you didn't do this while load-testing. For a single request, you are very correct, the performance difference is negligible. But try serving 100 users each grabbing images from a database, and see if you're still hitting single digit performance differences. I'm willing to bet the relationship is almost exponential for performance by the number of requests. In the end, the filesystem is always the best container for information you don't plan on changing a lot. Leave the database to hold DATA.
"Moral: You never want to feed from a database what can be fed as static content."
Yep, so you have a background script which runs every 2 minutes and pulls new images out of the database and writes them to disk.
so how do you recommend serving up 36,000,000 images, taking up 30 G of space, with over ten web servers and nfs doesn't cut it? the sheer number of files is enough to kill any os.
harddisks are cheap nowadays.
even taking into account some overhead for syncing the new files to all servers. it is still lower than the DB overhead for each request. you could even use some kind of smart caching: 1. write to DB 2. on request check if there is a local file. (if you are a lighttpd user, checkout mod_cml and power-magnet.) 3. on local cache miss -> request file from DB -> write to disk. serve file. in the next round trip you serve the static file. 4. in case you are an apache user mod_rewrite's REWRITE_COND ! -f should help you. darix
darix has hit it right - using the database not as a delivery mechanism, but as a distribition mechanism is quite simple and elegant.
In another life I was working with a large freemail provider over here in Germany. This outfit had quite a few image server, and in their case, rsync was being used as a distribution mechanism for static files, and a specially configured webserver as a delivery mechanism for static files. Using dynamically generated content instead has a very noticeable performance impact, because of the interprocess communication overhead outlines above, and because the memory usage per slot is much higher in the dynamic case, limiting the degree of available parallelism. Databases have their place, but (compared to filesystems) handling large binary objects is not one of their strengths - and that is even before you start thinking about handling objects larger than max_allowed_packet inside your database.
I appreciate your ideas, but you haven't read what I've written: "the sheer number of files is enough to kill any os."
I'm serious about that. we're talking 36 million images -- that's 36 million files (no thumbnails, they're all reduced size) plus directories to store them in. We have cacheing tuned very well, and it's still kills the OS. Physically we CANNOT store these images on disk as individual files. That's what we've been doing, and the performance is horrible. In a MyISAM database, each table is one file, and thus MUCH easier to handle. In an innodb database, all innodb tables are encompassed in one file (or you can split them out). So performance will be MUCH improved. We have a unique situation, to be sure; photos are CONSTANTLY being looked at. These are personal ad images and so are being looked at at a rate that is much higher than, say, a photo site. I agree that in most situations, your solution works. I can tell you from having done it that way, that we're reaching the limit on that. So as the first poster said, "never say never".
What filesystem are you doing this on? And did you create hashed directories to store your files in? (so that you would end up with stuff like "/images/s/st/sto/storage_container.jpg", which will vastly improve performance when trying to read files in a directory)
The problem really is not your operating system, but the efficiency with which the file system implements "finding a file in the file system" on such high numbers. Some file systems (e.g., ReiserFS, though I wouldn't recommend using that) have been specifically optimized for "having millions of files in one directory"; if your needs are that specific, you really want to look into such things.
We have indeed looked at those kinds of things, and they're not appropriate for our environment. Hence the reason that MySQL is appropriate, though I realize and have said that it's only appropriate in a very narrow set of circumstances, such as our own.
I like how you say it's enough to kill any OS, but it seems like you just say that without backing it up (IE, what OSes have you even _tried_ this on?).
Ever heard of ZFS? ZFS wouldn't break a sweat under that kind of setup I'm more than willing to bet. So please, don't go spreading FUD, do your homework next time and use ZFS if you've seriously got that many photos.
I'm curious. What kind of set ups do the big guys use to stream video and photos? I'm setting up a travel website using php and of course I'm encountering the same problems many here have alluded to. I know what the low end is (because I live there and I don't like it!) but what if I wanted to scale up? I assume you-tube and folks with cash have considerably better options.
In essence, what should I be shooting for? A "Flash" 'ala defunkt Macromedia" server operation? Anything in MYSQL land that is competitive? Please don't say Mircrosoft anything since I will break out in hives at the sound of Mr. Softy!! Thanks!
This is what the X-Sendfile header supported by lighttpd does. Your php can decide what content to view, then pass back an instruction to lighttpd to serve that content direct from disk using sendfile.
It would be great if lighty'd have special treatment for relative URL redirects. Lets say dynamic request handler returns redirect to relative thus "local" URL
to static content. Instead of sendind that redirect back to browser lighty might serve that redirect the same way if browser sent a new request for that URL in response to a redirect. Saying "redirect" I mean 3xx HTTP response code family.
can you describe your setup a bit better? oO mod_Cml should handle the request locally without a round trip to the browser oO
darix
I'm not familiar with mod_cml but I'm afraid a cache could not help in my situation.
I want fast-cgi script to see _all_ requests and decide what static contet should be served since there is no static mapping between any information sent by browser and static content. Lets consider current (not so good) scenario: * suppose some URL is handled by some script that gets request from browser * this script makes decision which static content matches a request based on some criteria and sends a redirect to browser * given such a response browser sends another request for that static resource * HTTP server handles request for static content In better scenario there should be no phase when browser receives redirect and send another request. That could be handled by HTTP server internally
Whoops...
I thought it was about mod_cml of lighttpd and http://www.lighttpd.net/documentation/cml.html says the following: QUOTE: CML (Cache Meta Language) wants to solves several problems: * dynamic content needs caching to perform * checking if the content is dirty inside of the application is usually more expensive than sending out the cached data * a dynamic page is usually fragmented and the fragments have different livetimes * the different fragements can be cached independently
mod_cml is a way to control the caching. but it doesnt do the caching on its own.
see the complete documentation of it. :)
Alternatively, it could be something similar to Apache's mod_rewrite map which is handled by external programm:
RewriteMap mapname prg:/path/to/request-mapper
I am confused about why you are comparing:
1) using Apache to serve images stored in the database and 2) using Lighttpd to serve images stored as files It seems to me that your message would be more effective if you would stick to one server and use that server in both sides of the comparison. (ie., use the server as the control in the experiment.) As it stands, you are making the second part of the comparison appear shorter and simpler than it probably would be if it also used Apache.
If you're using Rails, this seems like a pretty solid, and simple, solution:
http://scottraymond.net/articles/2005/07/05/caching-images-in-rails Rails will automatically create a cached version of the image after it's first requested - just as it would a dynamic page - if you set up your controllers correctly. Pretty cool.
I can’t believe this is a discussion! It is just plain (computer) common sense. The file system IS a database. It may not be a rational database, but it is none-the-less. 2 million pictures, 26 million pictures, 30GB, 100GB, 1TB, it does not matter. MySQL has to read and write the same amount of information to the disk. MySQL is simply overhead. At the very least it has to perform the same amount of I/O. If you are not getting better performance going direct to the file system, then you are NOT using the file system correctly.
Some one snottily asks: Why not write text out the files system? Because text can be queried and BLOBs can’t. Is that not obvious? That is why rational database are used: To retrieve information. And why all this talk about NFS? Use a cluster! They are cheap and easy to set up. You have to put some thought into it. Take a MD5 hash of the filename or the file’s data. MOD that number with the number of computers in your cluster and write the file to that computer number. For example, you get an MD5 of 123456789. Then you MOD it with 10 because you have ten computers in the cluster getting 9. Store the file on computer number 9. And store the file in a directory path like /1/2/3/4/5/6/7/8/9/My Picture.jpg. Then put all the metadata in MySQL. Someone wants the picture: http://img.home.com/1/2/3/4/5/6/7/8/9/My Picture.jpg The web server goes directly to its local harddrive. Well it should be a LUN over fibre channel in a SAN cluster with redundant failover and WAN replication. Or just use rsync. There is so much stuff out there for files. I will leave researching that as exercise for the reader. BTW, You can MOD it again and spread it out over several local harddrives. As harddrives fill you can easily move the data and increase the number of drives. Oh and fibre really is spelled ‘fibre’ and not fiber channel.
Yes, the issue is that while the filesystem *is* a database, it's not the *same* database. Which means a file may be changed on the filesystem and the metadata may not be updated. There's no way to enforce a transaction between a filesystem and a database.
The most common case is if the entry is deleted in the database but the file remains, or the file is deleted but the entry remains. There is simply no way to ensure data integrity if you store files on a filesystem. If data integrity is not important, that's OK, but make sure it's a conscious decision.
Newer Unix systems including Linux have in fact APIs that allow applications such as databases to monitor a directory or even a directory hierarchy for file changes.
In Linux, the standard way to do this is fam. CODE: kris@linux:~> rpm -qi fam The Examples section of the FAM manual has an example titles "Example: A client monitors a directory containing two files. Later, another program creates a third file." which is the situation that most closely matches your scenario.
... Description : Fam is a file alteration monitoring service. With it, you can receive signals when files are created or changed.
Have you considered using something like Amazon S3 for image storage. At $0.15/GB-month you'd be looking at a whopping $4.50 + transfer costs, which are pretty affordable.
|
QuicksearchComments
about Wed, 31.10.2012 13:46
My friendly Offtopic Channel
in IRC has horror stories
about this, it is Halloween,
after all.
"Well, I do
have [...]
about Mon, 15.10.2012 09:41
What if we use SSDs. Can we
then assume memory limits
are not as significant
because SSDs can handle
random access [...]
about Fri, 05.10.2012 07:46
In fact, I am pretty certain
that this also breaks
mysql_upgrade or has the
potential to do so.
about Fri, 05.10.2012 06:44
It is quite a bit worse than
that: Right now, the default
RPM as distributed by Oracle
will fail to install if it
[...]
about Thu, 04.10.2012 18:52
> MyISAM in mysql.* has to go
before GTID can be
sucessfully deployed.
I
think you meant to say
before it can b [...]
about Thu, 04.10.2012 18:36
Sorry, it wasn't clear given
that none of this is
actually mentioned.
about Thu, 04.10.2012 13:47
You can turn it on with FLUSH
PRIVILEGES once the tables
exist.
about Thu, 04.10.2012 11:34
I think I have to install a
comment voting plugin just
so that i can +1 what Justin
just wrote.
Categories
Blog AdministrationDisclaimerAll examples, code-snippets, advice, trivia, etc. are supplied in the hope that they are useful, but no warranty of any kind is given. Use of trademarks is not intended as a challenge to them.
ImpressumCopyright 2005-2012 bei den jeweiligen Autorinnen der Beiträge. Die Infrastruktur von MySQL-dump wird betrieben von:
Azundris, Immanuelkirchstraße 18, 10405 Berlin, Germany
|
In Serving Images from a Database, Sheeri responded to me: I appreciate your ideas, but you haven't read what I've written: "the sheer number of files is enough to kill any os." I'm serious about that. we're talking 36 million images -- that's 36 mill
Tracked: Mar 14, 12:32
Eine monstergefährliche Sicherheitslücke bei StudiVZ wird derzeit als Sau der Woche durch Kleinbloggersdorf getrieben. Was passiert? StudiVZ betreibt einen Webserver für dynamische Anwendungen und einen zweiten Webserver für statische Daten, zum Beispi
Tracked: Nov 22, 08:48