From an interview with Steve Souders published on ☞ O’Reilly Radar:
The first thing I recommend website owners do is get a handle, get an idea of the overall page load time. And then the second thing I say is break that into two parts: the back-end and the front-end. And if you find that like most websites, your back-end time is less than 10 or 20 percent, then you’re correct in focusing on these front-end best practices. If your back-end time is 30 to 50 percent or more, then you should really start on looking at your back-end architecture.
And in case you are running a media company, you should also check the part of the interview focused on the degradation of performance due to non-optimized ad servers.
After writing about MonetDB and InfoBright, the MySQL Performance guys are now taking a look at ☞ Tokyo Tyrant, the network interface of the NOSQL Tokyo Cabinet solution.
Their observations span 3 articles (☞ part 1, ☞ part 2 and ☞ part 3) and cover subjects like data durability (the D in ACID), read and write performance.
And if I mentioned Tokyo Cabinet, I should point you to a presentation given by Ilya Grigorik (of PostRank): ☞ Lean & Mean Tokyo Cabinet Recipes
Then if you have another 57minutes, you can watch the O’Reilly Webcast: Tokyo Cabinet in One Hour:
Protocol buffers (or ☞ protobuf) are a way of encoding structured data in an efficient yet extensible format. The project have been released by Google who is using it internally for RPC protocols and file formats. Facebook has released a while back ☞ Thrift: which serves the same purposes. And people spent time trying to figure which one outperforms the other.
InfoQ has published an article about ☞ BERT a new format now powers GitHub’s backend which is different from the above two.
And there is also ☞ BSON the format introduced by MongoDB, which is a binary-encoded serialization of JSON-like documents.
If I started the day with a thought about NOSQL, then maybe this video will give me even more ideas. Slides from the presentation are embedded below.
This is just a quick thought I had this morning: KV (key-value) storage solutions are excelling at item-based read/write throughput, but suck at everything that involves range queries. The column-based storage solutions might probably not have the same read/write throughput, but have a better chance at offering range queries.
I’ll probably have to check this by looking at some of the solutions included in the NOSQL reference.
Meanwhile, what do you think? Are there any other upfront ‘advantages’?
A standard Google server appears to have about 16G RAM and 2T of disk. If we assume Google has 500k servers (which seems like a low-end estimate given they used 25.5k machine years of computation in Sept 2009 just on MapReduce jobs), that means they can hold roughly 8 petabytes of data in memory and, after x3 replication, roughly 333 petabytes on disk. For comparison, a large web crawl with history, the Internet Archive, is about 2 petabytes and “the entire [written] works of humankind, from the beginning of recorded history, in all languages” has been estimated at 50 petabytes, so it looks like Google easily can hold an entire copy of the web in memory, all the world’s written information on disk, and still have plenty of room for logs and other data sets.
— Greg Linden in ☞ Advice from Google on large distributed systems
The above deduction is based on a presentation given by Google Fellow, Jeff Dean: ☞ Designs, Lessons and Advice for Building Large Distributed Systems (pdf) (embedded below for reading)
An article from the Digg dev team about how they have introduced Cassandra to their environment and how they solved one of their problems.
It looks like lately I’ve been gather quite a few links about NoSQL storage solutions, relational databases optimizations and future, so I thought I should just share them with everyone.
Sounds a bit like CouchDB.
Riak combines a decentralized key-value store, a flexible map/reduce engine, and a friendly HTTP/JSON query interface to provide a database ideally suited for Web applications.
— Riak
The MySQL Performance guys are taking a look at Redis ☞, a key-value database which supports data structures.
I think Redis can be great piece of architecture for number of applications. You can use it as the database or as cache (it supports data expiration too)
A blog post about lessons learnt while building a data analysis platform using Hadoop.
In recent months I’ve led my team at Visible Technologies through creating a new analytics platform based on Hadoop. This article lists a few lessons from our ordeal, which hopefully should help if you venture into similar territory. Here are my observations:
- Big data is BIG
- This is systems software, not an application (for now)
- Learn the source, engage the community, contribute feedback
- Scalable doesn’t imply cheap or easy. Just cheaper and easier.
- It’s so much easier with smart, experienced people.
— Bradford
Point 4 above, scalable doesn’t imply cheap or easy. Just cheaper and easier may always look like a surprise to some.
This entry demonstrates why normalized data is simpler for write/update operations, while the NoSQL approaches are usually simpler for reads.
Joining doesn’t scale to millions of concurrent users, and it’s rumored that some companies ban joins completely. NoSQL databases, which include key/value stores and document databases, drop the notion of normalized data.
Another post about normalization in relational databases and the implications of not using it in the NoSQL.
There are a number of things to keep in mind once you choose to denormalize your data including
- Denormalization means data redundancy which translates to significantly increased storage costs. […]
- Fixing data inconsistency is now the job of the application. […]
A discussion about how Reddit is handling database joins ☞.
For a long time all of these groups of users introduced clusters for two main reasons: ensuring availability and raising performance. Spreading processing across a cluster of smaller commodity machines was a good solution to both requirements and explains the enormous popularity of MySQL Replication as well as many less-than-successful attempts to implement multi-master clustering. However the state of the art has evolved in a big way in the last couple of years.
The article talks about a different than data-partitioning approach for parallel databases.
Kickfire, along with other companies that perform database operations directly in the hardware using FPGA technology, like Netezza and XtremeData, need to take a different approach to parallelism.
For those of you that are interested/passionate and that are closer to implementations, Daniel Abadi’s article talks about 3 different approaches for building hybrid column-store/row-store database systems: PAX, Fractured mirrors and Fine-grained hybrids and how these are used by commercial products like Oracle 11g, Vertica, and VectorWise:
Oracle, Vertica, and VectorWise have announced hybrid systems using one of these schemes (I have no inside knowledge about any of these implementations, and only know what’s been published publicly in all cases). It appears that Oracle (see Kevin Closson’s helpful responses to my questions in a comment thread to one of his blog posts) and VectorWise use the first approach, PAX. Vertica uses fine-grained hybrids (approach 3), though they probably could use their row-oriented storage scheme to implement fractured mirrors (approach 2) as well, if they so desire. Given that two out of the three authors of the fractured mirrors paper have been reunited at Microsoft, I would not be surprised if Microsoft were to eventually implement a fractured mirrors hybrid scheme.
If these topics are close to your heart, I’d encourage you to also check the other posts on NoSQL, storage and datastorage
Web crawling on-demand:
80legs is a web crawling and processing service that puts the power of 50,000 computers to use at a fraction of the cost.
The costs of crawling are expressed in terms of pages (basically URIs) and CPU time. But there’s no mention of the correlation between the number of pages and CPU-time.
A paper from Google analyzing possible CDN optimizations and an internal system they’ve built to help their administrators:
First, WhyHigh measures client latencies across all nodes in the CDN and correlates measurements to identify the prefixes affected by inflated latencies. Second, since clients in several thousand prefixes have poor latencies, WhyHigh prioritizes problems based on the impact that solving them would have, e.g., by identifying either an AS path common to several inflated prefixes or a CDN node where path inflation is widespread. Finally, WhyHigh diagnoses the causes for inflated latencies using active measurements such as traceroutes and pings, in combination with datasets such as BGP paths and flow records. Typical causes discovered include lack of peering, routing misconfigurations, and side-effects of traffic engineering. We have used WhyHigh to diagnose several instances of inflated latencies, and our efforts over the course of a year have significantly helped improve the performance offered to clients by Google’s CDN.