If I started the day with a thought about NOSQL, then maybe this video will give me even more ideas. Slides from the presentation are embedded below.
This is just a quick thought I had this morning: KV (key-value) storage solutions are excelling at item-based read/write throughput, but suck at everything that involves range queries. The column-based storage solutions might probably not have the same read/write throughput, but have a better chance at offering range queries.
I’ll probably have to check this by looking at some of the solutions included in the NOSQL reference.
Meanwhile, what do you think? Are there any other upfront ‘advantages’?
You can request a beta invite to try out MongoDB-as-a-Service.
An article from the Digg dev team about how they have introduced Cassandra to their environment and how they solved one of their problems.
As we all know traditional relational databases are implementing the ACID principles: atomicity, consistency, isolation, durability. Then there came Brewer’s CAP (consistency, availability and partition tolerance) conjecture that was leading the road towards BASE systems (basically available, soft state, eventually consistent).
Here are a couple of great resources about BASE:
Thanks Debasish Ghosh ☞ for the links. You should also check the NoSQL, Relational and Storage in General entry.
It looks like lately I’ve been gather quite a few links about NoSQL storage solutions, relational databases optimizations and future, so I thought I should just share them with everyone.
Sounds a bit like CouchDB.
Riak combines a decentralized key-value store, a flexible map/reduce engine, and a friendly HTTP/JSON query interface to provide a database ideally suited for Web applications.
— Riak
The MySQL Performance guys are taking a look at Redis ☞, a key-value database which supports data structures.
I think Redis can be great piece of architecture for number of applications. You can use it as the database or as cache (it supports data expiration too)
A blog post about lessons learnt while building a data analysis platform using Hadoop.
In recent months I’ve led my team at Visible Technologies through creating a new analytics platform based on Hadoop. This article lists a few lessons from our ordeal, which hopefully should help if you venture into similar territory. Here are my observations:
- Big data is BIG
- This is systems software, not an application (for now)
- Learn the source, engage the community, contribute feedback
- Scalable doesn’t imply cheap or easy. Just cheaper and easier.
- It’s so much easier with smart, experienced people.
— Bradford
Point 4 above, scalable doesn’t imply cheap or easy. Just cheaper and easier may always look like a surprise to some.
This entry demonstrates why normalized data is simpler for write/update operations, while the NoSQL approaches are usually simpler for reads.
Joining doesn’t scale to millions of concurrent users, and it’s rumored that some companies ban joins completely. NoSQL databases, which include key/value stores and document databases, drop the notion of normalized data.
Another post about normalization in relational databases and the implications of not using it in the NoSQL.
There are a number of things to keep in mind once you choose to denormalize your data including
- Denormalization means data redundancy which translates to significantly increased storage costs. […]
- Fixing data inconsistency is now the job of the application. […]
A discussion about how Reddit is handling database joins ☞.
For a long time all of these groups of users introduced clusters for two main reasons: ensuring availability and raising performance. Spreading processing across a cluster of smaller commodity machines was a good solution to both requirements and explains the enormous popularity of MySQL Replication as well as many less-than-successful attempts to implement multi-master clustering. However the state of the art has evolved in a big way in the last couple of years.
The article talks about a different than data-partitioning approach for parallel databases.
Kickfire, along with other companies that perform database operations directly in the hardware using FPGA technology, like Netezza and XtremeData, need to take a different approach to parallelism.
For those of you that are interested/passionate and that are closer to implementations, Daniel Abadi’s article talks about 3 different approaches for building hybrid column-store/row-store database systems: PAX, Fractured mirrors and Fine-grained hybrids and how these are used by commercial products like Oracle 11g, Vertica, and VectorWise:
Oracle, Vertica, and VectorWise have announced hybrid systems using one of these schemes (I have no inside knowledge about any of these implementations, and only know what’s been published publicly in all cases). It appears that Oracle (see Kevin Closson’s helpful responses to my questions in a comment thread to one of his blog posts) and VectorWise use the first approach, PAX. Vertica uses fine-grained hybrids (approach 3), though they probably could use their row-oriented storage scheme to implement fractured mirrors (approach 2) as well, if they so desire. Given that two out of the three authors of the fractured mirrors paper have been reunited at Microsoft, I would not be surprised if Microsoft were to eventually implement a fractured mirrors hybrid scheme.
If these topics are close to your heart, I’d encourage you to also check the other posts on NoSQL, storage and datastorage
Update: MongoDB 1.0.0 is production ready and available for download ☞.
MongoDB is getting closer to the 1.0 release. From mongodb:
MongoDB 0.9.10 has been released. This release fixes a few minor bugs in 0.9.9 in preperation for 1.0. Please give it a try and let us know if there are any issues.
Downloads: http://www.mongodb.org/display/DOCS/Downloads
Jira change log: http://jira.mongodb.org/browse/SERVER/fixforversion/10035
Git change log: http://mongo-db.appspot.com/changelog/mongo/0.9.10
There is a lot of innovation happening now in the alternative data storage space. People working on these projects have started a NOSQL community already and those not being part of it yet are trying to come up with schema-less approaches on top of relational databases.
There are a few things I am concerned about though. It looks like each of these solution is inventing its own API and is using their own protocols (being it memcached(-like), protobuffers , thrift , absolutely custom, etc.). I am not sure what the adoption status of these solution is right now, but I believe that over time these inconsistencies will become extremely expensive. While probably still early, but I really wish the NOSQL guys will start talking sooner than later about common APIs and protocols. (n.b. I am aware that there is almost impossible to expose the whole functionality of these systems through a common API, but I’m pretty sure it will be possible to find out the common points).
I also think that anyone looking into this field will have quite a hard time figuring out what’s his best option. I know that the NOSQL people are doing their best to add documentation and provide valuable help on their user groups, but there seems to be an almost complete lack of information on recommended usage scenarios. And there also might be the misconception about what commodity hardware means for others.
I usually don’t trust (micro or not) benchmarks, but I have to agree that VPork is an interesting and possibly very useful initiative:
With the wide range of distributed, non relational databases out there it is hard to know which one to choose. One part of the puzzle is of course performance. Personally I’m interested in low response times.
Here is my short TODO list on how to make things better:
Is there anything else you’d add to this list?
It looks like the performance is degrading with the size of the data and its amount:
If my data size is 2K then I get awesome performance.
It starts getting worse at about 5M Keys. […] So looks like as the overall database size increases beyond to what can fit in the cache the performance starts degrading.
SQL NULL is peculiar in a number of ways, and the general excuse for this is that there is a need to represent “missing information” — which may be true. But there are lots of ways to represent missing information, as I pointed out in a previous post, and SQL’s approach to missing information is, well, “unique”.
"