Monday, December 27, 2010

Solar Power - More panels and Nissan Leaf

Solar City will triple our solar power output in February, and we just ordered a Leaf, which they say is deliverable in the April-July timeframe.

We are adding 6.5KW of net delivered AC capacity (gross is about 7.5KW DC) to the garage roof. This time we are leasing it, and there are several options for how much downpayment vs. recurring payment to make.

There was a power cut last night for about two hours, and the automatic generator that runs off our propane tank kicked in after about 20 seconds. When the power came back, there wasn't a glitch, everything kept running. That was the first real test. We have a Generac 17KW unit that powers most of the house. With hindsight, it would have been better to get the 20KW generator that can power a whole 200Amp distribution panel rather than routing individual circuits. The extra cost would be offset by a simpler installation. We hope to replace the propane furnace with a heat pump at some point, then the generator will be the only thing running on propane, and since our solar system is grid-tied, it needs to work well to keep an all electric house going.

We were originally looking a geothermal heat pumps, but the extra cost of digging wells to get a more efficient heat pump is not cost effective versus adding extra solar panels to make up the difference. In climates where it is colder and there is less solar irradiance available, it makes more sense to use geothermal.

During 2010 we generated about 60% of the electricity we consumed, and also greatly reduced our propane consumption by switching appliances from propane to electric. In 2011 we should be almost 100% solar powered electric, and reduce our gasoline usage as well.

Wednesday, November 17, 2010

NoSQL Netflix Use Case Comparison for Translattice

[There is some discussion of this posting with comments by Michael at Slashdot]
Michael Lyle @mplyle CTO of Translattice kindly provided a set of answers that I have interspersed with the questions below. Translattice isn't technically a NoSQL system, but it isn't a conventional database either. It's a distributed relational SQL database that supports eventual consistency, as Michael puts it:
These answers are for the Translattice Application Platform (TAP)'s database component. Unlike other stores that have answered this question set, TAP contains a relational database that scales out over identical nodes. TAP further allows applications written to the J2EE platform to scale out across the same collection of nodes.
The original set of questions are posted here. Each respondent will get their own blog post with answers, when there are enough to be interesting, I will write some summary comparisons.

If you have answers or would like to suggest additional questions, comment here, tweet me @adrianco or blog it yourself.

Use Case Scenario for Comparison Across NoSQL Contenders
While each NoSQL contender has different strengths and will be used for different things, we need a basis for comparison across them, so that we understand the differences in behavior. Here is a sample scenario that I am publishing to put to each vendor to get their answers and will post the results here. The example is non-trivial and is based on a simplified Netflix related scenario that is applicable to any web service that reliably collects data from users via an API. I assume that is running on AWS and use that terminology, but the concepts are generic.

Use Case
A TV based device calls the API to add a movie to its favorites list (like the Netflix instant queue, but I have simplified the concept here), then reads back the entire list to ensure it is showing the current state. The API does not use cookies, and the load balancer (Amazon Elastic Load Balancer) is round robin, so the second request goes to a different API server, that happens to be in a different Amazon Availability Zone, and needs to respond with the modified list.

Favorites Storage
Favorites store is implemented using a NoSQL mechanism that persistently stores a single key=user value=movielist record on writes, and returns the movielist on reads.

Question 1: Availability Zones
When an API reads and writes to a queue store using the NoSQL mechanism, is the traffic routing Availability Zone aware? Are reads satisfied locally, or spread over all zones, is the initial write local or spread over the zones, is the write replication zone aware so data is replicated to more than one zone?
In the Translattice Application Platform, data in relational tables is transparently sharded behind the scenes by the data store. These shards are stored redundantly across the nodes. Reads are satisfied with the most local copy of data available on the network, unless that resource is currently overloaded in which case the system may fall back to reads from more distant locations.

When it comes to writes, applications have the choice on the durability and isolation levels for changes. Each transaction may be made in a fully synchronous, serializable isolation level, or may be made in a locked eventually consistent mode that provides ACID serializable semantics except that durability may be sacrificed if an availability zone fails. A further asynchronous mode allows potentially conflicting changes to be made and allows a user-provided reconciliation function to decide which change "wins". A final commit requires a majority of nodes storing a shard to be available; in the case of the fully synchronous mode this would delay or prevent the return of success if a critical subset of the cluster fails.

Policy mechanisms in the system allow administrators to specify how physical and cloud database instances correspond to administratively-relevant zones. An administrator can choose to require, for instance, that each piece of information is replicated to at least three database nodes across a total of two availability zones. An administrator may also use these mechanisms to require that particular tables or portions of tables must or must not be stored in a given zone (for instance, to meet compliance or security requirements). Within the constraints set by policy, the system tracks usage patterns and places information in the most efficient locations.

Question 2: Partitioned Behavior with Two Zones
If the connection between two zones fails, and a partition occurs so that external traffic coming into and staying within a zone continues to work, but traffic between zones is lost, what happens? In particular, which of these outcomes does the NoSQL service support?
  • one zone decides that it is still working for reads and writes but half the size, and the other zone decide it is offline
  • both zones continue to satisfy reads, but refuse writes until repaired
  • data that has a master copy in the good zone supports read and write, slave copies stop for both read and write
  • both zones continue to accept writes, and attempt to reconcile any inconsistency on repair
Assuming that the SQL transaction in question is running in the fully synchronous or eventually consistent locked mode, writes will only be allowed in one of the two zones. Reads will continue in both zones, but will only be able to satisfy requests for which at least one replica of the requested data exists in the local zone (policy can be specified to ensure that this is always the case). In the eventually consistent mode, multiple partitioned portions of the system can accept writes and reconcile later. Essentially, any of the above desired modes can be used on a transaction-by-transaction basis depending on application and performance requirements.
Question 3: Appending a movie to the favorites list
If an update is performed by read-modify-write of the entire list, what mechanisms can be used to avoid race conditions? If multiple attribute/values are supported for a key, can an additional value be written directly without reading first? What limits exist on the size of the value or number of attribute/values, and are queries by attribute/value supported?
Because fully relational primitives are provided, there can easily be one row in the database per favorite. Read-modify-write of the whole list is not required, and the only practical limits are application-defined.

Any SQL queries are supported against the store, and are transformed by the query planner into an efficient plan to execute the query across the distributed system. Of course, how efficient a query is to execute will depend on the structure of the data and the indexes that an administrator has created. We think this allows for considerable flexibility and business agility as the exact access methods that will be used on the data do not need to be fully determined in advance.
Question 4: Handling Silent Data Corruption
When the storage or network subsystem corrupts data without raising an error, does the NoSQL service detect and correct this? When is it detected and corrected, on write, on read or asynchronously?
Network activity is protected by cryptographic hash authentication, which provides integrity verification as a side benefit. Distributed transactions also take place through a global consensus protocol that uses hashes to ensure that checkpoints are in a consistent state (this is also how the system maintains transactional integrity and consistency when changes cross many shards). Significant portions of the on-disk data are also presently protected by checksums and allow the database to "fail" a disk if corrupt data is read.
Question 5: Backup and Restore
Without stopping incoming requests, how can a point in time backup of the entire dataset be performed? What is the performance and availability impact during the backup? For cases such as roll-back after a buggy application code push, how is a known good version of the dataset restored, how is it made consistent, and what is the performance and availability impact during the restore? Are there any scalability limits on the backed up dataset size, what's the biggest you have seen?
A good portion of this relational database's consistency model is implemented through a distributed multi-version concurrency control (MVCC) system. Tuples that are in-use are preserved as the database autovacuum process will not remove tuples until it is guaranteed that no one could be looking at them anymore. This allows a consistent version of the tables as of a point in time to be viewed from within a transaction; so BEGIN TRANSACTION; SELECT ... [or COPY FROM, to backup] ; COMMIT; works. We provide mechanisms to allow database dumps to occur via this type of mechanism.
In the future we are likely to use this mechanism to allow quick snapshots of the entire database and rollbacks to previous snapshot versions (as well as to allow the use of snapshots to stage development versions of application code without affecting production state).

Tuesday, November 09, 2010

NoSQL Netflix Use Case Comparison for Riak

Justin Sheehy @justinsheehy of Basho kindly provided a set of answers that I have interspersed with the questions below.

The original set of questions are posted here. Each NoSQL contender will get their own blog post with answers, when there are enough to be interesting, I will write some summary comparisons.

If you have answers or would like to suggest additional questions, comment here, tweet me @adrianco or blog it yourself.

Use Case Scenario for Comparison Across NoSQL Contenders
While each NoSQL contender has different strengths and will be used for different things, we need a basis for comparison across them, so that we understand the differences in behavior. Here is a sample scenario that I am publishing to put to each vendor to get their answers and will post the results here. The example is non-trivial and is based on a simplified Netflix related scenario that is applicable to any web service that reliably collects data from users via an API. I assume that is running on AWS and use that terminology, but the concepts are generic.

Use Case
A TV based device calls the API to add a movie to its favorites list (like the Netflix instant queue, but I have simplified the concept here), then reads back the entire list to ensure it is showing the current state. The API does not use cookies, and the load balancer (Amazon Elastic Load Balancer) is round robin, so the second request goes to a different API server, that happens to be in a different Amazon Availability Zone, and needs to respond with the modified list.

Favorites Storage
Favorites store is implemented using a NoSQL mechanism that persistently stores a single key=user value=movielist record on writes, and returns the movielist on reads.

Question 1: Availability Zones
When an API reads and writes to a queue store using the NoSQL mechanism, is the traffic routing Availability Zone aware? Are reads satisfied locally, or spread over all zones, is the initial write local or spread over the zones, is the write replication zone aware so data is replicated to more than one zone?

There are two possibilities with Riak. The first would be to spread a single Riak cluster across all three zones, for example one node in each of three zones. In this case, a single replica of each item would exist in each zone. Whether or not a response needed to wait on cross-zone traffic to complete would depend on the consistency level in the individual request. The second option would require Riak EnterpriseDS, and involves placing a complete cluster in each zone and configuring them to perform inter-cluster replication. This has multiple advantages. Every request would be satisfied entirely locally, and would be independent of latency or availability characteristics across zone boundaries. Another benefit is that (unlike either the first scenario or some other solutions that spread clusters and quorums over a long haul) read requests would not generate any cross-zone traffic at all. For an application with a high percentage of reads, this can make a large difference.

Question 2: Partitioned Behavior with Two Zones
If the connection between two zones fails, and a partition occurs so that external traffic coming into and staying within a zone continues to work, but traffic between zones is lost, what happens? In particular, which of these outcomes does the NoSQL service support?
  • one zone decides that it is still working for reads and writes but half the size, and the other zone decide it is offline
  • both zones continue to satisfy reads, but refuse writes until repaired
  • data that has a master copy in the good zone supports read and write, slave copies stop for both read and write
  • both zones continue to accept writes, and attempt to reconcile any inconsistency on repair

As write-availability is a central goal achieved in Riak, the fourth option will be the observed behavior. This is the case regardless of the strategy chosen for Question 1. In the first strategy, local nodes other than the canonical homes for given data will accept the writes instead, using the hinted-handoff technique. In the second strategy, the local cluster will accept the write, those changes will be replayed across the replication link when the zones are reconnected. In all cases, vector clocks provide a clean way of resolving most inconsistency, and various reconciliation models are available to the user for those cases which cannot be syntactically resolved.


For more information on vector clocks in Riak, see:


http://blog.basho.com/2010/01/29/why-vector-clocks-are-easy/

and

http://blog.basho.com/2010/04/05/why-vector-clocks-are-hard/


Question 3: Appending a movie to the favorites list
If an update is performed by read-modify-write of the entire list, what mechanisms can be used to avoid race conditions? If multiple attribute/values are supported for a key, can an additional value be written directly without reading first? What limits exist on the size of the value or number of attribute/values, and are queries by attribute/value supported?

Riak will use vector clocks to recognize causality in race conditions. In the case of two overlapping writes to the same value, Riak will retain both unless explicitly requested to simply overwrite with the last value received. If one client changes A to B and another changes A to C, then (unless told to overwrite) Riak will return both B and C to the client. When that client then modifies the object again, the single descendant "D" that they created will be the new value. For applications such as sets which are mostly added to and rarely deleted from, the application code to perform this reconciliation is trivial and in some cases is simply a set union operation. This would look a bit like this in terms of vector clock ancestry:

http://dl.dropbox.com/u/751099/ndiag1.png




Riak allows values to be of any arbitrary content type, but if the content is in JSON then a JavaScript map/reduce request can be used to query by attribute/value.

Question 4: Handling Silent Data Corruption
When the storage or network subsystem corrupts data without raising an error, does the NoSQL service detect and correct this? When is it detected and corrected, on write, on read or asynchronously?

Many layers of Riak perform consistency checking, including CRC checking in the persistence engine and object equality in the distributed state machines handling requests. In most cases where corruption can be detected in a given replica of some item, that replica will immediately but asynchronously be fixed via read-repair.

Question 5: Backup and Restore
Without stopping incoming requests, how can a point in time backup of the entire dataset be performed? What is the performance and availability impact during the backup? For cases such as roll-back after a buggy application code push, how is a known good version of the dataset restored, how is it made consistent, and what is the performance and availability impact during the restore? Are there any scalability limits on the backed up dataset size, what's the biggest you have seen?

There are two approaches to back up Riak systems: per-node or whole-cluster. Backing up per-node is the easiest option for many people, and is quite simple. Due to bitcask (the default storage engine) performing writes in an append-only fashion and never re-opening any file for writing once closed, Riak nodes can easily be backed up via the filesystem backup method of your choice. Simply replacing the content of the data directory will reset a node's stored content to what it held at the time. Alternately, a command line backup command is available which will write out a backup of all data on the cluster. This is fairly network and disk intensive and requires somewhere to put a whole-cluster backup, but is very useful for prototyping situations which are not holding enormous amounts of data.


Monday, November 01, 2010

Are we ready for spotcloud yet?

Launched today by Enomaly (@ruv) Spotcloud is a "Cloud Capacity Clearinghouse and Marketplace". There was a lot of discussion on twitter about whether this is really new, and previous attempts to do something similar.

My background in this is that I was working at Sun in 2003/2004 when we were thinking about a marketplace for public grid computing capacity, I was chief architect for Shahin Khan's High Performance Technical Computing group at the time, and we "owned" Grid for Sun. We were both RIFd in the summer of 2004, but some of our projects stayed alive, and @ruv mentioned some of these ideas from Sun surfacing in 2005.

I moved to eBay, and one idea that I tried to get eBay interested in at the time was building a marketplace for compute capacity. The problem was that eBay is a retail product focused company, and had no product managers looking at digitally delivered products. I couldn't find a marketplace manager who understood what I was proposing and thought it might be worth working on. In practice, it was too early, but Amazon had the vision to build a cloud at this time, and eBay could have done the same if it wanted to create a market, rather than make existing markets more efficient.

In 2006 (while I was working at eBay Research Labs) I wrote a blog post about a maturity model for innovation. The key point is:

"the evolution of a marketplace goes from competing on the basis of technology, to competing on service, to competing as a utility, to competing for free. In each step of the evolution, competitors shake out over time and a dominant brand emerges.

To use this as a maturity model, take a market and figure out whether the primary competition is on the basis of technology, service, utility or search"
Today the cloud marketplace is somewhere between the service and utility phases. Each individual cloud has their own specific services and service interfaces, and they have not turned into a standard commodity yet, so we do not have the basis for competition purely on the basis of a Utility (i.e. on service quality - uptime, not on service features).

From this point of view, it is still too early for Spotcloud to take off. Cloud's problem is not "finding generic capacity at low cost" (the cloud utility search problem), the cloud marketplace is still evolving it's differentiated service interfaces towards a common set of functionality and standards. Spotcloud is starting out based on Enomaly's interfaces, and say they will add others, while the market leader is Amazon, who have already implemented their own spot pricing model.

One thing I did learn at eBay, is how hard it is to manage marketplaces. One unfortunate measure of success is that it attracts people whose aim is to make money by manipulating the market rather than contributing to it. There are a lot of non-intuitive details that you have to get right for a marketplace to scale and be robust enough to build and maintain trust, while also having very low "friction" so that it attracts and retains buyers and sellers.

So one way to tell that the marketplace for cloud capacity is viable is when you see eBay entering that marketplace :-)



Sunday, October 31, 2010

NoSQL Netflix Use Case Comparison for MongoDB

Roger Bodamer @rogerb from 10gen.com kindly provided a set of answers for MongoDB that I have interspersed with the questions below. The original set of questions are posted here. Each NoSQL contender will get their own blog post with answers, when there are enough to be interesting, I will write some summary comparisons. If you have answers or would like to suggest additional questions, comment here, tweet me @adrianco or blog it yourself.

Use Case Scenario for Comparison Across NoSQL Contenders
While each NoSQL contender has different strengths and will be used for different things, we need a basis for comparison across them, so that we understand the differences in behavior. Here is a sample scenario that I am publishing to put to each vendor to get their answers and will post the results here. The example is non-trivial and is based on a simplified Netflix related scenario that is applicable to any web service that reliably collects data from users via an API. I assume that is running on AWS and use that terminology, but the concepts are generic.

Use Case
A TV based device calls the API to add a movie to its favorites list (like the Netflix instant queue, but I have simplified the concept here), then reads back the entire list to ensure it is showing the current state. The API does not use cookies, and the load balancer (Amazon Elastic Load Balancer) is round robin, so the second request goes to a different API server, that happens to be in a different Amazon Availability Zone, and needs to respond with the modified list.

Favorites Storage
Favorites store is implemented using a NoSQL mechanism that persistently stores a single key=user value=movielist record on writes, and returns the movielist on reads.

Question 1: Availability Zones
When an API reads and writes to a queue store using the NoSQL mechanism, is the traffic routing Availability Zone aware? Are reads satisfied locally, or spread over all zones, is the initial write local or spread over the zones, is the write replication zone aware so data is replicated to more than one zone?

Let's assume for discussion purposes that we are using MongoDB across three availability zones in a region. We would have a replica set member in each of the three zones. One member will be elected primary at a given point in time.


All writes will be sent to the primary, and then propagate to secondaries from there. Thus, writes are often inter-zone. However availability zones are fairly low latency (I assume the context here is EC2).


Reads can be either to the primary, if immediate/strong consistency semantics are desired, or to the local zone member, if eventually consistent read semantics are acceptable.


Question 2: Partitioned Behavior with Two Zones
If the connection between two zones fails, and a partition occurs so that external traffic coming into and staying within a zone continues to work, but traffic between zones is lost, what happens? In particular, which of these outcomes does the NoSQL service support?
  • one zone decides that it is still working for reads and writes but half the size, and the other zone decide it is offline
  • both zones continue to satisfy reads, but refuse writes until repaired
  • data that has a master copy in the good zone supports read and write, slave copies stop for both read and write
  • both zones continue to accept writes, and attempt to reconcile any inconsistency on repair
Let's assume again we are using three zones - we could use two but three is more interesting. To be primary in a replica set, the primary must be visible to a majority of the members of the set: in this case, two thirds of the members, or two thirds of the zones. If one zone is partitioned from the other two, what will happen is: a member in the 2 zone side of the partition will become primary, if not already. It will be available for reads and writes.

The minority partition will not service writes. Eventually consistent reads are still possible in the minority partition.

Once the partition heals, the servers automatically reconcile.

http://www.mongodb.org/display/DOCS/Replica+Set+Design+Concepts
Question 3: Appending a movie to the favorites list
If an update is performed by read-modify-write of the entire list, what mechanisms can be used to avoid race conditions?

MongoDB supports atomic operations on single documents via both its $ operators ($set, $inc) and also by compare-and-swap operations. In MongoDB one could model the list as a document per favorite, or, put all the favorites in a single BSON object. In both cases atomic operations free of race conditions are possible. http://www.mongodb.org/display/DOCS/Atomic+Operations


This is why mongodb elects a node primary: to facilitate these atomic operations for use cases where these semantics are required.


If multiple attribute/values are supported for a key, can an additional value be written directly without reading first?
Yes.
What limits exist on the size of the value or number of attribute/values?

A single BSON document must be under the limit -- currently that limit is 8MB. If larger than this, one should consider modeling as multiple documents during schema design.

and are queries by attribute/value supported?

Yes. For performance, MongoDB supports secondary (composite) indices.


Question 4: Handling Silent Data Corruption
When the storage or network subsystem corrupts data without raising an error, does the NoSQL service detect and correct this? When is it detected and corrected, on write, on read or asynchronously?

The general assumption is that the storage system is reliable. Thus, one would normally use a RAID with mirroring, or a service like EBS which has intrinsic mirroring.


However, the BSON format has a reasonable amount of structure to it. It is highly probable, although not certain, that a corrupt object would be detected and an error reported. This could then be correct with a database repair operation.


Note: the above assumes an actual storage system fault. Another case of interest is simply a hard crash of the server. MongoDB 1.6 requires a --repair after this. MongoDB v1.8 (pending) is crash-safe in its storage engine via journaling.


Question 5: Backup and Restore
Without stopping incoming requests, how can a point in time backup of the entire dataset be performed? What is the performance and availability impact during the backup?

The most used method is to have a replica which is used for backups only; perhaps an inexpensive server or VM. This node can be taken offline at any time and any backup strategy used. Once re-enabled, it will catch back up. http://www.mongodb.org/display/DOCS/Backups


With something like EBS, quick snapshotting is possible using the fsync-and-lock command.


For cases such as roll-back after a buggy application code push, how is a known good version of the dataset restored, how is it made consistent, and what is the performance and availability impact during the restore? Are there any scalability limits on the backed up dataset size, what's the biggest you have seen?

One can stop the server(s), restore the old data file images, and restart.


MongoDB supports a slaveDelay option which allows one to force a replica to stay a certain number of hours behind realtime. This is a good way to maintain a rolling backup in case of someone "fat-fingering" a database operation.

Friday, October 29, 2010

NoSQL Netflix Use Case Comparison for Cassandra

Jonathan Ellis @spyced of Riptano kindly provided a set of answers that I have interspersed with the questions below.

The original set of questions are posted here. Each NoSQL contender will get their own blog post with answers, when there are enough to be interesting, I will write some summary comparisons.

If you have answers or would like to suggest additional questions, comment here, tweet me @adrianco or blog it yourself.

Use Case Scenario for Comparison Across NoSQL Contenders
While each NoSQL contender has different strengths and will be used for different things, we need a basis for comparison across them, so that we understand the differences in behavior. Here is a sample scenario that I am publishing to put to each vendor to get their answers and will post the results here. The example is non-trivial and is based on a simplified Netflix related scenario that is applicable to any web service that reliably collects data from users via an API. I assume that is running on AWS and use that terminology, but the concepts are generic.

Use Case
A TV based device calls the API to add a movie to its favorites list (like the Netflix instant queue, but I have simplified the concept here), then reads back the entire list to ensure it is showing the current state. The API does not use cookies, and the load balancer (Amazon Elastic Load Balancer) is round robin, so the second request goes to a different API server, that happens to be in a different Amazon Availability Zone, and needs to respond with the modified list.

Favorites Storage
Favorites store is implemented using a NoSQL mechanism that persistently stores a single key=user value=movielist record on writes, and returns the movielist on reads.
The most natural way to model per-user favorites in Cassandra is to have one row per user, keyed by the userid, whose column names are movie IDs. The combination of allowing dynamic column creation within a row and allowing very large rows (up to 2 billion columns in 0.7) means that you can treat a row as a list or map, which is a natural fit here. Performance will be excellent since columns can be added or modified without needing to read the row first. (This is one reason why thinking of Cassandra as a key/value store, even before we added secondary indexes, was not really correct.)

The best introduction to Cassandra data modeling is Max Grinev's series on
basics, translating SQL concepts, and idempotence.

Question 1: Availability Zones
When an API reads and writes to a queue store using the NoSQL mechanism, is the traffic routing Availability Zone aware? Are reads satisfied locally, or spread over all zones, is the initial write local or spread over the zones, is the write replication zone aware so data is replicated to more than one zone?
Briefly, both reads and writes have a ConsistencyLevel parameter controlling how many replicas across how many zones must reply for the request to succeed. Routing is aware of current response times as well as network topology, so given an appropriate ConsistencyLevel, reads can be routed around temporarily slow nodes.

On writes, the coordinator node (the one the client sent the request to) will send the write to all replicas; as soon as enough success messages come back to satisfy the desired consistency level, the coordinator will report success to the client.

For more on consistency levels, see
Ben Black's excellent presentation.
Question 2: Partitioned Behavior with Two Zones
If the connection between two zones fails, and a partition occurs so that external traffic coming into and staying within a zone continues to work, but traffic between zones is lost, what happens? In particular, which of these outcomes does the NoSQL service support?
  • one zone decides that it is still working for reads and writes but half the size, and the other zone decide it is offline
  • both zones continue to satisfy reads, but refuse writes until repaired
  • data that has a master copy in the good zone supports read and write, slave copies stop for both read and write
  • both zones continue to accept writes, and attempt to reconcile any inconsistency on repair
Cassandra has no 'master copy' for any piece of data; all copies are equal. The other behaviors are supported by different ConsistencyLevel values for reads (R) and writes (W):

R=QUORUM, W=QUORUM: One zone decides that it is still working for reads and writes, and the other zone decides it is offline
R=ONE, W=ALL: Both zones continue to satisfy reads, but refuse writes
R=ONE, W=ONE: Both zones continue to accept writes, and reconcile any inconsistencies when the partition heals

I would also note that reconciliation is timestamp-based at the column level, meaning that updates to different columns within a row will never conflict, but when writes have been allowed in two partitions to the same column, the highest timestamp will win. (This is another way Cassandra differs from key/value stores, which need more complex logic called vector clocks to be able to merge updates to different logical components of a value.)
Question 3: Appending a movie to the favorites list
If an update is performed by read-modify-write of the entire list, what mechanisms can be used to avoid race conditions? If multiple attribute/values are supported for a key, can an additional value be written directly without reading first? What limits exist on the size of the value or number of attribute/values, and are queries by attribute/value supported?
Cassandra's ColumnFamily model generally obviates the need for a read before a write, e.g., as above using movie IDs as column names. (If you wanted to allow duplicates in the list for some reason, you would generally use a UUID as the column name on insert instead of the movie ID.)

The maximum value size is 2GB although in practice we recommend using 8MB as a more practical maximum. Splitting a larger blob up across multiple columns is straightforward given the dynamic ColumnFamily design. The maximum row size is 2 billion columns. Queries by attribute value are supported with secondary indexes in 0.7.
Question 4: Handling Silent Data Corruption
When the storage or network subsystem corrupts data without raising an error, does the NoSQL service detect and correct this? When is it detected and corrected, on write, on read or asynchronously?
Cassandra handles repairing corruption the same way it does other data inconsistencies, with read repair and anti-entropy repair.
Question 5: Backup and Restore
Without stopping incoming requests, how can a point in time backup of the entire dataset be performed? What is the performance and availability impact during the backup? For cases such as roll-back after a buggy application code push, how is a known good version of the dataset restored, how is it made consistent, and what is the performance and availability impact during the restore? Are there any scalability limits on the backed up dataset size, what's the biggest you have seen?
Because Cassandra's data files are immutable once written, creating a point-in-time snapshot is as simple as hard-linking the current set of sstables on the filesystem. Performance impact is negligible since hard links are so lightweight. Rolling back simply consists of moving a set of snapshotted files into the live data directory. The snapshot is as consistent as your ConsistencyLevel makes it: any write visible to readers at a given ConsistencyLevel before the snapshot will be readable from the snapshot after restore. The only scalability problem with snapshot management is that past a few TB, it becomes impractical to try to manage snapshots centrally; most companies leave them distributed across the nodes that created them.

Wednesday, October 27, 2010

Comparing NoSQL Availability Models

let's risk feeding the CAP trolls, and try to get some insight into the differences between the many NoSQL contenders. I have circulated an earlier version of this to a few people and got at least one good response. If you have answers, or would like to suggest additional questions, comment here, tweet me @adrianco or blog it yourself.


Use Case Scenario for Comparison Across NoSQL Contenders
While each NoSQL contender has different strengths and will be used for different things, we need a basis for comparison across them, so that we understand the differences in behavior. Here is a sample scenario that I am publishing to put to each vendor to get their answers and will post the results here. The example is non-trivial and is based on a simplified Netflix related scenario that is applicable to any web service that reliably collects data from users via an API. I assume that is running on AWS and use that terminology, but the concepts are generic.

Use Case
A TV based device calls the API to add a movie to its favorites list (like the Netflix instant queue, but I have simplified the concept here), then reads back the entire list to ensure it is showing the current state. The API does not use cookies, and the load balancer (Amazon Elastic Load Balancer) is round robin, so the second request goes to a different API server, that happens to be in a different Amazon Availability Zone, and needs to respond with the modified list.

Favorites Storage
Favorites store is implemented using a NoSQL mechanism that persistently stores a single key=user value=movielist record on writes, and returns the movielist on reads.

Question 1: Availability Zones
When an API reads and writes to a queue store using the NoSQL mechanism, is the traffic routing Availability Zone aware? Are reads satisfied locally, or spread over all zones, is the initial write local or spread over the zones, is the write replication zone aware so data is replicated to more than one zone?

Question 2: Partitioned Behavior with Two Zones
If the connection between two zones fails, and a partition occurs so that external traffic coming into and staying within a zone continues to work, but traffic between zones is lost, what happens? In particular, which of these outcomes does the NoSQL service support?
  • one zone decides that it is still working for reads and writes but half the size, and the other zone decide it is offline
  • both zones continue to satisfy reads, but refuse writes until repaired
  • data that has a master copy in the good zone supports read and write, slave copies stop for both read and write
  • both zones continue to accept writes, and attempt to reconcile any inconsistency on repair
Question 3: Appending a movie to the favorites list
If an update is performed by read-modify-write of the entire list, what mechanisms can be used to avoid race conditions? If multiple attribute/values are supported for a key, can an additional value be written directly without reading first? What limits exist on the size of the value or number of attribute/values, and are queries by attribute/value supported?

Question 4: Handling Silent Data Corruption
When the storage or network subsystem corrupts data without raising an error, does the NoSQL service detect and correct this? When is it detected and corrected, on write, on read or asynchronously?

Question 5: Backup and Restore
Without stopping incoming requests, how can a point in time backup of the entire dataset be performed? What is the performance and availability impact during the backup? For cases such as roll-back after a buggy application code push, how is a known good version of the dataset restored, how is it made consistent, and what is the performance and availability impact during the restore? Are there any scalability limits on the backed up dataset size, what's the biggest you have seen?

Sunday, October 10, 2010

Netflix in the Cloud

I'm presenting this talk on Thursday at the Cloud Computing Meetup, and again on Nov 3rd at QConSF. So far I have posted a "teaser" summary on slideshare. After QCon I will post the full slide deck [update: combined deck from both talks posted here].

The meetup is the "beta test" of the presentation. It's in Mountain View at "Hacker Dojo", and at the time of writing 437 people have signed up to attend. If everyone turns up it's going to be crazy and over-flowing trying to park and get in, so get there early.... I will focus the meetup talk more on the operational aspects of the cloud architecture, and migration techniques.

At QConSF, the presentation is in the "Architectures you've always wondered about" track, and I will spend more time talking about the software architecture.

Why give these talks? We aren't trying to sell cloud to CIOs, it's all about hiring, we are talking at engineer focused events in the Bay Area. Netflix is growing fast, pathfinding new technologies in the public cloud to support a very agile business model, and is trying to attract the best and brightest people. We use LinkedIn a lot when we search for positions, so feel free to connect to me if you think this could be interesting, or follow @adrianco on Twitter.

Wednesday, October 06, 2010

Time for some climate action on 10/10/10

organized by 350.org there are events all over the world. The significance of 350 is that it is the safe level of CO2 in the atmosphere. We are already at 390ppm, which is why the climate is changing and extreme weather events are becoming common. The level is going up by 2ppm a year at the moment and the changes needed to reverse this haven't started, so its going to go much higher and have increasingly worse effects.

Even if all you do is buy and start reading a copy of James Hansen's book Storms of my Grandchildren (you can get it instantly from the Kindle store) it will help.

Here is a useful discussion of "The Value of Coherence in Science" which is how you can tell what makes sense from what must be wrong. I'm a Physics graduate, and I was taught that science works by making predictions and testing them, and by eliminating incoherent propositions. i.e. if an argument contradicts itself it must be false. The denialist counter-arguments fail this test. To see why the denialist arguments are getting any air time at all, read Merchants of Doubt and Climate Cover Up.

For an example of incoherent and shoddy denialist work John Mashey has systematically deconstructed the Wegman Report which was presented to Congress as an impartial study when it was nothing of the sort.

Sunday, October 03, 2010

More solar power

We just signed up for almost 8KW on the garage roof. It's justified based on replacing our propane furnace with a heat pump that also gives us aIr conditioning, and also running an electric car (e.g. Nissan Leaf). We can use our existing monitoring system, but unlike last time, when we bought the system, we plan to lease it from Solar City this time. There Is a four month lead time, so we can pick the exact terms of the lease next February before it is installed. There are several options, zero down with a monthly charge that increases slightly each year, or options to pay various amounts up front with a fixed monthly payment. The monthly payments are all less than a typical electricity bill.

The goal for next year is to only use propane to run the backup generator (usually for a few days a year), and to make a dent in our petrol usage for the daily commute. We are on the Nissan Leaf list, but not in the first wave of owners. We should be able to get a test drive soon, I will report on any progress as it happens.

If you are interested in electric cars, look for Robert Llewellyn's Fully Charged video show on YouTube. He has been test driving everything and he's good fun to watch. You may recognize him as the actor who played Kryton in Red Dwarf.

Sunday, September 26, 2010

Solar Power - The Year in Review

For our full story on solar click here.

We installed Solar panels in August 2009, and they were turned on during September. We have now had a whole year of power which is shown below. The time-of-use metering means that we get paid much more for the power we generate in the afternoon than the power we use overnight, so the "zero point" for billing is different than that for power consumption. PG&E recently sent us our annual bill, which was about $500. Added to the monthly bills for basic service this comes to about $700 for the whole year. Our previous electricity bill was about $2000 a year. Since we changed our hot water, clothes dryer and range from propane to electric, we saved over $1000 in propane cost as well. That puts payback at around ten years at todays prices, given the likelihood of increased propane and electricity prices over time, actual payback would be earlier than that.

The panels have got a lot of dust on them (the roof is too high to get at easily to clean them), so aren't running at peak efficiency, the best day we saw was around 28KWh, with a lot of days of 26-27 KWh through the summer. The totals for each month are shown in the screenshots below.






Since we have built a new garage, we now have a lot more roof area. So we are now looking to add another 4KW of panels, and to swap out our propane furnace for a heat pump that will give us heating and cooling (yay - its hot today...) sometime next spring. Then the only use for propane will be the emergency generator. It should also make the garage a bit cooler in the summer by shading half of the roof with panels.

Thursday, August 26, 2010

Netflix for iPhone in the cloud and HTML5

I posted on this subject a while ago, and generated a lot of confused comments, who read things into my post that weren't there. Today I hope it's a bit clearer what I was talking about, because we released the Netflix iPhone app, which is based on HTML5 for its user interface elements, back ends into the Amazon cloud, and uses conventional, non-HTML5 video playback and DRM. The playback mechanism is the same as the iPad, but to support the scale of expected usage we needed more capacity (it's currently the top free app in the app store as I write this) and we got that by rebuilding our API tier and personalized movie choosing backend to run on the Amazon cloud. The user interface runs in a webkit based browser window in the app, just like on iPad, but the entire UI is built using Javascript with advanced CSS and HTML5 animations to get it to feel like a native iPhone app.

Returning to the theme of my last post, Netflix is hiring engineers to work on cloud tools, platforms and performance, and advanced user interfaces. I think we are breaking new ground in both areas and its an exciting place to be. We have very high standards and are looking for the best people in the industry to come and help us...

Wednesday, August 18, 2010

Eventual Consistency of Cloud?

Lori MacVittie wrote that eventually cloud standards will converge around the private cloud standards (RT @swardley). I disagree, and there are several examples I can think of that point to the opposite conclusion, that public cloud will be the standard, and AWS will continue to dominate. The original article is here.

The first point of disagreement is the claim that public clouds aren't getting the input they need from customers to mature their services, because real enterprise customers aren't running in the public cloud. Well, here at Netflix, we are giving Amazon exactly that kind of input. We use every feature they have, we are driving them hard and Amazon is taking the input and improving their product rapidly in ways that benefit all the users of AWS.

My main disagreement is the claim that lots of individual IT departments will converge on a single standard that will win out over public cloud standards. I find this highly implausible, there are a host of vendors feeding technology to enterprise clouds, and they will all do the usual vendor thing of looking for ways to lock in the customer, even if they base on the same standards their implementations will be different. Here's an example, the "private" Enterprise Unix variants Solaris, AIX, HP-UX, IRIX, OSF/1 etc. are all based on the same Unix standards, however the "public" alternatives are Linux and BSD, and Linux has won the mind share in this space. I see Linux as an analogy for the public cloud in the sense that there is a very low barrier to adoption. For Linux, you can download it to run on any old computer for nothing, learn to use it, then build very low cost solutions out of it. For AWS, for a few dollars on the existing Amazon account you use to buy books etc, you can explore all the features and learn to build very powerful systems in a few hours. This has produced a large population of very productive engineers, who know how to use AWS, Linux and other open source tools to solve problems rapidly at low cost using the same tools. In contrast, if every enterprise cloud moves ahead by solving their problems independently they will produce a variety of architectures, each optimized to their own problem, and with their own tooling, and a very small number of people who know how to run each variant. You will also find that every company also uses the public cloud to get stuff done more quickly and cheaply than the IT department, so that will become the common standard.

Part of the thinking behind Netflix' move to the cloud is that large public cloud providers like Amazon will have far more engineers working on making their infrastructure robust, scalable, well automated and secure than any individual enterprise could afford. By using AWS we are leveraging a huge investment made by Amazon, and paying a small amount for it on a month by month basis. We also get to efficiently allocate resources, for example how much does it cost to provision a large cage of equipment in a new datacenter and how long does it take from deciding to do it, to having it running reliably in production? Let's say $10M and many months. Instead we could spend $10M on licensing more movies and TV shows to stream, and grow incrementally in the cloud. In a few months time we have more customers than if we spent the money up front on buying compute capacity and we just keep re-provisioning new instances in the cloud, so we never end up with a datacenter full of inappropriately sized or obsolete equipment. At present, Netflix' growth is accelerating, so it is difficult to guess in advance how much capacity to invest in, but we have already flat-lined our datacenter capacity, and have all incremental capacity happening on AWS. For example, we can just fire up a few thousand instances for a week to encode all the new movies we just bought the rights to, then stop paying for them until another big deal closes. Likewise on The Oscars Awards night, there is a big spike in web traffic, and we can grow on the day and shrink afterwards as needed without planning it and buying hardware a long time in advance.

While the other public and private cloud vendors are competing to come up with standards, we are finding that resumes from the kind of engineers we want to hire already reference their experience with AWS as a de-facto cloud standard. It's also easier to attract the best people if they will learn transferable skills and work on the very latest technologies.

That might sound like a lock-in, but a well designed architecture is layered, and the actual AWS dependencies are very localized in our code base. The bet on the end game is that in coming years, other cloud vendors produce large scale AWS compatible offerings (full featured, not just EC2 and S3), and a very large scale multi-vendor low cost public cloud market is created. Then even a large and fast growing enterprise like Netflix will be an insignificant and ever smaller proportion of the cloud. By definition, you can't be an insignificant proportion of your own private cloud....

Monday, August 16, 2010

Reducing TCP retransmit timeout?

Cloud networks are lossy and low latency, reducing TCP_RTO_MIN and TCP_DELACK_MIN looks like a good idea, but it looks as if this needs a linux kernel recompile. Anyone else looked at this?
Here is a relevant paper “Safe and Effective Fine-grained TCP Retransmissions for Datacenter Communication“
http://www.cs.cmu.edu/~vrv/papers/sigcomm147-vasudevan.pdf

Friday, August 06, 2010

Open letter to my Sun friends at Oracle

I recently heard about Illumos via a tweet from Alec Muffett, and responded with my own tweet "I predict that #illumos will be just as irrelevant as Solaris has been for the last few years. Legacy." - personally I haven't logged into a Solaris or SPARC machine for about four years now. There are none at Netflix.

I have also been talking to a few friends who stayed at Sun and are now at Oracle, and there is a common thread that I decided to put out there in this blog post.

This week I presented at a local Computer Measurement Group meeting, talking about how easy it is to use the Amazon cloud to run Hadoop jobs to process terabytes of data for a few bucks [slideshare]. I followed a talk on optimizing your Mainframe software licensing costs by tweaking workload manager limits. There are still a lot of people working away on IBM Mainframes, but it's not where interesting new business models go to take over the world.

The way I see the Oracle/Sun merger is that Oracle wanted to compete more directly with IBM, and they will invest in the bits of Sun that help them do that. Oracle has a very strong focus on high margin sales, so they will most likely succeed in making good money with help from Solaris and SPARC to compete with AIX, z/OS and P-series, selling to late-adopter industries like Banking, Insurance etc. Just look where the Mainframes are still being used. Sun could never focus on just the profitable business on its own, because it had a long history of leading edge innovation that is disruptive and low margin. However, what was innovative once is now a legacy technology base of Solaris and SPARC, and it's not even a topic of discussion in the leading edge of disruptive innovators, who are running on x64 in the cloud on Linux and a free open source stack. There is no prospect of revenue for Oracle in this space, so they are right to ignore it.

That is what I meant when I tweeted that Illumos is as irrelevant as Solaris, and it is legacy computing. I don't mean Solaris will go away, I'm sure it will be the basis of a profitable business for a long time, but the interesting things are happening elsewhere, specifically in public cloud and "infrastructure as code".

You might point to Joyent, who use Solaris, and now have Bryan Cantrill on board, but they are a tiny bit-player in cloud computing and Amazon are running away with the cloud market, and creating a set of de-facto standard APIs that make it hard to differentiate and compete. You might point to enterprise or private clouds, but as @scottsanchez tweeted: "Define: Private Cloud ... 1/2 the features of a public cloud, for 4x the cost", that's not where the interesting things are happening.

So to my Sun friends at Oracle, if you want to work for a profitable company and build up your retirement fund Oracle is an excellent place to be. However, there are a lot of people who joined Sun when it was re-defining the computer industry, changing the rules, disrupting the competition. If you want some of that you need to re-tool your skill set a bit and look for stepping stones that can take you there.

When Sun shut down our HPC team in 2004 I deliberately left the Enterprise Computing market, I didn't want to work for a company that sold technology to other companies, I wanted to sell web services to end consumers, and I had contacts at eBay who took me on. In 2007 I joined Netflix, and it's the best place I've ever worked, but I needed that time at eBay to orient myself to a consumer driven business model and re-tool my skill set, I couldn't have joined Netflix directly.

There are two slideshare presentations on the Netflix web site, one is on the company culture, the other on the business model. It is expected that anyone who is looking for a job has read and inwardly digested them both (its basically an interview fail if you haven't). These aren't aspirational puff pieces written by HR, along with everyone else in Netflix management (literally, at a series of large offsites), I was part of the discussion that helped our CEO Reed Hastings write and edit them both.

What can you do to "escape"? The tools are right there, you don't need to invest significant money, you just need to carve out some spare time to use them. Everything is either free open source, or available for a few cents or dollars on the Amazon cloud. The best two things you can have on your resume are hands on experience with the Amazon Web Services tool set, and links to open source projects that you have contributed to. There isn't much demand for C or C++ programmers, but ObjectiveC is an obvious next step, it's quite fun to code in and you can develop user interfaces for iPhone/iPad in a few lines of code, that back-end into cloud services. Java code (for app servers like Tomcat) on Android phones, Ruby-on-Rails, and Python are the core languages that are being used to build innovative new businesses nowadays. If you are into data or algorithms, then you need to figure out how to use Hadoop, which as I describe in one of my slideshare decks is trivially available from Amazon. You can even get an HPC cluster on a 10Gbit ethernet interconnect from Amazon now. There is hadoop based open source algorithm project called Mahout that is always looking for contributors.

To find the jobs themselves, spend time on LinkedIn. I use it to link to anyone I think might be interesting to hire or work with. Your connections have value since it is always good to hire people that know other good people. Keep your own listing current and join groups that you find interesting, like Java Architecture or Cloud Computing, and Sun Alumni. At this point LinkedIn is the main tool used by recruiters and managers to find people.

Good luck, and keep in touch (you can find me on LinkedIn or twitter @adrianco :-)