We just put up a new blog post that discusses our recent asynchronous checkpointing work released in 0.5.3.? It goes into some detail about the problems around recovering distributed systems to a consistent global state, some ways you might do this incorrectly, and some reasons why the checkpointing algorithm we settled on is a great fit for streaming systems.