Friday, June 24, 2011

Vector Clocks

Vector clocks are a way of versioning data so that there is an indication of the history of which actors in your system have been involved in the history of that data.

Confused?  Let's consider a 'real world' analogy to the problem vector clocks helps to solve.  You have four rugby fans:  a Leinster fan; a Munster fan; a Connacht fan and an Ulster fan.  Together they are going to try to determine who they think should be wearing the number 6 jersey for Ireland in the 2011 Rugby World Cup.  Now:
  • The fans are all able to communicate their selection for the jersey with each other
  • They can send their selection to everyone or just one other fan
  • They are able to change their mind
  • They are able to communicate concurrently. This means the Munster fan can send a text to Leinster fan while at the same time the Connacht fan can send a text to the Ulster fan.  
  • Each fan is able to give an indication of the version he got before he sends his selection on. This may or may not have influenced their own selection!
Sean O'Brien
So it starts off. The Leinster fan is using Twitter and has the other 3 lads subscribed.  He tweets: Sean O'Brien.   The Munster fan thinks about this for a while.  He can't deal with it.  He's a proud Munster fan.  The pressure gets to him and he sends a text message to the Connacht fan that the man who should be wearing number 6 is  Denis Leamy.  It's just a text to the Connacht fan; the Ulster and Leinster fan are unaware of this communication.   The Connacht fan decides decides to amuse himself and decides to agree.  He texts back Denis Leamy to the Munster lad. The Ulster lad then sends a text to the Connacht lad which reads: Stephen Ferris.  The Connacht fan can see the potential tension brewing; especially between the Munster and Leinster lads so he decides to be diplomatic and decides to change his selection to Stephen Ferris.

The Leinster lad then looks for the other lads' answers.  Unfortunately he's lost the Ulster lads phones number so he can only communicate with the Munster and Connacht lads.  He sees the Munster lad has chosen Leamy and the Connacht lad has chosen Ferris.  He can see there's clearly been a contradiction.  But he can also see that the Connacht lad has received communication from the Munster lad and the Ulster lad and that the Connacht lad has changed his mind to agree with the Ulster lad.  He sees the consensus and decides to go with Ferris (O'Brien can always play 7 or 8). Eventually the Munster fan sees it's completely overruled and goes with Ferris.

So how are fans communicating not only their selections but indicating what they knew before their selections? They are using vector clocks!

You see, these lads aren't just rugby fans; they are nerds.

The lads agree an communication protocol between themselves. Not only will they text their choice for the number 6 jersey they will also include a vector of 4 numbers.  The first number will correspond to the Leinster version, the second the Ulster version, the third the Connacht version and the fourth the Munster version. Why are they doing this?  So they can indicate what they know about other people's selection before they made their own.

Ok, I appreciate that's confusing.  But rugby itself can be confusing. So be patient.  Let's go back to the example and examine the full messages they sent.

The Leinster fan starts - he sends out his tweet to all other fans: Sean O'Brien, {1, 0, 0, 0}.  This means Sean O'Brien is the first version from the Leinster fan (indicated by the value 1 as the first element in the vector).  It also indicates the Leinster fan had no record of any other version from any other fan before sending this out (indicated by the value 0 as the other three elements).  Then the Munster fan sends Denis Leamy {1, 0, 0, 1} to the Connacht fan. This indicates two things to the Connacht fan:
Denis Leamy
  • The Munster fan got Sean O'Brien {1, 0, 0, 1} from the Leinster fan. The Connacht fan can be sure of this.  Because the first digit in the vector corresponds to the Leinster version.  This is 1.
  • The Munster fan has updated his selection to Denis Leamy. This is indicated by the last element in vector having the value 1.  The last element corresponds to the Munster version.
Then, the Connacht fan sends Denis Leamy {1, 0, 1, 1} back to the Munster fan.  The Munster fan ascertains from this message that:
  • The Connacht fan got the original Leinster choice  - since the first digit is one
  • The Connacht fan got the Munster fans 1st selection  - since the last digit is one
  • And poor old John Muldoon  because the Connacht fan has made his selection and agrees with the Munster fan.  This is ascertainable because the message is Denis Leamy and the third digit is 1.
Then, the Ulster fan sends a text with his selection to just the Connacht fan: Stephen Ferris {1, 1, 0, 0}. This indicates:
  • The Ulster fan got the original Leinster choice - since first digit is one.
  • The Ulster fan has indicated his preference for the Stephen Ferris - from the message itself and because the second digit is one.
  • The Ulster fan knows nothing of the Munster selection or Connacht's selection - since last two digits are 0.
The Connacht fan sees things are getting messy now.  He's knows:
  • Leinster fan chose O'Brien
  • Muster fan chose Leamy
  • Ulster fan chose Ferris
He thinks about it and decides to change his mind to Ferris.  So he sends Ulster: Ferris {1, 1, 2, 1}
With this message, the Connacht fan is saying:
Stephen Ferris
  • I saw the Leinster version 1. You know that's O'Brien.
  • I also saw the Munster's version 1.  You don't know that's Leamy.
  • My initial selection was something other than Ferris. You don't know what that was either.
  • My current selection is Ferris.






So overall, it is possible now to see that:
  • Leinster's selection is O'Brien; 
  • Ulster and Connacht say Ferris and Munster says Leamy
and very importantly
  • Everyone saw Leinster's selection
  • Connacht saw Munster and Ulster's selection
  • Ulster saw Connacht's second selection
This information is very important as it makes it easier to resolve the conflict. It is not the responsibility of Vector clocks to resolve the conflict. That is someone else's responsibility.  But Vector clocks make it easier for someone else to resolve the conflict because they indicate what each actor knows or doesn't know about other actor's selection. Are you listening Declan Kidney?

Declan Kidney - does he use Vector clocks?

What does this mean in architectural terms?
The Rugby fan analogy serves as a nice intro to the use of vector clocks in distributed architectures.  In a simple and conventional architecture, you often have just one relational database.   It's running on a powerful box with lots of CPU power and disk space. The database is not distributed and it is not replicated.  You can have a column for each domain entity in its corresponding table to track its version.  This is usually a numeric value which indicates the version of the domain entity.  At any time a client can check the version it has in memory with the version it has in the database to ensure it does not have stale data.  All easy. All cool.

Now, sometimes you have replication.  Usually one database is the master and one (or more) is then the slave.  If the master goes offline, the slave takes over. When the master comes back online, resync happens. Again all straight forward.  No need for vector clocks.

In a high end architecture you are going to be under pressure to scale.  You'll be under pressure to ditch relational database as they don't scale well and use a NoSql architecture instead. In this architecture to provide availability the data is replicated and it is distributed.  There is usually no dedicated master because this makes it harder to scale.  Instead there is just a ring of database nodes ( also just called servers).  Scaling involves adding more nodes with the intent of distributing the work out evenly - you see there really is no master.  The nodes can communicate with each other in a peer to peer fashion.  They need to be able to indicate what version of the data they have relative to all other nodes.  To do this they use vector clocks.

There's no master and data can still go out of sync

So suppose your data is distributed over four nodes and to guarantee high availability each piece of data replicated is across the nodes (it doesn't have to be replicated on every node but let's just keep things simple). When the data gets updated, unless you block until every single node is up to date as part of the one transaction your nodes can go out of sync.  That's slow and even if you do decide to do it, what happens if a Node goes down or a new Node joins the group?  You are going to still have conflicts.

Ok, so say you try to add a version column to your tables to represent entity versions. This will only get you so far.  It will only tell you what version the entity is in the scope of each node. It's quite possible for two Nodes to both have version 4 but because they both could have gone down and come back up at different times they data may not match!  So it's easy to see why version columns which only hold a single version number will not work.

Cue Vector Clocks!  

The vector clock tells you the version information not just for a single actor in your system but for a range of actors in your system.

Reverting back to our Rugby fan analogy.  Let's say there are four servers.
  • a request comes into your system from the leinster fan which makes the selection: Sean O'Brien. Leinster fan is identified as the first actor in the system so the vector clock is {1, 0, 0, 0}. All nodes get this version
  • a request then comes in from Munster fan.  Munster fan is identified as the last actor in the vector clock. Now, only the 4th database node gets this the update because all other servers are down.  The vector clock is {1, 0, 0, 1}. But remember only the 4th node gets this because all the other servers are down.
  • The third server comes back online and 4th node tells the 3rd node about Denis Leamy and that the vector clock is {1, 0, 0, 1}. It does this by using something like the Gossip protocol (text messages where one user contacts just one other user corresponds to the gossip protocol in our rugby fan analogy).  The 3rd Node updates to Denis Leamy and informs the 4th Node that it has done so.  Again it can do all this by Gossip protocol. It sends the message Denis Leamy {1, 0, 1, 1} back to the 4th Node.
  • Now the Ulster fan sends a request in and suggests Ferris.  The 1st node is still offline and it's a bad day for the sys admin team because the 3rd and 4th node have gone offline again.  However, the second node can process it because it has come back online.  It updates to Stephen Ferris and updates its Vector clock to {1, 1, 0, 0}. Remember the second node got no updates from the 3rd and 4th node.  It would have got them eventually but it got the request from the Ulster fan first.
  • Again the 3rd Node comes back online and again through the Gossip protocol it gets the update for Ferris.  It sees the contradiction and decides to resolve it by updating to Ferris.  It tells the 2nd node {1, 1, 2, 1}.   This shows the 2nd node that it had versions from itself and the 4th node that the 2nd node did not know about.  The 2nd node is cool with this because he also has Ferris so he has no conflict.
  • The 1st node comes back online.  Through the gossip protocol he eventually sees that the 3rd is at Ferris {1, 1, 2, 1} and the 4th Node is at Leamy {1, 0 , 0, 1}. It's a conflict.  But he sees that the 3rd node is more up to date so he goes for Ferris. It's clear that Ferris is more up to date.  Every element in its vector clock is equal to or later than the every element in the 4th node's vector clock so he updates to Ferris.  The 4th eventually realises the same and does the same.
Choosing the Actor

You'll notice in my example, I have four fans (corresponding to four clients) and four servers.  You'll also notice I had the Ulster fan hitting the 2nd Node, the Connacht fan hitting the 3rd node and the Munster fan hitting the 4th node.   
In reality:
  • Request could be hitting any node. Usually much more than one unless sys admin are having a bad day and are turning off machines to annoy you.
  • The data was been replicated on every node.  In reality you wouldn't usually do this. It would be overkill
  • And in the real world, you'd usually have way more clients than servers.  
Who were the actors? 
In the above example, it may have seemed like we chose the servers as the actors in our vector clocks but it was actually the clients.  Recall, the Leinster fan initially hit all four servers but the vector clock was just update to {1, 0, 0, 0} not {1, 1, 1, 1}. But let's consider the consequences of choosing the servers or the clients to be the actors for our vector clocks.

The server is the actor
Because there are less servers in a system this means your vector clocks will be smaller. If you've four servers your vector clocks will look like {S1, S2, S3, S4} instead of {C1, C2, C3, C4, C5, C6, C7, C8, C9, ...,C99999}. This means comparison between vector clocks will be quicker.  However it also means you can lose updates.  For example, suppose two clients are mapped to the same server and makes different updates.  The first update will be lost.

The client is the actor

In this case, vector clocks are much longer, comparisons take longer but all updates are tracked. This means that there can be a higher degree of certainty that conflicts get properly resolved.  In order to deal with the vector clocks getting unsustainable long, they can be culled periodically:  very old versions can be discarded.

Even when the client is the actor, the vector clock is still stored with the data in the server nodes.  The vector clock is always stored in the database nodes.

When are vector clocks used?
The primary use is in distributed systems. They are not responsible for resolving conflicts they are a versioning system that help someone else resolve them.  Sometimes it can be easy to resolve conflicts. For example, when there are two conflicting vector clocks, if every element in one vector clock is equal to or greater than every element in the other vector than it is easy to ascertain that that is the one more uptodate.  When that can't be done, the conflict has to be resolved by some other logic.

Vector clocks are used in systems such as Amazon's Dynamo.

References:
1. Basho's blog http://blog.basho.com/2010/04/05/why-vector-clocks-are-hard/
2. Basho's blog http://blog.basho.com/2010/01/29/why-vector-clocks-are-easy/
3. Amazon's Dynamo - http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html
4. Rutgers http://www.cs.rutgers.edu/~pxk/rutgers/notes/clocks/index.html

Isolation levels

Working hard at the Till
Ok so let's say your music shop, 'Breako's Beats' has a very simple database to capture all the CDs for sale in the shop.  The database is just one simple table where each row represents a different music CD such as: 'Rank 1 - Airwave' (this song is playing on iTunes as I am writing this and it's catchy so hence the reference!)  Now, let's say there is a column in the table to indicate how many copies of each CD are available in the shop.  Now this shop is big; there are multiple tills and people can also purchase the CDs from the shop over the web.  In addition, there is someone reponsible for adding new rows as new CDs come into the Breako's Beats store. This means your database will have multiple users all looking at and changing the same table and potentially the same rows in the table at the same time. Yikes...

This means potential concurrency issues.  Now what sort of problems could we expect if we just built this system and never thought about concurrency?

Problem number 1: Dirty Reads

Trent Reznor - how many copies left?
A dirty read is when a transaction reads data from another transaction that has been written but not yet committed. This is problematic if the other transaction never commits the data he has changed.  Suppose a customer asked you to check how many copies of Trent Reznor's 'In Motion' was left and you said 3. But this was because the inventory user was in the middle of transaction which set this value from 0 to 3.  But, the inventory user decided not to commit his transaction for whatever reason.  So it's really 0 copies left but you saw 3.  Unless you check again you are going to end up thinking its 3.


Problem number 2: Non-repeatable reads

Eric Saade is Popular!
This is when a transaction reads the same rows of data more than once and gets different results for those rows. In this case the different answers are from data that has been committed by other transactions. So you could be checking the number of copies you have for: Eric Saade's 'Popular' and you get back different answers in the same transaction.  Why? Because this time the inventory guy added more copies and updated the row corresponding to Eric Saade's 'Popular' and committed his transaction while you were still in the middle of your transaction. As you can see, even though this may be dangerous it isn't as dangerous as dirty reads because at least when the change happened it's from a committed transaction not a momentary blip.

The only enhancement you could make here is to block the other user from making the changes until you're finished.  This would mean you'd lock the rows you're reading.  Because this would slow things down, it really depends on what your system needs to determine if this approach makes sense or not.

Problem number 3: Phantom reads

Andy Weatherall before the smoking ban!
This is similar to non-repeatable reads. This time someone rings your shop and asks you what you have from Andy Weatherall.  The first time you check you get:
You decide to check again in the same transaction but before you do that, the inventory guy got a shipment with more of Andy's stuff. He has added it to the system and committed his transaction.  When you re-read you get back the same as above but you also get back the new additions:
This problem is similar to non-repeatable reads but this time rows have been added not changed. The first time there was only two rows for Andy, the subsequent time there was five.

Again, this is probably not the riskiest thing in the world  - at least you're being told the truth! The solution would again involve blocking the second transaction until you have finished your transaction completely.  As was the case with non - repeatable reads, it really depends on your system requirements (and how probable this problems will happen) to decide if it is actually worth going to the hassle of locking rows.

Isolation levels

In databases, "isolation level" is a property which defines how much concurrent transactions can see or can't see of each other. Isolation is the 'I' in ubiquitous ACID database acronym coined by
Andreas Reuter and Theo Harder in 1983.  Isolation level is a configurable property in most databases.  So what can it be set to?  Well there are usually four settings, namely:

Serializable, Repeatable Reads, Read Committed, Read uncommitted

The table below shows the differences.

Isolation level Dirty reads Non-repeatable reads Phantoms
Read Uncommitted Y Y Y
Read Committed N Y Y
Repeatable Read N N Y
Serializable N N N

Y  means that the problem can happen, N means it can't

My recommendation is set to Read Committed.  The reason why is because serializable can be over protective and will impact performance because it incurs a lot of locking.  Repeatable reads only protect you when you re-read the same data in a transaction. This is something that is uncommon in most systems because too much re-reading usually results in bad performance.   You also need to consider if your transactions are generally long or short.  If you don't have long running transactions it's less likely you are going to have collisions and hence you may not have to be so strict in how isolated the transactions need to be.

Dirty reads can be really problematic.  They are not only misleading but produce unusual problems that are difficult to diagnose. This is because the incorrect data you got was never even persisted.

Therefore I would advise to set to at least read committed to eliminate dirty reads.  Only go for a more stronger setting if your system needs it and if the problems such as phantom reads are probable.

As stated, more protection means more locking. The more locking means the slower your system will go as users get blocked.  Details of the locking are in the table below.

Isolation level Write Lock Read Lock Range Lock
Read Uncommitted N N N
Read Committed Y N N
Repeatable Read Y Y N
Serializable Y Y Y

Y means locking is incurred. N means it is not.

Is that all I need to do?

You wish! Isolation levels for concurrent transactions deal only with concurrent transactions. It is much more likely you'll have concurrent usage of the same data even though you don't have concurrent transactions on the same data.  For example:
  • User 1 reads in the data at the beginning of his session
  • User 2 reads in the same data at the beginning of his session. User 2's session is independent of User 1's session.
  • User 2 then makes some changes and commits change. His transaction is finished.
  • User 1's data has become stale, before he even starts a transaction.  He begins his transaction after User 2 has committed User 2's transaction. But User 1's data is stale.  Surely he should have to respect the most up to date version of the data before making changes!
These scenarios are much more likely to happen in your system and warrant serious attention. They are dealt with by using locking policies such as optimistic locking and pessimistic locking. This involves how you architect your application layer, not just your database.




Monday, June 20, 2011

Consistent Hashing

Consistent Hashing is a clever algorithm that is used in high volume caching architectures where scaling and availability are important. It is used in many high end web architectures for example: Amazon's Dynamo.  Let me try and explain it!


Firstly let's consider the problem.

Let's say your website sells books (sorry Amazon but you're a brilliant example). Every book has an author, a price, details such as the number of pages and an ISBN which acts as a primary key uniquely identifying each book. To improve the performance of your system you decide to cache the books.  You split the cache over four servers.  You have to decide which book to put on which server.  You want to do this using a deterministic function so you can be sure where things are.  You also want to do this at low computational cost (otherwise what's the point caching).  So you hash the book's ISBN and then mod the result by the number of servers which in our case is 4.  Let's call this number the book's hash key.

So let's say your books are:
  1. Toward the Light, A.C. Grayling (ISBN=0747592993)
  2. Aftershock, Philippe Legrain (ISBN=1408702231)
  3. The Outsider, Albert Camus (ISBN=9780141182506)
  4. This History of Western Philosophy, Bertrand Russell (ISBN=0415325056)
  5. The life you can save, Peter Singer (ISBN=0330454587)
... etc

After hashing the ISBN and moding the result by 4, let's say the resulting hash keys are:
  1. Hash(Toward the Light) % 4 = 2. Hashkey 2 means this book will be cached by Server 2.
  2. Hash(Aftershock) % 4 = 1. Hashkey 1 means this book will be cached by Server 1.
  3. Hash(The Outsider) % 4 = 4. Hashkey 1 means this book will be cached by Server 4.
  4. Hash(The History of Western Philosophy) % 4 = 1. Hashkey 1 means this book will be cached by Server 1.
  5. Hash(The Life you can save) % 4 = 3. Hashkey 1 means this book will be cached by Server 3.
Oh wow doesn't everything look so great. Anytime we have a book's ISBN we can work out its hash key and know what server its on!  Isn't that just so amazing!  Well no.  Your website has become so cool, more and more people are using it.  Reading has become so cool there are more books you need to cache.  The only thing that hasn't become so cool is your system.  Things are slowing down and you need to scale.  Vertical scaling will only get you so far; you need to scale horizontally.

Ok, so you go out and you buy another 2 servers thinking this will solve your problem.  You now have six servers.  This is where you think the pain will end but alas it won't.  Because you now have 6 servers so your algorithm changes. Instead of moding by 4 you mod by 6.  What does this mean?  Initially, when you look for a book because moding by 6 will mean you'll end up with a different hash key for it and hence a different server to look for it on. It won't be there and you'll have incurred a database read to bring it back into the cache.  It's not just one book, it will be the for the majority of your books.  Why? Because the only time a book will be on the correct server and not need to be re-read from the database is when the hash(isbn) % 4 = hash(isbn) % 6.  Mathematically this will be the minority of your books.
So, your attempt at scaling has put a burden on the majority of your cache to restructure itself resulting in massive database re-reads.  This can bring your system down.  Customers won't be happy with you sunshine!

We need a solution!

The solution is to come up with a system which has the characteristic that when you add more servers and only a small minority of books will move to new servers meaning a minimum number of database reads.  Let's go for it!

Consistent Hashing explained

Consistent hashing is an approach where the books get the same hash key irrespective of the number of books and irrespective of the number of servers - unlike our previous algorithm which mod'ed by the number of servers.  It doesn't matter if there is one server, 5 servers or 5 million servers, the books always always always always get the same hash key. 

So how exactly do we generate consistent hash values for the book
s?

Simple.  We use a similar approach to our initial approach except we stop moding on the number of servers. Instead we mod by something else, that is constant and independent of the number of servers. 

Ok, so let's hash the ISBN as before and then mod by 100.  So if you have 1,000 books.  You end up with a distribution of hash keys for the books between 0 - 100 irrespective of the number of servers.  All good. All we need is a way to figure determinstically and at low computational cost which books reside on which servers.  Otherwise again what would be the point in caching?

So here's the ultra funky part... You take something unique and constant for each server (for example its IP address) and you pass that through the exact same algorithm. This means you also end up with a hash key (in this case a number between 0 and 100) for each server.

Let's say:
  1. Server 1 gets: 12
  2. Server 2 gets: 37
  3. Server 3 gets: 54
  4. Server 4 gets: 87
Now we assign each server to be responsible for caching the books with hash keys between its own hash key and that of the next neighbour (next in the upward direction).

This means:
  1. Server 1 stores all the books with hash key between 12 and 37
  2. Server 2 stores all the books with hash key between 37 and 54
  3. Server 3 stores all the books with hash key between 54 and 87
  4. Server 4 stores all the books with hash key between 87 and 100 and 0 and 12.
If you are still with me... great. Because now we are going to scale.  We are going to add two more servers.  Lets say server 5 is added and gets the hash key 20.  And server 6 is added and gets the hash value 70.
This means:
  1. Server 1 will now only store books with hash key between 12 and 20
  2. Server 5 will stores the books with hash key between 20 and 37.
  3. Server 3 will now only store books with hash key between 54 and 70.
  4. Server 6 will stores books with the hash key between 70 and 87.
Server 2 and Server 4 are completly unaffected.

Ok so this means:
  1. All books still get the same hash key. Their hash keys are consistent.
  2. Books with hash keys between 20 and 37 and between 70 and 87 are now sought from new servers.
  3. The first time they are sought they won't be there and they will be re-read from the system and then cached in the respective servers. This is ok as long as it's only for a small amount of books.
  4. There is a small initial impact to the system but its managable.
Now, you're probably saying:
"I get all this but I'd like to see some better distribution. When you added two servers, only two servers got their load lessoned. Could you share the benefits please?"

Of course. To do that, we allocate each server a number of small ranges rather than just one large range.  So, instead of server 2 getting one large range between 37 and 54.  It gets a number of small ranges. So for example, if could get: 5 - 8, 12 - 17, 24 - 30, 43 - 49, 58 - 61, 71 - 74, 88 - 91.  Same for all servers.  The small ranges all randomly spread meaning that one server won't just have one adjacent neighbour but a collection of different neighbours for each of its small ranges.   When a new server is added it will also get a number if ranges, and number of different neighbours which means its benefits will be distributed more evenly.

Isn't that so cool!

Consistent hashing benefits aren't just limited to scaling. They also are brilliant for availability. Let's say server 2 goes offlines.  What happens is the complete opposite to what happens for when a new server is added.  Each one of Server 2 segments will become the responsibility of the server who is responsible for the preceeding segment. Again, if servers are getting a fair distribution of ranges they are responsible for it means when a server fails, the burden will be evenly distributed amongst the remaining servers.

Again the point has to emphasised, the books never have to rehashed. Their hashes are consistent.
References
  1. http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html
  2. http://www.tomkleinpeter.com/2008/03/17/programmers-toolbox-part-3-consistent-hashing/
  3. http://michaelnielsen.org/blog/consistent-hashing/

Say Hello in Spring 2.5 / Spring 3.0

Ok you want to get set up with Spring and do a simple 'Hello World'.
Let's do it.

Step one: Sort out your IDE

There are some choices here. You can just add some Spring libs to your path, you can just download some Spring plugins for Eclipse or you can do I what I recommend which is to download SpringSource Tools. This is an eclipse based IDE developed especially for Spring development. You get the benefits of Eclipse, a guarantee the Spring stuff will work (ever have pain with plugins not working or clashing :-)) and you'll also get the developer edition of vFabric tc Server which we won't need that for this example but is still useful to have for future use.

Step two: Create a Spring Project

Select File / New / Spring Template Project. Select Simple Spring Utility Project.
Provide project name e.g. HelloWorld. Provide top level package e.g.

com.dublintech.springtutorial1


Select Finish. In the Project Explorer view you should see your project. Note the 's' just above the project folder icon. This indicates it's a spring project.

Step three: Ignore (or get rid off) the sample code

You'll note that the project template comes with example source files and test files (at time of writing these are: Service.java, Example Service.java, ExampleConfigurationTests.java and ExampleServiceTests.java). Ignore or delete all of these.

Step four: Write your own code.

Let's define an interface: GreetingService

An implementation that will give us the proverbial HelloWorld:

And another implementation of the same interface. But this time a little bit more personal!


Keep things simple and just put everything in the package you specified when creating the project. Notice that we have no dependencies to any spring libraries in our interface or either of our classes.

Step five: Add some spring configuration.

Go to file app-context.xml. It comes with the template.

Notice that the default bean that comes with the template is configured and given a Bean id.



Now it's time to make your classes Spring beans. Add the following:

Ok, no need to sweat. All we're doing here is
* giving the HelloWorldService the bean id helloWorldService
* giving the PersonalGreetingService the bean id personalGreetingService


In addition
* personalGreetinService will have its name property set to Alex.


Step Six: Write some tests so that you see your beans in action.

With respect to this test:
  • I have left out the imports to keep the page space to a min. Hopefully they are obvious enough
  • The @ContextConfiguration is used to hook a spring application context into the test. When no location is specified the default one is derived from the name of the class. In this case it will be: /resources/com/dublintech/springtutorial1/GreetingTests-context.xml. We'll see more about this file in the next step.
  • @RunWith(SpringJUnit4ClassRunner.class) means that the GreetingTests can run JUnit 4.4 tests *and* reap the benefits of the Spring framework such as dependency injection and loading of application contexts.
  • @Autowired is a spring short hand for autowiring beans. In this case the first
    autowiring wires the bean that has the id helloWorldService into your test.
    The second autowiring wires the bean that has the id personalGreetingService into your test. This is where smart naming helps you and typos will reck your head!!!

Now onto the tests themselves. The first test tests the helloWorldService bean and the second tests
personalGreetingService tests... you guessed it... that Alex will be wired in by the spring framework
into the PersonalGreetingServiceBean and 'Hello Alex' will be returned.

Step seven: Your tests need to know how to load the ApplicationContext

Create a file named: GreetinTests-context.xml. Place it in the package
com.dublintech.springtutorial1 (under src/test/resources).

It will look like:

This is pretty simple. All it does is tell your tests what Spring configuration / ApplicationContext to use.

Step Eight: Run your tests


Go back to your tests, right click and select to run as junit.
That's it.

Comment:
This was a pretty simple example but there a few points worth noting.

1. How easy it is to tell your tests what spring configuration / ApplicationContext to use.
This mean you can use different configurations, which wire your beans different ways and test
different things without ever having to recompile any code.

2. Spring isn't a lightweight as you might think. In this example, your IDE will use Maven
to ensure that all the jars your project needs are in place. Expand the Maven Dependencies in
your project explorer view and you'll see what jars your project is using.  For one of Spring simplest features still requires a lot of jars.

References
1. http://static.springsource.org/spring/docs/2.5.x/reference/testing.html

Friday, June 17, 2011

Are you still fond of your ANT?

Ok you haven't caught the Maven bug and you're still using ANT. Here's two very simple Ant tips for you...

Tip 1: echoproperties

You're trying to debug things, you'd like to know details of the Ant and the JVM you're using. You'd also like to know details of the properties you have defined. You want this all quickly. Well look no further than Ant's built in target: echoproperties.

Wrap this around your own target so you can call it from the command line and that's it. I usually wrap it around a target called debug.

So I'd have something like this:


A sample output would be:




Tip 2: Those friggin' classpaths.

You're not a java developer unless you've spent some hours in your life banging your head of the wall because your classpath was incorrect. This isn't just a beginner's problem; it can even catch season pro's out. Especially when a class or set of classes is available from multiple jars. For example, you can easily use whatever
StAX implementation you want but if you forget to put the one you want on your classpath, your runtime will use the reference implementation from the JDK. So double check your classpath before you go around boasting of the performance improvements you're getting with Woodstox!

Right, so you want a target that will echo your classpaths so you can see them how ANT sees them. Well, in ANT you can't just echo paths. But, you can make property representations of paths and then just echo them.

So let's say you set up your properties, directories and classpaths as something like


All you do is create a target, which defines properties that represent these paths and
then you echo these properties.


Sample output:


Notice the way I have defined the properties locally to the target. This is because they are only relevant to the target echo.classpath. There is no need to make the global. Always try to encapsulate - even in Ant.

References:

1. http://ant.apache.org/manual/Tasks/echoproperties.html

Friday, June 10, 2011

A very good example of a Cloud Computing product from Dublin's JLizard

Logentries is an interesting product from Dublin start up JLizard. This product is a brilliant example of SaaS. Users run software hosted in the cloud; they don’t worry about complex set up; they use the software via a thin client; they pay only for what they agree to use. Now, since JLizard have a Dublin connection (it is a spin out company from the Performance Engineering Lab at UCD) we gotta fill you in!

Help me understand my logs?

We all know log files can get verbose very quickly making it difficult to spot patterns and identify what’s important. The Logentries product solves this problem by allowing you define tags for your logfile to identify parts (for example the exceptions). It also allows you to generate a pictorial view of your logfile which highlight your tags. Not only is this a very good way to provide a summary of a logfile, it makes it much easier to spot patterns.

In addition the logentries product allows searching and filtering to make it easier to identify the important parts of the logfile quickly. It also makes it easy to check on-line resources. For example, let’s say you get a DB2 exception with a DB2 error code. Just highlight it, and immediately you can check what Google and Google code can tell you about that obscure error code. All clever stuff. The UI is also very user friendly. A demo is available here.

So what’s the relevance of the Cloud to all this?

Where Logentries gets really interesting is that it is architected in a Cloud as a Saas. So what difference does this make to a tool that can make an ugly logfile look pretty? Well quite a lot really.

Suppose you have customers in disparate places all running your software and generating logfiles on their systems. When problems inevitably happen, you will need to see the logs. This requires co-ordination and some ftp’ing which all mean time. Logentries solves this problem. It provides instant access by using an agent which is deployed in the system listening to what is being logged. Effectively, the agent is a like a smart log4J appender - listening to what is being logged but unobservable to the system which does the logging. The logged information is sent to the Cloud securely in real-time. This means you can view it instantly.

But if my customers are already in the cloud (or I can access their VPN) what difference does this all make?

The point here is that the customer does not have to move to the cloud. Many organisations are reticent about moving their architectures to public cloud or many just don’t have the need. Virtualization suffices. The Logentries agent means the load balancer, the AppServer, byte code and the database data stay where they are because it handles the communication – securely. All that ends up in the cloud is the logfile.

Any more?

Of course. The Logentries product is a very good example of the usefulness of the elasticity provided by a Cloud architecture. If more logs are generated than you anticipated and you need more disk space and upload bandwidth, it’s no problem because the Cloud means if you need more resources you can get them – quickly.

Don’t forget it’s a SaaS!

Logentries is a brilliant example of a SaaS. You pay to use the software, it resides in the cloud, set up and ready to go. You only pay for what you use. It is a brilliant example of the type of products we can expect as the computer industry moves to generation Cloud.

References:

1. http://twitter.com/logentries

2. https://logentries.com/

3. https://logentries.com/blog/

4. http://www.linkedin.com/company/jlizard%27s-logentries

5. http://www.siliconrepublic.com/start-ups/item/21146-jlizard-secures-50-000-inv

Saturday, June 4, 2011

Cloudspeak

Let's have a look at some of the technology lexicon associated with Cloud Architectures.

BigTable
Cloud database offered by Google. It is non-relational and highly scalable.

Cloud computing
There are five principles of cloud computing
1. Resources are pooled
2. Machines are virtualised (achieves maximum utilisation)
3. Elasticity. Users can scale up or down very easily.
4. Virtual machines can be created or deleted automatically
5. Billing is by resource usage rather than by a flat fee

CloudBursting
Cloudbursting concerns hybrid architectures where a classical enterprise architecture can make use of a cloud on demand. This means that part of the architecture can be behind a private firewall and be kept away from the cloud completely but the elastic benefits of the cloud are still possible should periodic or unexpected traffic occur. In the cloud bursting model, the load balancer is not in the cloud. The load balancer decides when to use the cloud based on demand and traffic.

Commodity computers
Commodity computers are cheaper computers used in architectures which do not require the hardware to be highly reliable. This is usually possible when the software has a high degree of failover incorporated. Google's MapReduce framework uses commodity computers and then reassigns tasks if any of the commodity computers fail and do not finish allocated tasks.

Data centre
The physical home which stores all computational resources
Facebook has various data centres in the US see:
http://www.datacenterknowledge.com/the-facebook-data-center-faq/

Infrastructure as a Service (IaaS)
This is the lowest level of service available from a Cloud. In this case, the Cloud provider simple provides virtual machine images with an operating system. Amazon's EC2 is an example of IaaS.

Hypervisor (also called virtual machine manager)
A thin layer of software that allocates hardware resources dynamically and transparently to virtual machines. The term hypervisor was coined as an evolution of the term "supervisor," the software that provided control on earlier hardware.

Platform as a service (PaaS)
Allows users to create their own application using the Cloud provider's platform and tools. This allows rapid development but also means there is a risk of vendor lock-in.

Example: Google's AppEngine, Microsoft's Azure, force.com

SimpleDB
Cloud database offered by Amazon. It is non-relational and highly scalable.

Sharding
Sharding is based on the "shared-nothing" principle. There are no dependencies between different portions of data. To achieve this usually involves denormalising data so that dependent data is stored together. The result is parallel processing of independent data is possible and hence higher concurrency is possible.

Data can also then be partitioned very easily. This is usually done horizontally - splitting up rows into separate partitions. Each individual partition is referred to as a shard or database shard. Partitioning data means the total number of rows in each table is reduced. This reduces index size, which generally improves search performance.

Software as a service (SaaS)
Allows users to run existing online applications.

Example: Salesforce.com, pixlr.com, jaycut.com

Private Cloud
In this model there are no subscribing customers, the computing resources are controlled by a single organisation. But, the resources are still pooled and shared; machines are still virtualised. The difference between the private cloud and standard virtualisation is that in the private cloud model, the virtual machine creation and deletion can be automated and can achieved very quickly and thus elastic scaling characteristics that are associated with cloud computing that can't be with standard virtualisation can be achieved.

References:
1. Google AppEngine http://en.wikipedia.org/wiki/Appengine
2. Google's map reduce http://en.wikipedia.org/wiki/Mapreduce
3. The Cloud at your Service, Jothy Rosenbery, Arthur Mateos (Manning).