23 Jul 2008

feedPlanet Apache

Justin Mason: Links for 2008-07-23

Why Spam Can’t Be Stopped â€" Emailappenders And Others Sell Bogus Lists Marketing company buys list of addresses, 85% of the 100k addresses bounce, marketer gets booted by ISP for spamming, marketer issues complaining press release. Let's say it again: opt-in permission can't be sold, and address list vendors are spammers

23 Jul 2008 11:05pm GMT

Tom White: Pluggable Hadoop

Update: This quote from Tim O'Reilly in his OSCON keynote today sums up the changes I describe below: "Do less and then create extensibility mechanisms." (via Matt Raible)

I'm noticing an increased desire to make Hadoop more modular. I'm not sure why this is happening now, but it's probably because as more people start using Hadoop it needs to be more malleable (people want to plug in their own implementations of things), and the way to do that in software is through modularity.

Some examples:

Job scheduling

The current scheduler is a simple FIFO scheduler which is adequate for small clusters with a few cooperating users. On larger clusters the best advice has been to use HOD (Hadoop On Demand), but that has its own problems with inefficient cluster utilization. This situation led to a number of proposals to make the scheduler pluggable (HADOOP-2510, HADOOP-3412, HADOOP-3444). Already there is a fair scheduler implementation (like the Completely Fair Scheduler in Linux) from Facebook.

HDFS block placement

Today the algorithm for placing a file's blocks across datanodes in the cluster is hardcoded into HDFS, and while it has evolved, it is clear that a one-size-fits-all approach is not necessarily the best approach. Hence the new proposal to support pluggable block placement algorithms.

Instrumentation

Finding out what is happening in a distributed system is a hard problem. Today, Hadoop has a metrics API (for gathering statistics from the main components of Hadoop), but there is interest in adding other logging systems, such as X-Trace, via a new instrumentation API.

Serialization

The ability to use pluggable serialization frameworks in MapReduce appeared in Hadoop 0.17.0, but has received renewed interest due to the talk around Apache Thrift and Google Protocol Buffers.

Component lifecycle

There is work being done to add a lifecyle interface to Hadoop components. One of the goals is to make it easier to subclass components, so they can be customized.

Remove dependency cycles

This is really just good engineering practice, but the existence of dependencies makes it harder to understand, modify and extend code. Bill de hÓra did a great analysis of Hadoop's code structure (and its deficiencies), which has lead to some work to enforce module dependencies and remove the cycles.

23 Jul 2008 9:18pm GMT

Rajith Attapattu: 5 reasons why Distributed Systems are hard to program

Here are 5 reasons why I found distributed system are hard to program. This is not some sort of thorough analysis, but merely my observations in dealing with such systems. For completeness, here is the definition of "Distributed System" I used.
A distributed system contains of more than one process that runs as a single system. These processes can be on the same computer or multiple computers that are on a local area network or geographically distributed over a wide area network.

Without any further do here are the reasons in no particular order.

1. Difficulty in identifying and dealing with failures.
When communicating between processes failures can happen at many levels. Dealing with them is not trivial. Of course you rely on frameworks based on technologies like RMI, CORBA, COM, SOAP, AMQP, REST(is an architectural style not a standard) etc to handle these. But the fact remains that you still need to clearly think about these cases and handle these situations properly.

For example if we consider a simple interaction between two processes on different computers, the following failures can happen.

Sometimes the framework you use, is unable to/may not report all these error cases. Sometimes when the error is reported, it may not contain enough information to figure out at which level the error occurred.
Did it reach the remote computer? if so how far up the stack did it go?. If the receiving process got the request or message did the error occur before or after the request/message was processed?
In some cases where idempotency is built into the the receiving application or the framework/protocol (ex a message client that detects duplicate messages, or doing an HTTP GET) a simple retry maybe ok. In some cases Idempotency and retrying maybe expensive or difficult to implement. In such cases careful thought needs to be given on how these different errors are identified and handled.

2. Achieving consistency in data across processes.
One of the hardest problems in programming distributed systems is achieving a consistent view of data across the processes. When one processes updates some data, you need to replicate them across the other processes, so if any other process decides to operate on the same set of data, then it is doing so on the most current copy.
Lets look at two examples.

Assume a global banking application for ABC bank. A customer goes to a branch in New York, US and deposits money to an account. A few moments later his relative in London, UK does a withdraw on that account. Due to latency there is obviously a time lag before the process in London, UK sees the updated amount in the account.

In an online trading system, a user in NY places an item for sale. The transaction is updated on the closest data center which is in Boston. A few moments later another user in LA is searching for the exact same item and is served off a data center in Phoenix. The user in LA may or may not see the item due to the latency involved in replicating the data across

For example 1 strong consistency is required, while for example 2, you could get away with weak consistency, for example by setting an SLA that says data is valid within a 5 min time window.
This is not an easy problem to solve and this area itself is a subject on its own. Wener Vogels wrote a nice peice on this called Eventually Consistent which is worth reading.
Of course there are specialized frameworks/libraries that can handle this for you. But still there is no escape for you and you pretty much need to have an understanding of the pros and cons of various approaches, failure modes etc.

3. Heterogeneous nature of the components involved in the system.
A distributed system may contain components written in a variety of languages deployed across machines with different architectures and operating systems. Needless to say that this poses certain challenges (especially integration, interoperability issues) when implementing the system. A whole range of standards/technologies were presented to solve these issues, including but not limited to CORBA, SOAP, AMQP, REST (is an architectural style not a standard) and RPC based frameworks like ICE, Thrift, Etch etc. Anyone who has worked with these technologies knows that neither of these are trivial to use nor provide a complete solution in every situation.

If anybody has read the recent posts by Steve Vinoski and the discussions around it would realize the issues/challenges surrounding RPC. The following paper discuss the impedance mismatch problems when working with IDL based systems. The issues with type systems and data formats are not limited to RPC only. When using a message oriented approach like SOAP (doc lit style) or AMQP you will end up tunneling data thats not supported by the protocol as a string or a sequence of bytes. When using REST you would need to represent your resource in a format the requesting application understands/supports, which maybe quite different from the native format.

Again not an easy issue to deal with no matter what technology or framework is used. As an architect/developer you need to understand these issues and deal with them accordingly.

4. Testing a distributed system is quite difficult.
This is arguably one of the hardest aspects of developing a distributed system. Verification of the behavior and impact of your code in the system is not easy.
There are many aspects that needs to be tested, and doing so before every checkin is not a fun task at all. Running some of these tests before every checkin is not practical. But its a good idea to run them nightly and some tests during the weekend. Here are some of the areas that needs to be tested (I plan to write another blog entry elaborating on the testing aspects).

Most often than not developers cut corners in their testing as running these tests are tedious and time consuming. Also these tests need to be run regularly to catch issues in a timely manner and the best way to tackle this issue is to automate as much testing as possible. There many options with continuous build systems like cruisecontrol or using a plain old cron job.
Functionality testing, TCK compliance, certain types of integration and interoperability tests can be run periodically.
In most organizations test machines are just lying around doing nothing during the night (unless around the clock testing is done with development centers in different time zones.). Instead of wasting computing cycles, you could automate test suites to run during the night. More time consuming integration and interoperability tests, performance, stress and soak testing can be done nightly, while more longer duration soak testing can be scheduled to run during the weekends.

While testing is a tough issue for any type of system, distributed systems have a lot more failure points which adds to the complexity.
Getting these tests right to cover these failure points and executing them needs a lot of careful thought and planning.

5. The technologies involved in distributed systems are not easy to understand .
Distributed system are not easy to understand. Neither are the myriad of technologies used in developing these systems.
Most folks find it difficult to grasp the concepts behind these technologies. If you look into the discussions and misconceptions surrounding REST you can understand what I am trying to get at. CORBA was not an easy spec to understand, so is WS-* or AMQP. While it is true that you don't need to understand everything to develop using them, you still need at least a reasonable understanding to figure how to tackle some of the above mentioned issues. Frameworks based on these technologies are touted as the cure for these problems. Sure they could help, but it still does not shift the burden away from you.
To compound the issue all sorts of vendors keep touting their technology/framework as the next silver bullet. No matter what vendor you use, at the end of the day you are still responsible for getting it right. And it is not an easy task. You need to face the reality that distributed systems are hard and that you cannot hide every complexity behind some framework.

23 Jul 2008 9:14pm GMT