23 Jul 2008
Planet Apache
Justin Mason: Links for 2008-07-23
Why Spam Can’t Be Stopped â€" Emailappenders And Others Sell Bogus Lists Marketing company buys list of addresses, 85% of the 100k addresses bounce, marketer gets booted by ISP for spamming, marketer issues complaining press release. Let's say it again: opt-in permission can't be sold, and address list vendors are spammers
23 Jul 2008 11:05pm GMT
Tom White: Pluggable Hadoop
Update: This quote from Tim O'Reilly in his OSCON keynote today sums up the changes I describe below: "Do less and then create extensibility mechanisms." (via Matt Raible)
I'm noticing an increased desire to make Hadoop more modular. I'm not sure why this is happening now, but it's probably because as more people start using Hadoop it needs to be more malleable (people want to plug in their own implementations of things), and the way to do that in software is through modularity.
Some examples:
Job scheduling
The current scheduler is a simple FIFO scheduler which is adequate for small clusters with a few cooperating users. On larger clusters the best advice has been to use HOD (Hadoop On Demand), but that has its own problems with inefficient cluster utilization. This situation led to a number of proposals to make the scheduler pluggable (HADOOP-2510, HADOOP-3412, HADOOP-3444). Already there is a fair scheduler implementation (like the Completely Fair Scheduler in Linux) from Facebook.
HDFS block placement
Today the algorithm for placing a file's blocks across datanodes in the cluster is hardcoded into HDFS, and while it has evolved, it is clear that a one-size-fits-all approach is not necessarily the best approach. Hence the new proposal to support pluggable block placement algorithms.
Instrumentation
Finding out what is happening in a distributed system is a hard problem. Today, Hadoop has a metrics API (for gathering statistics from the main components of Hadoop), but there is interest in adding other logging systems, such as X-Trace, via a new instrumentation API.
Serialization
The ability to use pluggable serialization frameworks in MapReduce appeared in Hadoop 0.17.0, but has received renewed interest due to the talk around Apache Thrift and Google Protocol Buffers.
Component lifecycle
There is work being done to add a lifecyle interface to Hadoop components. One of the goals is to make it easier to subclass components, so they can be customized.
Remove dependency cycles
This is really just good engineering practice, but the existence of dependencies makes it harder to understand, modify and extend code. Bill de hÓra did a great analysis of Hadoop's code structure (and its deficiencies), which has lead to some work to enforce module dependencies and remove the cycles.
23 Jul 2008 9:18pm GMT
Rajith Attapattu: 5 reasons why Distributed Systems are hard to program
Here are 5 reasons why I found distributed system are hard to program. This is not some sort of thorough analysis, but merely my observations in dealing with such systems. For completeness, here is the definition of "Distributed System" I used.
A distributed system contains of more than one process that runs as a single system. These processes can be on the same computer or multiple computers that are on a local area network or geographically distributed over a wide area network.
Without any further do here are the reasons in no particular order.
1. Difficulty in identifying and dealing with failures.
When communicating between processes failures can happen at many levels. Dealing with them is not trivial. Of course you rely on frameworks based on technologies like RMI, CORBA, COM, SOAP, AMQP, REST(is an architectural style not a standard) etc to handle these. But the fact remains that you still need to clearly think about these cases and handle these situations properly.
For example if we consider a simple interaction between two processes on different computers, the following failures can happen.
- Failures that occur within the process that initiates the communication (sending the message or invoking the RPC call).
- Failures between the time the process hands over the request to the OS and the OS writing it to the network.
- Network failures between the time it takes to transmit the packets from one computer to the other.
- Failures between the time the OS on the receiving end receives the packets and then handing it over to the recipient process.
- Failures that occur when the recipient process tried to process the request/message.
Sometimes the framework you use, is unable to/may not report all these error cases. Sometimes when the error is reported, it may not contain enough information to figure out at which level the error occurred.
Did it reach the remote computer? if so how far up the stack did it go?. If the receiving process got the request or message did the error occur before or after the request/message was processed?
In some cases where idempotency is built into the the receiving application or the framework/protocol (ex a message client that detects duplicate messages, or doing an HTTP GET) a simple retry maybe ok. In some cases Idempotency and retrying maybe expensive or difficult to implement. In such cases careful thought needs to be given on how these different errors are identified and handled.
2. Achieving consistency in data across processes.
One of the hardest problems in programming distributed systems is achieving a consistent view of data across the processes. When one processes updates some data, you need to replicate them across the other processes, so if any other process decides to operate on the same set of data, then it is doing so on the most current copy.
Lets look at two examples.
Assume a global banking application for ABC bank. A customer goes to a branch in New York, US and deposits money to an account. A few moments later his relative in London, UK does a withdraw on that account. Due to latency there is obviously a time lag before the process in London, UK sees the updated amount in the account.
In an online trading system, a user in NY places an item for sale. The transaction is updated on the closest data center which is in Boston. A few moments later another user in LA is searching for the exact same item and is served off a data center in Phoenix. The user in LA may or may not see the item due to the latency involved in replicating the data across
For example 1 strong consistency is required, while for example 2, you could get away with weak consistency, for example by setting an SLA that says data is valid within a 5 min time window.
This is not an easy problem to solve and this area itself is a subject on its own. Wener Vogels wrote a nice peice on this called Eventually Consistent which is worth reading.
Of course there are specialized frameworks/libraries that can handle this for you. But still there is no escape for you and you pretty much need to have an understanding of the pros and cons of various approaches, failure modes etc.
3. Heterogeneous nature of the components involved in the system.
A distributed system may contain components written in a variety of languages deployed across machines with different architectures and operating systems. Needless to say that this poses certain challenges (especially integration, interoperability issues) when implementing the system. A whole range of standards/technologies were presented to solve these issues, including but not limited to CORBA, SOAP, AMQP, REST (is an architectural style not a standard) and RPC based frameworks like ICE, Thrift, Etch etc. Anyone who has worked with these technologies knows that neither of these are trivial to use nor provide a complete solution in every situation.
If anybody has read the recent posts by Steve Vinoski and the discussions around it would realize the issues/challenges surrounding RPC. The following paper discuss the impedance mismatch problems when working with IDL based systems. The issues with type systems and data formats are not limited to RPC only. When using a message oriented approach like SOAP (doc lit style) or AMQP you will end up tunneling data thats not supported by the protocol as a string or a sequence of bytes. When using REST you would need to represent your resource in a format the requesting application understands/supports, which maybe quite different from the native format.
Again not an easy issue to deal with no matter what technology or framework is used. As an architect/developer you need to understand these issues and deal with them accordingly.
4. Testing a distributed system is quite difficult.
This is arguably one of the hardest aspects of developing a distributed system. Verification of the behavior and impact of your code in the system is not easy.
There are many aspects that needs to be tested, and doing so before every checkin is not a fun task at all. Running some of these tests before every checkin is not practical. But its a good idea to run them nightly and some tests during the weekend. Here are some of the areas that needs to be tested (I plan to write another blog entry elaborating on the testing aspects).
- Functionality testing (can be covered with well written unit testing)
- Integration testing - you need to test the distributed system as a whole with all the components involved
- Interoperability testing - this is crucial when heterogeneous components (different languages, OS) are involved, and is quite different from integration testing
- TCK compliance - If your system is based on standards/specifications, then you need to ensure that you haven't broken anything w.r.t compliance
- Performance testing - to ensure that your changes haven't accidentally caused a degradation in performance
- Stress testing - to ensure that your checkin hasn't accidentally caused any stability issues - ex increased chance of deadlocks when the load increases
- Soak testing - to ensure that your checkin hasn't caused any longevity issues - ex a memory leak thats manifested after a couple hours, days
Most often than not developers cut corners in their testing as running these tests are tedious and time consuming. Also these tests need to be run regularly to catch issues in a timely manner and the best way to tackle this issue is to automate as much testing as possible. There many options with continuous build systems like cruisecontrol or using a plain old cron job.
Functionality testing, TCK compliance, certain types of integration and interoperability tests can be run periodically.
In most organizations test machines are just lying around doing nothing during the night (unless around the clock testing is done with development centers in different time zones.). Instead of wasting computing cycles, you could automate test suites to run during the night. More time consuming integration and interoperability tests, performance, stress and soak testing can be done nightly, while more longer duration soak testing can be scheduled to run during the weekends.
While testing is a tough issue for any type of system, distributed systems have a lot more failure points which adds to the complexity.
Getting these tests right to cover these failure points and executing them needs a lot of careful thought and planning.
5. The technologies involved in distributed systems are not easy to understand .
Distributed system are not easy to understand. Neither are the myriad of technologies used in developing these systems.
Most folks find it difficult to grasp the concepts behind these technologies. If you look into the discussions and misconceptions surrounding REST you can understand what I am trying to get at. CORBA was not an easy spec to understand, so is WS-* or AMQP. While it is true that you don't need to understand everything to develop using them, you still need at least a reasonable understanding to figure how to tackle some of the above mentioned issues. Frameworks based on these technologies are touted as the cure for these problems. Sure they could help, but it still does not shift the burden away from you.
To compound the issue all sorts of vendors keep touting their technology/framework as the next silver bullet. No matter what vendor you use, at the end of the day you are still responsible for getting it right. And it is not an easy task. You need to face the reality that distributed systems are hard and that you cannot hide every complexity behind some framework.
23 Jul 2008 9:14pm GMT
Diwaker Gupta: Introducing uBoggle!
As I mentioned recently, I have been dabbling with Google App Engine for fun with some of my friends. I think we are now at a stage where we could use some more players! :)
Ladies and Gentlemen, I'm proud to present uBoggle - the best game of its kind out there! If you like games like Scrabble, you'll love uBoggle :)
We have some amazing features - such as word highlight, word meanings and board rotation - and addictive game play, so do give it a try. You can play without logging in, but if you log in, you will get access to additional features such as game history. Many more features are on the horizon, so check back often.
If you like the game, please leave a comment here. Thanks!
Similar Posts:
23 Jul 2008 5:53pm GMT
Ruwan Linton: Open Source Web Services with Ruby
Ruby is an interesting dynamic programing language which is considered as a scripting language but I don't think it is fair enough to categorize it as just a Scripting Language, but it should be considered as a fully fledged programing language.
Proving that WSF/Ruby adds the capability of doing Web Services with Ruby. WSF/Ruby is released under Open Source Apache Software License 2.0 and WSF/Ruby team at WSO2 has just released its 1.1.0 version today (23rd July 2008). WSF/Ruby enables you to consume/provide Web Services both with REST and with the power of WS-* stack including WS-Reliable Messaging, WS-Security, WS-Addressing and MTOM Attachments with Ruby.
Key Features of the WSF/Ruby framework includes;
- Client API to consume Web services
- Service API to provide Web services
- Attachments with MTOM
- WS-Addressing
- WS-Security
- WS-Reliable Messaging
- WSDL mode support for both client and server side
- REST Support
and many more... Try it with your self, WSF/Ruby brings the power of Ruby and Web Services into one space desk :-)
WSO2 provides the support for Web Services in a number of languages including Java, C, Perl, Ruby and so on... You may have a look at the developer portal for more information.
23 Jul 2008 4:22pm GMT
Rodent of Unusual Size (Ken Coar): I'm starting to remember why I don't like Westn hotels
I'm staying at the Westin hotel in Portland, Oregon for OSCON this year. When I made the reservation something tickled the back of my mind about the chain, but I couldn't remember what it was. Now that I'm here, I'm starting to remember..
Right now the thing that's popping my corn is how they handle network access. There's free wifi in the lounge area, allegedly, but the rooms only have hardline access. Which is not free; it's US$12.95 per 24 hours. And it needs to be renewed every day. And it's charged per IPA assigned. Since I habitually use two laptops, that means they want to (and did, the first night) charge me US$25.90 just to get online from my room for a day.
In this day and age, there are so many aspects of that which are patently ridiculous. For instance:
- Charging for network access? Wake up, Starwood! Many of your competitors have realised that this is a commodity, not a luxury, and free - or at least cheap - net access is a selling point.
- Having to renew every day? Admittedly I've only been to a few hotels that had a 'for the duration of your stay' option, but, by methyl cellulose, more should!
- Charging per IPA? Come on! I'm one guest, you're already making me pay for access.. give me all the access I need!
On the [scant] positive side, they don't futz around with filtering SSH or SSL or SMTP, which is a welcome rarity.
But that's not enough for me to want to ever stay at a Westin again. And Chris DiBona tells me most of the Starwood hotels are the same..
23 Jul 2008 3:44pm GMT
Ben Laurie: Getting At Public Data
The government has quietly launched two quite fascinating initiatives. I have no idea why there wasn't more fanfare. I was even at OpenTech, where one was announced, and I didn't know!
Firstly, Show Us A Better Way
Ever been frustrated that you can't find out something that ought to be easy to find? Ever been baffled by league tables or 'performance indicators'? Do you think that better use of public information could improve health, education, justice or society at large?
The UK Government wants to hear your ideas for new products that could improve the way public information is communicated.
And 20 grand for the best ideas, too.
Secondly, The Public Sector Unlocking Service (Beta). I love that they put "Beta" in there. Tell them about crown copyright data some bureaucrat is hoarding, and they'll read them the riot act. Awesome.
23 Jul 2008 1:46pm GMT
Henning Schmiedehausen: WADA rocks!
http://www.sport1.de/de/apps/news/news-meldung/news_2303983.html
If you can't read German: It seems that WADA (the world anti doping agency) and pharma giant Roche have an agreement so that Roche puts a "secret molecule" (probably some kind of tracer) into their EPO product, which can be tracked by anti-doping tests.
Now that is cool. Let's see how many athletes suddenly become injured or sick right before the Olympics….
23 Jul 2008 12:32pm GMT
Yoav Shapira: Happy birthday HubSpot!
Yesterday almost everyone at HubSpot went out to celebrate the second birthday of our internet marketing company. Alyssa and Ellie did a great job organizing the event, as they always do, and I had a blast. I think everyone had a lot of fun.
We went out to Kings for some food, drinking, and bowling. Apparently we have some ringers who are better bowlers than expected. I did just OK, but then again I bowl roughly once every 3 years, so my expectations are low. Team spirit was evident throughout the event, which was cool. And the new people are blending in really nicely.
We posted a bunch of pictures of Flickr if you're curious.
Happy birthday HubSpot!
23 Jul 2008 11:35am GMT
Ben Laurie: The Register on Security
So, The Register has a story on Mozilla doing security metrics. Which is cool.
But what tickles me is that The Register thinks I should download an Excel file to read more about the project. Yeah, right.
23 Jul 2008 3:54am GMT
22 Jul 2008
Planet Apache
Ted Husted: Ajax Experience 2008 - More Ted, 'nuff said
I'll be giving three -- count 'em three -- presentations at the Ajax Experience at the end of September. Two talks are Struts-related reprisals form last year, and the third talk, new this year!, dives into popular tools for testing Ajax applications.
Ajax Testing Tool Review
Not long ago, testing Ajax components meant play-testing a page by hand. Today, there are a growing number of tools we can use to simplify and automate Ajax testing.
In this session we will cover when to test, what to test and how to test Ajax components. You learn how to create automatic tests with various tools, including YUI Test, OpenQA Selenium and TIBCO Test Automation Kit, and how to use Ajax testing tools with IDEs and Continuous Integration systems.
In this session, you will learn:
- When, where and how to test Ajax components;
- How to create automatic tests with various tools;
- How to use Ajax testing tools with IDEs and Continuous Integration systems.
Struts on Ajax: Retrofitting Struts with Ajax Taglibs
Struts is Java's most popular web framework. Ajax is the web's hottest user interface. What happens when we put Struts on Ajax?
In this session, we stir some Ajax wizardry into a conventional Struts application, without all the sweat and bother of writing our own JavaScript. Struts 1 and Struts 2 both support Ajax taglibs that look and feel just like ordinary JSP tags. If it's just a little bit of Ajax that you want, these tags will get you around the learning curve in record time.
During the session, we will cover
- Using the Java Web Parts taglib with Struts 1
- Using the Ajax YUI plugin with Struts 2
Who should attend: Struts developers who would like to utilize Ajax with existing applications, and Ajax developers who would like to utilize Struts as a backend.
To get the most from this session, some familiarity with Struts or a similar framework is helpful.
To register, visit Ajax Experience site.
Ajax on Struts: Coding an Ajax Application with Struts 2
Ajax is the web's hottest user interface. Struts is Java's most popular web framework. What happens when we put Ajax on Struts?
In this session, , we look at writing a new Struts 2 application from square one, using the Yahoo User Interface (YUI) Library on the front end, and Struts 2 on the backend. YUI provides the glitz and the glamour, and Struts 2 provides the dreary business logic, input validation, and text formatting.
During the session, we will cover
- How to integrate an Ajax UI with Struts 2
- Basics of the Yahoo User Interface (YUI) Library
- Business services Struts can provide to an Ajax UI
Who should attend: Ajax developers who would like to utilize Struts as a back-end, and Struts developers who would like to utilize Ajax as a front-end.
To get the most from this session, some familiarity with an Ajax library, like YUI or Dojo, is helpful.
Visit the Ajax Experience site to register.
22 Jul 2008 11:37pm GMT
Justin Mason: Links for 2008-07-22
ZSFA - I Want The Mutt Of Feed Readers Zed recommends Newsbeuter. must take a look
We Want A Dead Simple Web Tablet For $200. Help Us Build It. having worked on a project to do just this, believe me, this is doomed. DOOMED
Science Clouds 'compute cycles in the cloud for scientific communities .. allows you to provision customized compute nodes .. that you have full control over using a leasing model based on the Amazon's EC2 service.' Wonder if they'd like to give SA some time ;)
22 Jul 2008 11:05pm GMT
Adrian Sutton: Content In The Mobile World
I had two of our keen young developers (Dylan and Suneth) email me overnight to ask my CTO-ish opinion of trends in the mobile space and how they might apply to Ephox. It's a very good question - with the advent of BlackBerrys first and now even more so with the iPhone, mobile internet is finally moving from "the future" to "the now", even if it's not evenly distributed yet. Of course, Ephox is squarely placed in the enterprise content creation business so no matter how popular the mobile world becomes we're very unlikely to bring out a mobile phone game or a tip calculator. So here's my take one where the mobile world is with regard to enterprise content creation.
Content Creation vs Content Consumption
Firstly, it's important to realize that there are two quite distinct areas to content - creation and consumption. There is a huge amount of content consumption on mobile devices - on the go access to email, websites, notifications, twitter etc are probably the most common uses for mobile internet. However, nearly all of this is just content consumption. Most people read their email but don't reply until they get back to their desk and have a full keyboard. People receive notifications on their phone and then take action via their computer. When people do respond to these things, it's generally a very short note because of the limitations of the input mechanism. After all, even with a physical keyboard, BlackBerrys are still a very slow way to write long emails.
What this means for content creation is that the input tools are generally extremely simple - usually if not always just plain text and maybe a photo or video from the onboard camera, but it's rare to find formatting functions etc. For a company that creates editors like Ephox, it's not looking like a particularly lucrative market.
Other Content Types
One area that is picking up on phones is the creation of non-textual types of content. After all, if you take away the full size keyboard and replace it with video and audio capabilities it's pretty obvious that text isn't going to be the most popular medium. Again though, the features required are actually pretty minimal - when you're on the go, you really just want to quickly grab the photo and move on or record your audio or video and either publish it immediately or upload it somewhere so you can edit it later on your full PC. The physical device constraints simply make it too hard to edit the content on your phone directly so it makes far more sense to use a full PC for that, or just not bother.
So Are We Done?
If it's the physical constraints of portable devices that are dictating their usage, does that mean that software has done all it can? Definitely not. There are two key aspects of the mobile content puzzle that to me seem largely unsolved, finding the content you need and annotating it. Plus as I mentioned in my previous post, synchronizing content.
Finding the right content is usually a hard problem on full PCs, but with the physical constraints of mobile devices it's even harder. Search obviously plays a big part in this, but so does notification systems. Having your phone tell you that you have important information waiting for you, or even just interesting information for when you have time, is a huge knowledge sharing opportunity. That's why reading your email on the go is so popular - it delivers generally useful information straight to you so you can use your travel time to stay on top of it and ready your thoughts before you get back to the office to type an email. There's a lot more information out there that's being created throughout the enterprise that you probably should be made aware of though and it's not all suited to email. New sales leads, updates to support cases, updates to intranets, wikis and blogs etc would all be useful to have delivered to you either with a notification get your attention or to just sit there for when you have time to look at your phone and find out what's new. I expect RSS and Atom to play a huge part in this but I wouldn't be surprised if there are content specific or area specific applications that come about as well.
The other aspect is annotating content. Quite often you have a few brief ideas you want to jot down on the go and the flesh out later, or perhaps you just want to proof read existing content etc. There are actually very few existing tools that allow you to do this. You can read content, you can often write new content or reply, but annotating existing content is quite rare. What I want to be able to do is read an email and add little notes to myself on it - preferably attached to specific points in the email but even just a generic notes field would do. For PDFs, RSS entries and web pages that could be even more useful as it would allow you to capture your thoughts on the spot so you don't forget them.
Summing Up
There's a huge potential for innovation in content in the mobile space but it's probably not just porting more and more of the desktop applications to mobile devices. The key is to take advantage of the "on the go" nature of mobile devices without forgetting their inherent limitations and inefficiencies. Combining mobile platforms and the desktop is the key to creating genuinely useful applications.
22 Jul 2008 4:44pm GMT
Carlos Sanchez: Saturday in Madrid, Coruña later
This Friday I leave for Madrid, I want to meet some people there so I'll spend some days around, until leaving to Coruña sometime next week. If you are around drop an email or leave a comment 
22 Jul 2008 4:00pm GMT
Henri Yandell: Random thought from Foundations summit
A random thought I had while attending the Floss Foundations summit was:
"What would an Affero Permissive License look like, and would it be GPL compatible?"
Namely, how would an AfPL differ from Badgeware or the BSD advertising clause. I'm thinking it would be something like adding this clause to the BSD:
"You must prominently reproduce to all users interacting with source or binary forms remotely through a computer network (if your version supports such interaction) the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided to the user."
Anyway - random useless thought of the day. Are we going to see the BSD advertising clause make a comeback?
"3. All advertising materials mentioning features or use of this software must display the following acknowledgement: This product includes software developed by the University of California, Berkeley and its contributors. "
22 Jul 2008 3:07pm GMT
Adrian Sutton: Mobile Fail Point No 1
I've quickly come to realize that the mobile worlds has a huge dependency on synchronization tehnology to make things work smoothly. Toucan read your email on the phone and reply from your laptop. Read rss items should be synced and just about everything else on your phone should be synced with somewhere else.
The problem is that generally synchronization support is lousy. NetNewsWire is too slow syncing feeds, Mail.app doesn't seem to notice if a message changes from unread to read and the WordPress iPhone app doesn't seem to download drafts that you created in the browser interface.
Sync is the killer requirement that goes unsaid on mobile devices. You can spend as long as you like polishing he UI but if your synchronization isn't seamless your app will be a chore to use. If you get it right users won't notice at all.
22 Jul 2008 3:05pm GMT
