Archive for February, 2010

Distribution Protocols Specifications

I recently looked around for the different distribution protocols that I mentioned in my previous post.

Here is what I found :
– The Bittorrent protocol specification,
– The Erlang’s Distribution protocol,
– The BOINC System Architecture,

And I also found out some interesting initiatives, that seems to be pretty inactive at the moment, sadly : Gittorrent and its protocol : GTP
An interesting mix 🙂

Apart from that, after reading a lot of research papers on distribution. I have selected a few that seem to be related to my interests.

One of the problems, in the way I see it, is that when you have one CPU, you can easily say that instruction happened “after” that one. And that becomes natural when you write a line of code after another. Now when you have multiple CPU running multiple processes in different places, with delays in transmissions, it become crucial to be able to determine if an instruction actually happened “before” or “after” ( or “at the same time” ) another instruction.

The most widely used algorithm to do that might be what is called Vector Clocks.
One problem though, is that vector clocks aren’t really dynamic, as you must know from the start how many concurrent processes are running, and you might be quite limited on the number of concurrent process you can have, as it will, with a naive implementation, change the size of your clock.
Second issue is that you need, most of the time, if you care about the size of the data you are sending over the wire, to have a central point to establish and increment that clock. Otherwise you need either :
– to duplicate your clock to actually end up with a Matrix Clock that can be in some context prohibitive because of its size…
– to send all your messages to all processes, so that the clock doesn’t become inconsistent between 2 nodes, and make sure they actually arrive…
This is not a simple issue to work around if you want to use your distribution redundancy as a safety net, for a process to replace another one if something fails for example, or for a “tunnel”  to be setup when a connection in your net falls down.

All this is a little annoying as if you want to optimise your transfer over the network, while using the benefits of the distribution to write safe applications, without any single failure point.

An interesting alternative to vector clock has been proposed : Interval Tree Clocks
It deals with the dynamic problem, and I planned first to try to implement something similar to organize data over my distributed network.

I am not sure I understand all of it already, but from what I can catch so far, I think the size of data you need to send to the different nodes might change with a naive implementation at least… I am not so sure of how to handle that especially if I want to use UDP and not TCP, as I do not really need the total ordering TCP provides, but I will likely enjoy the UDP performance…
The idea to have an ProcessID that is used when you increment the clock seems to be the way to go to handle dynamic creation / failures / termination of processes ; and the way the number of concurrent processes is “compressed” is pretty effective and interesting. However I am not sure if it will actually work in an architecture without “central point” or total broadcast, or some node might wait forever for a message who is not coming…
I now think that I need something closer to the network, at a lower level. Something that can deal with network intricacies (latency, losses … ) if needed. Or maybe I will need to subdivide my feature set, and implement them in separate components… we’ll see.

When you have a distributed, network hungry, application, the best way to reduce messages transmitted from one place to the other is to actually filter the messages you are going to send. Therefore there is in many protocols a “publish/subscribe” system to actually determine what needs to be send to who. And that “selective messaging” while improving network usage, doesnt play nice with ordering algorithms…

What I want to try to get, is a completely distributed architecture, without the need for a central “reference” point at any moment, and with possible “selective sending”. While keeping the partial ordering in a dynamic context that is provided by ITC. I think this might raise a lot of interests in people towards distributed architecture, because of the safety you might gain ( no central failure point ) and the optimisation on the network that appears when you have many processes communicating the same information around…

A lot of other improvements have already been made around vector clocks, more or less publicly, with more or less success, and what I need to likely to be a “mix of those”. Yet I ll have to prove somehow that it s going to work, and that it will not impact the network in a bad way…
The search is far from over…

Distributed Computing, what is it used for ?

It s been quite a bit since I last blogged here, and I ve been busy looking for a job, fixing my little game engine, and writing prototypes erlang for the network side of it. Since I plan to implement quickly some type of distribution, it is useful to have a look at existing distributed applications and protocols, and how they are used and loved ( or not ) by people out there…

But I am talking about total, complete, distribution, meaning peer-to-peer distribution. No central servers, ie. no central point of failure, or slow-downs.

So first lets make a quick list of things, that might be worth looking into :
bittorent, which has been around for quite a while. ( the tracker can be the server, however it is just required temporarily… )
BOINC, which is right now probably the most used distributed platform ( do they need servers ? )
erlang_OTP, which apparently can be used to write and run completely distributed applications.
Git and Bazaar, Distributed CVS : only data is distributed, not the actual computation effort. However the distribution is complete here. Server or not, everything still works…
– still need to have a look at Mozart

Mmmm after a quick look, completely distributed applications arent really wide spread… After all the problem is still an open one and there are still lots of research being done.

Famous applications, that may look distributed, such as Twitter for example, actually arent using anything more than the usual client-server architecture, and therefore require a server… However some might be distributable among different servers, without bottleneck effect : Can ejabberd be completely distributed for example ?
Any other ideas ?

In the following posts I ll have a look at how these applications works to help me write prototypes of distributed, server-less, code. My first application is likely to be a chat-like, probably similar to twitter, as it can use XMPP, and has pretty simple commands. Let’s see what I can come up with 😉

Distributed Programming Language -> What is it exactly ?

There are already quite a few distributed programming languages out there. The most popular being probably Erlang, although it s not a “network transparent” or “distribution transparent” language. On the other hand Mozart ( which I didnt try yet, but probably will very soon ), seems to have some mechanics for transparent distribution, but with the cost of lot of semantics to be learned. And, well to be honest I never heard about it so far, although I am sure there must be clever systems in it… Why is that ?

The difficulty in finding a proper distributed language for your application comes from the fact, that you can write distributed application as soon as you have any “socket related” feature in your language. However the time you need to write such an application in a language that hasnt been designed for it, is humongous. In addition, in the real world, the necessity or distributed languages is arguable right now, because companies dont see much applications for it just yet. Without the wheel nobody thought about cars I bet 😉

The main question I want to discuss here is : “What does a distributed programming language need in its very core to be useful, widely used and accepted by usual developers ??”

=> To be used it needs to be useful, yet simple, so most developers used to imperative programming can grasp it quickly, and do something useful with it.
In traditional local programming, the execution is done on a CPU, one instruction after another.
In concurrent / distributed programming, the execution is done on multiple CPU, many instructions at the same time ( maybe? ) and we do need a way to harness that. It can be an extremely complex “concurrency graph” of relations between different instructions set in different locations. One of the way to harness it with a very fine-grained control system could be based on “Causality-tracking” algorithm, that are already in use by TeleCom Companies, for mobile phones related applications.

In traditional ( imperative and functional ) programming, we are writing instructions in a structured way, that we can represent as a decision tree. Problems arises when you have to mix the “causality graph” of relations between different instructions or events in the different location, and that decision tree.
If I try to write ( in C++ like syntax) instructions like :

var a = 5; if ( condition )
//I am asking remote processB to increment my variable "a"
{ remote_execute ( incr(a), processB); }

//I expect a to be incremented -> MISTAKE !!!
{ getfrom(a, processB ); }

I am making an obvious mistake, because in the “else” block, my variable a is not going to be incremented, and even more, it might not be known by processB at all. Even if it is a programmer error, we do need a way to tell him early on it s an error, or better, make it impossible for someone to write such a code. And this might be the most obvious error that can happen. If you have written multithreaded applications in the past, you know how troublesome it can quickly become.

The main thing behind compiled languages, is that they try to warn the user during compilation of all the possible errors. And they cant prevent most concurrency errors, because these are very dependent on other factors ( execution environment, ordering of instructions, etc. ) that the compiler cannot have knowledge of. The causality tree of a concurrent program is unveiling during the execution, there is no static representation of it that we can code. And even if we could, it would be a big hassle to the programmer who would probably not even want to think about it, since it s not related to his problem, as long as performance is satisfactory.

To conclude, to be useful, a distributed language needs to :
– remain simple and efficient, maybe simplifying the different ways to have concurrency( threads process, etc. )  and making it more transparent.
– harness concurrency with transparent causality tracking, ( simplicity : transparent, efficiency : allowing concurrent events – partial order )
– find simple and intuitive ways to mix that “dynamic” causality tree with the “static” decision tree. This is a very tricky part. Some system have concurrency controling imperative algorithms( BOINC ), some have imperative control over lightweight concurrency ( multithreads )
– be based on “evaluation”, just like script engines, because we already have a running system, and we can dynamically check for concurrency errors using it. It s also a good sandbox for concurrency algorithm tests.

=> To be widely accepted it needs to do *what we are already doing* but better, faster, stronger, even if we are not already doing distributed applications…
The most widespread “development” and “application writing” activity might be website development nowadays. It make sense in a way because it s related to distribution, people interaction, and just like with distributed languages, few years before the invention of the web, all the nice web applications we use on a daily basis right now would have been very hard to imagine.
To be able to compete with traditional imperative local languages, it should have decent performance on local CPU execution. Probably with implicit threading when it is detected possible, since nowadays most new machines have multiples CPU. It should also be quite familiar and intuitive to get used to it quickly. We probably need to keep and use most of the traditional control instructions ( if/else, for, while, etc. )

A good way for people to be able to use a quite low-level language for what they want is to allow “modules” or “plugins” to add high level functionality to it. the core language should however provide interface to most devices.

The road to an accepted distributed language seems still pretty long, but it s definitively an interesting one indeed…

Does someone already a knows a Distributed Language that fits all his needs ? Does anyone has other / better ideas for distributed language features ?
I wonder when the first truly transparent distributed language is going to arrive and be widely used…