Harnessing Distributed Computing Power

碧城仙 · 发表于 2004-11-14 09:47:05

Harnessing Distributed Computing Power
using Open Source Tools

转自：http://www.chessbrain.net/hdcp.html
主题：利用开放源代码工具收集分布式计算力量
概述：使用开放源代码工具创建分布式计算项目的白皮书
出自：《ChessBrain.NET》
发表时间：2004年2月1日

内容：

Harnessing Distributed Computing Power
using Open Source Tools
Carlos Justiniano
ChessBrain.NET

NordU 2004 USENIX Conference

Copenhagen, Denmark
February 1st 2004

ABSTRACT

Distributed Computing projects are successfully utilizing the idle processing power of millions of computers. This paper examines how Open Source tools can be used to harness the vast computing potential of Internet connected machines.

Because distributed computing projects depend on individuals, who generously volunteer the use of their machines, the human and machine dynamics that makes distributed computing projects possible will be closely examined.

The second half of this paper explores technical issues and examines how open source tools and technologies are helping to pave the way for new distributed computing projects.

1. PART ONE - DISTRIBUTED COMPUTING, human and machine dynamics

1.0 INTRODUCTION TO DISTRIBUTED COMPUTING

Distributed Computing (DC) can be defined as the use of computers and their resources from geographically dispersed locations.  By definition, the word distributed implies: dispersed, partitioned, and perhaps loosely connected.  The term distributed computing has become synonymous with Internet connected machines that work together to accomplish one or more goals.  The ability to use the resources of remote machines amounts to the technological equivalent of sharing.  In fact, the earliest systems were referred to as timesharing systems.

The basic principles behind distributed computing are by no means new.  From the moment there were two computers present side by side, there must have been a desire to connect them.  Indeed, the desire to connect research centers and computing resources eventually led to the creation of the Internet.

1.0.1 Cluster Computing

As machines became more powerful, researchers began exploring ways to connect collections (clusters) of smaller machines to build less expensive substitutes for the larger, more costly systems.

In 1994, NASA researchers, Thomas Sterling and Don Becker connected 16 computers to create a single cluster. They named their new cluster, Beowulf, and it quickly became a success.  Sterling and Becker demonstrated using commodity-off-the-shelf (COTS) computers as a means of aggregating computing resources and effectively solving problems, which typically required larger and more dedicated systems.

Beowulf clusters are now commonplace, with large clusters being built using thousands of machines.  Assembling clusters is no longer the purview of computer scientist, using the freely available ClusterKnoppix, which is a modified distribution of Knoppix Linux and OpenMosix, students can quickly connect a group of machines to form a cluster.

While speaking at a Mac OS X conference, technology publisher, Tim O'Reilly, told his audience that they were are all Linux users.  He went on to explain that the search engine, Google, is powered by the world's largest Linux cluster3.  In fact, the cluster consists of more than ten thousand servers according to a Google press release4.

1.0.2 Peer-to-Peer Computing

Although, distributed computing predates the Internet, it continues to touch nearly every aspect of our computing lives.  Beyond web browsing and email, the use of instant messaging and P2P file sharing increasingly rank high in the lives of hundreds of millions of people.

Consider the acronym, P2P, it means Peer to Peer.  In a P2P system peers connect to one another rather then to a central location.  The concept seems new to some, because the Internet they experience most, is a system called the World Wide Web.  In contrast, P2P systems rightfully bypass websites and in turn communicate directly with one another.  The point is noteworthy because it underscores a core philosophy, which has given birth, and continues to guide the development of many of the Internet systems we now take for granted.  The Internet was not conceived of as a system involving centralized control, rather, early pioneers went out of their way to ensure the Internet would exist and flourish as a decentralized system.

Peer-to-Peer systems are allowing machines to connect with one another to share resources.  The resources might involve file sharing and remote storage, or the sharing of processing power.  Increasingly researchers are exploring the potential for the sharing of processing resources.  Studies show that most computers are vastly underutilized.  This squandering of resources doesn’t just involve home computers; very few businesses operate on 24-hour shifts, leaving vast numbers of machines sitting idle for large periods of time. If one also considers the millions of smart cash registers at retail outlets and markets, it seems that the underutilization of computing power appears everywhere.

1.1. HARNESSING THE VAST POTENTIAL OF DISTRIBUTED MACHINES

It’s important to point out that the primary lure of harnessing remote computing power is largely one of economics.  After all, if limited financial resources were not of concern, what researcher wouldn’t want a massive cluster of dedicated machines?  Reality being what it is… dedicated resources tend to be in high demand and researchers find themselves writing proposals pleading for precious time on large compute clusters.  It’s easy to envision a researcher (who potentially holds the code that might unlock a medical cure) being denied access to centralized and dedicated computing resources.

Because the lack of computing resources may come from political, economic, and social barriers, researchers are turning to the Internet as a viable way to continue their computationally intensive work.  What is equally interesting is that ordinary citizens (many of whom pay the taxes which fund major institutions) are providing the machines which support distributed computing based projects!

1.1.1 Distributed Computing Applied to Computation

Distributed computing offers researchers the potential of solving complex problems using many dispersed machines.  The result is faster computation at potentially lower costs when compared to the use of dedicated resources.  The term Distributed Computation has been used to describe the use of distributed computing for the sake of raw computation rather than say, remote file sharing, storage or information retrieval.

This paper will tend to focus mainly on distributed computation, however, many of the concepts and tools apply to other distributed computing projects.

1.1.2 Distributed computation: a Community Approach to Problem Solving

Distributed Computation offers researchers an opportunity to distribute the task of solving complex problems onto hundreds and in many cases thousands of Internet connected machines.  Although, the network is itself distributed, the research and end user participants form a loosely bound partnership.  The resulting partnership is not unlike a team or community.  People band together to create and connect the resources required for the achievement of a common goal.  A fascinating aspect of this continues to be humanity’s willingness to transcend cultural barriers.

1.1.3 Successful Projects

The earliest known DC projects used email to distribute tasks and results in the late 1980’s.  At the time, most people considered massively distributed projects to be largely idealistic.  The notion that ordinary people would pay for and allow their computers to participate in research projects was foolhardy at best.

Participants in a distributed computing project bare the cost of equipment, electricity, and any necessary upgrades.  The fact that millions of people now participate in distributed computing projects is a testament to an evolving technology landscape.

Distributed Computing entered the public spotlight, On October 24th 1997, when the New York Times published an article entitled "Cracked Code Reveals Security Limits" to report on the distributed.net project’s achievement in deciphering the RSA 56-bit encryption challenge. Distributed.net proved the viability of using massively distributed machines over a public network when they found the encryption key that unlocked RSA’s encoded message: "The unknown message is: It's time to move to a longer key length".

The following year, a group of researchers at the University of California in Berkeley launched the SETI@home project.  While SETI@home hasn’t found any signs of alien life, it has demonstrated a staggering use of computational power. SETI@home researcher, David Anderson compared the computing power available to the project against the ASCI White super computer which was built by IBM for the U.S. Department of Energy. The machine cost US $110 million, weighed 106 tons, and offered a peak performance of 12.3 TFLOPS.  Anderson went on to state that SETI@home was faster than the ASCI White at less than 1% of the cost.  The reduction in expenditure was due to the fact that not only were the computations distributed… but so were the costs!1

The AspenLeaf DC website (http://www.aspenleaf.com/distributed/) currently lists over 50 active projects, ranging from Analytical Spectroscopy to the Zeta grid working in areas such as Art, Cryptography, Life Sciences, Mathematics and Game research.  Not surprisingly, the category of mathematics features the largest number of DC projects.

[ Last edited by 碧城仙 on 2004-11-17 at 12:33 PM ]

碧城仙 · 发表于 2004-11-14 09:47:51

1.1.4 How DC Projects Come About

Distributed computing projects begin with one or more problems.  A problem may involve a complex and intensive simulation, or perhaps massive amounts of data processing.  Typically, the problem can’t be solved in a timely fashion using available computing resources.

The first task would-be project founders are faced with is determining whether their problem is suitable for distributed computation.  That is, can the problem be subdivided into manageable work units (or subtasks) that can be independently analyzed? Problems where intermediate results depend heavily on prior results prove unsuitable for parallel distribution. Some problems, such as computational models, require information sharing among nodes and are typically better suited for clusters, which use high-speed interconnections and shared memory.

The ideal type of problem for distributed computation is one, which can be subdivided into manageable chunks, where each machine’s processing activities is largely independent of what another machine is doing.

A good example is the Folding@Home project at Stanford University.  The project set out to use large scale distributed computing to create a simulation involving the folding of proteins.  Proteins, which are described as nature’s nano-bots, assemble themselves in a process known as folding.  When proteins fail to fold correctly the side effects include diseases such as Alzheimer’s, Parkinson’s and Mad Cow.

The Folding@Home team described their protein simulation in the form of a computational model suitable for distribution.  The result is now a tool that allows them to simulate folding using timescales that are thousands to millions of times longer than were previously possible.

1.1.5 End-user Considerations

All software needs to be developed with an understanding of the target audience.  Who will run the software?  What skills are end users expected to posses?  Understanding the answers to these and other important questions can directly translate to whether a project will be well received and ultimately well supported.  To gain wide acceptance, DC software needs to shield end users from unnecessary complexities.

The SETI@home project was the first successful DC project to recognize the ubiquity of the PC screensaver as a vehicle for harnessing idle CPU processing cycles.  There are now several well understood reasons why screensavers have proven to be ideal.  Most people view screensavers as both innocent and non-intrusive.  People know that screensavers only start when they’re not using their machines as a screensaver runs when the machine is typically idle.  People in office settings enjoy displaying their screensavers rather than simply turning off their monitors.  This makes screensavers an ideal tool for advertising a project, often leading to a “word of mouth” marketing effect.1 The SETI@home project has surpassed three million registered users, far exceeding the popularity of other distributed computing projects.

Another goal of many DC projects is to see their software run on as many computers as possible.  One way to accomplish this goal is to release the project software for as many operating system platforms as time and project resources will permit.

While the prior considerations should be carefully reviewed, by far the most important consideration is to release well-tested software.  Few things frustrate members’ more than unstable software.

碧城仙 · 发表于 2004-11-14 09:48:11

1.1.6 Global Collaboration

Distributed computing projects often bring together people from diverse disciplines.  Projects often leverage the Internet for global collaboration, and many projects include members who have never meet in person and in some cases have never actually spoken to one another via telephone.

Like the Internet itself, team members are bound by common interest and are loosely assembled in an open spirit of collaboration.

1.2 The Importance of Community

You may be wondering, what drives a person to contribute their time, energy, and the use of their computer to a distributed computing project?  The answer can be summed up in one, rather illusive word: community.

Community is the force that connects people around common interests.  On a human level we all seek to be members of various communities and in many cases seek the acceptance of its members.  The dynamics of community are the driving forces behind the willingness of netizens to participate in distributed computing projects.  A natural human tendency is to identify one’s self by what one does rather than who one is. Communities offer participatory membership and an opportunity to do something that is perhaps larger than any one person.

1.2.1 Communication

The Internet landscape is littered with questionable materials, with scam artists lurking near every corner.  Internet based projects are faced with the need to establish credibility via their online presence.  This process often begins with a well maintained and informative website.

Most distributed computing projects maintain websites, which articulate the project goals, present project status reports, and offer software download areas.  In some cases, project websites feature online forums where members can post feedback directly to the development team and its other members.

1.2.2 Establishing Credibility

Distributed computing projects establish credibility by clearly articulating their goals, and demonstrating the project’s commitment to achieving success.  Potential members won’t waste their time participating in a project that doesn’t appear worthwhile, and need to be sold on the benefits of participating.

Credibility can be established in a number of ways.  One possibility is to gain credibility through association.  Some DC projects enjoy instant credibility when well-established institutions sponsor them.

·       SETI@Home: Berkeley University

·       Folding@Home: Stanford University

Other projects achieve credibility when sponsored by companies:

·       Grid.org: United Devices Inc.

·       FightAids@Home: Entropia Inc. and The Scripps Research Institute

Students and researchers wishing to start distributed computing projects can do well by attracting the support of other well established researchers and basing the project on previously published work.

Credibility may also emerge when members take the time out to participate in public forums.  The willingness of DC project leaders to publicly engage in conversations regarding their project goes a long way toward communicating the commitment and seriousness of the projects efforts.

An end goal for establishing credibility is to persuasively convince potential members that a project is worth their time.

1.2.3 Nurturing An Open And Trusting Relationship

Once credibility is established the next step is to build a relationship based on trust.  Generally, participants must believe in the credibility and trustworthiness of a project before downloading and running potentially malicious software on their machines and networks.  This is analogous to inviting a near stranger into your home to spend the night.

One of the surest ways to establish trust is to engage in direct conversations via email and if possible, online on the site’s community forum.  Nothing says, “your voice matters” faster than a responsive reply to a member’s inquiry.  While this isn’t always possible, the goodwill generated is worth its weight in gold!

Open and honest communication is a tool that tears down relationship barriers, and helps foster healthy and productive relationships. This is a point that is often difficult to remember when dealing with difficult people.  Let’s face it - public relations can be a difficult job.  Freedom of networked speech often results in members pretty much saying whatever they want and venting frustrations publicly.  These sorts of behavior can quickly erode a project’s credibility as mob-like conditions lead others to join-in. This is where project leaders need to exercise the most restraint.  Months and possibly years of relationship building can quickly crumble as a result of an ill-prepared response.

It’s important to remember that project contributors give freely of themselves and that it’s difficult to run a distributed computing project without them. Exercising tempered restraint, and maintaining an eternal state of gratitude is important to maintaining a successful project.

碧城仙 · 发表于 2004-11-14 09:48:30

1.2.4 Currency

Not all currency is monetary. Currency, especially on the Internet, is essentially a medium of exchange.  So it’s fair to ask, what do members want in “exchange” for their active participation?  While specific reasons for participation vary, there are a few common themes that consistently appear within DC projects.

A sense of purpose

Some members are motivated by a deep sense of purpose. Projects like, FightAids@home and the University of Oxford’s cancer research effort offer individuals the opportunity to support noble research, which might ultimately benefit millions of people throughout the world.

A sense of community

Many active members participate on community forums posting messages, which support other members.  People like to be involved in things that transcend them as individuals.

Peer recognition

Members want to know that their contributions matter!  All major distributed computing projects track member contribution and post the results on the projects website.  Members gain the respect of their peers and obtain subculture ranking within communities.

An informal survey of DC projects reveals that while a sense of purpose and perhaps a sense of community is important, the most important motivators across all projects appear to be the desire to make a difference and for peer recognition.

1.2.5 Stats: A Numbers Game?

Distributed Computing projects have given birth to teams who compete against one another in an effort to see which group can make the most significant contributions to a project.  The groups have adopted the moniker: Distributed Computing Team, or DC Team, and members refer to themselves as DC’ers.  World-renowned DC Teams include the Dutch Power Cows, Ars Technica, and Team AnandTech, each boasting thousands of members.

The significance of DC Teams became apparent to the ChessBrain project in 2001 when it was discovered that most of the people contributing had very little interest in chess.  The project was supported by members from internatioanl DC groups such as Ars, DPC, ExtremeDC, FreeDC, Rechenkraft, and SwissTeam.NET.  Indeed, DC Teams provided the bulk of computer power required to develop and test the ChessBrain network.  In turn, a large portion of ChessBrain’s development time was spent supporting the statistical reporting needs of its DC members.  The result has been a give and take relationship with positive outcomes.

All successful DC projects recognize and humbly respect the contributions of the DC Team communities.  A recent example is when the well-established DC project, fightaids@home kept their members abreast of stats related issues during a transition.

碧城仙 · 发表于 2004-11-14 09:48:50

2. PART TWO: Technical Considerations

2.0. TECHNICAL CONSIDERATIONS

A DC project begins with the recognition that a particular problem might benefit from having many computers working toward a solution.  There are a number of technical challenges that must be addressed before a distributed computing project can become a reality.  In the short term, the abundance of potential choices further complicates the initial development effort.

During the past decade companies have emerged to offer distributed computing products and services.  Entropia Inc. and United Devices Inc. both offer PC Grid software and consulting services using their proprietary platforms. Large companies such as IBM, Microsoft and Sun have invested heavily in distributed computing initiatives.  Most companies embrace open standards as a way of supporting and interfacing with customers.  The clear corporate message is, “we’re open for business”.  The key benefit to researchers is that they don’t have to solve and provide for every aspect of their distributed computing needs; businesses are ready and willing to help.  The development of distributed computing software is both complex and labor intensive.  Future DC project developers will do well to examine ways of avoiding the need to build everything themselves.

One of the goals of the ChessBrain project was to explore custom server application development and web services in the context of distributed computation.  This isn’t something to be recommended, unless you share the same desire to learn through first hand experience.  ChessBrain predates the Berkeley Open Infrastructure for Network Computing (BOINC) initiative and was unable to consider it during development.  BIONC aims to be a general purpose distributed computing project platform and is being released as GPL Open Source.  By all accounts, the project appears to be heading in the right direction, and is well worth a look!

The items outlined in this section continue to examine options with a focus on the Open Source based choices made by the ChessBrain distributed computation project.

2.1 Exploring A Problem

Earlier in this paper it was considered that the first step a researcher is faced with is determining whether their problem is suitable for distribution.  A researcher might well begin exploring a problem using a multiprocessor machine, and perhaps a small network of computers.  The initial goal is to explore how the problem can be subdivided into discrete chunks (or subtasks), called work units.  Programs are written to test the decomposition, distribution, collection and assembly of processed work units.  The programs might first use multi-threaded and multi-processing techniques, and later move to multiple processes that communicate using shared memory or TCP/IP socket communication.

Proof of concept work might be prototyped using PERL or any one of the many fine and capable scripting languages.  The goals at this stage include exploration and discovery with a focus on the actual problem at hand.  It’s remarkably easy to invest a considerable amount of effort exploring technology outside the context of the problem at hand.

2.1.1 Work Distribution: SuperNode and PeerNodes

Once a researcher is confident that his/her problem can indeed benefit from distributed computation, the next step is to consider both a general architecture and the various programs, which have to be built.

Dispatched work units are processed by independent entities known as compute nodes, which are sometimes referred to as PeerNodes.  PeerNodes receive the data and the processing instructions required to translate the data into usable information. While a PeerNode can be a locally networked machine, in a final sense, a PeerNode is typically a geographically dispersed machine somewhere on the Internet.

Most distributed computing projects resemble classical client server paradigms rather than P2P networks where the lines between clients and servers fade.  The reason is because modern DC projects maintain a central point, called a SuperNode, where work is dispatched, later collected and assembled into a computed solution.  Some P2P projects support the presence of multiple SuperNodes, which in turn cluster communities of PeerNodes.  The core value of a SuperNode is the potential to manage clusters of peers and for peers to loosely bind to another SuperNode, should a current SuperNode cease to exist.  The challenge is left to the researcher to determine the resulting network topology, which best supports, the project’s needs.

A SuperNode must be a networked application, which listens to and provides services to connected PeerNodes.  In this regard, a SuperNode is not unlike a web application server that sends and receives requests while maintaining discrete sessions.  This line of thinking naturally leads to a webservice approach.

There is considerable flexibility in the choice of computer languages and other tools, which can be used to build both the SuperNode and PeerNode applications.  Quick prototyping can be done entirely using modern scripting languages such as PERL.  In some cases there may not even be a need to move beyond the use of a scripting language.  The choice in tools is best made from an understanding of the specific project requirements, resources, and timelines.

碧城仙 · 发表于 2004-11-14 09:49:38

2.1.2 Client / Server Communication

Prior to considering the design of a SuperNode server a developer should first consider how clients will communicate with the server.  Considering the application level protocol is one way to examine the project’s specific communication needs in greater detail.  Developers should consider these and other questions:

·       How large is the data intended for transmission to clients?  How large will the result be? Does the data need to be compressed?

·       How important is network transmission speed to the application?  Should special attention be put on the transport layer?

·       How important is data security?  Does the data have to be encrypted at the server, client or both?

·       Will PeerNode clients need to communicate through intermediary firewalls and proxy servers?

·       During communication, will sessions need to be maintained? Or will transactions be stateless?

·       Will the application communicate through port 80 or will it require another port?

·       Will PeerNode clients need to communicate with each other, or only with the SuperNode server?

·       Will the communication be synchronous or asynchronous?

The answers to these important questions may lead one down the path taken by many distributed computing projects.  Specifically, in the use of existing open standards and well-established web based communication methods including the use of the HTTP protocol.

DC projects have long recognized the difficulties of requiring their users to understand networking topics.  It’s far simpler to tunnel through existing environments than it is to answer countless questions regarding a clients custom configuration settings.

Naturally a potential downside to HTTP is that the data the project might transmit and or receive is smaller than the actual HTTP headers.  It’s no secret that inoperability often requires compromise.

HTTP is a well-supported and well-documented protocol with the added benefit of being text based, and easy to parse.  This makes HTTP one of the easier protocols to learn.  Armed with a running web server and scripting language a person can learn the basics in very little time.

Consider the following PHP script, which communicates with a web server to retrieve an index.html file:

1    <?php
2       $fp = fsockopen("chessbrain.net", 80, $errno, $errstr, 60);
3       if (!$fp)
4       {
5          print("$errstr ($errno)\n");
6       }
7       else
8       {
9             $HTTPheader = "GET http://www.chessbrain.net/index.html HTTP/1.0\r\n" .
10                         "\r\n";
11          fputf($fp, $HTTPheader);
12
13          while (!feof($fp))
14          {
15             $res = fgets($fp, 128);
16             print("$res");
17          }
18          print("\n");
19       }
20       fclose($fp);
21    ?>
In line 2 a connection is established to the server at chessbrain.net, and in line number 9, an HTTP request header is specifically sent asking for the index.html file. The request is transmitted to the server in line 11, and the script immediately enters a while loop until an EOF (end of file) marker has been reached.  As data is being retrieved, it is printed in line 16. Finally, in line 20 the connection to the server is closed and the script exists.

The prior example shows an HTTP request using the verb GET to retrieve a document at a URL.  HTTP includes other useful verbs, such as POST for sending data, HEAD for retrieving document metadata, TRACE for tracing the path data takes among intermediary servers, in addition to a few less commonly used verbs.

A core benefit of HTTP is that it includes support (via text based keywords) for interacting with other network devices such as routers, load balancers, and proxy cache servers.  Imagine writing a specialized protocol to deal with such devices and servers, only to later discover that a firewall has filtered the projects network traffic because it doesn’t fit the mold of the data it expects, or to discover that the client can’t communicate outside a users system because it can’t communicate with a proxy server which stands as a gateway to the outside world.

When one considers how important HTTP has become it’s easy to see why it has become a standard for building distributed applications.  Using HTTP verbs, software can send and request packaged data.  There is no predefined limit on size or format.  Internet radio stations use HTTP to broadcast continuous streams of binary data on an open HTTP connection.

The freedom to transmit any type of data further makes HTTP ideal. However, using an agreed upon data format enables interoperability among partners.  To address this need, the XML markup language was created. XML facilitates the interchange of structured data by allowing developers to use agreed upon keywords.  Consider the XML text below; it is able to define one or more software products along with product ID’s and file information such as when it was posted, and the size and version of the file.

<software>
  <checkuptime interval="01:00:00"/>
  <product id="CBCC" uc="low" os="win32">
<file name="cbcc_inst.exe" size="756345" ver="3.00.00.00"
      posted="12/16/03 12:34 pm GMT"
      md5="2d6897210ca5b816f24a7d52e319e07d"/>
<desc>ChessBrain Control Center</desc>
<changes>This release contains a fix for...</changes>
  </product>
</software>
Arguably, the XML information could have been represented in a far more compact binary format.  At one extreme, the XML easily translates to a structure with precise data types such as:

struct Software
{
char product_id[20];
char uc[4];
char os[10];
char filename[40];
long filesize;
char fileversion[12];
char filemd5[33];
char filedesc[80];
char filechanges[1024];
};
To match the flexibility of the XML text one would have to support an array of products and transmit the total entries along with the binary structure.  Of immediate concern would be the use of the “long filesize” field, which would be interpreted differently across computers.  The so called “Endian” differences cause Mac and Windows PC’s running Windows, Linux or MacOSX to interpret the long file size number in incompatible ways.  Naturally, there are well-established methods of addressing this problem.  The issue is that one must take those steps, and further place a similar burden on customers or partners when they attempt to interpret the same data.

Another apparent limitation is the use of “fixed” data sizes.  Misunderstandings and lack of proper data validation leads to software defects such buffer overflows, and the act of overwriting memory.

XML allows non-heterogeneous operating systems and computers to safely receive text data and to process the data in more flexible formats.  XML’s potential for enabling well-formed data makes it ideal for text parsing. Open Source XML parsers remove much of the complexity of dealing with XML.  When dealing with simplistic XML one may even be able to quickly parse data using standard string runtime calls such as strcmp(), strstr(), and strncpy().

At the other far end of the spectrum, XMLRPC (XML for Remote Procedure Calls) and now SOAP (the Simple Object Access Protocol) have built on XML to further formalize data exchange.  Web based services now provide distributed computing applications with ways of communicating with one another.

Interoperability may not appear to be an overriding concern in a distributed computing project, however, by making small investments (using open source tools) in the present, the application will have a better chance of being scalable in the future.  Apache, PERL and PHP support XML.  Open Source XML parsers such as Expat and Xerces reduce the complexity of working with XML from within programming languages such as C/C++, Java, PERL and PHP.

[ Last edited by 碧城仙 on 2004-11-14 at 09:53 AM ]

碧城仙 · 发表于 2004-11-14 09:50:07

2.1.3 Server Design and Testing Considerations

The concept of a SuperNode can be placed within the context of existing distributed computing frameworks and web based application development paradigms. For example, the open source LAMP (Linux, Apache, MySQL and PERL/PHP) tools form an effective foundation for designing, testing and deploying web based distributed computing applications.  At the proprietary end of the spectrum, Microsoft’s .NET solutions offer an unprecedented rapid application development suite.  Professional software developers can see what .NET offers, and as impressive as it may be, many of us choose the letter F in “Freedom”, and continue to travel down the LAMP path.  The ChessBrain project leverages all five LAMP tools in various ways, ranging from application prototyping to software testing.  As stated earlier, a core goal on the project was to learn about distributed computation first hand, and so the ChessBrain project built its own custom SuperNode server.

It is not recommended that anyone consider writing their own custom application server if they can avoid it.  The result of building one’s own web-like application server is that one will most likely end up with a very poor version of Apache. Thinking twice and coding once can save a lot of time!

One of our server related goals on the ChessBrain project was to build a SuperNode server that was highly portable across a number of platforms.

The Java language and related technologies seemed like an ideal choice.  Many do in fact adopt the Java approach and consider it ideal.  We know enough about Java to admire its rich set of features, which greatly reduced the complexity of building distributed computing tasks.  Our own experiences led us to use the C/C++ languages and to steer away from platform specific libraries and features.  ChessBrain compiles on Linux and MacOSX using GNU tools.  ChessBrain is a native, multithreaded, concurrent connection server, which is written in C++ using GNU GCC.  Software development occurs under Linux with debugging under the GNU GDB debugger and under X, using the DataDisplay Debugger, DDD (picture on right).

A common approach to server development under *nixes has been to build servers which fork processes.  Threads (pThreads) were chosen for a number of reasons, portability being a primary rational.

An important server design consideration (which is just as important for non-server related software development) is to think about how the server is going to be checked for adherence to intended behavior, as well as stress tested for robustness.  An all too common mistake is to develop software without first considering how it will be tested.  In the end, some software products cost more to test then they cost to actually develop in the first place!

Here again, open source tools provide the means to take on serious testing.  PERL scripts were used to test our SuperNode server’s functionality and its ability to withstand excessive stress.  The beauty of this partnership lies in the simplicity and flexibility of using scripts. Once written, scripts allow for subsequent quality assurance regression testing, ensuring that the server performs correctly after subsequent modifications.

Project developers should simulate end user environments as closely as possible in an effort to perform accurate tests.  It’s essential to install proxy servers and firewalls during the software research and development stage to ensure proper behavior.

In particular, the use of proxy caching servers should be carefully examined.  A caching server seeks to reduce network traffic by storing and reusing data.  This behavior can introduce problems for applications, which utilize HTTP.  The open source product, Squid Proxy, offers a best of breed caching proxy server free of charge.

The use of open standards such as HTTP and XML ensure that both professional testing houses and informal testers can support a testing effort. You see - interoperability does have its benefits!

碧城仙 · 发表于 2004-11-14 09:50:28

2.1.4 Network Failures: Expecting The Unexpected

Building network software can seem deceptively simple at times, especially to less experienced developers.  What makes network programming, like games programming, difficult is the non-deterministic behavior of a complex system.  In network programming, the complex actions of humans, computers and networks must also be dealt with.  Any one element can cause unexpected behavior resulting in a catastrophic outcome, such as a server crash.  The key to surviving catastrophes is to plan for them!

Let’s examine this from a number of perspectives.

·       The SuperNode server may send a task to a PeerNode client, which gladly accepts the tasks, later to have its user uninstall the client software before a result can be computed and returned.

·       A PeerNode client may be in the process of returning a result while it loses its connection to the Internet.

·       The SuperNode server may collapse under the pressure of PeerNode connections resulting in a crash, which leaves thousands and possibly millions of PeerNodes unable to connect.

Unless carefully planned for, the problems listed above may ultimately destroy a distributed computing project that is ill prepared to handle simple delays and data loss.  Project developers have to consider reliability, fault tolerance, and the use of redundancy.

Successful DC projects have the luxury of distributing the same work units to many PeerNodes.  If one PeerNode doesn’t return a result, the chances are high that some other PeerNode will.  Should all of the PeerNodes, working on the same task, fail to return a result then the work unit is simply marked as incomplete and will be sent to another batch of PeerNodes at a later time.  The use of PeerNodes and redundancy creates a robust system, where the project isn’t entirely dependent on whether an individual PeerNode completes a task.

Project developers must take into account the potential failure of a SuperNode server.  How will PeerNode clients respond if a SuperNode is no longer available?  PeerNode clients must be designed to handle lost SuperNodes.  One way to address this problem is to design PeerNodes so that they’re able to connect to Internet addresses by name, such as node01.distributedchess.net, and node02.distributedchess.net.  If a connection to node01.distributedchess.net fails, then the PeerNode will try node02.  PeerNodes can also maintain a list of SuperNode servers and migrate to the next available server as needed.

Should a SuperNode server fail, PeerNodes will behave like a swarm of bees changing direction on their way to another destination.

碧城仙 · 发表于 2004-11-14 09:50:45

2.1.5 Bandwidth and Hosting

An important consideration for any DC project involves determining network bandwidth utilization.  It’s important to consider bandwidth from both the SuperNode server’s end as well as from a PeerNode’s perspective.  The actual utilization depends on the data that is both sent and received and the frequency of transmission.

Many popular DC projects package data into chunks, which take PeerNodes days and sometimes weeks to process.  In these cases, the actual network bandwidth utilization on the client side is negligible by average end-user network usage patterns.  For example, accessing a single high bandwidth website with lots of graphics can use more bandwidth than most DC clients use in weeks!

The situation on the server side offers a very different story.  A single SuperNode server might service thousands, and if truly popular, millions of requests per day.  The specific nature of a distributed computation problem has a great deal to do with the frequency and size of data transmissions.  The ChessBrain project attempts to play real-time chess using thousands of PeerNodes.  During the course of making a single move, ChessBrain communicates with hundreds of computers, and often speaks to the same machines multiple times.  So it’s not at all surprising that ChessBrain’s network bandwidth requirements are high by most project standards.  Of course, ChessBrain isn’t communicating with millions of computers so the actual bandwidth costs remain manageable.

Experts recommend taking latency into consideration and examining worse case scenarios as well as minimum necessary requirements. One of the best ways to study the behavior of a network application is to use a network analyzer, which is often referred to as a network sniffer.  The ChessBrain team has used the open source, Ethereal product, in a number of situations.  The next screenshot shows Ethereal and a network session involving a data exchange between a ChessBrain PeerNode client and a SuperNode server.  The format of the data is HTTP and the actual data transferred totals 405 bytes.  The total bytes reflect the size of the TCP/IP header, the HTTP header and the application specific data payload.  When measuring bandwidth usage it’s important to consider the true total costs instead of just the size of application’s data.

The example above only focuses on the transmission of a single message response.  Not all messages will necessarily be the same size.  In addition, Ethereal displays information in TCP/IP packet frames and so one will need to look out for HTTP continuations when the data exceeds the size of a single frame.  Lastly don’t forget to take TCP/IP SYN/ACK into account.  A single SYN weighs in at 62 bytes and an ACK at 54 bytes.

Ethereal is a good tool for examining what is actually going through the wire, however network bandwidth measuring tools and load test simulations should be used to gain better insights into the application’s true requirements.

Understanding the project’s network requirements is necessary for choosing the right server hosting plan.  Proper planning is essential, because if the project becomes successful one may discover one can’t afford to pay for the project’s bandwidth.  Even if the bandwidth is not directly paid for, the bottom line is that someone, somewhere actually does pay, and without proper analysis the project may be terminated prematurely.

2.1.6 Security!

Security plays a key role in many aspects of a DC project. Both project organizers and participants have valid concerns regarding the security of their work. Distributed computing participants have concerns about their privacy and their machine’s susceptibility to viruses.  Participants wonder whether using DC software makes their machines easier targets for attackers.  Project developers have concerns that attackers may find ways to tamper with results, whereby invalidating the project’s work2.  An unspoken concern among DC project organizers is that a compromised DC project has the potential of becoming a massive virus distribution system.

There’s little doubt that security is of paramount importance in distributed computing.  Fortunately, both developers and consumers recognize the need for security.  Many consumers are taking measures to safeguard their machines using anti-virus and network firewall products.

Project developers are faced with securing the project from a number of vantage points.  A first order of business is to examine points of vulnerability.  Where can an attacker cause harm?  Which aspects of the project can be exploited and otherwise abused by participants?  This last question may seem a bit strange.  After all, why would a willing participant abuse or attempt to exploit aspects of a project?  The reason lies in the overly competitive nature of some participants who see the project as a completive playing field where ranking is of paramount importance.  Others simply derive satisfaction in disruption.

2.1.6.1 Securing The Project’s Website And Core Servers

Project participants download PeerNode client software from a project’s website, so it makes sense that the projects web server is an obvious target. If an attacker can penetrate a site’s security and replace the downloadable PeerNode clients with compromised versions, then many machines may become infected.  This scenario has a strong biological resemblance to the way deceases infect host systems and use the system as a carrier.

Fortunately, a considerable amount of research has been done to address web site security.  Web server log files and log file monitoring tools help to identify potentially threatening activity such as, brute force password attacks.

Intrusion detection systems (IDSes) use sophisticated monitoring techniques to detect potential security issues.  An IDS can monitor TCP packets to identify when an attacker is performing a port scan or when a denial of service attack is underway. IDSes can also monitor system files and user patterns for unexpected behavior, such as when a normal user acquires root level access, and when core system configuration files are modified.

One of the most popular Intrusion detection systems is an open source product called, Snort.  Now in its second version, Snort is on par with leading edge open source tools such as Apache and MySQL.

碧城仙 · 发表于 2004-11-14 09:51:10

2.1.6.2 Safeguarding Against Abuse

DC projects have to take measures to ensure that shared resources are not abused. The use of shared resources needs to be monitored to ensure that they’re being used as expected.

During the preparation of this paper the Folding@home project posted a message (see next image) on their site announcing that some servers were downloading complete lists of member stats, and that the behavior was causing performance problems. It’s very likely that the abuse was not intended to harm the project. Many DC project enthusiasts enjoy maintaining their own stats, and often go through great lengths to build statistical packages complete with graphs. The last thing a DC project should do is threaten potentially tech-savvy DC members. It is far better to offer alternative solutions involving opportunities for co-creation. An experienced performer will tell you that audience participation is a powerful tool!

By far the best way to address potential system abuse is to identify them before they happen.  Project developers must carefully consider how potentially abusive behavior might circumvented, and or tracked as it occurs. This is remarkably easy to do using the open source scripting language and database of your choice.

2.1.6.3 Securing Data Transmissions

Earlier we examined the use of Ethereal to monitor network traffic.  Ethereal uses packet-sniffing libraries to perform its higher level functions, such as filtering and display.  The same types of underlying tools are available to attackers who can use them to intercept and modify data while in transit.  Take for example an attacker who wants to disrupt a DC project.  The attacker builds a Trojan software product, which masquerades as a useful monitoring and statistics tracking system for end users.  The malicious tool performs useful functions, while slightly modifying the transmitted results, which are sent to the project.  The tampered data has the potential of completely invalidating the DC project’s efforts resulting in a complete waste of time for all involved. We won’t get into the many psychological reasons why this sort of behavior is considered exciting to some, but suffice it to say that the motivation for disrupting a high profile DC project offer attackers icon status in certain circles.

2.1.6.4 Data Hiding

Software developers are sometimes faced with the classical problem of size vs. speed.  The need to protect data, may be sufficiently clear, however the cost of doing so may be considered prohibitive.  Hiding data, rather then fully encrypting it, and using strong validation techniques on both the server and client end, may offer a suitable compromise.

Data compression can be effectively utilized to reduce bandwidth requirements, and has a positive side effect of masking the original contents.  Applying byte transformations, such as XOR operations and weak reversible data ciphers, will further aid in data hiding.

Data hiding is by no means as effective as data encryption but may be suitable for use in some settings.

2.1.6.5 Data Encryption

Open source implementations of popular data encryption algorithms leave project developers with little reason not to apply some form of data protection.

One, popular algorithm, Rijndael (pronounced: “rain doll”) formally known as the Advanced Encryption Standard, offers a high-end encryption method.  AES, is a variable block symmetric encryption algorithm developed by Belgian cryptographers Joan Daemen and Vincent Rijmen as a replacement for the aging DES and Triple DES standards which are still commonly used to secure e-commerce2.

Because many distributed computing projects are using XML and SOAP, they’re forced to convert binary compressed data into a format that is suitable for use with XML.  The BASE64 format, which is used by email programs, provides a readily accessible method of representing binary data within an XML document.

An obvious issue involving the use of both compression and encryption is that both hinder the inoperability benefits that XML and SOAP seek to offer.  To address this issue, the World Wide Consortium (W3C) has developed the encryption specification (www.w3.org/TR/xmlenc-core) which offers a useable set of compromises.

For maximum security, where performance may be less of an issue, the use of public key cryptography is highly recommended.  Public Key Infrastructure (PKI) systems use public key encryption to create digital certificates, which are managed by certificate authorities.  Certificate Authorities (known as CA’s) establish a trust hierarchy, which can be used to validate authenticity through association.

The use of PKI would allow a SuperNode server to authenticate PeerNodes and for PeerNodes to validate that they are indeed communicating with an authentic SuperNode.

碧城仙 · 发表于 2004-11-14 09:51:29

2.1.6.6 Detecting Software Tampering

Project contributors may acquire PeerNode client software from any one of many sources.  For example, a DC team project leader might download and place the client software on the team’s website along with specialized instructions for team members.  In addition, a popular Linux distribution might include client software as part of a packaged offering.  This happened on the ChessBrain project when Gentoo Linux offered the ChessBrain PeerNode client.

There is a certain degree of trust associated when the software is downloaded directly from the project’s website.  However, when project software comes from other sources, contributors may not know or trust the origins of a project’s software client.  To address this concern, DC software is often posted on a projects site along with a cryptographic hash string, such as:

3402b30a24dc4d248c7c207e9632479a  client21201-01-lin-i586.tgz

The string of numbers and letters is the output of a program called md5sum, which generates the string when given a filename as input.  End users can download the file and type:

md5sum client21201-01-lin-i586.tgz

To validate that the file hashes to the same string that was posted on the site:

3402b30a24dc4d248c7c207e9632479a
Some projects sign their files using a private key.  Users who wish to validate that a files digital signature is correct can retrieve the projects public key (available online via public key servers) and use it to validate that the file was signed using the project’s private key.  The signature below is an example of what a contributor might see posted on the project’s site.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.1 (GNU/Linux)

iD8DBQA/ye8su9d1K+MjI6sRAtQXAKCgcXahYj1ZcptsXR10WCnSbKs2ggCeK/Qv
4THuyfGOeDEyHiHnHX9pkZw=
=Nj+a
-----END PGP SIGNATURE-----
The important thing concerning a cryptographic hash and a digital signature is that both techniques can be used to determine whether a file has been tampered with after it was posted.  This allows the software to be distributed and for end users to validate that the file hasn’t been modified.  By far the most secured method involves the use of digital keys, and that technique is being used in an increasing number of projects.

2.1.6.7 Open Source Encryption Algorithms And Tools

There is a wealth of security information available on the Internet, and many open source projects demonstrate working implementations.  Every DC project has the responsibility of exploring security, and protecting both their project and their members!

2.1.7 Combating Aging: Software Updates

During the past few years we’ve grown accustomed to automatic software updates.  Companies such as Microsoft, Apple, and RedHat offer their Customers software updates.

Updates are released for any number of valid reasons, ranging from program fixes to new features to the latest security band-aids.  In all fairness, software complexity has increased while the time to market has decreased causing products to be released to an unsuspecting public in a less than perfect state. In addition, network connected software is under siege as attackers attempt to exploit security flaws.  Rapidly distributing software updates has become the only real defense companies have against coping with the unexpected.

A real issue facing developers is how to make software updates quick and painless for their customers.  Long gone are the days when posting an update on your company's FTP or web site sufficed.  Companies are seeing their products used by a wider demographic where even a "one click install" is one click too many.

The challenge affects all DC projects to lesser and greater degrees.  Long running static type projects often do not require software updates.  However, for projects that are highly dynamic, where the client code is being updated in response to ongoing research and development, the need to release continuous updates is far more critical.  Software flaws, both logical and security related can cause DC projects to release software update.

On the other side of a “life in the fast lane” lies the fact that most members will quickly become frustrated with the frequent availability of updates which require them to stop what they’re doing in order to perform some action.

The fact is that software development is both dynamic and fluid.  To address the needs of software developers and to streamline the experience of project participants, more distributed computing project will need to support software autoupdates.  The ChessBrain project recognized the need for an autoupdate system and is currently in the process of deploying a new service.  The open source BOINC project also recognizes this need as well as the issues surrounding secure software distribution.

In the ChessBrain AutoUpdate system (aus.chessbrain.net), an Apache web server uses PHP scripts to offer an XML based web service that provides information and access to the latest project files.  The client software periodically checks the autoupdate site for new software releases.  When an update is detected the client software downloads and installs the software update.  The system currently uses digital signatures and the intention is to move to the use of Public Keys using GNU Privacy Guard (GNUPG) in the near future.

碧城仙 · 发表于 2004-11-14 09:52:57

3. CONCLUSION

Individual people hold the keys that unlock the sources of distributed computing power.  Many are more than willing to share their own personal resources, requiring very little in return.  Successful, distributed computing projects recognize the importance of people and work hard to earn their respect and trust.

Open source developers know, by experience, the power of collaboration.  We expect that distributed computing projects using open source tools will continue to accomplish great things.

Time is on our side! Each year machines and networks get faster… and each year more computers connect to the Internet and become potential resources!

4 OPEN SOURCE TOOLS

This section contains the list of open source tools mentioned in this paper.  The list is by no means comprehensive, nor does it represent the best tools for all projects.  The reader is encouraged to explore these, and the many other fine tools the open source community has to offer!

Apache: http://www.apache.org/

Cluster software:

·       OpenMosix: http://openmosix.sourceforge.net/

·       ClusterKnoppix: http://bofh.be/clusterknoppix/

DC Software:

·       BONIC: http://boinc.berkeley.edu/

Encryption: AES (Rijndael)

·       http://csrc.nist.gov/CryptoToolkit/aes/rijndael/

·       http://fp.gladman.plus.com/cryptography_technology/rijndael/

·       http://www.cryptix.org/news/02102000.html

·       http://aescrypt.sourceforge.net/

W3C Encryption Specification: http://www.w3.org/TR/xmlenc-core

GNU Privacy Guard: http://www.gnupg.org/

Community building software:

·       phpBB: http://www.phpbb.com/

Database:

·       MySQL: http://www.mysql.org

GNU Tools: http://www.gnu.org/

Intrusion Detection System

·       Snort: http://www.snort.org/

Languages:

·       Java: http://wwws.sun.com/software/learnabout/java/

·       C/C++: http://gcc.gnu.org/

·       PERL: http://www.perl.org/

·       PHP: http://www.php.net/

.NET:

·       Mono: http://www.go-mono.com/

Network Sniffer:

·       Ethereal: http://www.ethereal.com

Operating systems:

·       Linux: http://www.linux.org

·       FreeBSD: http://www.freebsd.org/

Proxy cache server:

·       Squid Proxy: http://www.squid-cache.org/

Windows (*nix users who use MS Windows might consider these products!)

·       Cygwin (Unix like environment for Windows) http://www.cygwin.com

·       Midnight Commander (File Manager) http://home.a-city.de/franco.bez/mc/mc.html

5. REFERENCES

1.    Justiniano, C and Frayn, C (2003). The ChessBrain Project: A Global Effort To Build The World's Largest Chess SuperComputer.  Journal of the International Computer Games Association.  ICGA Journal Vol. 26, No. 2, pp. 132-138. http://www.chessbrain.net/chessbrainproject.html

2.    Justiniano, C (2003). ChessBrain: a Linux based distributed computing experiment. The Linux Journal. September 2003 Issue 113, pp. 66-70.  http://www.linuxjournal.com/article.php?sid=6929

3.    Looking Toward a Networked World, Wired Magazine: http://www.wired.com/news/mac/0,2125,60999,00.html

4.    Google press release: http://www.google.co.uk/press/highlights.html

6. ACKNOWLEDGMENTS

The author would like to thank the ChessBrain team and community members for their invaluable support, and to his wife Karla for her undying patience and valuable editing assistance.  Cedric Griss from the Distributed Computing Foundation provided valuable technical assistance in the preparation of this paper.  Any errors belong to the author who will quickly blame them on the lack of sleep.

		自动登录	找回密码
密码			新注册用户