 | Level: Introductory Cameron Laird (claird@phaseit.net), Vice president, Phaseit, Inc.
01 Oct 2002 Science and engineering laboratories have long depended on proprietary products for daily data analysis chores. Now, many labs are turning to open source products and development languages for specific technical benefits the conventional products don't give them.
Scientists and engineers conventionally use proprietary products such as The MathWorks' MATLAB, Wolfram Research's Mathematica, or SAS Institute's SAS/IML (see Resources for links to the companies' products mentioned in this article) to collect, process, and report research data. Although it's tempting from a software standpoint to lump these products into the larger category of "business intelligence" or "analytics," they each are carefully crafted for the specialized science laboratory market -- and enjoy unusually high levels of customer satisfaction. These applications have been successful, both financially and technically. MathWorks, for example, is still aggressively hiring programmers and Mathematica rolls out interesting enhancements with every release of its symbolic calculator, including "document-centered interfaces" and its "algorithm knowledge base." These applications are running into a new generation of competitors, though. In particular, reliance on open source software at many sites is growing steadily. This article explains why, and what the consequences are for your own work. It also concludes that open source and proprietary products might end up as teammates, rather than opponents. Figure 1. A 3-D surface rendered by SciLab, an open source tool for numerical computations

Open source advantages
The main reasons scientists and their in-house developers are switching to open source software are:
- No-fee licensing
- Ease of license fee management
- Better large-scale programmability
- Easier integration
- Better performance
- Development convenience
- Intellectual propriety
- Better support
"Free" products have an obvious appeal -- but perhaps this is less important than it would seem. Quite a few users say they can bear the cost of products such as IML and MATLAB; the software works well, and they're happy to pay license fees, which represent only a tiny fraction of the value they provide. Some users even claim to be indifferent to software costs, because they pass the amounts through on their research grants. What fee-free licensing does give, though, is flexibility. A researcher might visit a colleague's laboratory and want to put a job together quickly on the host's hardware. When software is open source-based, it's never more than a download away. Even documentation of proprietary products is sometimes restricted; in regard to Research Systems Software's Interactive Data Language (IDL), Terry Hancock, CTO of Anansi Spaceworks says, "I have trouble getting a copy of the full API." With open source, however, there's no need to arrange for a purchase order, seek authorization, or otherwise engage whatever financial bureaucracies would need to be involved. Users appreciate that. Some administrators will tell you that their costs to configure and maintain clumsy proprietary license managers are higher than the license fees themselves. No-charge software also has a modest edge with students. Students certainly report that they're more likely to work on their own with software that doesn't cost them anything. On the other hand, vendors generally offer academic discounts, and it's hard to substantiate the notion that purchase price truly does deter the next wave of prospective users. Figure 2. Flow around an airfoil, rendered using Yorick

Programmability
Commercial products offer enormous value in their documented ability to do common data manipulations right out of the box. They can easily store data arriving from instruments, make it available for browsing, sort it, graph it, and wrap it up into a report. The recipes for such operations are easy to learn and allow working scientists responsible for their own data processing to concentrate on science, not computing techniques. As computing objects, though, the same products have several blemishes. Their base languages imperfectly support newer, more powerfully expressive idioms including object-oriented and functional programming. As apt as the tools are for quickly pulling together an impressive calculation, they become clumsy when used by large teams working on long lasting projects. Code reusability is difficult enough that cutting and pasting dominates the workflow in many laboratories. Teamwork is always a challenge, of course. Can open source do any better? In many cases, yes. For instance, according to Konrad Hinsen, a senior researcher with the Centre de Biophysique Moleculaire in Orleans, MATLAB "has nothing for structuring code beyond the function level (no modules, for example), and no definable data structures at all. It is impossible to build abstractions... Matlab is the interactive equivalent of Fortran: everything is a matrix." Contrast this lack of data structures with the built-in object-orientation and module packaging of open source development languages such as Python or Ruby. William Kleb is computational methods development leader at NASA Langley Research Center. Reusability is particularly important for him. An aeronautical engineer who specializes in "hot" (above Mach 5) flight, he deliberately began searching a few years ago for a better way to use computers. Kleb says, "We were tired of continually scrapping code because it had become too fragile. We also came to realize that our team was clueless about working as a team on a single piece of software. We turned to the software engineering/development community, looking for best practices." The result? "At this point," says Kleb, "we are using Ruby to create custom tools to support some of our XP [eXtreme Programming] practices like automated acceptance testing and unit testing. We have also managed to write a Fortran 95 mouth for Ruby's awesome documentation tool, Rdoc, to provide automated API documentation. We also use Ruby for Fortran code generation, for conditional compilation, and as glue to combine various code elements into multidisciplinary combinations. We are evolving toward a goal of wrapping nearly all the Fortran bits with Ruby." These distinctions are important. Programmability matters, even to scientists who do not have a background in software. Everyone who develops with engineering toolkits will be abstracting; the only question is how well the toolkit supports this operation. Moreover, there are commercial products, including Mathematica, that boast well-designed languages. In general, though, it's the open source languages, including such esoteric ones as Sather (see Resources), R, and J, which have emphasized the abstract expressive power essential for long-term maintenance and reuse. Figure 3. The DomainFinder application, a program for determining and characterizing dynamical domains in proteins

Within open source, at least a couple of distinct approaches to scientific problems are visible. Projects such as Octave (see Resources) aim to substitute for MATLAB rather narrowly. They perform scientific data analysis, and that's all they do. Several no-fee packages are even more specialized: Image Reduction and Analysis Facility (IRAF) is widely used in astronomy (see Resources), for example, and disciplines such as anthropology, high-energy physics, and genomics all have analogous applications. These tend to have valuable libraries of canned routines, but often a "terrible base language," as Joe Harrington of Cornell University characterizes IRAF. Other projects start with general-purpose languages, including Perl, Java, and Python, and build "vertical" toolkits based on them. Among these, SciPy is particularly ambitious (see Resources), with a growing record of concrete accomplishment. SciPy is a Python-based project that aims to reproduce all of MATLAB's functionality, better its performance, and ease its integration with other software, all while remaining entirely free of charge and at least as easy to use as MATLAB. Among the sixty attendees at the last physical meeting of the SciPy interest group were representatives from Cal Tech, the National Biomedical Computational Resource, the national laboratories at Lawrence Livermore and elsewhere, Lockheed-Martin, the Baylor College of Medicine, the Space Telescope Science Institute, and the Stanford Linear Accelerator Center. Full general-purpose programming power integrated with a rich collection of specialized modules is the ideal at which these contributors aim.
 |
Easy interfaces
Programmability isn't only about development in the larger sense. It also affects the other end of the scale, the level of individual statements or commands in an application. One of the vulnerabilities of commercial software is that it's been resistant to integration with other systems. While all the commercial products document interfaces to "foreign functions" including network data feeds, instrumentation device drivers, and so on, the interfaces are hard to use. Open source languages, in contrast, are notoriously cheerful about gluing themselves together with other pieces. Programmers have written thousands of extensions for such general-purpose languages as Python, Tcl, and Perl. This is crucial for many organizations. But regardless of how much organizations like the cosmetics or libraries of commercial products, they simply can't afford the high cost of tying specialized external systems into hermetic software packages. Developers have found they're better off with more easily integrated open source toolkits. A final issue related to programmability is performance; many researchers will tell you that proprietary software, when scripted, performs poorly. While specific operations are nicely optimized, general-purpose language features often are slow. Even though "glue" or "scripting" languages such as Perl also have a long-standing reputation for being slow, the superior algorithmic expressiveness and speed optimizations of recent releases combine to create powerful advantages.
Easier to debug
All the arguments for open source so far have rested on secondary characteristics of open source software. For some developers, openness itself is a paramount feature. Tom Silva is an experienced consultant who works with the Space Shuttle and other aeronautic projects. He simply but persuasively observes, "If I have the source, I can find the problems. I can add features that make my life easier. I can trace the code to figure out how it really works, as opposed to how it's documented." Related to this is a growing concern about the scientific propriety of reliance on commercial software. What does it mean that a certain result was achieved through use of a proprietary library? How does an academician document a discovery or conclusion when it rests on a black box whose details are trade secrets held by a third party? Open source doesn't suffer from these liabilities, of course. In principle, everything about an open source program can be specified precisely. It doesn't harbor any secrets. Superior support is the final reason some researchers favor open source. While commercial companies employ dedicated support personnel, the "community-based" online forums available for several open source technologies constitute powerful competition.
Summary
Harrington is "eager for the day when we can dump the overpriced and underfeatured languages we use" and switch to open source alternatives, and so are a growing number of researchers. But it's more than that. While proprietary products aimed at scientists and engineers provide a great deal of value, they do so at high cost -- not just the purchase price, but also the costs of inflexibility, stubbornly slow performance, and difficult development. Many users are already making the switch. The result: they're improving their work right now. Within this broad trend, there are several interesting local changes. Drama pervades the bioinformatics marketplace, where venture capital-financed products with high price tags are common, as are well-accepted and well-regarded no-cost packages built on Perl, and to a lesser extent, other languages such as Java, Tcl, and Python. In other disciplines, the fastest progress seems to be made by those who combine new and old technologies. Kleb's group wraps up legacy Fortran in Ruby-based packages. Several groups are using Tcl to improve the interfaces of such commercial packages as MATLAB. Scientific and engineering teams saddled with uncomfortable software needn't wait on product vendors to solve their problems.Many projects are rapidly making all sorts of solutions freely available. Even without those, open source developers with the experience to replicate elements of proprietary value are multiplying. Some of the best coding currently being done marries free and commercial parts to achieve results beyond the capacity of either in isolation.
Resources - Participate in the discussion forum.
- Learn about MATLAB at The MathWorks Web site.
-
Mathematica from Wolfram Research is a particular favorite of mathematicians and more analytic scientists, but it's a bit heavy for many biologists and chemists focused mainly on reporting laboratory data.
- Visit the SAS Web site to learn more about SAS/IML (Interactive Matrix Library) software.
- Read about IDL (Interactive Data Language) at the Research Systems Software Web site.
- The SciPy Toolkit aims to replace MATLAB.
- GPL-ed Octave attempts to provide much of the same functionality as MATLAB.
- IRAF is widely used by astronomers.
- Sather aims to combine the best of C, Eiffel, Lisp, and other better-known languages.
- Scilab is a free MATLAB competitor that uses sophisticated algorithms to analyze data and produce lovely graphics.
- The Molecular Modelling Toolkit (MMTK) is based on the Numeric extension to Python.
- Read Cameron's article on how JPL scientists are using open source software and strict development practices to produce truly mission critical applications (developerWorks, August 2002).
- IBM researchers are involved in a number of scientific and technological disciplines, including chemistry, computer science, electrical engineering, materials science, math, and physics. Learn more about it at the IBM Research site.
- IBM Life Sciences addresses IT needs specific to biotechnology, pharmaceuticals, genomics, proteomics, and healthcare.
- Find the Linux resource you're looking for in the developerWorks Linux zone.
About the author  | |  | Cameron is a full-time consultant for Phaseit, Inc. He writes and
speaks frequently on open source and other technical topics. You can contact Cameron at
claird@phaseit.net.
|
Rate this page
|  |