Interesting People mailing list archives

from TidBITS#186/26-Jul-93


From: David Farber <farber () central cis upenn edu>
Date: Tue, 27 Jul 1993 11:18:22 -0500



Software Acceleration
---------------------
  by Roy K. McDonald, Connectix -- connectix () applelink apple com
     Presented at the Sumeria Technologies & Issues Conference

  Hardware gets faster every year. We've all come to expect it. And,
  a huge amount of work is going on right now to ensure that next
  year the same thing will happen.

  Software gets more features. And unfortunately, all too often, the
  presumption that fast hardware will take up the slack has meant
  that inelegant software design needlessly eats up performance
  advances. The irony is that software improvements are often far
  more dramatic in their impact than hardware improvements. Hardware
  is the tortoise, advancing relentlessly in tens of percents per
  year; software is the hare - on occasion it leaps orders of
  magnitude.

  This article reviews what has been done in software acceleration
  on the Mac, highlighting how much more could be done right now. I
  aim to persuade you to think about Mac performance as a hybrid of
  hardware and software acceleration and perhaps shift your
  priorities a little in favor of pushing the envelope on code
  rather than silicon.


Decade of Macintosh Hardware Advances
  Let's start by seeing what can be done with hardware. How has
  Macintosh hardware improved in performance over the past 10 years?

  The original 128K Mac had an effective speed of roughly 1/2 MIP.
  Today's Quadra 950 provides about 8 MIPs. Of course, the Quadra
  950 is relatively expensive, so on a real $/MIP basis, the growth
  is only eight-fold, equivalent to a yearly average improvement of
  26 percent.

  SCSI, NuBus, and AppleTalk speeds have changed less. SCSI may be
  about twice as fast as it originally was. The new Cyclone NuBus
  standard will give a four times performance boost. AppleTalk is
  basically unchanged. And, although EtherTalk has led to a high-
  speed network standard bandwidth that is roughly twenty times
  better than what we had in 1984, actual throughput is roughly only
  a factor of five better.

  Typical RAM installation has grown from 128K to the current
  average of 6 MB, a 50 times growth, or about 50 percent per year.
  Access speeds of main storage have only improved about a factor of
  two (although caching has mitigated this otherwise fatal
  limitation).

  Common hard drives seek an average of about five times faster and
  have ten times the capacity than they did when drives first
  shipped for the Mac Plus. The average transfer rate hasn't
  improved by much more than a factor of two.

  Overall, we might imagine a "Speedometer" increase of as much as a
  factor of 20 over the past decade (with perhaps much more than
  that for floating-point operations).

  That's not to say that hardware can't make occasional big leaps,
  too. RISC processors will provide a roughly three times
  performance jump on one-third the die size, for an overall price-
  performance step of ten times in what will probably be a two to
  three year transition period. DSP can also accelerate certain
  processes by an order of magnitude.

  But, taken all together, typical jobs on a constant-priced Mac
  have been able to be performed roughly 25 percent faster every
  year, solely because of technical advances in hardware and
  increased performance for the price. This means hardware
  performance doubles roughly every three years, a rate likely to
  continue for the foreseeable future.


Software Advances
  While hardware advances are relentless and pervasive, software
  improvements are often more specific in their impact. The
  performance results, however, can be dramatic.

  For a familiar example, consider the case of 'Find File' running
  under System 6 versus System 7. For fun, we recently took a Mac
  Plus running System 7 and raced it against a Mac IIci using System
  6. The System 7 software was running on hardware five years older
  than the System 6 version. Still, Find File went slightly faster
  on the Plus, because Find File is roughly ten times faster in its
  current form.

  Unfortunately, it often takes a long time for well-known software
  techniques to enter the commercial sector. For instance, it was
  many years after the introduction of the first spreadsheet
  (VisiCalc) before sparse and virtual array techniques were used.
  If you wanted a 50 by 1,000 cell spreadsheet, you had to have
  50,000 cells worth of RAM (say, 800K), even if most cells were
  empty.

  Sparse techniques would have allowed you to use only the amount of
  memory taken by full cells, and virtual techniques to use disk
  space as well, at the cost of slower calculation. But the
  marketing war focussed on porting to new platforms and adding new
  features, not on saving RAM. A few engineer-years could have saved
  users tens of millions of dollars worth of RAM.

  Many new technologies which seem to arrive because of hardware
  advances are in fact largely enabled by software breakthroughs. We
  did a rough analysis of the increased performance in a variety of
  frontier technologies over the past five years and tried to assess
  what fraction of speed improvements came from software as opposed
  to hardware. We concluded that the software components for the
  various technologies were:

* Voice recognition         80%
* Handwriting recognition   80%
* Dynamic 3D graphics       60%
* Compression               50%
  In all cases, some hardware improvement was necessary in order to
  make the technologies practical, (e.g. DSP) but better software,
  particularly better software algorithms were the most important
  enabling technology.


Components of Speed
  Where does the speed come from? You can break the software design
  process into three components: algorithms, implementation, and
  compilation.

  The largest range of performance difference comes from algorithm
  selection. This may also be the area of poorest performance in the
  industry today. Factors of 10 and 100 losses in performance are
  common. Why is this?

  Consider the basic Order theory of algorithms. Every computer
  algorithm can be classed by Order. For example, an Order N
  algorithm takes twice as long when you run it on twice as much
  data. An Order N-squared algorithm takes four times as long. Lots
  of computational problems are easy to code as N-squared
  algorithms, but can be rewritten with difficulty to scale as
  NlogN.

  A famous example was the introduction of the Fast Fourier
  Transform in the mid-60's, an NlogN algorithm that replaced the
  previous N-squared algorithm.

  A 1,024 point transform could thus be performed 100 times faster
  by this new software method. So this advance was comparable in
  speed to over 20 years of general-purpose hardware speed
  improvement. And, it was accomplished through a software change
  which, once developed, had no marginal cost over the prior
  solution.

  Unfortunately, plenty of commercial software ships every day
  containing inefficient algorithms. Sorting records in a database
  is a familiar example where NlogN algorithms can be used but
  aren't always. When you scale your data from 10 to 100 records,
  pixels, or whatever, it means the algorithm may take 100 times
  longer to run, when it only needs to take twenty times longer.

  It's easy to see why it happens. From the technical perspective,
  debugging and benchmarking is often done on limited data sets that
  don't reveal how badly the code will bog down in real world
  applications. And the real world constantly increases data set
  size, often at an exponential rate. Screen diagonal and pixel
  resolution are two common parameters which quadruple data set size
  when the parameters double.

  Over in marketing, they know that software is not as rigorously
  benchmarked for speed as hardware, because comparisons are often
  more difficult to apply. So feature lists and time-to-market
  become disproportionately important factors.

  Good algorithms are not enough. Implementation counts as well. For
  example, suppose you need code for looking up records in a
  database. An efficient algorithm for this is Order N - twice as
  many records means twice as long a search.

  The usual way to accomplish this is to index the records in a
  binary tree. Then you need to do log(2) N index lookups to get the
  location. To find a single record in a 1,000 record data base
  requires 10 lookups.

  But, if each of these lookups involves a separate hard drive
  access, the implementation is poor, even though the algorithm is
  optimal. A better (and more typical) implementation would bring
  some or all of the directory information into RAM at the time of
  the first disk hit and cache it there for the next nine lookups.
  Whether or not you use an optimized algorithm, if the
  implementation is three times slower than necessary, the overall
  performance suffers by the same ratio.

  Good implementation is often a matter of deep familiarity with the
  target hardware platform, a familiarity which is increasingly
  difficult to achieve as technology life cycles shrink ever
  shorter.

  Also, the code we write is not the code the system runs. Between
  the two stands a compiler.

  Within the Mac world one can find a range of commercial C
  compilers that vary by as much as 30 percent or more in ultimate
  compiled code performance. To do better than that, one must write
  in assembler, and here the variations are even greater. To put it
  bluntly, it's not hard to do a lot better than MPW.

  Looking beyond the Mac, we must face the fact that much more
  effort has gone into optimizing 80x86 compilers than 680x0
  products. As Windows has gained market share, more and more cross-
  platform benchmarks are being published of essentially identical
  object code compiled for Windows versus Mac and run on similarly
  powered CPUs. The Windows products tend to run faster because the
  compilers are, by and large, a little bit better. The most
  striking example I've seen was a recent PC Magazine benchmark of
  WordPerfect where the Windows advantage was substantial. This is
  not because of a superior operating system, but because of the
  availability of a better optimized compiler.

  With the move from CISC to RISC architecture, and especially with
  the move to superscalar pipelines, ever more burden is placed upon
  the compiler. If sloppy compilers can be written for CISC
  machines, time-to-market pressures could produce RISC compilers
  which have even more of an effect.

  The trend in the software industry today is in the opposite
  direction of this theme. We are all sacrificing performance in
  favor of time-to-market. Object Oriented Programming is the
  epitome of this trade-off. Now, there's nothing wrong with OOP,
  and it's great that we'll all soon be writing Newton applications
  by dragging and dropping resources from the object pool.

  But OOP is an obvious formula for inefficient code. Witness the
  feel of the Finder in System 6 vs. System 7. In many applications
  I'll guess that early products will be sketched in OOP and later,
  more mature products or versions will be coded at lower levels.

  Lately we've been thinking about starting a development house that
  specializes in knocking off popular OOP-based products with C or
  assembler-based me-too versions. We'd be second to market but we'd
  win the benchmark wars every time.


System Software
  System software is particularly important because of its pervasive
  impact on performance. Well-written, native-mode system calls are
  critical to good performance for a wide range of software
  products, and can to some extent overcome limitations imposed by
  inefficient compilers. If most of the computer's time is spent in
  highly-optimized system calls, the inefficiencies of the calling
  program can easily be overlooked.

  On the downside, many advances in system software have undermined
  performance. Windowing systems and multitasking both advance
  overall productivity, but add overhead which slows routine
  operation. The user gets new functionality, but it doesn't come
  for free, and it affects all applications.

  Moreover, advances often improve performance in ways that are
  difficult to define quantitatively. Both virtual memory and RAM
  disk technology can significantly enhance Mac productivity, but
  it's hard to benchmark their contributions. For example, Connectix
  end-user studies of Virtual and MAXIMA customers indicate that
  either product can increase total work output per session by 5-20
  percent, but results vary widely according to the type of work
  performed and the system configuration.

  An area of particular interest to Connectix is the use of
  advanced, dynamic disk caching techniques, utilizing all of the
  often "wasted" RAM on computers to avoid unnecessary disk access.
  The benefits of this are two-fold:

  First, disk accesses are usually a hundred to a thousand times
  slower than RAM accesses, so tremendous speed improvements can be
  achieved. Preliminary benchmarks on our Velocity caching product
  show an overall work throughput increase of about 25 percent.
  That's not bad for a low-cost software extension considering what
  it costs to accomplish the same boost in hardware.

  Second, caching has become increasingly important because of
  portable computing. PowerBook users will enjoy considerable
  battery life extension through the elimination of unneeded disk
  spin-ups, which typically account for 10 percent of power use in a
  battery-powered PowerBook session. Many PowerBook users also
  complain that their PowerBooks seem sluggish compared to
  comparable desktop systems - mainly, it appears, because of the
  random annoying delays of drive spin up.

  The key to a successful caching strategy involves maximizing the
  available cache size and filling it with the data most likely to
  be called for next by the CPU. Velocity incorporates unique
  advances in both of these areas, which I look forward to
  discussing in the future.


Input/Output
  One of the most productive areas for software acceleration is in
  the I/O domain, both internal to the system, and over a network.
  After all, processing has three major steps - you get the
  information, then you process it, then you spit out the results.
  Two thirds I/O, one third processing.

  Consider the following thought experiment: Watch a typical user
  for an hour. She opens files, launches applications, enters
  alphanumeric data, spell checks, calculates, sends email, closes
  windows. Now, double the processor speed. Maybe she'll save 5
  minutes out of the hour. Instead, suppose you double the I/O
  speeds - SCSI, ADB, AppleTalk, and NuBus. How much does she save
  then? Our testing indicates it's also about five minutes, and it's
  certainly within a factor of two of that either way for most
  sessions.

  Moreover, a lot of the time saved will occur during periods when
  the user would be especially annoyed at delays. Most people are
  prepared to watch their clock spin a few seconds when calculating,
  but have less patience when saving or opening a document. The
  system just doesn't seem to be working as hard then.

  Hardware I/O speeds are generally not improving quite as fast as
  raw computation speeds. But a lot can be done in software here.
  Many I/O bottlenecks give 10 to 1 or even 100 to 1 speed delays.
  Even though they are only relevant to system operation a small
  fraction, say 10 percent of the time, addressing these bottlenecks
  can have a big impact. If you want a graphic example of this,
  compare benchmark data of third-party 25 versus 33 MHz accelerator
  boards. With a 33 percent higher clock speed, you often see
  benchmarks only 10 or 20 percent better, because I/O is setting
  the pace.


Networks
  Enormous increases in network bandwidth are becoming available
  because of the introduction of new technologies, particularly
  optical transmission. The underlying structure of network data
  transmission on the Mac is starting to be strained by these
  capabilities.

  I recently spoke with a vendor who successfully developed an
  attractive low-cost, high-performance FDDI card with about ten
  times the effective speed of today's Ethernet systems. It failed
  as a product, however, because the throughput of the network
  bottlenecked at both ends of the link by packet creation and
  decoding time. This seems like an area ripe for new software
  paradigms.


Video
  There has been little improvement in the software that drives Mac
  video over the years. This reflects the fact that the Mac started
  with an excellent foundation, the original version of QuickDraw.
  Subsequent versions have improved screen draw times by about a
  factor of two, and big improvements in the future seem unlikely.


User/System
  Finally, there is one bandwidth limitation which dominates all
  others in importance, one link in the I/O chain responsible for 99
  percent of the wasted clock cycles in every Macintosh. This, of
  course, is the interface between the user and the system. Far
  outweighing compiler, implementation, and even swamping the effect
  of new algorithms is how efficiently a user can communicate her
  wishes to the machine, and how in turn the machine can let the
  user understand or appreciate the results and implications of
  those actions. The ultimate bandwidth limitation, and the single
  most important way to improve the total performance of the user-
  system combination is the user interface metaphor.

  The Mac established its special position in the industry by virtue
  of its unique ability to address this one issue. Essentially, the
  key technology that enabled it to do so was software. But more
  remains to be done, and the pace of improvement in the last five
  years has not been particularly impressive. For all the two
  thousand engineer years that went into its development, is the Mac
  a lot easier to use under System 7 than it was before? I don't
  believe so, and I hope we're in for some paradigm shifting
  breakthroughs here. Personal computing could use such a shot in
  the arm today.


Conclusion
  Time-to-market and feature list forces are driving software
  developers to work in ever higher-level programming languages and
  to pay less and less attention to the efficiency of the underlying
  code. Because hardware speed has increased over the years, they
  have been able to get away with this for some time.

  But considering how much effort goes into pushing the speed
  envelope of the hardware, it seems like users would be well served
  if more emphasis were placed on software acceleration. In
  everything from mainstream applications to system software, users
  do care about speed and software will often be the best price-
  performance technology to provide it.


Current thread: