This morning I was reading John West’s article about Intel’s acquisition of RapidMind. It’s the latest example of the High Performance Computing (HPC) industry recognizing the need to make the use of accelerators and many/multi-core and cluster parallelism easier, or to be specific with regards to the InsideHPC article, to design software for this.
I have always viewed the need to customize your software for specific accelerators to in most cases be a dead-end approach, be it GPU, FPGA or anything else. Granted that there’s a set of exception cases where developers and end-users are prepared to go down that route, fully aware of the costs. But to really reach the larger audience you need to make it much easier and essentially hide the complexity. History is littered with the remains of accelerator companies that never really solved that problem and only could take advantage of a limited window of opportunity.
I compare this with the times when I had to do assembly programming and count cycles to get that last ounce of performance that was needed in the embedded realtime systems I was working on while at Ericsson. In our case it made sense to do those time critical pieces at that low level, but for the most part we were using a high level language with built-in constructs for our most used and critical functions (realtime signaling and communicating over a high speed network designed for telecom and defense related applications). Only very few developers had to deal with the complexity of assembly level and really knowing what hardware was underneath. This approach greatly enhanced productivity when designing the actual applications and the performance was “good enough” so that we came out ahead every time.
I see many similarities with that and where the HPC industry has been with the use of accelerators and many/multi-core in parallel systems. It’s been a journey from having only those low level or hardware specific tools available for the really dedicated to where we now have several approaches to upleveling it to a point where the application developer can have essentially one source code and let the “system” take care of translating it in such a way that they take maximum (or close enough) advantage of the hardware it runs on.
Steve Wallach of Convex and Data General fame, now at Convey Computer, has said it very well: “The architecture which is simpler to program will win”
Apart from Intel/RapidMind; take a look at what Nvidia is doing with CUDA, OpenCL and integration with PGI compilers; what Convey Computer is doing with their HC-1 system; and for that matter what Apple and Microsoft are doing for promoting common API’s (OpenCL and DirectX Compute respectively)
We’re at an inflection point where the use of various type of accelerators now is easy enough for developers and we’re getting to a point where it’s also easy to deploy. Essentially providing “stealth acceleration” where it “just works” almost regardless of what hardware you have. This opens the door wide open for heterogenous clusters with Grid/Cloud level software that takes the pain out of scheduling for optimum time to results.
If I compare with my previous example of what we were doing for realtime networked applications, the next step would be a high level language that allows the developer to stay close to the application code and not worry about things like how to use MPI for best performance and scaling in a cluster. Sun’s Fortress language seemed to be addressing this in a way similar to what Java did for its space. However, with the Oracle acquisition of Sun you have to wonder if Fortress will survive? I’m hoping it will, as an open source project.