Unless you’re in a newly constructed data center I would argue that compute density isn’t the problem you should focus on. You won’t even have the power and cooling density to fully utilize the most dense systems out there.
[There are definitely exceptions to this, such as when you're dealing with the maximum distance for your networking essentially defining for you the radius for the area where you need to fit your equipment.]
But for most HPC users, this isn’t the case. You’re not pushing the physical limits for electrical signals and your power and cooling are limited. If you’re in a data center built some years ago and you’re ready to upgrade to the next generation hardware, then you already can get more performance out of every rack unit than were the case when the data center was built. In other words, you probably have floor space to spare when you move to newer hardware.
So why would you pay extra for higher density?
My take on it is that unless you’re in the very high end of HPC or have some other very special reason to do it, you shouldn’t. Density is not the problem to focus on. Results per watt is.
If you follow that train of thought, and assume that you indeed have data center space to spare (or at least don’t need to reduce it), you first start to look at more generic servers that may or may not have more space in each box. You then distribute them more sparsely in the space you have. In one rack or multiple racks. Remember that you (usually) get more work done per box than you did with the last generation of hardware you installed.
Now, this may not meet your overall performance requirements. If so, it’s time to look at accelerators like GPU or FPGA and replace/complement your x86 servers with this. Depending on factors like your applications, if you have access to source code, if you have the skills to deal directly with FPGAs, etc – you’ll end up in your personal spot in this range of solutions. Nvidia for example has been working on this for a long time and have a nice set of both applications ready to take advantage of their Tesla GPUs and they also have good development tools that make it easy to just use it with an application or develop for it. Or, if you do have the skills to deal with FPGAs directly and have the volume and budget to support it, you could create a very specific accelerator for your needs.
The important thing is that by deploying accelerators like this you can address your overall performance requirements and still solve for “results per watt”.
At this point you have a so-called “nice” problem to contend with. This is where you need to decide if you want to get maximum performance out of the power/space/money budgets you have to work with or if you’re OK meeting a certain performance level and instead minimize the number of boxes you need to get there. I.e. do you exceed your performance target within your money/power/psace budgets or do you give something back from your budgets?