Colleagues,
This is meant to be a summary of one coding group's experience with the above 3 aspects of Fortran programming and to elicit other opinions, experiences, and insights.
Background: Our group provides commercially available software for the building design and construction industries. Our most extensive code is, essentially, an elaborate radiative transfer analysis of a building. Input is user-produced CAD along with supporting data that describes building equipment. Essential and important computational tasks involve computational geometry, radiative transfer analysis, setting up and solving systems of equations, and so on. Typical tasks of most large-scale engineering analysis systems. Over the past 10 years we have generated and now modify/maintain/update about 250,000 lines of code.
Vectorization: This has proved to be (for our work, at least) the most important and efficacious optimization technique -- by far. The analysis of even a modest-sized project involves 10's of millions of dot-products, cross-products, and geometric bounds-checking. Good practice of a decade ago had data arranged so that, say, the x,y,z Cartesian coordinates of a vertex were contiguous in memory: x:y:z. Best for dot-products and cross-products. Now is it best to arrange arrays so that all the x coordinates are contiguous: x1:x2:x3: . . .:xn. Similarly for y and z. Or at least maintain a duplicate data set with the coordinates arranged so. The processing of the x-part of a large set of dot-products is then vectorizable.
DotProd(1:N) = CoorA(1:N,1)*CoorB(1:N,1)+CoorA(1:N,2)*CoorB(1:N,2)+CoorA(1:N,3)*CoorB(1:N,3)
Where CoorA(1:N,1) points to the x-coordinates of all N surfaces; and so on. We have found the speed-up to be larger than that expected from just the use of SIMD (4, in our case). Evidently, memory is (much) better used/accessed in this way. In general, we have found this to be (much) faster, even when some of the dot-products produced are not used or inappropriate. That is, it's better to throw away some of the vectorized results, than to trouble not computing them. We have found the speed-up even greater for the cross-product intensive part of our code.
We have found that axis-aligned bounding box checking is an important opportunity for vectorization: InOut is a vector of integers
InOut(1:N) = ( Coor(1:N,1) < BoxMaxX )*( Coor(1:N,2) < BoxMaxY )*( Coor(1:N,3) < BoxMaxZ)
The check against the bounding box minimum coordinates can be (often is) concatenated onto the Max check. In general (that is, statistically) we find this is to be considerably faster than an explicit, early-out loop that checks x, then, y, then z. Obviously, if any of the checks fail, the value of InOut for that surface will be zero. In this regard, we have been looking for an efficient way to pack the zeros out of a long vector -- without success so far. The intrinsic Pack routine is hopelessly slow. We also wonder (we've made no investigation yet) whether such results are better stored in vectors of smaller individual element byte length; 2-byte integers, 1-byte integers. If the results are later operated on repeatedly, and SIMD is used, then instead of 4-at-a-time, testing/evaluating/check can be done 8-at-a-time, or in even large clumps.
All this is obvious. But what is important (to us, at least) is that in general, in practice (statistically ,for most projects) the speed-up is significant and worth the significant and wide-spread changes in code required. This is an important aspect for those dealing with valuable, legacy code. And to some extent it requires a different type of thought (maybe even different algorithms) when generating new code. We imagine that as the SIMD registers get larger, these effects will be even more pronounced.
Parallelization: We have found that in general, and for our code, parallelization by threading is essentially useless. (Our team jokes that parallelization/OpenMP isn't a false promise, it's a cruel hoax). To be sure there is lots of evidence that there are many cases where sharing work in multiple threads is very efficacious. But we find, almost always, that the overhead involved completely swamps whatever gain their might be. Some of this is due to the nature of what we are computing. There are very, very few places in our analysis where the work to be done is "tight"; that is, expressible or accomplishable with just a few operations and so just a few lines of code. As when one multiplies a matrix, or is manipulating 10^8 pixels in an image. In general, the work to be done in our code is elaborate and so the work necessary to establish threads is also elaborate. If, for example, we have 10^4 surfaces, then we have 10^8 occlusion analyses to do (can one surface "see" another?). There might be 10^3 potential blocking surfaces to check, with each check requiring a relatively elaborate analysis. By the time we back out of the nested loops far enough to prevent overhead/setup time from being prohibitive, it proves better (by far) to use the Coarray Fortran paradigm. We are particularly interested in others' experience (and advice!) in this regard.
Having written that, I should add that there are some (very few) times when threading is efficacious: as in matrix multiplication. By-the-way, if you are interested in a crystal-clear, practical, detailed exposition of how such a task can be handled, we suggest you view the series of videos that Jim Demsey (frequent and important contributor to this forum) has produced. You can find the link at his web site.
In general, we have found that evaluations of various optimization techniques that use matrix multiplication are not useful, because they are NOT indicative of what is required for scientific/engineering work that involves repeated use of an elaborate or lengthy process. I don't mean to sound silly, but we no longer pay attention to claims (or evaluations) that involve matrix multiplication. The problem is, in many ways, trivial and not sufficiently indicative. The difficult and expensive work is setting up the matrices or system of equations, not multiplying the matrices or solving the system.
Coarray Fortran: We have had considerable success with this. Very considerable. Our approach does not focus on the shared data between images (the coarrays), but rather the opportunity to have multiple instances of (very nearly) identical code working on pieces of very large problems. We note the following. The most difficult part of making effective use of multiple images is to predict the work load. We have had to spend considerable time developing quick, effective ways to predict work and so generate more-or-less even workloads for each image. In our case, for example, simple functions involving surface area, orientation, square of separating distance, and so on. This turns out to be important (and non-trivial) since it doesn't help to have 1 or 2 of the images doing all the heavy lifting. In this regard, we have found it useful to have a non-coarray Fortran program do an initial analysis and determine workload, and then have it launch a coarray Fortran program is establishes multiple images and performs the work.
As Steve Lionel has mentioned several times, The implementation of coarrays in the Intel Fortran compiler is a work in progress, and aspects of it will improve over time. For the present, we find the communication between images using coarrays directly to be too slow. Communcation using files is faster. (We were surprised, too). This may change. Currently, we limit communication between images that uses coarrays (usually at the start and end of the work to be done), and each image writes its result to a file. The "launcher" Fortran program (having waited for all images to finish) then gathers the result into a single, neat package.
We have found it important to limit the number of images to the number of physical core present on the host machine. Using the virtual cores in addition to the physical ones generally slows the overall process. And so, setting the appropriate number-of-cores environment variable is very important since we have found the slowing effect can be considerable. Several months ago, Steve provided a routine that can be called from Fortran that returns this information about a host.
We strongly suspect that Coarray Fortan will like be our team's most significant investment in optimizing our engineering code in the furture.
Perhaps I should apologize for such a long post, but it is a very interesting subject and we are interested other's experiences and findings.
David