On 1/21/2015 9:05 AM, Bill Long wrote:
> On Jan 21, 2015, at 1:15 AM, Ian D Chivers <[log in to unmask]> wrote:
>
>> Jane and I are writing some new examples
>> for the next edition of the book
>> and have been benchmarking an openmp
>> summation example and comparing the timing
>> against the sum intrinsic.
>>
>> are we likely to see parallel
>> versions of any of the intrinsics?
> It depends on the implementation.
>
> But, in general, routines that are BLAS 1 replacements will not be threaded (though very likely vectorized). Examples include SUM and DOT_PRODUCT. These tend to be memory bandwidth constrained and would not benefit from threading. On the other hand, BLAS 3 routines are compute bound, so a threaded version of MATMUL would be entirely reasonable.
>
> Some other cases are probably dependent on the processor architecture. For example, a standalone math function evaluation (ERFC, for example) that had a very large array argument could benefit from automatic threading as long as the threads overhead is light weight. So, a processor with many hardware threads (Intel Phi, for example) on the chip could be a good target for threaded library routines.
>
> Typically, even for cases that default to threaded, there will be a non-threaded version that would be used within an OpenMP region.
>
> Cheers,
> Bill
>
MATMUL already has facilities in several compilers for automatic library
function calls. In Cray compilers, it goes back decades. gfortran has
the external-blas option to turn MATMUL into a BLAS ?gemm call, with a
size threshold, and ifort has the opt_matmul options to call into the
MKL library (skipping some of the overhead of ?gemm). As Bill hinted,
these options may need to be avoided in OpenMP regions (and depend on
specific thread pinning options under MPI).
DO CONCURRENT is a frequent topic for auto-parallel optimization,
although the effort to use this as a means to improve portability of
optimization seems elusive.
Vendor optimized BLAS 1 libraries may include automatic threading of
?dot and ?sum beyond a size threshold on the order of 10000. You would
not see this in an OpenMP region until you set OMP_NESTED.
--
Tim Prince
|