Writing multithreaded code is hard. You’d like to utilize all of the machine’s processing power, keeping code simple and avoid data races at the same time.
Let’s see how C++17 can make writing parallel code a bit easier.
Intro
With C++11/14 we’ve finally got threading into the standard library. You can now create std::thread
and not just depend on third party libraries or a system API. What’s more, there’s also async processing with futures.
For example, in 2014 I wrote about using async tasks in this article: Tasks with std::future and std::async .
Multithreading is a significant aspect of modern C++. In the committee, there’s a separate “SG1, Concurrency” group that works on bringing more features to the standard.
What’s on the way?
- Coroutines,
- Atomic Smart pointers,
- Transactional Memory,
- Barriers,
- Tasks blocks.
- Parallelism
- Compute
- Executors
- Heterogeneous programming models support
- maybe something more?
And why do we want to bring all of those features?
There’s a famous talk from Sean Parent about better concurrency. It was a keynote at CppNow 2012, here’s a recent version from 2016 from code::dive 2016.
Do you know how much processing power of a typical desktop machine we can utilize using only the core version of C++/Standard Library?
50%,
100%?
10%?
Sean in his talk explained that we can usually access only around 0,25% with single-threaded C++ code and maybe a few percent when you add threading from C++11/14.
So where’s the rest of the power?
GPU and Vectorization (SIMD) from CPU.
GPU power | CPU vectorization | CPU threading | Single Thread |
---|---|---|---|
75% | 20% | 4% | 0,25% |
Of course, some third party APIs allow you to access GPU/vectorization: for example, we have CUDA, OpenCL, OpenGL, vectorized libraries, etc. There’s even a chance that your compiler will try to auto-vectorize some of the code. Still, we’d like to have such support directly from the Standard Library. That way common code can be used on many platforms.
With C++11/14 we got a lot of low-level features. But it’s still tough to use them effectively. What we need is an abstraction. Ideally, code should be auto-threaded/parallelized, of course with some guidance from a programmer.
C++17 moves us a bit and allows to use more computing power: it unlocks the auto vectorization/auto parallelization feature for algorithms in the Standard Library.
Plus of course, not everything can be made parallel/multi threaded as there’s Amdahl’s law. So always using 100% (110% with CPU boost :)) of the machine power is only a theoretical case. Still, it’s better to strive for it rather than write everything single-threaded.
The Series
This post is the seventh in the series about C++17 features.
The plan for the series
- Fixes and deprecation
- Language clarification
- Templates
- Attributes
- Simplification
- Library changes - Filesystem
- Library changes - Parallel STL (today)
- Library changes - Utils (soon + 1)
- Wrap up, Bonus
Documents & Links
Just to recall:
First of all, if you want to dig into the standard on your own, you can read the latest draft here:
N4659, 2017-03-21, Draft, Standard for Programming Language C++ - from isocpp.org.
Also, you can grab my list of concise descriptions of all of the C++17 - It’s a one-page reference card:
Links:
- Compiler support: C++ compiler support
- The official paper with changes: P0636r0: Changes between C++14 and C++17 DIS
- There’s also a talk from Bryce Lelbach: C++Now 2017: C++17 Features
- My master C++17 features post: C++17 Features
- Jason Turner: C++ Weekly channel, where he covered most (or even all!) of C++17 features.
And the books:
OK, let’s discuss the parallel algorithms!
Overview
I’ve already mentioned the reasoning why we want to have so many ‘tools’ for multithreading/computing in the Standard.
The TS paper describing what was merged into the Standard: P0024R2
The new feature looks surprisingly simple from a user point of view. You just have a new parameter that can be passed to most of the std algorithms: this new parameter is the execution policy.
std::algorithm_name(policy,/* normal args... */);
I’ll go into the detail later, but the general idea is that you call an algorithm and then you specify how it can be executed. Can it be parallel, maybe vectorized, or just serial.
That hint is necessary because the compiler cannot deduce everything from the code (at least not yet :)). We, as authors of the code, only know if there are any side effects, possible race conditions, dead locks, or if there’s no sense in running it parallel (like if you have a small collection of items).
Current implementation
I hope this article will be soon updated, but for now, I have bad news.
Unfortunately, as of today, any of the major compilers support the feature.
Hovewer you can play with the following implementations/API’s:
- Codeplay: http://github.com/KhronosGroup/SyclParallelSTL
- HPX: http://stellar-group.github.io/hpx/docs/html/hpx/manual/parallel.html
- You can have a look at Rainer’s article: C++17: New Parallel Algorithms where he used HPX for code samples.
- Parallel STL - https://parallelstl.codeplex.com/
- Intel - https://software.intel.com/en-us/get-started-with-pstl
- n3554 - proposal implementation (inited by Nvidia) https://github.com/n3554/n3554
- Thibaut Lutz: http://github.com/t-lutz/ParallelSTL
Execution policies
The execution policy parameter will tell the algorithm how it should be executed. We have the following options:
sequenced_policy
- is an execution policy type used as a unique type to disambiguate parallel algorithm overloading and require that a parallel algorithm’s execution may not be parallelized.- the corresponding global object is
std::execution::seq
- the corresponding global object is
parallel_policy
- is an execution policy type used as a unique type to disambiguate parallel algorithm overloading and indicate that a parallel algorithm’s execution may be parallelized.- the corresponding global object is
std::execution::par
- the corresponding global object is
parallel_unsequenced_policy
- is an execution policy type used as a unique type to disambiguate parallel algorithm overloading and indicate that a parallel algorithm’s execution may be parallelized and vectorized.- the corresponding global object is
std::execution::par_unseq
- the corresponding global object is
Note that those are unique types, with their corresponding global objects. It’s not just an enum.
Sequential execution seems obvious, but what’s the difference between par
and par_unseq
?
I like the example from Bryce Adelstein’s talk:
If we have a code like
double mul(double x,double y){
return x * y;
}
std::transform(
// "Left" input sequence.
x.begin(), x.end(),
y.begin(),// "Right" input sequence.
z.begin(),// Output sequence.
mul);
The sequential operations that will be executed with the following instructions:
load x[i]
load y[i]
mul
store into z[i]
With the par
policy the whole mul()
for the i-th element will be executed on one thread, the operations won’t be interleaved. But different i
can be on a different thread.
With par_unseq
mul()
each operation can be on a different thread, interleaved. In practice it can be vectorized like:
load x[i... i+3]
load y[i...i+3]
mul // four elements at once
store into z[i...i+3]
Plus, each of such vectorized invocation might happen on a different thread.
With par_unseq
function invocations might be interleaved, so it’s not allowed to use vectorized unsafe code: like using mutexes, memory allocation… More on that here: @cppreference.
Also, the current approach allows to provide nonstandard policies, so compiler/library vendors might be able to provide their extensions.
Let’s now see what algorithms were updated to handle the new policy parameter.
Algorithm update
Most of the algorithms (that operates on containers/ranges) from the Standard Library can handle execution policy.
What have we here?
- adjacent difference, adjacent find.
- all_of, any_of, none_of
- copy
- count
- equal
- fill
- find
- generate
- includes
- inner product
- in place merge, merge
- is heap, is partitioned, is sorted
- lexicographical_compare
- min element, minmax element
- mismatch
- move
- n-th element
- partial sort, sort copy
- partition
- remove + variations
- replace + variations
- reverse / rotate
- search
- set difference / intersection / union /symmetric difference
- sort
- stable partition
- swap ranges
- transform
- unique
The full list can be found here: @cppreference.
A simple example:
std::vector<int> v = genLargeVector();
// standard sequential sort
std::sort(v.begin(), v.end());
// explicitly sequential sort
std::sort(std::seq, v.begin(), v.end());
// permitting parallel execution
std::sort(std::par, v.begin(), v.end());
// permitting vectorization as well
std::sort(std::par_unseq, v.begin(), v.end());
New algorithms
A few existing algorithms weren’t ‘prepared’ for parallelism, but instead we have new, similar versions:
for_each
- similar tostd::for_each
except returnsvoid
.for_each_n
- applies a function object to the first n elements of a sequence.reduce
- similar tostd::accumulate
, except out of order execution.exclusive_scan
- similar tostd::partial_sum
, excludes the i-th input element from the i-th sum.inclusive_scan
- similar tostd::partial_sum
, includes the i-th input element in the i-th sumtransform_reduce
- applies a functor, then reduces out of ordertransform_exclusive_scan
- applies a functor, then calculates exclusive scantransform_inclusive_scan
- applies a functor, then calculates inclusive scan
For example, we can use for_each
(or new for_each_n
) with an execution policy, but assuming we don’t want to use the return type of the original for_each
.
Also, there’s an interesting case with reduce. This new algorithm provides a parallel version of accumulate. But it’s important to know the difference.
Accumulate returns the sum of all the elements in a range (or a result of a binary operation that can be different than just a sum).
std::vector<int> v{1,2,3,4,5,6,7,8,9,10};
int sum = std::accumulate(v.begin(), v.end(),/*init*/0);
The algorithm is sequential only; a parallel version will try to compute the final sum using a tree approach (sum sub-ranges, then merge the results, divide and conquer). Such method can invoke the binary operation/sum in a nondeterministic order. Thus if binary_op
is not associative or not commutative, the behavior is also non-deterministic.
For example, we’ll get the same results for accumulate and reduce for a vector of integers (when doing a sum), but we might get a slight difference for a vector of floats or doubles. That’s because floating point operations are not associative.
Summary
Is that the end for today?
Multithreading/Concurrency/Parallelism are huge topics to discover and understand. I’ll hope to return with some more examples (possibly with some working implementation in common compilers!). So for now, I’ve described only the tip of an iceberg :)
From this post, I’d like you to remember that concurrency/parallelism is one of the key areas in the C++ standard and a lot of work is being done to bring more features.
With C++17 we get a lot of algorithms that can be executed in a parallel/vectorized way. That’s amazing, as it’s a solid abstraction layer. With this making, apps is much easier. A similar thing could be achieved possibly with C++11/14 or third-party APIs, but now it’s all in the standard.
- Do you use any other parallel libraries? CUDA? SYCL? Intel TBB? Something else?
- Do you try to make you code multi threading or write most of the code single threaded?
Below I’ve also gatherd a few valuable resources/articles/talks so that you can learn more.
Resources
The original paper for the spec: P0024R2
The initial TS paper: PDF: A Parallel Algorithms Library | N3554
ModernesCpp articles about parallel STL:
- C++17: New Parallel Algorithms of the Standard Template Library
- Parallel Algorithm of the Standard Template Library - ModernesCpp.com
Bryce Adelstein’s talk about parallel algorithms. Contains a lot of examples for map reduce
(transform reduce) algorithm:
And the Sean Parent talk about better concurrency in C++