



# Using High-Level C++ for HEP Data Processing on Accelerators

Attila Krasznahorkay

#### **KKIO 2021**

XXII KKIO Software Engineering Conference Online from Krakow 21-22 September 2021



#### Outline



- Data processing software at the LHC
  - And why it needs to worry about accelerators
- An overview of current accelerators, and their programming languages
  - With some information on how we are using / planning to use these features
  - Putting special emphasis on memory management techniques with modern C++
- An insight in the kind of software R&D happening in HEP and in ATLAS at the moment

#### ATLAS And Its Offline Software



- ATLAS is one of the general-purpose experiments at the <u>Large Hadron</u>
   Collider
  - Built/operated by the largest collaboration for any physics experiment ever
- The software (<u>atlas/athena</u>, <u>atlassoftwaredocs</u>) written for processing its data is equally large
  - ~4 million lines of C++ and ~2 million lines of Python



#### Data Reconstruction in ATLAS





#### Data Reconstruction in ATLAS





#### Data Reconstruction in ATLAS





# Why Accelerators?

## (High Performance) Computing in 2021



- Computing has been getting more and more complicated in the last decades
  - A modern CPU has a very complicated design, mainly to make sure that (our!) imperfect programs would execute fast on it
- Complexity shows up both "inside of single computers", but also in the structure of computing clusters
  - A modern computing cluster has different nodes connected to each other in a non-trivial network
- All the added complexity is there to achieve the highest possible theoretical throughput "for certain calculations" on these machines





## (High Performance) Computing in 2021





## **NVIDIA**



- Supercomputers all use accelerators
- Which come in many shapes and sizes
  - NVidia GPUs are the most readily available in general, used/will be in <u>Summit</u>, <u>Perlmutter</u>, <u>LEONARDO</u> and <u>MeluXina</u>
  - AMD GPUs are not used too widely in comparison, but will be in <u>Frontier</u>, <u>El Capitan</u> and <u>LUMI</u>
  - Intel GPUs are used even less at the moment,
     but will get center stage in <u>Aurora</u>
  - FPGAs are getting more and more attention, and if anything, they are even more tricky to write (good) code for
- Beside HPCs, commercial cloud providers also offer an increasingly heterogeneous infrastructure

#### Why HEP/ATLAS Cares About Accelerators





- As described in
   CERN-LHCC-2020-015, being able to process the data collected in LHC
   Run 4 (and beyond) in ATLAS
   requires major software developments
  - In order to fit into our "CPU budget", we need to consider new approaches in our data processing
- One of these areas is to look at non-CPU resources

## Multiprocessing, Multithreading



- "Simple" applications are almost always single threaded
  - This is what you get by default out of most programming languages. A single execution thread performing tasks one by one.
- Luckily many tasks in HEP are embarrassingly parallel
  - We can just start N instances of the application, all doing different things.
- Usually (at least in HEP) when memory usage becomes an issue, the application needs to become multi-threaded
  - Where a single process executes calculations on multiple threads in parallel.





#### **HEP Software**



- Most (but not absolutely all) HEP software is written in C++ these days
  - We even agreed on a single platform (<u>Threading Building Blocks</u>) for our multithreading
- LHC experiments, mostly driven by their (our...) memory hungry applications, are all migrating to multithreaded workflows by now
  - ATLAS will use a multithreaded framework for triggering and reconstructing its data during LHC Run-3
  - However smaller HEP/NP experiments are still happily using multiprocessing to parallelise their data processing
- It is in this context that we are looking towards upgrading our software to use non-x86 computing as well

## What Accelerators?

#### **GPGPUs**



- General Purpose GPUs (GPGPUs) are the "most common" accelerators
- They can achieve very high theoretical
   FLOPs because they have a lot of units for performing floating point calculations
- But unlike CPUs, these cores are not independent of each other
  - Control units exist for large groups of computing cores, forcing the cores to all do the same thing at any given time
  - Memory caching is implemented in a much simpler way for these computing cores than for CPUs
- Coming even close to the theoretical limits of accelerators is only possible with purpose designed algorithms



#### FPGAs / ASICs



```
vectors and fill two with random values.
                  vec_a(kSize), vec_b(kSize), vec_r(kSize);
                  i < kSize; i++) {
   vec_b[i] = rand();
 // Select either:
       the FPGA emulator device (CPU emulation of the FPGA)
       the FPGA device (a real FPGA)
#if defined(FPGA_EMULATOR)
 ext::intel::fpga_emulator_selector device_selector;
 ext::intel::fpga_selector device_selector;
#endif
 try {
   // Create a gueue bound to the chosen device.
   // If the device is unavailable, a SYCL runtime exception is thrown.
   queue q(device_selector, dpc_common::exception_handler);
   // Print out the device information.
   std::cout << "Running on device: "
             << q.get_device().get_info<info::device::name>() << "\n"
```

- Will become important as well, but at the moment are a bit less important with "generic algorithms"
  - They are normally suited better for well-defined/understood data processing steps. For instance decoding data coming from the detector.
- The software projects to know about with these are <u>Intel's oneAPI</u> and various High Level Synthesis (HLS) implementations

#### The Future of CPUs/GPUs (?)





#### Skylake + FPGA on Purley



- Power for FPGA is drawn from socket & requires modified Purley platform specs
- Platform Modifications include Stackup, Clock, Power Delivery, Debug, Power up/down sequence, Misc IO pins (see BOM cost section)

| Cores                                                                                 | Up to 28C with Intel® HT Technology                                                                                                                    |                                                                       |
|---------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------|
| FPGA                                                                                  | Altera® Arria 10 GX 1150                                                                                                                               |                                                                       |
| Socket TDP                                                                            | Shared socket TDP<br>Up to 165W SKL & Up to 90W FPGA                                                                                                   |                                                                       |
| Socket                                                                                | Socket P                                                                                                                                               |                                                                       |
| Scalability                                                                           | Up to 25 - with SKL-SP or SKL + FPGA SKUs                                                                                                              |                                                                       |
| РСН                                                                                   | Lewisburg: DMI3 – 4 lanes; 14xUSB2 ports<br>Up to: 10xUSB3; 14xSATA3, 20xPCIe*3 New: Innovatio<br>Engine, 4x10GbE ports, Intel® QuickAssist Technology |                                                                       |
|                                                                                       | For CPU                                                                                                                                                | For FPGA                                                              |
| Memory                                                                                | 6 channels DDR4<br>RDIMM, LRDIMM,                                                                                                                      | Low latency access to<br>system memory via UPI &<br>PCIe Interconnect |
|                                                                                       | 2666 1DPC,<br>2133, 2400 2DPC                                                                                                                          |                                                                       |
| Intel® UPI                                                                            | 2 channels<br>(10.4, 9.6 GT/s)                                                                                                                         | 1 channel<br>(9.6 GT/s)                                               |
| PCIe*                                                                                 | PCIe* 3.0<br>(8.0, 5.0, 2.5 GT/s)                                                                                                                      | PCIe* 3.0<br>(8.0, 5.0, 2.5 GT/s)                                     |
|                                                                                       | 32 lanes per CPU<br>Bifurcation support<br>x16, x8, x4                                                                                                 | 16 lanes per FPGA<br>Bifurcation support:<br>x8                       |
| High Speed<br>Serial Interface<br>(Different board<br>design based on<br>HSSI config) | N/A                                                                                                                                                    | 2xPCle 3.0 x8                                                         |
|                                                                                       |                                                                                                                                                        | Direct Ethernet<br>(4x10 GbE, 2x40 GbE,<br>10x10 GbE, 2x25 GbE)       |

- Is quite uncertain...
  - These days even the future of x86 seems
     to be in some jeopardy
- Heterogeneous seems to be the key
  - Some CPUs already have different cores, meant for different tasks
  - CPU+GPU combinations will likely become more and more popular in HPCs
    - Making it possible to manage the memory of applications more easily
  - GPUs are not even the only game in town
    - "FPGA inserts" may become a part of future high-performance CPUs/GPUs...

# (Current) Programming Languages

#### C++...?



- Just as with "CPU languages", there is no single language for writing accelerator code with
  - But while HEP settled on C++ for CPUs, at this point the whole community just can't settle on a single language for accelerators yet
- However most of these languages are at least C/C++ based
  - But unfortunately each of them have (slightly) different capabilities



- Multiple projects exist / are actively developed for hiding this complexity from the programmers (<u>Kokkos</u>, <u>Alpaka</u>, <u>Thrust</u>, <u>Parallel STL</u>, etc.)
- Eventually the goal is to make heterogeneous programming part of the ISO C++ standard
  - I will try to show the most interesting/important fronts on which this is happening

#### **CUDA**



 NVidia/CUDA is the most established player in this game



- As such they have the most support in existing applications, the best documentation, etc.
- Originally designed as a C language/library
  - Over the years getting more and more C++ support
  - By now supporting even some C++17 features in "device code", including some "light amount" of virtualisation
- Practically only supported on NVidia hardware

```
void cudaMultiplyKernel( int n, float* array, float multiplier ) {
 const int index = blockIdx.x * blockDim.x + threadIdx.x;
 if(index >= n) {
 array[ index ] *= multiplier;
void cudaMultiply( std::vector< float >& array, float multiplier ) {
 int nCudaDevices = 0:
  CUDA CHECK( cudaGetDeviceCount( &nCudaDevices ) ):
 if( nCudaDevices == 0 ) {
 float* deviceArray = nullptr;
 CUDA CHECK( cudaMalloc( &deviceArray, sizeof( float ) * array.size() ) );
 CUDA CHECK( cudaMemcpy( deviceArray, array.data(),
                         sizeof( float ) * array.size(),
                         cudaMemcpyHostToDevice ) );
  // Run the kernel.
  static const int blockSize = 256;
  const int numBlocks = ( array.size() + blockSize - 1 ) / blockSize;
 cudaMultiplyKernel<<< numBlocks, blockSize >>>( array.size(),
                                                  deviceArray,
                                                  multiplier );
 CUDA CHECK( cudaDeviceSynchronize() );
 // Copy the array back to the host's memory.
  CUDA CHECK( cudaMemcpy( array.data(), deviceArray,
                         sizeof( float ) * array.size(),
                         cudaMemcpyDeviceToHost ) );
 CUDA_CHECK( cudaFree( deviceArray ) );
```

#### ROCm / HIP



```
namespace {
    // Linear transformation kernel
    _global_
    void hipLinearTransform( std::size_t size, float* data, float a, float b ) {

        // Get the current index.
        const std::size_t index = hipBlockIdx_x * hipBlockDim_x + hipThreadIdx_x;
        if( index >= size ) {
            return;
        }

        // Perform the linear transformation.
        data[ index ] = a * data[ index ] + b;
        return;
    }
} // private namespace
```

```
void linearTransform( int deviceId, std::vector< float >& data, float a,
                     float b ) {
 HIP CHECK( hipSetDevice( deviceId ) ):
  float* deviceData = nullptr:
  const std::size t dataSize = data.size() * sizeof( float );
  HIP CHECK( hipMalloc( &deviceData, dataSize ) ):
  HIP CHECK( hipMemcpy( deviceData, data.data(), dataSize,
                        hipMemcpyHostToDevice ) );
  static constexpr int blockSize = 256:
  const int numBlocks = ( data.size() + blockSize - 1 ) / blockSize;
  static constexpr int sharedMem = 0;
  static constexpr hipStream t stream = nullptr:
  hipLaunchKernelGGL( hipLinearTransform, numBlocks, blockSize, sharedMem,
                      stream, data.size(), deviceData, a, b):
  HIP CHECK( hipGetLastError() );
 HIP CHECK( hipDeviceSynchronize() );
 HIP_CHECK( hipMemcpy( data.data(), deviceData, dataSize,
                        hipMemcpyDeviceToHost ));
 HIP CHECK( hipFree( deviceData ) );
```

- Is basically a copy-paste of CUDA
  - The concepts are all the same
  - CUDA functions exist in 99% in HIP, with a slightly different name
- Support/documentation is far inferior to that of CUDA
- Code written in HIP is relatively easy to compile for both AMD and NVidia backends





#### oneAPI / SYCL



- Intel's answer to the programming language question
- Unlike CUDA, does not require an extension to the C++ language
  - Which means that it's possible to provide support for SYCL code using "a library" with any compiler
    - As long as GPU support is not required
- Very strong design-wise, built on top of the latest C++ capabilities
- Technically it's possible to compile SYCL code for Intel (CPU, GPU, FPGA), NVidia and AMD backends
  - However the AMD backend's support is a bit flakier than the others

```
// Create a vector array that would be manipulated.
                                 std::vector< float > dummyArray;
                                 static const std::size t ARRAY SIZE = 10000;
                                 dummyArray.reserve( ARRAY SIZE );
                                 static const float ARRAY ELEMENT = 3.141592f;
                                 for( std::size_t i = 0; i < ARRAY_SIZE; ++i ) {
                                    dummyArray.push back( ARRAY ELEMENT );
oneAPI
                                 // Set up a SYCL buffer on top of this STL object.
                                 cl::sycl::buffer< cl::sycl::cl float > buffer( dummyArray.begin(),
                                                                                dummvArray.end() );
                                 // Set up the SYCL queue.
                                 cl::sycl::queue queue( m deviceSelector );
                                 cl::sycl::range< 1 > workItems( buffer.get count() );
                           #ifndef TRISYCL CL SYCL HPP
                                 // Let the user know what device the calculation is running on.
                                 const cl::sycl::device& device = queue.get device();
                                 ATH MSG DEBUG( "Using device "
                                                << device.get info< cl::sycl::info::device::name >()
                                                << device.get info< cl::sycl::info::device::version >()
                           #endif // not TRISYCL CL SYCL HPP
                                 // Multiply these values using SYCL.
                                 static const float MULTIPLIER = 1.23f:
                                 queue.submit( [&]( cl::sycl::handler& handler ) {
                                          buffer.get access< cl::sycl::access::mode::read write >( handler );
                                       handler.parallel for< class SYCLMultiply >( workItems,
                                          [=]( cl::sycl::id< 1 > id ) {
                                             acc[ id ] *= MULTIPLIER;
```

#### C++ / Host Code



- One of the first idea from everybody who starts working on this type of code is to make it possible to run the exact same code on accelerators and on the host
  - And for a good number of calculations this can be a good idea, especially for making certain parts of debugging a little easier
- However many algorithms in HEP do not factorise well like this
  - Any "combinatorial" code usually has to be implemented with a different logic for CPUs (where you want to minimise FLOPs with conditionals) and GPUs (where you want to minimise conditionals, while not caring about FLOPs all that much)
  - Because of this, even when using <u>oneAPI/SYCL</u>, we still implement separate algorithms for CPUs and GPUs for most things

# Memory Management (in C++)

## **Memory Management**



- Modern CPUs have a very complicated memory management system
  - Which we can in most cases avoid knowing about
- GPUs have a complicated system of their own
  - However this we can not avoid knowing more about to use GPUs efficiently
  - Most importantly, caching is much less automated than on modern CPUs
- In some cases however you can get away with not knowing everything
  - For a performance penalty...



#### Memory (De-)Allocation



```
const std::vector< float >& input = ...;
float *a = nullptr, *b = nullptr;
CUDA CHECK( cudaMallocManaged( &a,
              input.size() * sizeof( float ) ) );
CUDA CHECK( cudaMallocManaged( &b,
              input.size() * sizeof( float ) ) );
for( std::size t i = 0; i < input.size(); ++i ) {</pre>
   a[ i ] = input[ i ];
linearTransform<<< 1, input.size() >>>( a, b, ... );
CUDA CHECK( cudaGetLastError() );
CUDA_CHECK( cudaDeviceSynchornize() );
std::vector< float > output;
output.reserve( input.size() );
for( std::size_t i = 0; i < input.size(); ++i ) {</pre>
   output.push back( b[ i ] );
CUDA CHECK( cudaFree( a ) );
CUDA CHECK( cudaFree( b ) );
```

- CUDA started by providing C-style memory allocation/deallocation functions
  - Eventually every other language followed this design as well
- Allows for a precise management of the memory resources
- But it is in stark contrast with modern
   C++ design guidelines
  - Modern C++ code should not even have <u>new/delete</u> statements in it, let alone <u>malloc(...)/free(...)</u>

## C++17 Dynamic Memory Management



- STL-friendly "adapter code" has been developed for a long time for this, using custom "container allocators"
- One important development came from NVidia, with <u>Thrust</u>
  - This was generalised to be part of <u>C++17</u>
     as the "memory resource infrastructure"
- Which is something that we have been very actively using in the <u>VecMem project</u>

#### Memory resources

| memory_resource(C++17)              | an abstract interface for classes that encapsulate memory resources (class)                                                                                |
|-------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------|
| new_delete_resource(C++17)          | returns a static program-wide std::pmr::memory_resource that uses the global operator new and operator delete to allocate and deallocate memory (function) |
| null_memory_resource(C++17)         | returns a static $\mathtt{std}: \mathtt{pmr}: \mathtt{:memory}\_\mathtt{resource}$ that performs no allocation $(function)$                                |
| get_default_resource(C++17)         | <pre>gets the default std::pmr::memory_resource (function)</pre>                                                                                           |
| set_default_resource(C++17)         | <pre>sets the default std::pmr::memory_resource (function)</pre>                                                                                           |
| pool_options(C++17)                 | a set of constructor options for pool resources (class)                                                                                                    |
| synchronized_pool_resource(C++17)   | a thread-safe std::pmr::memory_resource for managing allocations in pools of different block sizes (class)                                                 |
| unsynchronized_pool_resource(C++17) | a thread-unsafe std::pmr::memory_resource for managing allocations in pools of different block sizes (class)                                               |
| monotonic_buffer_resource(C++17)    | a special-purpose std::pmr::memory_resource that releases the allocated memory only when the resource is destroyed (class)                                 |

#### VecMem



```
acts-project / vecmem Public
                    vecmem::sycl::copy copy(&m_queue);
P main - P 3 branches ⊙ 3 tags
                                                                vecmem::svcl::device memory resource device resource(&m queue);
                                                                vecmem::data::jagged_vector_buffer<int> output_data_device(
krasznaa Merge pull request #102 from acts-project/WindowsDPCPP-main-202...
                                                                    {0, 0, 0, 0, 0, 0}, {10, 10, 10, 10, 10}, device_resource, &m_mem)
                                                                auto input data = vecmem::get data(m vec);
                                                                 n_queue.submit([&input_data, &output_data_device](cl::sycl::handler& h) {
                                                                     n.parallel_for<class FilterKernel>(
                                                                        cl::sycl::range<2>(input_data.m_size, 5),
                                                                        [input = vecmem::get_data(input_data),
                                                                            if (id[1] >= input.m_ptr[id[0]].size()) {
CMakeLists.txt

□ LICENSE

README.md
                                                                            vecmem::jagged_device_vector<int> outputvec(output);
                                                                 vecmem::jagged_vector<int> output(&m_mem);
                                                                 copy(output data device, output);
```

XPECT\_EQ(output.size(),

static\_cast<vecmem::jagged\_vector<int>::size\_type>(6));

- As part of a larger effort in the <u>Acts</u> <u>community</u>, we are developing a library that could help with using containers of "simple" data in heterogeneous code
  - It provides a set of classes for use in host and device code, for simplifying common container access patterns
- Dedicated presentations about this project will be shown at future conferences/workshops

## **Atomic Memory Operations**



- Many multi-threaded / GPU algorithms make use of atomic variables
  - o GPU hardware allows for atomic updates to any variable in "global memory". Which is unfortunately not possible to express with the current C++ <u>std::atomic</u> interface.
  - Projects like <u>VecMem</u>, and (very importantly!) <u>Kokkos</u>, had to work around this using their own atomic types.
- One important new feature in C++20 is <u>std::atomic\_ref</u>, pushed into the standard by the Kokkos developers
  - o It provides an interface that is finally appropriate for "device code" as well
  - Future versions of CUDA/HIP/SYCL shall be able to understand this type in "device code", making code sharing between different platforms even easier

## Offloaded Code Execution

#### Formalism



- CUDA, HIP and SYCL each have their own formalism for executing a "function" on many parallel threads
  - They all need to allow a detailed specification of how to launch the function on the hardware
- Since the concept is quite the same in all cases, a number of projects were written to create uniform interfaces on top of them
  - But while this can be very useful in some situations, having to launch a GPU kernel in slightly different ways in the different languages is rarely the difficult part in porting some code...

```
dim3 numBlocks(...), blockSize(...);
uint32_t sharedMem = ...;
hipStream_t stream = ...;
hipLaunchKernelGGL( mySuperKernel, numBlocks,
    blockSize, sharedMem, stream, ...);
```

#### C++17 Parallel STL Algorithms



```
Simple examples
Here are a few simple examples to get a feel for how the C++ Parallel Algorithms work
From the early days of C++, sorting items stored in an appropriate container has been relatively easy using a single call
such as the following:
 std::sort(employees.begin(), employees.end(),
             CompareByLastName());
Assuming that the comparison class CompareByLastName is thread-safe, which is true for most comparison
functions, then parallelizing this sort is simple with C++ Parallel Algorithms, Include <execution> and add an
execution policy to the function call:
 std::sort(std::execution::par.
             employees.begin(), employees.end(),
            CompareByLastName());
Calculating the sum of all the elements in a container is also simple with the std::accumulate algorithm. Prior to
C++17, transforming the data in some way white taking the sum was somewhat awkward. For example, to compute the
average age of your employees, you might write the following code example:
     std::accumulate(employees.begin(), employees.end(), 0,
                      [](int sum, const Employee& emp){
                          return sum + emp.age();
     / employees.size():
The std::transform reduce algorithm introduced in C++17 makes it simple to parallelize this code. It also results
in cleaner code by separating the reduction operation, in this case std::plus, from the transformation operation, in
this case emp.age
 int ave age =
     std::transform reduce(std::execution::par unseq,
                            employees.begin(), employees.end(),
                            θ, std::plus<int>(),
                            [](const Employee& emp)
                                return emp.age();
     / employees.size();
```

Example:

Use the C++ Standard Execution Policies

```
#include <oneapi/dpl/execution>
#include <oneapi/dpl/algorithm>
#include <vector>

int main()
{
    std::vector<int> data( 1000 );
    std::fill(oneapi::dpl::execution::par_unseq, data.begin(), data.end(), 42);
    return 0;
}
```

- Purely numeric calculations can even be expressed without writing any accelerator code directly
  - If your calculation can be expressed purely through STL algorithms, it is likely that it can be executed on an accelerator as well
- It very much relies on compiler support
  - Even more, while the <u>NVidia HPC SDK</u> allows you to run "more-or-less-standard" C++17 code on your GPU, <u>Intel oneAPI</u> requires you to use some Intel specific includes...
- Still, it is one of the most platform independent ways of writing accelerated code at the moment

#### C++23(?) Executors



```
executor auto ex = ...;
execute(ex, []{
          cout << "Hello, executors!\n"; });</pre>
```

- <u>P0443R14</u> proposes a unified interface for launching tasks on "some backend"
  - With a formalism a little reminiscent of SYCL
- The goal is of course to introduce a formalism that could describe CPU and accelerator multi-threading using a single interface
  - Allowing hardware makers to process code (with their own compilers, at least initially) that could look practically the same for all types of accelerators

#### Code Sharing



- Until the "device code launch" formalism is standardized, we can still organise our code in clever ways
  - As much code as possible should be delegated into "standard" functions, which kernels can call on to perform some task/calculation
  - This mainly requires a unified handling of memory in my opinion, which can already be done in clever ways
- We are currently experimenting with exactly how far we can take this, in acts-project/traccc

```
DEVICE FUNCTION
float calculateSomething( const vecmem::device_vector<const float>& vec,
                          std::size t index );
__global
void cudaKernel( vecmem::vector view<const float> vec view, ... ) {
   const std::size t i = blockIdx.x * blockDim.x + threadIdx.x;
   vecmem::device vector<const float> vec( vec view );
  float foo = calculateSomething( vec, i );
global
void hipKernel( vecmem::vector view<const float> vec view, ... ) {
   const std::size_t i = hipBlockIdx_x * hipBlockDim_x + hipThreadIdx_x;
   vecmem::device vector<const float> vec( vec view );
  float foo = calculateSomething( vec, i );
class SyclKernel {
public:
  SyclKernel( vecmem vector view<const float> vec view )
      : m_vec_view( vec_view ), ... {}
  void operator()( sycl::id<1> id ) {
      vecmem::device vector<const float> vec( m vec view );
      float foo = calculateSomething( vec, id );
private:
   vecmem::vector_view<const float> m_vec_view;
```

# Developments in ATLAS

## Heterogeneous Computing and Accelerators Forum



#### Mandate for the Heterogeneous Computing and Accelerators Forum (Updated on 14.1.2021)

#### Mandate:

The future of computing hardware is uncertain, but one global trend is towards heterogeneous resources and more specifically towards "accelerators": specialized (non-CPU) hardware that enhances performance for certain computations. One of the most obvious examples is the Graphics Processing Unit (GPU), which is adept at highly parallel, low-accuracy computations. Other popular examples include FPGAs and TPUs.

Within ATLAS, discussion and overall planning of work on heterogeneous resources should be within the Heterogeneous Computing and Accelerators Forum (HCAF) which includes efforts from both offline software and TDAQ. The conveners of the forum should maintain a list of high-level milestones towards the adoption of the technologies targeted by development within ATLAS.

The forum should meet at least once a month.

#### Reporting and Liaisons:

The HCAF conveners report to the ATLAS Computing Coordinator and the TDAQ Project, TDAQ Upgrade Project, and Upgrade Project Leaders. They may appoint liaisons or contacts as needed. They should ensure ATLAS is represented in collaborative forums focused on accelerators, like the HSF accelerators forum.

#### Term of Office:

The HCAF conveners are appointed by the ATLAS Computing Coordinator and TDAQ Upgrade Project Leader with a renewable one year term normally starting October 1st. At least two conveners are appointed. Between them, responsibilities are split; however, knowledge should be shared such that they can represent each other in case one is unavailable.

- To organise/oversee the developments in this area, the Heterogeneous Computing and Accelerators Forum (HCAF) was formed
  - Built on top of the previous separate groups overseeing the offline and TDAQ efforts in this area
- It is in this group that we try to organise all of these types of developments...

## **Current Studies/Developments**



- R&D is happening in many areas of the ATLAS offline software
  - (Charged) track reconstruction
  - Calorimetry
  - Core Software
- Probably the most "public" development at the moment is happening in <u>acts-project/traccc</u>
  - Where we intend to demonstrate a "realistic setup" for performing charged track reconstruction on accelerators



## Summary



- After a calm period of homogeneous x86 computing, HEP will once again have to use a wide variety of heterogeneous hardware for a while
  - I believe there is a periodicity to this. Current accelerator technologies will inevitably become more homogeneous after a while.
- C++ will stay the "main" programming language of HEP for a long time to come
  - If things are done correctly, it shall event allow us to efficiently program all the emerging hardware variants by itself
- C++2X (C++3X?) will not have all the capabilities that the LHC experiments require by the start of HL-LHC
  - We need to make sure in the next few years that we choose a programming method that will be as close to the eventual C++ standard as possible
- There is a lot of work to be done! If you're interested, ATLAS is certainly welcoming enthusiastic software developers!



http://home.cern

# Backup

#### Previous Studies (1)



- 2012: ID Trigger prototype
  - (ATL-DAQ-PROC-2012-006)
    - Complete Level2 ID Trigger on GPU (ByteStream to tracks)
    - GPU (Tesla C2050) gave x12 speedup\* c.f. 1 CPU core
- 2015: Trigger GPU Demonstrator (ATL-COM-DAQ-2019-059)
  - Athena integration using client-server technology (APE)
  - Calo topo-clustering & cluster splitting: x3.6 speedup\* on Kepler K80 GPU
  - Pix & SCT clustering + ID seed-maker: x28 speed-up\* on Pascal GTX1080 GPU
  - Overall trigger server throughput x1.4 throughput with GPU c.f. Cpu-only
- 2019: GPU ID pattern-matching prototype (ATL-COM-DAQ-2019-173)
  - o FTK-like pattern matching on GPU





## Previous Studies (2)







#### 2020: GPU trigger algorithm integration in AthenaMT

- AthenaMT integration using acceleration service
- ID seed-maker algorithm implemented on GPU
- Calorimeter reconstruction under development

#### Acts

- Seed finding implemented using both CUDA and SYCL
  - https://github.com/acts-project/acts/tree/master/Plugins/Cuda
  - https://github.com/acts-project/acts/tree/master/Plugins/Sycl
- Kalman filter demonstrator

#### FCS: Parametrized Calorimeter Simulation

- First developed in CUDA, but then used as a software portability testbed
- ATL-COM-SOFT-2020-069
- oneMKL cuRAND Support Development (<u>GitHub Code</u>)
- Studies with GNNs for tracking (<u>presentation</u>)