In the past two years, I worked for Intel Corporation as a compiler engineer. The major project I worked during that period of time is to design and implement the x86 hardware intrinsic for .NET Core. This project was starting with the design proposal
Motivation
Computer science is the fastest developing area in the last 50 years that has made so many fantasies come true. But why? The root reason is that our computer hardware and software become faster and faster at an incredible rate (a.k.a., Moore’s Law
Intrinsics are special functions that you cannot implement by yourself in the programming language that you are using. Hardware intrinsic functions are special functions that can be directly converted to a single (or a few) hardware instructions by the compiler, so that it exposes the underlying instruction architecture without abstraction overhead. Intrinsic functions perfectly integrate with other language features because they are just “functions”. For example, intrinsic operates over variables instead of registers that assembly languages have but higher-level languages are not aware. Hardware intrinsics have been a native language (e.g., C/C++) feature for a long time. Although intrinsic functions can significantly improve the productivity of SIMD (or other hardware-dependent) programming, certain inherent drawbacks of native languages (e.g., manual memory management) make programming still difficult. Managed runtimes such as .NET Core are designed to improve programmer productivity and security by providing higher abstraction layers, type safety, and automatic memory management. This new feature, hardware intrinsics in .NET Core 3.0, combines the advantages of SIMD programming and managed languages (C#).
New Namespaces and Classes
As it was mentioned above, hardware intrinsics will be available as a formal and built-in feature in .NET Core 3.0 which exposes new namespaces, SIMD types, and classes representing different Instruction Set Architectures (ISA). The top-level namespaces are:
System.Runtime.Intrinsics: contains SIMD types which abstract the underlying SIMD registers.Vector128<T>andVector256<T>whereTcan be instantiated to any C# numeric type correspond to XMM and YMM registers on Intel ISA, respectively. This namespace also contains certain platform-agnostic convenience functions that provide common vector operations, e.g., initializing a vector with specified elements.System.Runtime.Intrinsics.X86: contains classes representing different Intel ISAs spanning SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, AVX, AVX2, FMA, LZCNT, POPCNT, BMI1, BMI2, PCLMULQDQ, and AES. For example, classAvxhas many static methods that each of them maps to an AVX instruction. You can let the compiler generatevaddps ymm, ymm, ymmby callingAvx.Add(vector1, vector2)wherevector1/2are instances ofVector256<float>. Particularity, each class has a boolean property calledIsSupportedwhich developers can use to check the underlying hardware support and contains intrinsic methods that operate over scalar or SIMD data. So, a hardware accelerated algorithm in .NET Core 3.0 usually has the top-level structure like below
if (Avx2.IsSupported)
{
// The AVX/AVX2 optimizing implementation for Intel Haswell or above CPUs
}
else if (Sse41.IsSupported)
{
// The SSE optimizing implementation for older x86 CPUs
}
else if (Arm.Arm64.Simd.IsSupported)
{
// The NEON optimizing implementation for ARM64 CPUs
}
else
{
// Scalar or software-fallback implementation
}
In this example, you may be curious about the Arm.Arm64 path. Yes, .NET Core hardware intrinsic system also has the ARM counterpart (under namespace System.Runtime.Intrinsics.Arm.Arm64) that is originally designed and implemented by QCOM engineers. However, the progress of the ARM side is quite different from x86 in .NET Core 3.0, and its availability depends on the ARM64 version of .NET Core releasing. In this day, I am not sure about the status of ARM64 support in .NET Core 3.0, so please watch for Microsoft’s official announcement if you want this feature on ARM.
SIMD Programming in .NET Core
Although the hardware intrinsic system is not only about SIMD, the SIMD intrinsics are the most exciting part. So, I would like to give a simple SIMD example to demonstrate how to use hardware intrinsic in your C# programs. I will keep it as simple as possible. If you are interested in deeper knowledge about SIMD, I will dive into it in the next blog with lovely C# code.
Let’s dotnet new a console application template and copy the code below to Program.cs file. You do not need to install any NuGet package because hardware intrinsic is an official feature in the core library of .NET Core 3.0. So, please make sure you have .NET Core SDK 3.0 installed (before .NET Core 3.0 formally released, I recommend using the daily build).
using System.Runtime.Intrinsics;
using System.Runtime.Intrinsics.X86;
static unsafe float[] SimdAdd(float[] a, float[] b, int n)
{
float[] result = new float[n];
fixed(float* ptr_a = a, ptr_b = b, ptr_res = result)
{
for (int i = 0; i < n; i += Vector256<float>.Count)
{
Vector256<float> v1 = Avx.LoadVector256(ptr_a + i);
Vector256<float> v2 = Avx.LoadVector256(ptr_b + i);
Vector256<float> res = Avx.Add(v1, v2);
Avx.Store(ptr_res + i, res);
}
}
return result;
}
This function adds two arrays and returns the sum in a new array. This function is not real product code, it is simplified for demo only. Firstly, we need using two namespaces. As mentioned above, System.Runtime.Intrinsics is for Vector256<float> in this program, and System.Runtime.Intrinsics.X86 is for using Avx intrinsics (e.g., Avx.LoadVector256(ptr_a + i)). Secondly, this function, SimdAdd has to be defined with unsafe keyword because Avx.LoadVector256 and Avx.Store operates over “pointers” to read the input data and write the computation result back to memory. Overall, we have two kinds of hardware intrinsics under System.Runtime.Intrinsics.X86 namespace:
- Computing intrinsics: this is the major group of intrinsic APIs. Usually, they take parameters with SIMD types (
Vector128<T>,Vector256<T>, etc.) and/or scalar numeric types (int,float,ushort, etc), return a computing result.Avx.Addis a typical example of this group. - Memory-access intrinsics: SIMD computing intrinsics accept input data that is already in vector variables. However, data is usually organized in memory/file with their own format/types rather than
Vector128<T>andVector256<T>. So, we need memory-access intrinsics to convert in-memory data between in-variable vectors. The most common memory-access intrinsics areSse.LoadVector128,Sse2.LoadVector128,Avx.LoadVector256, andSse.Store,Sse2.Store,Avx.Store.
The main functionality of SimdAdd is fulfilled by Avx.Add that takes two vectors of float (each vector contain 8 float numbers), adds float numbers 8-by-8 (256-bit/32-bit == 8), and puts the sum vector in a new Vector256<float> variable (res).
Finally, preparing two input arrays and calling SimdAdd from another function (e.g., Main) to see the result
if (Avx.IsSupported)
{
sum = SimdAdd(a, b, 256);
}
Note, please check the hardware capability (by IsSupported) before calling any platform-specific intrinsic. Executing hardware intrinsic on incorrect hardware platforms would throw System.PlatformNotSupportedException.
> dotnet run
Unhandled Exception: System.PlatformNotSupportedException: Operation is not supported on this platform.
at System.Runtime.Intrinsics.X86.Avx.LoadVector256(Single* address)
at IntrinsicDemo.IntrinsicDemo.SimdAdd(Single[] a, Single[] b, Int32 n) in /Users/fiigii/workspace/test/IntrinsicDemo/Program.cs:line 30
at IntrinsicDemo.IntrinsicDemo.Main(String[] args) in /Users/fiigii/workspace/test/IntrinsicDemo/Program.cs:line 17
You may wonder how I got such an old CPU that does not support AVX instructions for showing the above message. Actually, during developing this feature in JIT compiler, we have considered the situations that hardware specific programs are difficult to test for all the hardware. So, we provide several environment variables to save developers’ money from purchasing old hardware for testing :). For example, you can set COMPlus_EnableAVX=0 to disable AVX (and newer ISAs that depend on AVX) in your .NET Core process. Then, the code path for older CPUs can be tested on new machines. .NET Core 3.0 has one such environment variable for each x86 ISA (e.g., COMPlus_EnableSSE41, COMPlus_EnableAVX2, COMPlus_EnableFMA, etc.).
Additionally, you may think the Simd.Add code too verbose since every intrinsic call has a leading ISA name (Avx.Add). Fortunately, this verbose can be avoided by C# using static
using static System.Runtime.Intrinsics.X86.Avx;
...
Vector256<float> v1 = LoadVector256(ptr_a + i);
Vector256<float> v2 = LoadVector256(ptr_b + i);
Vector256<float> res = Add(v1, v2);
Store(ptr_res + i, res);
We intentionally designed every intrinsic API to work with using static without conflicts, even if the program mixes intrinsics from different ISAs.
Further Studying
The hardware intrinsic system in .NET Core 3.0 is not only a bunch of new APIs, that also opens the door of low-level programming and high-performance computing for .NET Core programmers. So, I plan to write a series of blogs to introduce the intrinsic APIs, SIMD programming in C#, and JIT compiler optimizations. But you might have no patience to wait for the next article, no worries, here are some materials that help for further studying
- API documentation: There is no formal documentation for hardware intrinsic yet before .NET Core 3.0 release. But everything in .NET Core is open source, you can take a look at the source code of these APIs at here, which each API has comments that correspond to C/C++ intrinsic APIs and x86 assembly instructions. Then you can reuse the C/C++ intrinsic API documentation provided by Intel that is pretty straightforward for developers who have C/C++ experience.
- Performance analysis: Improving performance is the main purpose to use hardware intrinsic. BecnmarkDotnet is an open source profiler for .NET Core applications and it is easy to use. Meanwhile, I really like Intel VTune that provides more sophisticated runtime hardware information, which is better for optimizing but understanding VTune output requires a bit of hardware knowledge.
- More hardware intrinsic examples: the .NET Core community has leveraged intrinsics to optimize the library code in CoreFX and CoreCLR repos, watching related issues and PRs on GitHub is a good approach to learn. For example, Ben Adams’s PRs that vectorizing
IndexOf. There are also some individual projects that heavily use hardware intrinsics, SimdJsonSharp is a C# port of SimdJson algorithm that accelerates JSON parsing using AVX2 instructions. PacketTracer is an SoA-vectorized ray tracing that is used to investigating .NET Core SIMD code generation quality.