I will be trying to locate assembly code that uses SIMD vectorization.
I chose ffmpeg, because I am familiar with video editing. I had to learn about codecs, encoding, decoding, containers and so on. I needed to edit video in my sony vegas pro 10( (long time ago), but I was not able to.(3gp, or maybe something else, but shot on early mobile (not smartphone!) camera). I was able to play it, but not to edit. Because in video editing you need both encoder and decoder. Even better if you have big pack pf all the codecs, so you never ever have problems with types of video. So I got K-lite codec pack , and never had problems. This pack incuded ffdshow (with pesky tray icon that scared me, as I thought I got a virus). Ffdshow uses ffmpeg among everything else. That's how I am familiar with ffmpeg.
No surprise that video editing would want to be efficient, so it uses a bunch of SIMD optimizations. Espesially when you need to do alot of operations on video data, such as decoding.
To find x86 SIMD vector instruction, we need to have list of instructions. There is wiki page for this.
Nice article on the topic.
About modern tools
Here is ffmpeg github page Did you know, if you press dot '.' while on github page, it opens full VS-code editor in browser? You have to be logged in.
Even better, now you can use CRTR+SHIFT+F to search across all files.
Let's now try to search for something vector related for X86:
VBROADCASTSS Copy a 32-bit, 64-bit or 128-bit memory operand to all elements of a XMM or YMM vector register. (AVX)
3 matches in libavcodec/x86/celt_pvq_search.asm
; Merge parallel maximums final round (1 vs 1)
shufps xm0, xm3, xm3, q1111 ; m0 = m3[1] = p[1]
cmpss xm0, xm3, 5 ; m0 = !(m0 >= m3) = !( p[1] >= p[0] )
pshufd xm2, xm1, q1111
PBLENDVB xm1, xm2, xm0
movd dword r4d, xm1 ; zero extends to the rest of r4q
VBROADCASTSS m3, [tmpX + r4]
%{1}ps m7, m3 ; Sxy += X[max_idx]
VBROADCASTSS m5, [tmpY + r4]
%{1}ps m6, m5 ; Syy += Y[max_idx]
; We have to update a single element in Y[i]
; However writing 4 bytes and then doing 16 byte load in the inner loop
; could cause a stall due to breaking write forwarding.
VPBROADCASTD m1, xm1
pcmpeqd m1, m1, m4 ; exactly 1 element matches max_idx and this finds it
and r4d, ~(mmsize-1) ; align address down, so the value pointed by max_idx is inside a mmsize load
movaps m5, [tmpY + r4] ; m5 = Y[y3...ym...y0]
andps m1, mm_const_float_1 ; m1 = [ 0...1.0...0]
%{1}ps m5, m1 ; m5 = Y[y3...ym...y0] +/- [0...1.0...0]
movaps [tmpY + r4], m5 ; Y[max_idx] +-= 1.0;
%endmacro
CELT is Constrained-Energy Lapped Transform (Audio) Codec Possibly we use vector reg to load audio data for processing.
ARM
FMUL Vd., Vn., Vm. Floating-point multiply (vector). Where is 2S, 4S or 2D.
33 results
example: libavcodec/aarch64/aacpsdsp_neon.S
function ff_ps_add_squares_neon, export=1
1: ld1 {v0.4S,v1.4S}, [x1], #32
fmul v0.4S, v0.4S, v0.4S
fmul v1.4S, v1.4S, v1.4S
faddp v2.4S, v0.4S, v1.4S
ld1 {v3.4S}, [x0]
fadd v3.4S, v3.4S, v2.4S
st1 {v3.4S}, [x0], #16
subs w2, w2, #4
b.gt 1b
ret
endfunc
I assume dsp stands for digital signal processing. There are also files h264dsp_neon.S, that are probably responsible for digital signal processing while using H.264 video codec.
judging by the name add squares function looks like it is going to add a lot of repetitive data. Probably they are using float point multiply as their way to calculate squares fast.
Conclusion
ffmpeg supports a lot of architectures with efficient optimizations. The github is very active, and a lot of hero developers contribute. Some of the inline comments quite amusing:
; TODO:
; carry over registers from smaller transforms to save on ~8 loads/stores
; check if vinsertf could be faster than verpm2f128 for duplication
; even faster FFT8 (current one is very #instructions optimized)
; replace some xors with blends + addsubs?
; replace some shuffles with vblends?
; avx512 split-radix
ffmpeg relies on different optimizations including broad use of SIMD. It was not hard at all find quite a lot of different implementations. This makes scene to use hardware acceleration for video processing, because video data is very repetitive.