Loran L
Loran L's Blog


Loran L's Blog

Assignment2: Finding SIMD in FFMPEG Source

Assignment2: Finding SIMD in FFMPEG Source


Loran L's photo
Loran L
·Dec 12, 2021·

4 min read

I will be trying to locate assembly code that uses SIMD vectorization.

I chose ffmpeg, because I am familiar with video editing. I had to learn about codecs, encoding, decoding, containers and so on. I needed to edit video in my sony vegas pro 10( (long time ago), but I was not able to.(3gp, or maybe something else, but shot on early mobile (not smartphone!) camera). I was able to play it, but not to edit. Because in video editing you need both encoder and decoder. Even better if you have big pack pf all the codecs, so you never ever have problems with types of video. So I got K-lite codec pack , and never had problems. This pack incuded ffdshow (with pesky tray icon that scared me, as I thought I got a virus). Ffdshow uses ffmpeg among everything else. That's how I am familiar with ffmpeg.

No surprise that video editing would want to be efficient, so it uses a bunch of SIMD optimizations. Espesially when you need to do alot of operations on video data, such as decoding.

To find x86 SIMD vector instruction, we need to have list of instructions. There is wiki page for this.

Nice article on the topic.

About modern tools

Here is ffmpeg github page Did you know, if you press dot '.' while on github page, it opens full VS-code editor in browser? You have to be logged in.


Even better, now you can use CRTR+SHIFT+F to search across all files.

VBROADCASTSS Copy a 32-bit, 64-bit or 128-bit memory operand to all elements of a XMM or YMM vector register. (AVX)

3 matches in libavcodec/x86/celt_pvq_search.asm

; Merge parallel maximums final round (1 vs 1)
    shufps        xm0, xm3, xm3, q1111  ; m0 = m3[1] = p[1]
    cmpss         xm0, xm3, 5           ; m0 = !(m0 >= m3) = !( p[1] >= p[0] )

    pshufd        xm2, xm1, q1111
    PBLENDVB      xm1, xm2, xm0

    movd    dword r4d, xm1          ; zero extends to the rest of r4q

    VBROADCASTSS   m3, [tmpX + r4]
    %{1}ps         m7, m3           ; Sxy += X[max_idx]

    VBROADCASTSS   m5, [tmpY + r4]
    %{1}ps         m6, m5           ; Syy += Y[max_idx]

    ; We have to update a single element in Y[i]
    ; However writing 4 bytes and then doing 16 byte load in the inner loop
    ; could cause a stall due to breaking write forwarding.
    VPBROADCASTD   m1, xm1
    pcmpeqd        m1, m1, m4           ; exactly 1 element matches max_idx and this finds it

    and           r4d, ~(mmsize-1)      ; align address down, so the value pointed by max_idx is inside a mmsize load
    movaps         m5, [tmpY + r4]      ; m5 = Y[y3...ym...y0]
    andps          m1, mm_const_float_1 ; m1 =  [ 0...1.0...0]
    %{1}ps         m5, m1               ; m5 = Y[y3...ym...y0] +/- [0...1.0...0]
    movaps [tmpY + r4], m5              ; Y[max_idx] +-= 1.0;

CELT is Constrained-Energy Lapped Transform (Audio) Codec Possibly we use vector reg to load audio data for processing.


instruction set

FMUL Vd., Vn., Vm. Floating-point multiply (vector). Where is 2S, 4S or 2D.

33 results

example: libavcodec/aarch64/aacpsdsp_neon.S

function ff_ps_add_squares_neon, export=1
1:      ld1         {v0.4S,v1.4S}, [x1], #32
        fmul        v0.4S, v0.4S, v0.4S
        fmul        v1.4S, v1.4S, v1.4S
        faddp       v2.4S, v0.4S, v1.4S
        ld1         {v3.4S}, [x0]
        fadd        v3.4S, v3.4S, v2.4S
        st1         {v3.4S}, [x0], #16
        subs        w2, w2, #4
        b.gt        1b

I assume dsp stands for digital signal processing. There are also files h264dsp_neon.S, that are probably responsible for digital signal processing while using H.264 video codec.

judging by the name add squares function looks like it is going to add a lot of repetitive data. Probably they are using float point multiply as their way to calculate squares fast.


ffmpeg supports a lot of architectures with efficient optimizations. The github is very active, and a lot of hero developers contribute. Some of the inline comments quite amusing:

;       carry over registers from smaller transforms to save on ~8 loads/stores
;       check if vinsertf could be faster than verpm2f128 for duplication
;       even faster FFT8 (current one is very #instructions optimized)
;       replace some xors with blends + addsubs?
;       replace some shuffles with vblends?
;       avx512 split-radix

ffmpeg relies on different optimizations including broad use of SIMD. It was not hard at all find quite a lot of different implementations. This makes scene to use hardware acceleration for video processing, because video data is very repetitive.

Share this