The Optimizations in Erlang/OTP 27

April 23, 2024 · by Björn Gustavsson

This post explores the new optimizations for record updates as well as some of the other improvements. It also gives a brief historic overview of recent optimizations leading up to Erlang/OTP 27.

A brief history of recent optimizations #

The modern history of optimizations for Erlang begins in January 2018. We had realized that we had reached the limit of the optimizations that were possible working on BEAM code in the Erlang compiler.

Erlang/OTP 22 introduced a new SSA-based intermediate representation in the compiler. Read the full story in SSA History.
Erlang/OTP 24 introduced the JIT (Just In Time compiler), which improved performance by emitting native code for BEAM instructions at load-time.
Erlang/OTP 25 introduced type-based optimization in the JIT, which allowed the Erlang compiler to pass type information to the JIT to help it emit better native code. While that improved the native code emitted by the JIT, limitations in both the compiler and the JIT prevented the JIT to take full advantage of the type information.
Erlang/OTP 26 improved the type-based optimizations. The most noticeable performance improvements were matching and construction of binaries using the bit syntax. Those improvements, combined with changes to the base64 module itself, made encoding to Base64 about 4 times as fast and decoding from Base64 more than 3 times as fast.

What to expect of the JIT in Erlang/OTP 27 #

The major compiler and JIT improvement in Erlang/OTP 27 is optimization of record operations, but there are also many smaller optimizations that make the code smaller and/or faster.

Please try this at home! #

While this blog post will show many examples of generated code, I have attempted to explain the optimizations in English as well. Feel free to skip the code examples.

On the other hand, if you want more code examples…

To examine the native code for loaded modules, start the runtime system like this:

erl +JDdump true

The native code for all modules that are loaded will be dumped to files with the extension .asm.

To examine the BEAM code for a module, use the -S option when compiling. For example:

erlc -S base64.erl

A simple record optimization #

To get started, let’s look at a simple record optimization that was not done in Erlang/OTP 26 and earlier. Suppose we have this module:

-record(foo, {a,b,c,d,e}).

update(N) ->
    R0 = #foo{},
    R1 = R0#foo{a=N},
    R2 = R1#foo{b=2},
    R2#foo{c=3}.

Here is BEAM code for the record operations:

    {update_record,{atom,reuse},
                   6,
                   {literal,{foo,undefined,undefined,undefined,undefined,
                                 undefined}},
                   {x,0},
                   {list,[2,{x,0}]}}.
    {update_record,{atom,copy},6,{x,0},{x,0},{list,[3,{integer,2}]}}.
    {update_record,{atom,copy},6,{x,0},{x,0},{list,[4,{integer,3}]}}.

That is, all three record update operations have been retained as separate update_record instructions. Each operation creates a new record by copying the unchanged parts of the record and filling in the new values in the correct position.

The compiler in Erlang/OTP 27 will essentially rewrite update/1 to:

update(N) ->
    #foo{a=N,b=2,c=3}.

which will produce the following BEAM code for the record creation:

    {put_tuple2,{x,0},
                {list,[{atom,foo},
                       {x,0},
                       {integer,2},
                       {integer,3},
                       {atom,undefined},
                       {atom,undefined}]}}.

Those optimizations were implemented in the following pull requests:

Updating records in place #

To explore the more sophisticated record optimization introduced in Erlang/OTP 27, consider this example:

-module(count1).
-export([count/1]).

-record(s, {atoms=0,other=0}).

count(L) ->
    count(L, #s{}).

count([X|Xs], #s{atoms=C}=S) when is_atom(X) ->
    count(Xs, S#s{atoms=C+1});
count([_|Xs], #s{other=C}=S) ->
    count(Xs, S#s{other=C+1});
count([], S) ->
    S.

count(List) counts the number of atoms and the number of other terms in the given list. For example:

1> -record(s, {atoms=0,other=0}).
ok
2> count1:count([a,b,c,1,2,3,4,5]).
#s{atoms = 3,other = 5}

Here follows the BEAM code emitted for count/2:

    {test,is_nonempty_list,{f,6},[{x,0}]}.
    {get_list,{x,0},{x,2},{x,0}}.
    {test,is_atom,{f,5},[{x,2}]}.
    {get_tuple_element,{x,1},1,{x,2}}.
    {gc_bif,'+',{f,0},3,[{tr,{x,2},{t_integer,{0,'+inf'}}},{integer,1}],{x,2}}.
    {test_heap,4,3}.
    {update_record,{atom,inplace},
                   3,
                   {tr,{x,1},
                       {t_tuple,3,true,
                                #{1 => {t_atom,[s]},
                                  2 => {t_integer,{0,'+inf'}},
                                  3 => {t_integer,{0,'+inf'}}}}},
                   {x,1},
                   {list,[2,{tr,{x,2},{t_integer,{1,'+inf'}}}]}}.
    {call_only,2,{f,4}}. % count/2
  {label,5}.
    {get_tuple_element,{x,1},2,{x,2}}.
    {gc_bif,'+',{f,0},3,[{tr,{x,2},{t_integer,{0,'+inf'}}},{integer,1}],{x,2}}.
    {test_heap,4,3}.
    {update_record,{atom,inplace},
                   3,
                   {tr,{x,1},
                       {t_tuple,3,true,
                                #{1 => {t_atom,[s]},
                                  2 => {t_integer,{0,'+inf'}},
                                  3 => {t_integer,{0,'+inf'}}}}},
                   {x,1},
                   {list,[3,{tr,{x,2},{t_integer,{1,'+inf'}}}]}}.
    {call_only,2,{f,4}}. % count/2
  {label,6}.
    {test,is_nil,{f,3},[{x,0}]}.
    {move,{x,1},{x,0}}.
    return.

The first two instructions test whether the first argument in {x,0} is a non-empty list and if so extracts the first element of the list:

    {test,is_nonempty_list,{f,6},[{x,0}]}.
    {get_list,{x,0},{x,2},{x,0}}.

The next instruction tests whether the first element is an atom. If not, a jump is made to the code for the second clause.

    {test,is_atom,{f,5},[{x,2}]}.

Next the counter for the number of atoms seen is fetched from the record and incremented by one:

    {get_tuple_element,{x,1},2,{x,2}}.
    {gc_bif,'+',{f,0},3,[{tr,{x,2},{t_integer,{0,'+inf'}}},{integer,1}],{x,2}}.

Next follows allocation of heap space and the updating of the record:

    {test_heap,4,3}.
    {update_record,{atom,inplace},
                   3,
                   {tr,{x,1},
                       {t_tuple,3,true,
                                #{1 => {t_atom,[s]},
                                  2 => {t_integer,{0,'+inf'}},
                                  3 => {t_integer,{0,'+inf'}}}}},
                   {x,1},
                   {list,[3,{tr,{x,2},{t_integer,{1,'+inf'}}}]}}.

The test_heap instruction ensures that there is sufficient room on the heap for copying the record (4 words).

The update_record instruction was introduced in Erlang/OTP 26. Its first operand is an atom that is a hint from the compiler to help the JIT emit better code. In Erlang/OTP 26 the hints reuse and copy are used. For more about those hints, see Updating records in OTP 26.

In Erlang/OTP 27, there is a new hint called inplace. The compiler emits that hint when it has determined that nowhere in the runtime system is there another reference to the tuple except for the reference used for the update_record instruction. In other words, from the compiler’s point of view, if the runtime system were to directly update the existing record without first copying it, the observable behavior of the program would not change. As soon will be seen, from the runtime system’s point of view, directly updating the record is not always safe.

This new optimization was implemented by Frej Drejhammar. It builds on and extends the compiler passes added in Erlang/OTP 26 for appending to a binary.

Now let’s see what the JIT will do when a record_update instruction has an inplace hint. Here is the complete native code for the instruction:

# update_record_in_place_IsdI
    mov rax, qword ptr [rbx+8]
    mov rcx, qword ptr [rbx+16]
    test cl, 1
    short je L38           ; Update directly if small integer.

    ; The new value is a bignum.
    ; Test whether the tuple is in the safe part of the heap.

    mov rdi, [r13+480]     ; Get the high water mark
    cmp rax, r15           ; Compare tuple pointer to heap top
    short jae L39          ; Jump and copy if above
    cmp rax, rdi           ; Compare tuple pointer to high water
    short jae L38          ; Jump and overwrite if above high water

    ; The tuple is not in the safe part of the heap.
    ; Fall through to the copy code.

L39:                       ; Copy the current record
    vmovups ymm0, [rax-2]
    vmovups [r15], ymm0
    lea rax, qword ptr [r15+2] ; Set up tagged pointer to copy
    add r15, 32            ; Advance heap top past the copy

L38:
    mov rdi, rcx           ; Get new value for atoms field
    mov qword ptr [rax+22], rdi
    mov qword ptr [rbx+8], rax

(Lines starting with # are comments emitted by the JIT, while the text that follows ; is a comment added by me for clarification.)

The BEAM loader renames an update_record instruction with an inplace hint to update_record_in_place.

The first two instructions load the tuple to be update into CPU register rax and the new counter value (C + 1) into rcx.

    mov rax, qword ptr [rbx+8]
    mov rcx, qword ptr [rbx+16]

The next two instructions test whether the new counter value is a small integer that fits into a word. The test has been simplified to a more efficient test that is only safe when the value is known to be an integer. If it is a small integer, it is always safe to jump to the code that updates the existing tuple:

    test cl, 1
    short je L38           ; Update directly if small integer.

If it is not a small integer, it must be a bignum, that is a signed integer that does not fit in 60 bits and therefore have to be stored on the heap with rcx containing a tagged pointer to the bignum on the heap.

If rcx is a pointer to a term on the heap, it is not always safe to directly updating the existing tuple. That is because of the way the Erlang generational garbage collector works. Each Erlang process has two heaps for keeping Erlang terms: the young heap and the old heap. Terms on the young heap are allowed to reference terms on the old heap, but not vice versa. That means that if the tuple to be updated resides on the old heap, it is not safe to update one of its elements so that it will reference a term on the young heap.

Therefore, the JIT needs to emit code to ensure that the pointer to the tuple resides in the “safe part” of the young heap:

    mov rdi, [r13+480]     ; Get the high water mark
    cmp rax, r15           ; Compare tuple pointer to heap top
    short jae L39          ; Jump and copy if above
    cmp rax, rdi           ; Compare tuple pointer to high water
    short jae L38          ; Jump and overwrite if above high water

The safe part of the heap is between the high water mark and the heap top. If the tuple is below the high water mark, if it is still alive, it will be copied to the old heap in the next garbage collection.

If the tuple is in the safe part, the copy code is skipped by jumping to the code that stores the new value into the existing tuple.

If not, the next part will copy the existing record to the heap.

L39:                       ; Copy the current record
    vmovups ymm0, [rax-2]
    vmovups [r15], ymm0
    lea rax, qword ptr [r15+2] ; Set up tagged pointer to copy
    add r15, 32            ; Advance heap top past the copy

The copying is done using AVX instructions.

Next follows the code that writes the new value into the tuple:

L38:
    mov rdi, rcx           ; Get new value for atoms field
    mov qword ptr [rax+22], rdi
    mov qword ptr [rbx+8], rax

If all the new values being written into the existing record are known never to be tagged pointers, the native instructions can be simplified. Consider this module:

-module(whatever).
-export([main/1]).

-record(bar, {bool,pid}).

main(Bool) when is_boolean(Bool) ->
    flip_state(#bar{bool=Bool,pid=self()}).

flip_state(R) ->
    R#bar{bool=not R#bar.bool}.

The update_record instruction looks like this:

    {update_record,{atom,inplace},
                   3,
                   {tr,{x,0},
                       {t_tuple,3,true,
                                #{1 => {t_atom,[bar]},
                                  2 => {t_atom,[false,true]},
                                  3 => pid}}},
                   {x,0},
                   {list,[2,{tr,{x,1},{t_atom,[false,true]}}]}}.

Based on the type for the new value, {t_atom,[false,true]}, the JIT is able to generate much shorter code than for the previous example:

# update_record_in_place_IsdI
    mov rax, qword ptr [rbx]
# skipped copy fallback because all new values are safe
    mov rdi, qword ptr [rbx+8]
    mov qword ptr [rax+14], rdi
    mov qword ptr [rbx], rax

References to literals (such as [1,2,3]) are also safe, because literals are stored in a special literal area, and the garbage collector handles them specially. Consider this code:

-record(state, {op, data}).

update_state(R0, Op0, Data) ->
    R = R0#state{data=Data},
    case Op0 of
        add -> R#state{op=fun erlang:'+'/2};
        sub -> R#state{op=fun erlang:'-'/2}
    end.

Both of the record updates in the case can be done in place. Here is the BEAM code for the record update in the first clause:

    {update_record,{atom,inplace},
                   3,
                   {tr,{x,0},{t_tuple,3,true,#{1 => {t_atom,[state]}}}},
                   {x,0},
                   {list,[2,{literal,fun erlang:'+'/2}]}}.

Since the value to be written is a literal, the JIT emits simpler code without the copy fallback:

# update_record_in_place_IsdI
    mov rax, qword ptr [rbx]
# skipped copy fallback because all new values are safe
    long mov rdi, 9223372036854775807  ; Placeholder for address to fun
    mov qword ptr [rax+14], rdi
    mov qword ptr [rbx], rax

The large integer 9223372036854775807 is a placeholder that will be patched later when the address of the literal fun will be known.

Here is the pull request for updating tuples in place:

#8090: Destructive tuple update

Optimizing by generating less garbage #

When updating a record in place, omitting the copying of the existing record should be a clear win, except perhaps for very small records.

What is less clear is the effect on garbage collection. Updating a tuple in place is an example of optimizing by generating less garbage. By creating less garbage, the expectation is that garbage collections should occur less often, which should improve the performance of the program.

Because of the highly variable execution time for doing a garbage collection, it is notoriously difficult to benchmark optimizations that reduce the amount of garbage created. Often the outcomes of benchmarks do not apply to performing the same tasks in a real application.

My own anecdotal evidence suggests that in most cases there are no measurable performance wins by producing less garbage.

I also remember when an optimization that reduced the size of an Erlang term resulted in a benchmark being consistently slower. It took the author of that optimization several days of investigation to confirm that the slowdown in the benchmark was not the fault of his optimization, but by creating less garbage, garbage collection happened at a later time when it happened to be much more expensive.

On average we expect that this optimization should improve performance, especially for large records.

Optimization of funs #

The internal representation of funs in the runtime system has changed in Erlang/OTP 27, making possible several new optimizations.

As an example, consider this function:

madd(A, C) ->
    fun(B) -> A * B + C end.

In Erlang/OTP 26, the native code for creating the fun looks like so:

# i_make_fun3_FStt
L38:
    long mov rsi, 9223372036854775807 ; Placeholder for dispatch table
    mov edx, 1
    mov ecx, 2
    mov qword ptr [r13+80], r15
    mov rbp, rsp
    lea rsp, qword ptr [rbx-128]
    vzeroupper
    mov rdi, r13
    call 4337160320       ; Call helper function in runtime system
    mov rsp, rbp
    mov r15, qword ptr [r13+80]
# Move fun environment
    mov rdi, qword ptr [rbx]
    mov qword ptr [rax+40], rdi
    mov rdi, qword ptr [rbx+8]
    mov qword ptr [rax+48], rdi
# Create boxed ptr
    or al, 2
    mov qword ptr [rbx], rax

The large integer 9223372036854775807 is a placeholder for a value that will be filled in later.

Most of the work of actually creating the fun object is done by calling a helper function (the call 4337160320 instruction) in the runtime system.

In Erlang/OTP 27, the part of fun that resides on the heap of the calling process has been simplified so that it is now smaller than in Erlang/OTP 26, and most importantly does not contain anything that is too tricky to initialize in inline code.

The code for creating the fun is not only shorter, but it also doesn’t need to call any function in the runtime system:

# i_make_fun3_FStt
L38:
    long mov rax, 9223372036854775807 ; Placeholder for dispatch table
# Create fun thing
    mov qword ptr [r15], 196884
    mov qword ptr [r15+8], rax
# Move fun environment
# (moving two items)
    vmovups xmm0, xmmword ptr [rbx]
    vmovups xmmword ptr [r15+16], xmm0
L39:
    long mov rdi, 9223372036854775807 ; Placeholder for fun reference
    mov qword ptr [r15+32], rdi
# Create boxed ptr
    lea rax, qword ptr [r15+2]
    add r15, 40
    mov qword ptr [rbx], rax

The difference from Erlang/OTP 26 is that the parts of the fun that is only needed when loading and unloading code are no longer stored on the heap. Instead those parts are stored in the literal pool area belonging to the loaded code for the module, and are shared by all instances of the same fun.

The part of the fun that resides on the process heap is two words smaller compared to Erlang/OTP 26.

The creation of the fun environment has also been optimized. In Erlang/OTP 26, four instructions were needed:

# Move fun environment
    mov rdi, qword ptr [rbx]
    mov qword ptr [rax+40], rdi
    mov rdi, qword ptr [rbx+8]
    mov qword ptr [rax+48], rdi

In Erlang/OTP 27, using AVX instructions both variables (A and C) can be moved using only two instructions:

# Move fun environment
# (moving two items)
    vmovups xmm0, xmmword ptr [rbx]
    vmovups xmmword ptr [r15+16], xmm0

Another optimization made possible by the changed fun representation is testing for a fun having a specific arity (the number of expected arguments when calling it). For example:

ensure_fun_0(F) when is_function(F, 0) -> ok.

Here is the native code emitted by the JIT in Erlang/OTP 26:

# is_function2_fss
    mov rdi, qword ptr [rbx]   ; Fetch `F` from {x,0}.

    rex test dil, 1            ; Test whether the term is a tagged pointer...
    short jne label_3          ; ... otherwise fail.

    mov eax, dword ptr [rdi-2] ; Pick up the header word.
    cmp eax, 212               ; Test whether it is a fun...
    short jne label_3          ; ... otherwise fail.

    cmp byte ptr [rdi+22], 0   ; Test whether the arity is 0...
    short jne label_3          ; ... otherwise fail.

In Erlang/OTP 27, the arity for the fun (the number of expected arguments) is stored in the header word of the fun term, which means that the test for a fun can be combined with the test for its arity:

# is_function2_fss
    mov rdi, qword ptr [rbx]   ; Fetch `F` from {x,0}.

    rex test dil, 1            ; Test whether the term is a tagged pointer...
    short jne label_3          ; ... otherwise fail.

    cmp word ptr [rdi-2], 20   ; Test whether this is a fun with arity 0...
    short jne label_3          ; ... otherwise fail.

All external funs are now literals stored outside all process heaps. As an example, consider the following functions:

my_fun() ->
    fun ?MODULE:some_function/0.

mfa(M, F, A) ->
    fun M:F/A.

In Erlang/OTP 26, the external fun returned by my_fun/0 would not occupy any room on the heap of the calling process, while the dynamic external fun returned by mfa/3 would need 5 words on the heap of the calling process.

In Erlang/OTP 27, neither of the funs will require any room on the heap of the calling process.

Those optimizations were implemented in the following pull requests:

Integer arithmetic improvements #

In the end of June last year, we released the OTP 26.0.2 patch for Erlang/OTP 26 that made binary_to_integer/1 faster.

To find out how much faster, run this benchmark:

bench() ->
    Size = 1_262_000,
    String = binary:copy(<<"9">>, Size),
    {Time, _Val} = timer:tc(erlang, binary_to_integer, [String]),
    io:format("Size: ~p, seconds: ~p\n", [Size, Time / 1_000_000]).

It measures the time to convert a binary holding 1,262,000 digits to an integer.

Running an unpatched Erlang/OTP 26 on my Intel-based iMac from 2017, the benchmark finishes in about 10 seconds.

The same benchmark run using Erlang/OTP 26.0.2 finishes in about 0.4 seconds.

The speed-up was achieved by three separate optimizations:

binary_to_integer/1 was implemented as a BIF in C using a naive algorithm that didn’t scale well. It was replaced with a divide-and-conquer algorithm implemented in Erlang. (Implementing the new algorithm as a BIF wasn’t faster than the Erlang version.)
The runtime system’s function for doing multiplication of large integers was modified to use the Karatsuba algorithm, which is a divide-and-conquer multiplication algorithm invented in the 1960s.
Some of the low-level helper functions for arithmetic with large integers (bignums) were modified to take advantage of a 128-bit integer data type on 64-bit CPUs when supported by the C compiler.

Those improvements were implemented in the following pull request:

#7426: Optimize binary_to_integer/1 and friends

In Erlang/OTP 27, some additional improvement of integer arithemetic were implemented. That reduced the execution time for the binary_to_integer/1 benchmark to about 0.3 seconds.

Those improvements are found in the following pull request:

#7553: Optimize integer arithmetic

Those arithmetic enchancements improve the running times for the pidigits benchmark:

Version		Seconds
26.0		`7.635`
26.2.1		`2.959`
27.0		`2.782`

(Run on my M1 MacBook Pro.)

Numerous miscellaneous enhancements #

Many enhancements have been made to the code generation for many instructions, as well as a few to the Erlang compiler. Here follows a single example to show one of the improvements to the =:= operator:

ensure_empty_map(Map) when Map =:= #{} ->
    ok.

Here is the BEAM code for the =:= operator as used in this example:

    {test,is_eq_exact,{f,1},[{x,0},{literal,#{}}]}.

Here is the native code for Erlang/OTP 26:

# is_eq_exact_fss
L45:
    long mov rsi, 9223372036854775807
    mov rdi, qword ptr [rbx]
    cmp rdi, rsi
    short je L44                  ; Succeeded if the same term.

    rex test dil, 1
    short jne label_1             ; Fail quickly if not a tagged pointer.

    ; Call the general runtime function for comparing two terms.
    mov rbp, rsp
    lea rsp, qword ptr [rbx-128]
    vzeroupper
    call 4549723200
    mov rsp, rbp

    test eax, eax
    short je label_1               ; Fail if unequal.
L44:

The code begins with a few tests to quickly succeed or fail, but in practice those are unlikely to trigger for this example, which means that the call to the general routine in the runtime system for comparing two terms will almost always be called.

In Erlang/OTP 27, the JIT emits special code for testing whether a term is an empty map:

# is_eq_exact_fss
# optimized equality test with empty map
    mov rdi, qword ptr [rbx]
    rex test dil, 1
    short jne label_1              ; Fail if not a tagged pointer.

    cmp dword ptr [rdi-2], 300
    short jne label_1              ; Fail if not a map.

    cmp dword ptr [rdi+6], 0
    short jne label_1              ; Fail if size is not zero.

Here follows the main pull requests for miscellaneous enhancements in Erlang/OTP 27: