ZhixiongXu's Blog: Saturated Q15 and Q31 arithmetic

Saturated Q15 and Q31 arithmetic

A 32-bit signed value can be treated as having a binary point immediately after its sign bit. This is equivalent to dividing its signed integer value by 2³¹, so that it can now represent numbers from –1 to (1 – 2^–31). When a 32-bit value is used to represent a fractional number in this fashion, it is known as a Q31 number.

Saturated additions, subtractions, and doublings can be performed on Q31 numbers using the same instructions as are used for saturated integer arithmetic, since everything is simply scaled down by a factor of 2^–31.

If two Q15 numbers are multiplied together as integers, the resulting integer needs to be scaled down by a factor of 2^–15 × 2^–15 == 2^–30. For example, multiplying the Q15 number 0x8000 (representing –1) by itself using an integer multiplication instruction yields the value 0x40000000, which is 2³⁰ times the desired result of +1.

This means that the result of the integer multiplication instruction is not quite in Q31 form. To get it into Q31 form, it must be doubled, so that the required scaling factor becomes 2^–31. Furthermore, it is possible that the doubling will cause integer overflow, so the result should in fact be doubled with saturation. In particular, the result 0x40000000 from the multiplication of 0x8000 by itself should be doubled with saturation to produce 0x7FFFFFFF (the closest possible Q31 number to the correct mathematical result of –1 × –1 == +1). If it were doubled without saturation, it would instead produce 0x80000000, which is the Q31 representation of –1.

To implement a saturated Q15 × Q15 --> Q31 multiplication, therefore, an integer multiply instruction should be followed by a saturated integer doubling. The latter can be performed by a QADD instruction adding the multiply result to itself.

ZhixiongXu's Blog

Saturday, January 21, 2023

Saturated Q15 and Q31 arithmetic

No comments:

Post a Comment