finite-word-length-effects

2.5 Finite word-length effects

There are hardware and software FIR filter realizations. Regardless of which of them is used, a problem known as the finite word-length effect exists in either case. One of the objectives, when designing filters, is to lessen the finite word-length effects as much as possible, thus satisfying the initiative requirements (filter specifications).

On software filter implementation, it is possible to use either fixed-point or floating-point arithmetic. Both representations of numbers have some advantages and disadvantages as well.

The fixed-point representation is used for saving coefficients and samples in memory. Most commonly used fixed-point format is when one bit denotes a sign of a number, i.e. 0 denotes a positive, whereas 1 denotes a negative number, and the rest of bits denote the value of a number. This is mostly used to represent numbers in the range -1 to +1. Numbers represented in the fixed-point format are equidistantly quantized with the quantization step 1/2N-1, where N is the number of a bit used for saving the value. As one bit is a sign bit, there are N-1 bits available for value quantization. The maximum error that may occur during quantization is 1/2 quantization step, that is 1/2N. It can be noted that accuracy increases as the number of bits increases. Table 2-5-1 shows the values of quantization steps and maximum errors made due to quantization process in the fixed-point presentation.

BIT NUMBER	RANGE OF NUMBERS	QUANTIZATION STEP	MAX. QUANTIZATION ERROR	NUMBER OF EXACT DECIMAL POINTS
4	(-1, +1)	0.125	0.0625	1
8	(-1, +1)	0.0078125	0.00390625	2
16	(-1, +1)	3.0517578125*10-5	1.52587890625*10-5	4
32	(-1, +1)	4.6566128730774*10-10	2.3283064365387*10-10	9
64	(-1, +1)	1.0842021724855*10-19	5.4210108624275*10-20	19

Table 2-5-1. Quantization of numbers represented in the fixed-point format

The advantage of this presentation is that quantization errors tend to approximate 0. It means that errors are not accumulated in operations performed upon fixed-point numbers. One of disadvantages is a smaller accuracy in coefficients representation. The difference between actual sampled value and quantized value, i.e. the quantization error, is smaller as the quantization level decreases. In other words, the effects of the quantization error are negligible in this case.

The floating-point arithmetic saves values with better accuracy due to dynamics it is based on. Floating-point representations cover a much wider range of numbers. It also enables an appropriate number of digits to be faithfully saved. The value normally consists of three parts. The first part is, similar to the fixed-point format, represented by one bit known as the sign bit. The second part is a mantissa M, which is a fractional part of the number, and the third part is an exponent E, which can be either positive or negative. A number in the floating-point format looks as follows:

where M is the mantissa and E is the exponent.

As seen, the sign bit along with mantissa represent a fixed-point format. The third part, i.e. exponent provides the floating-point representation with dynamics, which further enables both extremely large and extremely small numbers to be saved with appropriate accuracy. Such numbers could not be represented in the fixed-point format. Table 2-5-2 below provides the basic information on floating-point representation for several different lengths.

BIT NUMBER	MANTISSA SIZE	EXPONENT SIZE	BAND	NUMBER OF EXACT DECIMAL POINTS
16	7	8	2.3x10-38 .. 3.4x1038	2
32	23	8	1.4x10-45 .. 3.4x1038	6-7

Table 2-5-2. Quantization of numbers represented in the floating-point format

It is not possible to determine the quantization step in the floating-point representation as it depends on exponent. Exponent varies in a way that the quantization step is as small as possible. In this number presentation, special attention should be paid to the number of digits that are saved with no error.

The floating-point arithmetic is suitable for coefficient representation. The errors made in this case are considerably less than those made in the fixed-point arithmetic. Some of disadvantages of this presentation are complex implementation and errors that do not tend to approximate 0. The problem is extremely obvious when the operation is performed upon two values of which one is much less than the other.

Example

FIR filter coefficients:

{0.151365, 0.400000, 0.151365}

Coefficients need to be represented as 16-bit numbers in the fixed-point and floating-point formats. If we suppose that numbers range between -1 and +1, then quantization level amounts to 1 / 2^16 = 0.0000152587890625. After quantization, the filter coefficients have the following values:

{0.1513671875, 0.399993896484375, 0.1513671875}

Quantization errors are:

{-0.0000021875, 0.000006103515625, -0.0000021875}

If filter coefficients are represented in the floated-point format, it is not possible to determine quantization level. In this case, the coefficients have the following values:

{0.151364997029305, 0.400000005960464, 0.151364997029305}

Quantization errors produced while representing coefficients as 16-bit numbers in the floating-point format are:

{0.000000002970695, -0.000000005960464, 0.000000002970695}

As seen, a coefficient error is less in the floating-point representation.

Floating-point arithmetic can also be expressed in terms of fixed-point arithmetic. For this reason, the fixed-point arithmetic is more often implemented in digital signal processors.

The finite word-length effect is the deviation of FIR filter characteristic. If such characteristic still meets the filter specifications, the finite word-length effects are negligible.

As a result of greater error in coefficients representation, the finite word-length effects are more prominent in fixed-point arithmetic.

These effects are more prominent for IIR filters for their feedback property than for FIR filters. In addition, coefficient representation can cause IIR filters to become instable, whereas it cannot affect FIR filters that way.

FIR filters keep their linear phase characteristic after quantization. The reason for this is the fact that the coefficients of a FIR filter with linear phase characteristic are symmetric, which means that the corresponding pairs of coefficients will be quantized to the same value. It results in the impulse response symmetry remaining unchanged.

After all mentioned, it is easy to notice that finite word length, used for representing coefficients and samples being processed, causes some problems such as:

1. Coefficient quantization errors;

1. Sample quantization errors (quantization noise); and

1. Overflow errors.

2.5.1 Coefficient Quantization

The coefficient quantization results in FIR filter changing its transform function. The position of FIR filter zeros is also changed, whereas the position of its poles remains unchanged as they are located in z=0. Quantization has no effect on them. The conclusion is that quantization of FIR filter coefficients cannot cause a filter to become instable as is the case with IIR filters.

Even though there is no danger of FIR filter destabilization, it may happen that transfer function is deviated to such an extent that it no longer meets the specifications, which further means that the resulting filter is not suitable for intended implementation.

The FIR filter quantization errors cause the stopband attenuation to become lower. If it drops below the limit defined by the specifications, the resulting filter is useless.

Transfer function changes occurring due to FIR filter coefficient quantization are more effective for high-order filters. The reason for this is the fact that spacing between zeros of the transfer function get smaller as the filter order increases and such slight changes of zero positions affect the FIR filter frequency response.

2.5.2 Samples Quantization

Another problem caused by the finite word length is sample quantization performed at multiplier’s output (after filtering). The process of filtering can be represented as a sum of multiplications performed upon filter coefficients and signal samples appearing at filter input. Figure 2-5-1 illustrates block diagram of input signal filtering and quantization of result as well.

digital-filter-design-chapter-02-image-5-1

digital-filter-design-chapter-02-image-5-1

Figure 2-5-1. Signal samples filtering

Multiplication of two numbers each N bits in length, will give a product which is 2N bits in length. These extra N bits are not necessary, so the product has to be truncated or rounded off to N bits, producing truncation or round-off errors. The later is more preferable in practice because in this case the mid-value of quantization error (quantization noise) is equal to 0.

In most cases, hardware used for FIR filter realization is designed so that after each individual multiplication, a partial sum is accumulated in a register which is 2N in length. Not before the process of filtering ends, the result is quantized on N bits and quantization noise is introduced, thus drastically reduced.

Quantization noise depends on the number of bits N. The quantization noise is reduced as the number of bits used for sample and coefficient representation increases.

Both filter realization and position of poles affect the quantization noise power. As all FIR filter poles are located in z=0, the effect of filter realization on the quantization noise is almost negligible.

2.5.3 Overflow

Overflow happens when some intermediate results exceed the range of numbers that can be represented by the given word-length. For the fixed-point arithmetic, coefficients and samples values are represented in the range -1 to +1. In spite of the fact that both FIR filter input and output samples are in the given range, there is a possibility that an overflow occurs at some point when the results of multiplications are added together. In other words, an intermediate result is greater than 1 or less than -1.

Example:

Assume that it is needed to filtrate input samples using a second-order filter.

Such filter has three coefficients. These are: {0.7, 0.8, 0.7}.

Input samples are: { ..., 0.9, 0.7, 0.1, ...}

By analyzing the steps of the input sample filtering process, shown in the table 2-5-3 below, it is easy to understand how an overflow occurs in the second step. The final sum is greater than 1.

FILTER COEFFICIENTS	INPUT SAMPLE	INTERMEDIATE RESULT
0.7	0.9	0.63
0.8	0.7	0.63 + 0.56 = 1.19
0.7	0.1	1.19 + 0.07 = 1.26

Table 2-5-3. Overflow

As the range of values, defined by the fixed-point presentation, is between -1 and +1, the results of the filtering process will be as shown in the table 2-5-4.

FILTER COEFFICIENTS	INPUT SAMPLE	INTERMEDIATE RESULT
0.7	0.9	0.63
0.8	0.7	0.63 + 0.56 - 2 = -0.81
0.7	0.1	-0.81 - 0.07 = -0.88

Table 2-5-4. Overflow effects

As mentioned, an overflow occurs in the second step. Instead of desired value +1.19, the result is an undesirable negative value -0.81. This difference of -2 between these two values is explained in Figure 2-5-2 below.

digital-filter-design-chapter-02-image-5-2

digital-filter-design-chapter-02-image-5-2

Figure 2-5-2. Signal samples filtering

However, if some intermediate result exceeds the range of presentation, it does not necessarily cause an overflow in the final result. The apsolute value of the result is less than 1 in this case. In other words, as long as the final result is within the word-length, overflow of partial results is not of the essence. This situation is illustrated in the following example.

Example:

The second-order filter has three coefficients. These are: {0.7, 0.8, 0.7}

Input samples are: { ..., 0.9, 0.7, -0.5, ...}

The desired intermediate results are given in the table 2-5-5.

FILTER COEFFICIENTS	INPUT SAMPLE	INTERMEDIATE RESULT
0.7	0.9	0.63
0.8	0.7	0.63 + 0.56 = 1.19
0.7	-0.5	1.19 - 0.35 = 0.84

Table 2-5-5. Desired intermediate results

As seen, some intermediate results exceed the given range and two overflows occur. Refer to the table 2-5-6 below.

FILTER COEFFICIENTS	INPUT SAMPLE	INTERMEDIATE RESULT
0.7	0.9	0.63
0.8	0.7	0.63 + 0.56 - 2 = -0.81
0.7	-0.5	-0.81 - 0.35 + 2 = 0.84

Table 2-5-6. Obtained intermediate results

So, in spite of the fact that two overflows have occured, the final result remained unchanged. The reason for this is the nature of these two overflows. The first one has decremented the final result by 2, whereas the second one has incremented the final result by 2. This way, the overflow effect is annuled. The first overflow is called a positive overflow, whereas the later is called a negative overflow.

Note:

If the number of positive overflows is equal to the number of negative overflows, the final result will not be changed, i.e. the overflow effect is annuled.

Overflow causes rapid oscillations in the input sample, which further causes highfrequency components to appear in the output spectrum. There are several ways to lessen the overflow effects. Two most commonly used are scaling and saturation.

It is possible to scale FIR filter coefficients to avoid overflow. A necessary and sufficient condition required for FIR filter coefficients in this case is given in the following expression:

where:

bk are the FIR filter coefficients; and
N is the number of filter coefficients.

If, for any reason, it is not possible to apply scaling then the overflow effects can be lessened to some extend via saturation. Figure 2-5-3 illustrates the saturation characteristic.

digital-filter-design-chapter-02-image-5-3

Figure 2-5-3. Saturation characteristic

When the saturation characteristic is used to prevent an overflow, the intermediate result doesn’t change its sign. For this reason, the oscillations in the output signal are not so rapid and undesirable high-frequency components are attenuated.

Let’s see what happens if we apply the saturation characteristic to the previous example:

Example

Again, it is needed to filtrate input samples using a second-order filter.

Such filter has three coefficients. These are: {0.7, 0.8, 0.7}

Input samples are: { ..., 0.9, 0.7, 0.1, ...}

The desirable intermediate results are shown in the table 2-7-7 below.

FILTER COEFFICIENTS	INPUT SAMPLE	INTERMEDIATE RESULT
0.7	0.9	0.63
0.8	0.7	0.63 + 0.56 = 1.19
0.7	0.1	1.19 + 0.07 = 1.26

Table 2-5-7. Desirable intermediate results

As the range of values, defined by the fixed-point presentation, is between -1 and +1, and the saturation characteristic is used as well, the intermediate results are as shown in the table 2-5-8.

FILTER COEFFICIENTS	INPUT SAMPLE	INTERMEDIATE RESULT
0.7	0.9	0.63
0.8	0.7	0.63 + 0.56 = 1
0.7	0.1	1 + 0.07 = 1

Table 2-5-8. Intermediate results and saturation characteristic

The resulting sum is not correct, but the difference is far smaller than when there is no saturation:

Without saturation: Δ = 1.26 - (-0.88) = 2.14

With saturation: Δ = 1.26 - 1 = 0.26

As seen from the example above, the saturation characteristic lessens an overflow effect and attenuates undesirable components in the output spectrum.