PeterFarrett.com

The MPEG Audio Standard

Dr. Peter W. Farrett

Introduction

With the advent of high quality audio technology, such as the compact disc or digital audio tape, audio systems are requiring excellent fidelity. A generic problem for computers or special purpose devices that require high quality audio fidelity: storage requirements for massive amounts of sampled (PCM) audio data often exceed storage capabilities. (E.g., one minute of CD audio requires approximately 10.5 MB of disk storage.) A solution for this problem is an audio compression algorithm, which compresses audio data while achieving high quality audio.

MPEG Audio is a high quality audio compression (frequency-domain) algorithm. It is based on subband coding techniques and a psycho-acoustic model of human perception. The algorithm is based on three layers. Layer I is the most fundamental layer, which takes into account all of the filter transforms as well as the encoder/decoder. Layer II primarily reduces the bit rate of layer I and has better quality. Layer III primarily handles error protection, with still better quality than layer II, and Huffman coding and table lookup is performed. Also note that layer II and III require more complex computations for improving the quality.

Overview of the Algorithm

MPEG Audio Technical Features

The algorithm utilizes subband coding with 32 subbands. Subbands are sampled at Fs/32 with bandwidths of Fs/64. Fs is 48kHz,44.1kHz, and 32kHz. The algorithm divides the input audio signal (16-20 bit PCM) into 32 subbands. A block companding technique is applied to each subband. The samples of 3 successive frames (blocks) in each subband are quantized with bit-allocation based on a maximum signal level and minimum masking threshold within the subband. The bit-allocation varies from frame to frame (dynamic bit-allocation, which is also data based on a psycho-acoustic model as well). For transmitting the encoded audio signal, the bit-allocation information is coded into the multiplexed signal together with the coded subband samples. The decoder is less complex since only inverse filtering has to be performed. Data rates and bandwidth are as follows: 192kbs @ 20kHz (studio quality), 128kbs @ 20kHz (high quality), 96kbs @ 20kHz (high quality), 64kbs @ 12-13.5kHz (intermediate quality), and 32kbs @ 6kHz (toll quality).

MPEG Audio Requirements

The main requirements for the algorithm are:

The Algorithm must function at sampling rates of 32kHz,44.1kHz, 48kHz. Also, the output sampling/playback rate should match the input sampling rate.
Input Resolution: 16-bit (uniform) PCM samples.
Bit rate for the codec (coder/decoder pair) must work at least 64, 96,128, and 192 kb/s. A stereo coder should use twice the bit rate of a monophonic coder.
Ancillary Data: The algorithm must allow for small deviations from specified data rates to allow application specific ancillary data without a drastic loss of quality.
The total encoding,transmission, and decoding delay must be less than 80ms at a bit rate of 2*128 kb/s and a sampling rate of 48kHz.
A coder in joint stereo mode must utilize a lower bit rate than a coder in stereo mode. (Joint stereo coding in all layers is optional.)

Codec Tasks

Encoder: The encoder utilizes an optimized filter bank, which separates the audio signal into 32 subband signals with constant bandwidths. The signals are then converted into 12 successive audio samples with a duration of 8ms @ 48kHzFs. Once framed, the (subband) signal is quantized to 6 bits; this yields a 96dB of dynamic range. (This is partially due to subband coding technique handling extreme over-lapping blocks of the signal.) To compensate for the uncertainty of the spectrum estimation performed by the filter bank, an FFT is performed in parallel to the filter of the audio signal (i.e., spectrum estimation by FFT is done to obtain masking thresholds by a psycho-acoustic model). After masking thresholds and (dynamic) bit allocation are performed via psycho-acoustic modelling, a data reduction scheme is then performed. (E.g., a raw input signal of 705kbs is compressed to 128,96, or 64kbs.)

Decoder: The decoder reconstructs the side information (any scale factor and quantized information), and then separates 12 successive samples from each subband (multiplexed) signal. The process further reconstructs a data format of 16 bit linear samples into control data for the inverse filter bank. This (inverse) filter-bank allows the audio signal to be reconstructed with a bandwidth of up to 24kHz for a 48kHz Fs.

Critical Parts of the Algorithm

The MPEG audio standard is based on subband coding (32 bands), and psycho-acoustic modeling. As mentioned earlier, the algorithm consists of three layers, which includes the following: layer I, which yields FM-/FM audio quality with the least complexity (i.e., less MIP intensive); layer II, which yields FM+/CD- audio with fair complexity; layer III, which yields CD audio quality with the most complexity. The algorithm functions at various sampling rates (48kHz, 44.1kHz, 32kHz) and bit rates (48, 56, 64, 96, 128, and 192 kb/s), which also yield a compression scheme of 8-16:1; (8:1 for high quality music applications; 16:1 for good quality voice applications) these are all run-time options.

The encoding and decoding process comprise various modules, which perform the analysis and synthesis tasks. Salient features of the algorithm are described below.

Filter Bank System

The filter bank system is an analysis procedure, which decomposes the (uncompressed) input signal with a selected sampling frequency (Fs) into 32 subbands (i.e., Fs/32). This allows the partitioning of the spectrum so that subbands are equally spaced intervals. Subband analysis is determined by the following method:

Build an input vector V of N sample values where N = 512 or 1024 audio samples for layer I and layers II and III, respectively;
Compute a history of samples where history consists of a drop/add operation, which drops 64 oldest samples of V, and adds (concatenates) 32 newer samples to vector U.
Calculate a final vector U of 32 newer samples with matrix coefficients, which outputs 32 reconstructed samples.

Psychoacoustic Model

Psychoacoustic modeling is performed with either a low complexity model, or a more complex one. By utilizing an additional FFT and masking techniques, either model yields the necessary SMR (signal-to-mask ratio) technique for each subband. This is an important step since bit-rate is constrained as a result of discarding unwanted bits, which produces the necessary bit allocation. In order to accomplish this, the algorithm derives the tonal (sinusoidal) and non-tonal (noise-like) components from the FFT spectrum. (Based on psychoacoustic research, better frequency resolution in lower frequency regions, rather than higher ones, is a perceptual, aural, phenomenon¹.) The psychoacoustic models employ this phenomenon by decomposing the frequency resolution (df) into band-limited spectral lines that are tonal or non-tonal.) This is accomplished by extracting tonal and non-tonal components, and calculating their masking thresholds in order to obtain frequencies, critical band rates, and bark values. Once this is performed, a signal-to-mask function computes each subband's value.

Quantization

A linear quantizer with a symmetric representation is used to quantize subband samples. Subband (input) samples are quantized by dividing their value by a scale factor X in order to obtain Q - bits. This representation is used in order to prevent values from quantizing to different intervalic amplitudes. This is done by calculating A*X+B (where A,B are quantized coefficients), and then taking the N most significant bits (where N represents the amount of bits to encode). N _Q-bits is then transformed into a set of coded symbols.

Frame Packing

This module assembles the quantized bit stream from the output data (coded symbols) of the previous module, and adds error correction and post processing if necessary.

Frame Unpacking

Bit stream unpacking of the encoded (sample) block of the compressed audio bit stream is performed in this module. The bit stream is unpacked in order to recover information. Error detection is performed if error-checking is applied in the encoder.

Reconstruction

In this module, a decoding of data elements is performed. That is, reconstruction of the waveform is performed by reconstructing the quantized version of the set of mapped samples.

Inverse Mapping

Information to produce digital audio is performed in this module. An inverse filter-bank system transforms samples back into (uniform) PCM format.

Compression and Performance

The algorithm is asymmetric; the ratio is approximately 2:1 for the codec. The compression ratio is approximately 7:1 bits per sample @ 44.1 kHzFs with 96I (5:1 @ 44kHzFs with 128I). The compression ratio is approximately 8:1 bits per sample @ 32kHzFs with 64I (16:1 @ 32kHzFs with 32I). The compression ratio is approximately 4:1 bits per sample @ 48kHzFs with 192I. (Compression is calculated: Fs*Rbit/Kbs = Rcomp; e.g., 44Fs*16Rbit/64Kbs = 11:1Rcomp.)

The algorithm is also floating-point oriented. However, the decoding process needs less computation power than the coding process, and can be implemented using only one fixed-point signal processor (for the decoder). Several companies have tested layers I and II of the CD. They report that 2 DSPs were used for encoding (2 DSPs 32C), and 1 DSP for decoding (1 Motorola DSP 56000). Another company also reports similar findings for layer II (2 DSPs 32C for encoding, and 1 DSP 32C for decoding). It is difficult to determine MIP horsepower due to very terse specs. One MPEG audio expert suggested that the performance complexity of layers I,II,III are in the order of 1.0:1.25:1.7, respectively. In fact, this could be higher for layers II and III since 1024 sample FFTs are required for psycho-acoustic modelling as opposed to 512 samples for layer I. There is little information available to know what memory requirements on a DSP are needed; however, since the algorithm has better time resolution (layers I and II), it more than likely requires less DSP memory.

Applications

Several MPEG audio applications include: multimedia audio/visual software and hardware manufacturing for combined moving picture and audio synchronization; production environments for tapeless studios; storage media for consumer and professional electronics; various music formats. Two of the most popular applications is Sony Corporation's MINI Disc, and MP3. The MINI Disc, a compact disc technology, records and playsback over 1 hr. of digital audio sound on a 2.5-inch optical disc. The MP3 music format is based on layer III of the algorithm.

Conclusion

The MPEG algorithm achieves high quality audio compression with good-to-excellent fidelity. MPEG audio is an ISO standard. Various industries have taken advantage of this technology by implementing the audio standard (e.g., multimedia platforms, consumer electronics, music). These implementations support video synchronization capabilities and stand-alone audio applications, and continue to influence the direction of MPEG audio. (E.G. The proposed MPEG-7 audio appears to be based on layer III with an MP3 implementation.)

References

Psychoacoustics: Facts and Models. Zwicker, E. and Fastl, H. Springer-Verlag, second edition. 1999.