—> To Continue with Chapter 5

Introduction to Spectral Manipulation

There are two different approaches to
manipulating the frequency content of sounds: filtering, and a combination of spectral analysis and resynthesis. Filtering techniques, at least classically (before the FFT became commonly used in by most computer musicians) attempted to describe spectral change by designing time domain operations to achieve them. More recently, a great deal of work in filter design has taken place directly in the spectral domain.

Spectral techniques allow us to represent and manipulate signals directly in the frequency domain, often a much more intuitive and user-friendly way to work with sound. Fourier analysis (especialoly the FFT) is the key to many current spectral manipulation techniques.

The Phase Vocoder
Unmodified speech speech made twice as long with a phase vocoder speech made half as long with a phase vocoder speech transposed up an octave speech transposed down an octave
The phase vocoder. By analysis/resynthesis of sounds using their Fourier representation, we can change the pitch of a sound without changing its length, and vice versa. This is called time stretching and pitch shifting.
Perhaps the most commonly used implementation of Fourier analysis in computer music is a technique called the
phase vocoder. What is called the phase vocoder actually comprises a number of techniques for taking a time domain signal, representing it as a series of amplitudes, phases, and frequencies, and providing a way to manipulate this information and return it to the time domain. Remember, Fourier analysis is the process of turning the list of samples of our music function, into a list of Fourier coefficients. These Fourier coefficients are a list of complex numbers and so they have phase and amplitude, and each corresponds to a frequency.

Two of the most important ways that musicians have used this technique are to manipulate the length of a sound without changing its pitch, and conversely, changing pitch without affecting length.

Why should this even be difficult? Well, consider trying this in the time domain: play back, say, a 33 1/3 RPM record at 45 RPMs. What happens? You play the record faster, the needle moves through the grooves at a higher rate, and the sound is higher pitched (often called the "chipmunk" effect, possibly after the famous 1960s novelty records featuring Alvin and his friends — did you think they were real chipmunks?). The sound is also much shorter: in this case, pitch is directly related to frequency — they’re both controlled by the same mechanism. A creative and virtuosic use of this technique is scratching as practiced by hip-hop, rap and dance DJs.
Soundfile .x

An example of time domain pitch shifting/speed changing. In this case, pitch and time transformations are related. The faster the sound is played, the higher the pitch becomes, as heard in the soundfile on the left. In the soundfile on the right, the opposite effect is heard, the slower file sounds lower in pitch.

The Pitch/Speed Relationship in the Digital World


Now think of doing this with a digital signal. To play it back faster, you might
raise the sampling rate, reading through the samples for playback more quickly. Remember that sometimes we refer to the sampling rate as the rate at which we stored (sampled) the sounds, but it also can refer to the kind of internal clock that the computer uses with reference to a sound (for playback and other calculations). We can vary it, for example playing back a sound sampled at 22.05kHz at 44.1kHz. With more samples (read) per second, the sound gets shorter. Since frequency is closely related to sampling rate, the sound also changes pitch.

Even with the basic pitch/speed problem, this kind of sonic manipulation has always attracted creative experiment. What if we could achieve an idea proposed by composer Steve Reich, in 1967, thought to be a kind of impossible dream for electronic music: slow down a sound without changing its pitch (and vice versa)?
Soundfile .x

A Soundhack varispeed of some standard speech. Note how the speech's speed is changed over time.

Varispeed is a general term for "fooling around" with the sampling rate of a soundfile.

Figure .x Steve Reich's Slow Motion Sound, from his book Writings About Music. In 1967, Reich considered this to be a "conceptual" score, since the technology wasn't commonly available to do it (except for a very few tape recorders with a "rotating playback head").

Here are Steve Reich's notes for Slow Motion Sound (1967).

    "Slow Motion Sound (1967) has remained a concept on paper because it was technologically impossible to realize. The basic idea was to take a tape loop, probably of speech, and ever so gradually slow it down to enormous length without lowering its pitch. In effect it would have been like the true synchronous sound track to a film loop gradually presented in slower and slower motion.

    The roots of this idea date from 1963 when I first became interested in experimental films, and began looking at film as an analog to tape. Extreme slow motion seemed particularly interesting since it allowed one to see minute details that were normally impossible to observe. The real moving image was left intact with only its tempo slowed down.

    Experiments with rotating head tape recorders, digital analysis and synthesis of speech, and vocoders all proved unable to produce the gradual yet enormous elongation, to factors of 64 or more times original length, together with high fidelity speech reproduction, which were both necessary for musical results.

    The possibility of a live performer trying to speak incredibly slowly did not interest me since it would be impossible, in that way, to produce the same results as normal speech, recorded, and then slowed down.

    I was able to experiment with a tape loop of a little African girl learning English by rote from an African lady schoolteacher in Ghana. The teacher said, "My shoes are new", and the little girl repeated, "My shoes are new." The musical interest lay in the speech melody which was very clearly [the pitches] e', c#', a, b [' means the higher octave of the pitch] both when the teacher spoke and when the little girl responded. Since African languages are generally tonal, and learning the correct speech melody is as necessary for understanding as is the correct word, this was simply carried over into the teaching of English. The loop was slowed down on a vocoder at the Massachusetts Institute of Technology to approximately 10 times its original length, in gradual steps, without lowering its pitch. Though the quality of speech reproduction on the vocoder was extremely poor it was still possible to hear how "My", instead of merely being a simple pitch, is in reality a complex glissando slowly rising from about c# up to e', then dissolving into the noise band of "sh", to emerge gradually into the c#' of "oe", back into the noise of "s", and so on.

    Slowing down the motion in musical terms is augmentation; the lengthening of duration of notes previously played in shorter note values. Though Slow Motion Sound was never completed as a tape piece, the idea of augmentation finally realized itself in my music as Four Organs of 1970, and then again in Music for Mallet Instruments, Voices and Organ of 1973.

    Though by now I have lost my taste for working with complex technology I believe a genuinely interesting tape piece could still be made from a loop of speech, gradually slowed down further and further, while its pitch and timbre remain constant."

Using The Phase Vocoder

Using the phase vocoder, we can realize Steve Reich’s piece, and a great many others. The phase vocoder allows us independent control over the time and the pitch of a sound.

How does this work? Actually, in two different ways.

To change the speed, or length of a sound without changing its pitch, we need to know something about what is called
windowing. Remember that when doing an FFT on a sound, we use what are called frames — time-delimited segments of sound. Over each frame we impose a window: an amplitude envelope that allows us to crossfade one frame into another, avoiding problems that occur at the boundaries of the two frames.

What are these problems? Well, remember that when we take an FFT of some portion of the sound, that FFT, by definition, assumes that we're analyzing a periodic, infinitely repeating signal. Otherwise, it wouldn't be Fourier analyzable. But if we just chop up the sound into FFT-frames, the points at which we do the chopping will be hard-edged, and we'll in effect be assuming that our periodic signal has nasty edges on both ends (which will typically show up as strong high frequencies). So to get around this we attenuate the beginning and ending of our frame with window, smoothing out the assumed periodical signal. Typically, these windows overlap at a certain rate (1/8, 1/4/ 1/2 overlap), creating even smoother transitions between one FFT frame and another.

Figure .x Why do we window FFT frames? The image on the left shows the waveform that our FFT would analyze without windowing—notice the sharp edges where the frame begins and ends. The image in the middle is our window. The image on the right shows the windowed waveform. By imposing a smoothing window on the time domain signal, and doing an FFT of the windowed signal, we de-emphasize the high frequency artifacts created by these sharp vertical drops at the beginning and end of the frame.


Thanks to: Jarno Seppänen <Jarno.Seppanen@nokia.com> Nokia Research Center, Tampere, Finland for these images

Figure .x After we window a signal for the FFT, we overlap those windowed signals so that the original signal is reconstructed without the sharp edges.

The Length of Overlap

By changing the length of the overlap when we resynthesize the signal, we can change the speed of the sound without affecting its frequency content (that is, the FFT information will remain the same, it’ll just be resynthesized at a "larger" framesize). That’s how the phase vocoder typically changes the length of a sound.

What about changing the pitch? Well, it’s easy to see that with an FFT we get a set of amplitudes that correspond to a given set of frequencies. But it’s clear that if, for example, we have very strong amplitudes at 100 Hz, 200 Hz, 300 Hz, 400 Hz and so on, we will perceive a strong pitch at 100 Hz. What if we just take the amplitudes at all frequencies, and move them "up" (or down) to frequencies twice as high (or as low)? What we’ve done is recreate the frequency/amplitude relationships starting at a higher frequency: changing the perceived pitch without changing the frequency.

Figure .x The figure above shows two columns of FFT bins. These bins divide the Nyquist frequency evenly. In other words, if we were sampling at 10 kHz, and we had 100 FFT bins (both these numbers are rather silly, but they're arithmetically simple), our Nyquist frequency would be 5 kHz, and the bin width would be 50 Hz.

Each of these frequency bins has its own amplitude, which is the strengh or energy of the spectra at that frequency (or more precisely, the average energy in that frequency range). To implement a pitch shift in a phase vocoder, the amplitudes in the left column are shifted up to higher frequencies in the right column. This operation shifts 1 spectra in the left bin to a higher frequency in the right bin.

Use of the Phase Vocoder

This technique actually works just fine, though for radical pitch/time deformations we get some problems (usually called "phasiness"). These techniques work better for nice, slowly changing, harmonic sounds, and for simpler pitch/time relationships (integer multiples). Still, the phase vocoder works well enough, in general, for it to be a widely used technique in both the commercial and artistic sound worlds.
Soundfile .x

Larry Polansky's one minute piece, Study: Anna, the long and the short of it.

All the sounds are created using phase vocoder pitch and time shifts of a recording of a very short cry (with introductory inhale) of the composer's daughter, age 6 months.




—> To Continue with Chapter 5

<— Back to 5.3

<— To Table of Contents