Sound and Digitization

Peter Dordal, Loyola University CS Department

Mathematics of Sound

Now we will learn about sin
Sine waves
A sine wave is described by
    y = A sin(2𝜋ft)
where t is the time (eg in seconds) and f is the frequency in cycles/sec

One way to think of this is to imagine a frequency of 1000 Hz. That means we should complete one full cycle in 1/1000 sec, which means that when t=1/1000, we have 2𝜋ft = 2𝜋, and at sin(2𝜋) this is indeed the completion of one full cycle.

Sometimes we denote by ω the "angular frequency" (radians per second) 2𝜋f, so that y = A sin(ωt).

Using cos() instead of sin(), or combining cos() and sin() waves of the same frequency, can always be reduced to A sin(2𝜋ft + 𝛿). The quantity 𝛿 here represents the phase shift.

Electromagnetic signals are often sin waves. They can be transmitted via
Guided transmission media can be point-to-point v multiple-access (multipoint)

When we write A sin(2𝜋ft) we are working in the so-called time domain: the horizontal axis represents time
A signal is periodic if, for some period P, s(t) = s(t+P).

For any periodic signal, we can express it as the sum of sin waves of periods P, P/2, P/3, P/4, ..... We do also need cosine waves, or else use sin(2𝜋ft+𝛿) (the 𝛿 here represents a horizontal "phase" shift).

This is a deep fact, called Fourier Analysis. Here are a couple online visualizations:

This means any periodic signal can be expressed in the
Frequency-domain concepts
Fact: physical transmission media have natural action on sine waves, due to things like "capacitance".  This action is different at different frequencies (eg high-frequency attenuation or distortion).
Physical vibrations (such as in our ears) generally involve natural combinations of sine waves.
Discussion of benefits of Fourier analysis and the frequency-domain view: what if distortion is linear, but:
A bandpass filter passes only those frequencies in a specified band (there are also lowpass and highpass filters). These filters can be digital or analog. Conceptually, to do bandpass filtering we


We'll look at this in more detail later, but the simplest way to digitize sound is to use sampling at regular intervals. Assume we have a waveform y = s(t), and a sampling rate R (eg R = 8000 samples/sec). We measure the waveform R times a second (R=8,000 for voice, R=44,000 for CD-quality music). The sampling interval is then 1/R sec. Given a waveform y=s(t), we sample at times t=0, t=1/R, t=2/R, ..., t=i/R as the sample points. We sample s(i/R), round it off to the nearest available discrete value, and put that in the array component A[i]. The array will be quite large: three minutes of music will involve 3×60×44,000 ≃ 8 million samples.

There are two sources of error: the sampling rate, and the fact that each individual A[i] is rounded off slightly to the nearest integer; this latter is called quantizing error.

This strategy is traditionally known as PCM (Pulse Code Modulation) encoding.

For wav files, we can specify the sampling rate; in the telephony world R=8000 is standard and for CD sound R=44000.

For telephony, we generally use 8-bit values for the A[n]'s, from +127 to -127 appropriately scaled. Suppose s(t) is the signal level, on a -1≤s(t)≤1 scale, and let ti = i/R be the sample time. If we are using linear 8-bit encoding, then A[i], the sample at ti, is 127*s(ti), rounded off to the nearest 8-bit integer. Because of the limited range here, nonlinear encoding is actually used. Better sound quality is obtained with 16-bit encoding, where linear encoding is the norm.

One issue with sound is the "ceiling" on the signal s(t): if we require -1≤s(t)≤1 for all t, then 1.0 represents the absolute maximum volume. We have to scale down the intensity level to conform to this. Suppose the maximum peaks in a music file are 100 times the average level. Then the average level of s(t) would be 0.01. Using linear 8-bit encoding would mean that most of the A[i] were either -1, 0 or 1, and the sound quality would be close to unrecognizable. Using 16-bit linear encoding would mean that most of the A[i] were less than 300 = 0.01×30,000 in absolute value (215 ≃ 30,000). This provides a reasonable, if not quite ideal, range of values for limiting the sampling error.

WAV file format

Here is a layout of a .wav file. The actual sound values are signed numbers.

                 | "RIFF" (4)  |        RIFF chunk : first 12 bytes  of the file
                 |  pkglen(4)  |
                 | "WAVE" (4)  |
                 |             |
                 | "fmt_" (4)  |        FORMAT : 24 bytes
                 |             |
                 |             |
                 |  0x10  (4)  |        for PCM
                 |             |
                 |  0x01  (2)  |        for linear quantization
                 | 1 or 2 (2)  |        Number of channels: 1 for mono, 2 for stereo
                 |             |
                 |  rate  (4)  |        sample rate, in Hertz, eg 8000  or 44,000
                 |             |
                 |             |
                 |  B/sec (4)  |        Bytes per second, = rate * (Bytes/samp le)
                 |             |
                 | B/samp (2)  |        Bytes per sample (eg 2 for 16-bit mono)
                 | b/samp (2)  |        bits per individual single-channel sample
                 | "data" (4)  |
                 |  len   (4)  |        length of data, in bytes; SOMETIMES WRONG!!

My file reads *.wav files into arrays, and writes arrays out into .wav files.

Some 16-bit digitized waves:
      440 Hz A: text wav
    ~440 Hz A: text wav (text file has frequency rounded off to 18; 8000/440 is really 18.18)
      294 Hz D: text wav
    ~294 Hz D: text wav (text file has frequency rounded off to 27 from 27.2)

Demo Program set 1: Oscillator and Mixer

Let's do this with SoX:

To create a sine wave, we can do this:

    sox -n A4.wav synth 3 sine 440
    sox -n sqr.wav synth 3 square 440
    sox -n saw.wav synth 3 saw 440

We can look at these with audacity.

Now let's make a chord: D4 = 293.66 Hz

Once we have A4.wav and D4.wav, let's mix them:

    sox --combine merge A4.wav D4.wav AD.wav; play AD.wav

How about CEGC? Frequencies: 262, 330, 392, 523.
    sox -n C.wav synth 3 sine 262
    sox --combine merge C.wav E.wav G.wav C5.wav CEGC.wav

The end result is a 4-channel sound file. If we want only one channel, we can use

    sox CEGC.wav CEGC1.wav remix -m 1-4

This sounds just a bit louder.

How about beat frequencies:

    sox -n A441.wav synth 3 sine 441

    sox --combine merge A4.wav A441.wav beat.wav remix 1-2

We can get information about sound files by using the soxi command or by adding stat to the end of a sox command.

Audacity demo

The audacity package lets you view waveforms (and also do some sound editing).


I've created a simple spreadsheet, fourier.xls, that can be used to create a graph of the result of the first 10 terms of any Fourier series.

Time domain and Frequency domain

You can take any signal, periodic or not, and apply the Fourier transform to get a frequency-domain representation. This can be messy, but is not particularly deep. Or helpful: having a signal composed of complex sine waves of continuously varying frequency can be intractable.

Most voice signals are not periodic at larger scales: people say different words.

What is useful, though, is the idea that over small-but-not-too-small intervals, say 10ms, voice does indeed look like a set of periodic waves. At this point, the frequency-domain view becomes very useful, because it is just a set of discrete frequencies.

In other words, "sound is made up of vibrations".

G.729 voice encoding (compresses 160-byte 10ms sample into 20 bytes): Step 1 is, for each sample, to try to match the "best fit" pure frequency.

Maybe even more importantly, most digital data signals similarly look "almost periodic", often over much larger intervals than voice. So the same principle applies: translation to the frequency domain is actually useful.

When we start looking at the details of audio highpass, lowpass, and bandpass filters (eg for audio), we'll see that they involve averaging over a time interval. If that time interval is too large, the signal may no longer look very periodic and we may get misleading or distorted results. If the time interval is too small, the filter may not be "sharp" enough.

Digital Signal Processing

(was: Transformations on waveforms)

How can we treat sound as data?

For many purposes, we just encode the sound and leave it at that until it reaches the receiving end. But sometimes we want to "tweak" the sound; the classic example is to run it through a highpass, lowpass, or bandpass filter, to eliminate frequencies that are too low, too high, or not in the middle. Another example is to try to filter out some of the noise (this works best if the noise is not purely random).

How do we get Fourier series?

We will assume that the waveform data has been digitized and is available as an array of values A[n].

One way to think of the Fourier series is as a set of averaging operations: the sequence A[n] is "averaged against" a sine (or cosine) wave sin(2𝜋Fn), where f is the "frequency" in units of the sampling rate. (More specifically, the frequency in Hz is f = FR, where R is the sampling rate. Alternatively, the sampling frequency in terms of the frequency in Hertz is F = f/R. For a 400-Hz tone, F = 1/20, meaning that 20 samples make up one full sine wave.)

The idea behind "averaging against" is that we're taking a weighted average of the A[n], eg 1/4 A[1] + 2/3 A[2] + 1/12 A[3]. However, with sin(n) involved, some of the weights can be negative.

Averaging two waveforms, in two arrays A and B, amounts to getting the following sum:
    double sum = 0.0
    for (i = 0; i < whatever; i++) sum += A[i]*B[i]

"whatever" might be A.length, or max(A.length,B.length). We might also "shift" one of the arrays, so the sum is

    sum += A[i]*B[shift+i]

In some cases it is natural to reverse the direction in B:

    sum += A[i]*B[shift−i]

If we're averaging against a sine wave, with frequency 2𝜋f, we might have something like this:

    for (n=0; n<len; n++) sum+=A[n]*sin(2*Pi*f*n);
    avg = sum / len

If you're familiar with calculus, this more-or-less corresponds to ∫ A(x) sin(2𝜋fx) dx

Digitized sound makes these sums easy to do, though; no calculus is needed. Sometimes it's unclear what you divide by (eg len, above; sometimes it's actually the time in seconds), but if we divide by the wrong thing we're usually just off by a constant.

Example 1: two sine waves of different frequency average to zero; phase does not matter

Two of the same frequency average to, well, maybe 1/2 (avg of sin2(x)) = 1/4. This is a constant.

Assume the signal g(x) is periodic, and is composed of sines, and has base frequency 1 (or 1/2𝜋). We know g(x) must have the following form; we just need to find A1, A2, etc:

    g(x) = A1 sin(x) + A2 sin(2x) + A3 sin(3x) + ...

To do this, we start by averaging both sides with sin(x):

Average(g(x), sin(x)) = A1 * average(sin(x), sin(x)) + A2 * average(sin(2x), sin(x)) + A3 * average(sin(3x), sin(x)) + ...
                            = A1 * 1/2

From this we can readily calculate A1, and by averaging with sin(2x) get A2, etc.

In the above averages, we calculate one fixed sum (eg for (i=0;i<max;i++) sum+=g(n)*sin(n)). There is another kind of averaging, where we repeatedly "shift" one of the two functions involved, to get a new sequence / waveform /function. This is also called moving averages. Things don't necessarily have to be periodic here.

Here's a simple example, defining a new sequence B[n] from existing sequence A[n]. If A[n] is defined for n from 0 to 1000, inclusive, then B[n] will be defined for n from 1 to 999.

    B[n] = (A[n-1] + A[n] + A[n+1])/3

We can think of this as a weighted average of A[] with (1/3, 1/3, 1/3). The point here is that we get a new sequence, B[n], instead of a single value like A1 above. However, the actual calculation is the same: find the sum of a bunch of A[n]*f(n) products.

Any linear transformation of the input signal can be expressed this way, though some may need lots and lots of coefficients and finding the coefficients might in some cases be difficult.

However, there are some shortcuts to the find-the-coefficients problem. If we know what we want in the frequency domain, we can apply some of the math above to find the corresponding coefficients for a corresponding moving average in the time domain.  For example, suppose we want to create a "lowpass filter", to cut off the higher frequency signals. This is an easy point-by-point thing to express, in the frequency domain:
(1, 1, 1, ...., 1, 0, 0, 0, ...)

Apply this to a sequence of frequency amplitudes (a1, a2, a3, a4, a5, a6, a7, a8) (representing respective amplitudes, say, of sin(x), sin(2x), ..., sin(8x)). Assume the cutoff is at position 4, and multiply corresponding values. We get
    (1*a1, 1*a2, 1*a3, 1*a4, 0*a5, 0*a6, 0*a7, 0*a8)
    = (a1, a2, a3, a4, 0, 0, 0, 0)

Point-by-point multiplication in the frequency domain corresponds to a specific moving average in the time domain, and we can calculate that in a relatively straightforward way (although it does involve some calculus).

In this particular case of a frequency-domain cutoff, above, it turns out that the time-domain moving average we need is the "sinc" function (with a "c", pronounced "sink"):

sinc(x) = sin(x)/x    (sinc(0) = 1)
Moving averages of this sort are sometimes called convolution. The function you are averaging against (eg sinc(x)) is sometimes called a "kernel", not to be confused with the OS use of that term. The new signal after filtering is array B[], where

    B[n] = sum (A[t+n]*sinc(2𝜋f * t) / 2𝜋f           -M ≤ t ≤ M

This is defined for M ≤ n ≤ A.length - M

Other common kernels:
Almost all modification of sound files uses this kind of "averaging against a kernel". (Echo cancellation is a harder problem that can't easily be done this way.)

Associativity: bandpass kernel: applying a lowpass filter L and then a highpass filter H is equivalent to convolving L and H together (L*H) and then applying that: (Signal*L)*H = Signal*(L*H)

Note that sinc() goes to infinity in both directions. In practical terms, we have to truncate, generally at a point where sinc(x)=0 (that's what M was, above), and we should also taper off as we approach ±M. The other practical problem is that convolution with sinc(x) involves some knowledge of the "future" sound, that is, to get B[n] we need to know A up to A[n+M]. If n represents the time, then we need to know values of A[i] M units into the future. One solution is to just accept that B is delayed by M units.

Demo: sincfilter.ods: This is a spreadsheet where we average sinc(f1x) (the lowpass filter with cutoff f1) against cos(f2x). We expect that if f2<f1 then there is little change, but if f2>f1 then the result is close to zero.

In the spreadsheet, we average just for x>0. Symmetric averaging about the origin with sin(x) is always exactly 0; cos(x) is the "worst case". Any sine wave in the real world has a phase, referring to its relative starting position d: Asin(fx + d). This can also be expressed as Bsin(fx) + C cos(fx) (it's a trig identities thing) and, as noted, all we have to worry about is the cos(fx) part.

To use the spreadsheet, you supply a filter frequency (f1 above) (it should be relatively small, compared to 1000), and a second, "input", frequency (f2). If the second frequency is higher than the filter frequency, the reduction will be substantial; otherwise, the reduction factor will be close to 1 (no change). The waveforms tabulated are of the form sin(2𝜋fx), so the wavelength is 1/f. Tabulation goes from 0 to 1 by steps of .001, so if f is 20 a wavelength of 1/20 is 50 "ticks" out of 1000, and there are 20 full cycles in the sum. To compare performance, we also calculate the attenuation after just 500 ticks. The ideal sinc filter would extend to infinity. Attenuation is normalized, so 100% means no attenuation. We use a cosine wave for the input frequency, as that is the "worst case"; strictly speaking, the attenuation really should be the average attenuation as the phase of the second, input signal varies from that of sin(x) to that of cos(x).

Note that frequencies can be fractional, eg f1 = 20 and f2 = 20.5.

Demo: play cantdo.wav subject to various highpass and lowpass filters.

Demo: look at the spectrograms for some .wav files

Some terminology

spectrum: range of frequencies from fmin to fmax
bandwidth: width, = fmax - fmin
effective bandwidth: frequencies you actually need
DC component

major point: in Frequency-division multiplexing, each channel is assigned a very specific "band-width"

Note: standard electronic circuits can extract "bands", hence the practical importance of this notion

Note that this is a different use of "band-width" from "data rate"

There is some relation, however: the narrower the analog band-width, the less room there is for multiple frequencies and modulation, and the less data capacity.

discussion of benefits of Fourier analysis:
what if distortion is linear, but:
what if we want to understand bandpass filtering

Example on data rate ("bandwidth" in the digital sense) versus bandwidth

Suppose we want to transmit a square wave, alternating from -1 to +1. We have a band-width of 4 Mhz.

Case 1: we use sin(2𝜋ft) + (1/3)sin(2𝜋(3f)t) + (1/5)sin(2𝜋(5f)t) (that is, three terms of the Fourier series), where f = 1 Mhz.
Look at this with fourier.xls: does it look squarish?
The frequencies are 1 Mhz, 3 Mhz and 5 Mhz. The band-width is 5-1 = 4 Mhz.
The data rate is 2 Mbps (sort of; we are sending 1 M 1-bits and 1 M 0-bits per second)

Note dependence on notion of what waveform is "good enough"

Case 2: If we double all the frequencies to 2 MHz, 6 MHz, 10 MHz, we get band-width 8MHz, data rate 4Mbps. This is the same as above with f=2 Mhz.

Case 3: We decide in the second case that we do not need the 10 MHz component, due to a more accurate receiver. The base frequency is f = 2MHz, frequencies 2MHz, 6MHz, band-width is now 6MHz - 2MHz = 4MHz, data rate: 4Mbps

Look at this with fourier.xls.

Note that we're really not carrying data in a meaningful sense; we can't send an arbitrary sequence of 0's and 1's this way. However, that's done mostly to simplify things.

Note also the implicit dependence on bandwidth of the fact that we're decomposing into sinusoidal waves.

Voice transmission: frequency band-width ~ 3-4kHz  (eg 300 Hz to 3300 Hz)
64Kbps encoding (8 bits sampled 8000 times a second)

Modems have the reverse job: given the band-width of ~3 kHz, they have to send at 56 kbps!

Note: we're looking at DISCRETE frequency spectrum (periodic signals). CONTINUOUS frequency spectrum also makes mathematical sense, but is kind of technical

Note that frequency-domain notion depends on fundamental theorem of Fourier analysis that every periodic function can be expressed as sum of sines & cosines (all with period an integral multiple of original)

Bandwidth of voice: <=4 kHz

quite different meaning from digitization: 64kbps

But if we wanted to encode the digitized voice back into a voice bandwidth, we'd have to encode 16 bits per cycle (Hertz), which is a little tricky.

Amplitude modulation & band-width
Note that AM modulation (ALL modulation, in fact) requires a "band-width"; ie range of frequencies. This will be very important for cellular.

AM:amplitude = [1+data(t)]*sin(2𝜋ft)
f is "carrier" high frequency; eg 100,000

If data(t) = sin(2𝜋gt), g a much lower frequency (eg 1000)
Then sin(2𝜋ft)*sin(2𝜋gt) = 0.5 cos(2𝜋(f-g)t) - 0.5 cos(2𝜋(f+g)t)

band of frequencies: (f-g) to (f+g)
band-width: 2g