Sound and Digitization
Peter Dordal, Loyola University CS Department
Mathematics of Sound
Now we will learn about sin
A sine wave is described by
y = A sin(2𝜋ft)
where t is the time (eg in seconds) and f is the frequency in cycles/sec
One way to think of this is to imagine a frequency of 1000 Hz. That means we
should complete one full cycle in 1/1000 sec, which means that when
t=1/1000, we have 2𝜋ft = 2𝜋, and at sin(2𝜋) this is indeed the completion
of one full cycle.
Sometimes we denote by ω the "angular frequency" (radians per second) 2𝜋f,
so that y = A sin(ωt).
Using cos() instead of sin(), or combining cos() and sin() waves of the same
frequency, can always be reduced to A sin(2𝜋ft + 𝛿). The quantity 𝛿 here
represents the phase shift.
Electromagnetic signals are often sin waves. They can be transmitted via
Guided transmission media can be point-to-point v multiple-access
- guided media (wire/fiber)
- unguided ("wireless")
When we write A sin(2𝜋ft) we are working in the so-called time
domain: the horizontal axis represents time
A signal is periodic if, for some period P, s(t) = s(t+P).
- typical sine waves
- square waves, nonperiodic signals
- discrete v continuous
- periodic signals
- wavelength (v frequency)
For any periodic signal, we can express it as the sum of sin waves of
periods P, P/2, P/3, P/4, ..... We do also need cosine waves, or else use
sin(2𝜋ft+𝛿) (the 𝛿 here represents a horizontal "phase" shift).
This is a deep fact, called Fourier Analysis. Here are a
couple online visualizations:
This means any periodic signal can be expressed in the
- time domain as a function s(t)
- frequency domain as the sum of the sine frequencies
- any periodic signal is a superposition of sine waves
- Actually, any signal is, but you have more control of the frequencies
with periodic signals
- construction of a square wave with sines: fourier.xls
- discrete v continuous
- deep question: why sin?
- if we use sin/cos as basis, a square wave has infinite bandwidth
- if we use square waves as basis, each individual one has zero
Fact: physical transmission media have natural action on sine
waves, due to things like "capacitance". This action is different at
different frequencies (eg high-frequency attenuation or distortion).
Physical vibrations (such as in our ears) generally involve natural
combinations of sine waves.
Discussion of benefits of Fourier analysis and the frequency-domain view:
what if distortion is linear, but:
A bandpass filter passes only those
frequencies in a specified band (there are also lowpass
and highpass filters). These filters can be digital or
analog. Conceptually, to do bandpass filtering we
- higher-frequency sine waves transmit faster/slower
- higher-frequency sine waves get damped more
- separate out the individual sine-wave frequencies
- throw away the out-of-band frequencies
- recombine what's left
We'll look at this in more detail later, but the simplest way to digitize
sound is to use sampling at regular intervals. Assume we
have a waveform y = s(t), and a sampling rate R (eg R = 8000 samples/sec).
We measure the waveform R times a second (R=8,000 for voice, R=44,000 for
CD-quality music). The sampling interval is then 1/R sec. Given a waveform
y=s(t), we sample at times t=0, t=1/R, t=2/R, ..., t=i/R as the sample
points. We sample s(i/R), round it off to the nearest available discrete
value, and put that in the array component A[i]. The array will be quite
large: three minutes of music will involve 3×60×44,000 ≃ 8 million samples.
There are two sources of error: the sampling rate, and the fact that each
individual A[i] is rounded off slightly to the nearest integer; this latter
is called quantizing error.
This strategy is traditionally known as PCM (Pulse Code Modulation)
For wav files, we can specify the sampling rate; in the telephony world
R=8000 is standard and for CD sound R=44000.
For telephony, we generally use 8-bit values for the A[n]'s, from +127 to
-127 appropriately scaled. Suppose s(t) is the signal level, on a -1≤s(t)≤1
scale, and let ti = i/R be the sample time. If we are using linear
8-bit encoding, then A[i], the sample at ti, is 127*s(ti),
rounded off to the nearest 8-bit integer. Because of the limited range here,
nonlinear encoding is actually used. Better sound quality
is obtained with 16-bit encoding, where linear encoding is the norm.
One issue with sound is the "ceiling" on the signal s(t): if we require
-1≤s(t)≤1 for all t, then 1.0 represents the absolute maximum volume. We
have to scale down the intensity level to conform to this. Suppose the
maximum peaks in a music file are 100 times the average level.
Then the average level of s(t) would be 0.01. Using linear 8-bit encoding
would mean that most of the A[i] were either -1, 0 or 1, and the sound
quality would be close to unrecognizable. Using 16-bit linear encoding would
mean that most of the A[i] were less than 300 = 0.01×30,000 in absolute
value (215 ≃ 30,000). This provides a reasonable, if not quite
ideal, range of values for limiting the sampling error.
WAV file format
Here is a layout of a .wav file. The actual sound values are signed
| "RIFF" (4)
| RIFF chunk : first 12
bytes of the file
| "WAVE" (4) |
| "fmt_" (4)
| FORMAT : 24 bytes
(4) | for PCM
(2) | for linear
| 1 or 2 (2)
| Number of channels: 1 for
mono, 2 for stereo
(4) | sample rate, in
Hertz, eg 8000 or 44,000
(4) | Bytes per second,
= rate * (Bytes/samp le)
| B/samp (2)
| Bytes per sample (eg 2 for
| b/samp (2)
| bits per individual
| "data" (4)
len (4) |
length of data, in bytes; SOMETIMES WRONG!!
My file WavReader.java reads *.wav files into arrays, and writes arrays out
into .wav files.
Some 16-bit digitized waves:
440 Hz A: text
~440 Hz A: text
wav (text file
has frequency rounded off to 18;
8000/440 is really 18.18)
294 Hz D: text
~294 Hz D: text
wav (text file
has frequency rounded off to 27 from 27.2)
Demo Program set 1: Oscillator and Mixer
Let's do this with SoX:
- WavReader library (uses objects for reading, but writeWav() is a
- osc_skel.java, mixer_skel.java
To create a sine wave, we can do this:
sox -n A4.wav synth 3 sine 440
sox -n sqr.wav synth 3 square 440
sox -n saw.wav synth 3 saw 440
We can look at these with audacity.
Now let's make a chord: D4 = 293.66 Hz
Once we have A4.wav and D4.wav, let's mix them:
sox --combine merge A4.wav D4.wav AD.wav; play AD.wav
How about CEGC? Frequencies: 262, 330, 392, 523.
sox -n C.wav synth 3 sine 262
sox --combine merge C.wav E.wav G.wav C5.wav CEGC.wav
The end result is a 4-channel sound file. If we want only one channel, we
sox CEGC.wav CEGC1.wav remix -m 1-4
This sounds just a bit louder.
How about beat frequencies:
sox -n A441.wav synth 3 sine 441
sox --combine merge A4.wav A441.wav beat.wav remix 1-2
We can get information about sound files by using the soxi
command or by adding stat to the end of a sox command.
The audacity package lets you view waveforms (and also do some sound
- 440.wav, 440-442beat.wav
- note the various filtering options
I've created a simple spreadsheet, fourier.xls,
that can be used to create a graph of the result of the first 10 terms of
any Fourier series.
- square wave
- triangle wave
Time domain and Frequency domain
You can take any signal, periodic or not, and apply the Fourier transform to
get a frequency-domain representation. This can be messy, but is not
particularly deep. Or helpful: having a signal composed of complex sine
waves of continuously varying frequency
can be intractable.
Most voice signals are not periodic at larger scales: people say different
What is useful, though, is the
idea that over small-but-not-too-small intervals, say 10ms, voice does
indeed look like a set of periodic waves. At this point, the
frequency-domain view becomes very useful, because it is just a set of
In other words, "sound is made up of vibrations".
G.729 voice encoding (compresses 160-byte 10ms sample into 20 bytes): Step 1
is, for each sample, to try to match the "best fit" pure frequency.
Maybe even more importantly, most digital data signals similarly look
"almost periodic", often over much larger intervals than voice. So the same
principle applies: translation to the frequency domain is actually useful.
When we start looking at the details of audio highpass, lowpass, and
bandpass filters (eg for audio), we'll see that they involve averaging over
a time interval. If that time interval is too large, the signal may no
longer look very periodic and we may get misleading or distorted results. If
the time interval is too small, the filter may not be "sharp" enough.
Digital Signal Processing
(was: Transformations on waveforms)
How can we treat sound as data?
For many purposes, we just encode the sound and leave it at that until it
reaches the receiving end. But sometimes we want to "tweak" the sound; the
classic example is to run it through a highpass, lowpass, or bandpass
filter, to eliminate frequencies that are too low, too high, or not in the
middle. Another example is to try to filter out some of the noise (this
works best if the noise is not purely random).
How do we get Fourier series?
We will assume that the waveform data has been digitized and is available as
an array of values A[n].
One way to think of the Fourier series is as a set of averaging
operations: the sequence A[n] is "averaged against" a sine (or
cosine) wave sin(2𝜋Fn), where f is the "frequency" in units of the sampling
rate. (More specifically, the frequency in Hz is f = FR, where R is the
sampling rate. Alternatively, the sampling frequency in terms of the
frequency in Hertz is F = f/R. For a 400-Hz tone, F = 1/20, meaning that 20
samples make up one full sine wave.)
The idea behind "averaging against" is that we're taking a weighted
average of the A[n], eg 1/4 A + 2/3 A + 1/12 A. However,
with sin(n) involved, some of the weights can be negative.
Averaging two waveforms, in two arrays A and B, amounts to getting the
double sum = 0.0
for (i = 0; i < whatever; i++) sum += A[i]*B[i]
"whatever" might be A.length, or max(A.length,B.length). We might also
"shift" one of the arrays, so the sum is
sum += A[i]*B[shift+i]
In some cases it is natural to reverse the direction in B:
sum += A[i]*B[shift−i]
If we're averaging against a sine wave, with frequency 2𝜋f, we might have
something like this:
for (n=0; n<len; n++) sum+=A[n]*sin(2*Pi*f*n);
avg = sum / len
If you're familiar with calculus, this more-or-less corresponds to ∫ A(x)
Digitized sound makes these sums easy to do, though; no calculus is needed.
Sometimes it's unclear what you divide by (eg len, above; sometimes it's
actually the time in seconds), but if we divide by the wrong thing we're
usually just off by a constant.
Example 1: two sine waves of different
frequency average to zero; phase does not matter
Two of the same frequency average
to, well, maybe 1/2 (avg of sin2(x)) = 1/4. This is a constant.
Assume the signal g(x) is periodic, and is composed of sines, and has base
frequency 1 (or 1/2𝜋). We know g(x) must have the following form; we just
need to find A1, A2, etc:
g(x) = A1 sin(x) + A2 sin(2x) + A3 sin(3x) + ...
To do this, we start by averaging both sides with sin(x):
Average(g(x), sin(x)) = A1 * average(sin(x), sin(x)) + A2 * average(sin(2x),
sin(x)) + A3 * average(sin(3x), sin(x)) + ...
= A1 * 1/2
From this we can readily calculate A1, and by averaging with sin(2x) get A2,
In the above averages, we calculate one fixed sum (eg for (i=0;i<max;i++)
sum+=g(n)*sin(n)). There is another kind of averaging, where we repeatedly
"shift" one of the two functions involved, to get a new sequence
/ waveform /function. This is also called moving
averages. Things don't necessarily have to be periodic here.
Here's a simple example, defining a new sequence B[n] from existing sequence
A[n]. If A[n] is defined for n from 0 to 1000, inclusive, then B[n] will be
defined for n from 1 to 999.
B[n] = (A[n-1] + A[n] + A[n+1])/3
We can think of this as a weighted average of A with (1/3, 1/3, 1/3). The
point here is that we get a new sequence,
B[n], instead of a single value like A1 above. However, the actual
calculation is the same: find the sum of a bunch of A[n]*f(n) products.
Any linear transformation of the
input signal can be expressed this way, though some may need lots and lots
of coefficients and finding the coefficients might in some cases be
However, there are some shortcuts to the find-the-coefficients problem. If
we know what we want in the frequency
domain, we can apply some of the math above to find the corresponding
coefficients for a corresponding moving average in the time
domain. For example, suppose we want to create a "lowpass filter", to
cut off the higher frequency signals. This is an easy point-by-point thing
to express, in the frequency domain:
(1, 1, 1, ...., 1, 0, 0, 0, ...)
Apply this to a sequence of frequency amplitudes (a1, a2, a3, a4, a5, a6,
a7, a8) (representing respective amplitudes, say, of sin(x), sin(2x), ...,
sin(8x)). Assume the cutoff is at position 4, and multiply corresponding
values. We get
(1*a1, 1*a2, 1*a3, 1*a4, 0*a5, 0*a6, 0*a7, 0*a8)
= (a1, a2, a3, a4, 0, 0, 0, 0)
Point-by-point multiplication in the
frequency domain corresponds to a specific moving average in the time
domain, and we can calculate that in a relatively straightforward
way (although it does involve some calculus).
In this particular case of a frequency-domain cutoff, above, it turns out
that the time-domain moving average we need is the "sinc" function (with a
"c", pronounced "sink"):
sinc(x) = sin(x)/x (sinc(0) = 1)
Moving averages of this sort are sometimes called convolution.
The function you are averaging against (eg sinc(x)) is sometimes called a
"kernel", not to be confused with the OS use of that term. The new signal
after filtering is array B, where
B[n] = sum (A[t+n]*sinc(2𝜋f * t) / 2𝜋f
-M ≤ t ≤ M
This is defined for M ≤ n ≤ A.length - M
Other common kernels:
Almost all modification of sound files uses this kind of "averaging against
a kernel". (Echo cancellation is a
harder problem that can't easily be done this way.)
- Identity ("delta" function)
- windowed sin(x): gives estimate of local frequency
- high-pass (Identity - sinc)
- bandpass: combination of lowpass and highpass
Associativity: bandpass kernel: applying a lowpass filter L and then a
highpass filter H is equivalent to convolving L and H together (L*H) and
then applying that: (Signal*L)*H = Signal*(L*H)
Note that sinc() goes to infinity in both directions. In practical terms, we
have to truncate, generally at a point where sinc(x)=0 (that's what M was,
above), and we should also taper off as we approach ±M. The other practical
problem is that convolution with sinc(x) involves some knowledge of the
"future" sound, that is, to get B[n] we need to know A up to A[n+M]. If n
represents the time, then we need to know values of A[i] M units into the
future. One solution is to just accept that B is delayed by M units.
This is a spreadsheet where we average sinc(f1x) (the lowpass
filter with cutoff f1) against cos(f2x). We expect that if f2<f1
then there is little change, but if f2>f1 then the
result is close to zero.
In the spreadsheet, we average just for x>0. Symmetric averaging about
the origin with sin(x) is always exactly 0; cos(x) is the "worst case". Any
sine wave in the real world has a phase,
referring to its relative starting position d: Asin(fx + d). This can also
be expressed as Bsin(fx) + C cos(fx) (it's a trig identities thing) and, as
noted, all we have to worry about is the cos(fx) part.
To use the spreadsheet, you supply a filter frequency (f1 above)
(it should be relatively small, compared to 1000), and a second, "input",
frequency (f2). If the second frequency is higher than the filter
frequency, the reduction will be substantial; otherwise, the reduction
factor will be close to 1 (no change). The waveforms tabulated are of the
form sin(2𝜋fx), so the wavelength is 1/f. Tabulation goes from 0 to 1 by
steps of .001, so if f is 20 a wavelength of 1/20 is 50 "ticks" out of 1000,
and there are 20 full cycles in the sum. To compare performance, we also
calculate the attenuation after just 500 ticks. The ideal sinc filter would
extend to infinity. Attenuation is normalized, so 100% means no attenuation.
We use a cosine wave for the input frequency, as that is the "worst case";
strictly speaking, the attenuation really should be the average
attenuation as the phase of the second, input signal varies from that of
sin(x) to that of cos(x).
Note that frequencies can be fractional, eg f1 = 20 and f2
Demo: play cantdo.wav subject to various highpass and lowpass filters.
Demo: look at the spectrograms for
some .wav files
spectrum: range of frequencies from
fmin to fmax
bandwidth: width, = fmax
effective bandwidth: frequencies you actually need
major point: in Frequency-division multiplexing, each channel is assigned a
very specific "band-width"
Note: standard electronic circuits can extract "bands", hence the practical
importance of this notion
Note that this is a different use of
"band-width" from "data rate"
There is some relation, however: the narrower the analog band-width, the
less room there is for multiple frequencies and modulation, and the less
discussion of benefits of Fourier analysis:
what if distortion is linear, but:
what if we want to understand bandpass filtering
- higher-frequency sine waves transmit faster/slower
- higher-frequency sine waves get damped more
Example on data rate ("bandwidth" in the digital sense) versus bandwidth
Suppose we want to transmit a square wave, alternating from -1 to +1. We
have a band-width of 4 Mhz.
Case 1: we use sin(2𝜋ft) + (1/3)sin(2𝜋(3f)t) + (1/5)sin(2𝜋(5f)t) (that
is, three terms of the Fourier series), where f = 1 Mhz.
Look at this with fourier.xls: does it look squarish?
The frequencies are 1 Mhz, 3 Mhz and 5 Mhz. The band-width is 5-1 = 4 Mhz.
The data rate is 2 Mbps (sort of; we are sending 1 M 1-bits and 1 M 0-bits
Note dependence on notion of what waveform is "good enough"
Case 2: If we double all the frequencies to 2 MHz, 6 MHz, 10 MHz, we get
band-width 8MHz, data rate 4Mbps. This is the same as above with f=2 Mhz.
Case 3: We decide in the second case that we do not
need the 10 MHz component, due to a more accurate receiver. The base
frequency is f = 2MHz, frequencies 2MHz, 6MHz, band-width is now 6MHz - 2MHz
= 4MHz, data rate: 4Mbps
Look at this with fourier.xls.
Note that we're really not carrying data in a meaningful sense; we can't
send an arbitrary sequence of 0's and 1's this way. However, that's done
mostly to simplify things.
Note also the implicit dependence on bandwidth of the fact that we're
decomposing into sinusoidal waves.
Voice transmission: frequency band-width
~ 3-4kHz (eg 300 Hz to 3300 Hz)
64Kbps encoding (8 bits sampled 8000 times a second)
Modems have the reverse job: given the band-width of ~3 kHz, they have to
send at 56 kbps!
Note: we're looking at DISCRETE frequency spectrum (periodic signals).
CONTINUOUS frequency spectrum also makes mathematical sense, but is kind of
Note that frequency-domain notion depends on fundamental theorem of Fourier
analysis that every periodic function can be expressed as sum of sines &
cosines (all with period an integral multiple of original)
Bandwidth of voice: <=4 kHz
quite different meaning from digitization: 64kbps
But if we wanted to encode the digitized voice back into a voice bandwidth,
we'd have to encode 16 bits per cycle (Hertz), which is a little tricky.
Amplitude modulation & band-width
Note that AM modulation (ALL
modulation, in fact) requires a "band-width"; ie range of frequencies. This
will be very important for cellular.
AM:amplitude = [1+data(t)]*sin(2𝜋ft)
f is "carrier" high frequency; eg 100,000
If data(t) = sin(2𝜋gt), g a much lower frequency (eg 1000)
Then sin(2𝜋ft)*sin(2𝜋gt) = 0.5 cos(2𝜋(f-g)t) - 0.5 cos(2𝜋(f+g)t)
band of frequencies: (f-g) to (f+g)