If you’re anything like me, you’ve always been fascinated by Shazam. How is it possible to have a program that listens to a random portion of a random song through various microphones in various environments and come up with a song? For a while I chalked it up to black magic and intangible analysis, but I was led down the path of some fancy audio analysis that helped me make the jump to 10,000 foot understanding.
The first thing to get an understanding of is the Fourier Transform. Essentially this will take a signal (in the case of Shazam, a microphone signal) and determine the frequencies (pitches) that make up that signal. A bass guitar will have much lower frequencies than a piccolo. The Fourier Transform can yield important information about all of the pieces that make up what we actually hear. (Hint: Save for electronic sounds, all tones are made up of many different frequencies, leading to what is called timbre.)
The Fast Fourier Transform is what is typically used in the digital world. Your cell phone lives in the digital world. I’ll keep out how fascinating the FFT matrix actually is, and just leave the fact that in order to run an FFT on data, enough time has to be recorded. (If you want to find a 20 Hz signal, it helps to have more than 1/20 seconds.)
Now that there is some information about FFTs, I can get into more detail about Shazam. Sure, one can take a Fourier Transform of “In The Garden Of Eden” by I. Ron Butterfly, but that doesn’t really make sense unless you’re looking for underlying patterns. Where the powerful information comes out is when you look much closer. Since the human ear can only hear sounds as low as about 20 Hz, it makes sense to start around there – maybe take samples that can handle half that frequency. This would be 10 Hz, or 100ms of data.
Taking the frequency data of the first 100ms also doesn’t yield very interesting results either. You might be able to get an idea what the first few notes are. But what can happen is you could take another window from, say 1ms-101ms. You could keep doing this “windowed” FFT across the entire 17 minutes of In-a-gadda-da-vida, and you have all the data you need to analyze and catalog the song in a Shazam database!
This technique has been done in several facets of Arts and Engineering: Aphex Twin did it in 1999, Music Visualization Software has been doing it for decades, and there is a pit of Wikipedia pages one can dig through for weeks, if so inclined.
I should mention this, though I feel it isn’t necessary for understanding. Back to how humans hear, the Mel-Frequency Cepstrum Coefficients are used to essentially make the frequencies from an FFT more relative to human hearing sensitivity. It helps to use, but isn’t necessary to understand the fundamentals.
At this point, I should cite the post that sparked my interest in this topic. https://blog.francoismaillet.com/epic-celebration/. He discusses (with images) the idea of using the MFCC to pull out specific sounds. Here he shows the relationship between a sound and the MFCC graph.
With a song, different things will show up in different ways. A singer’s voice, a cymbal crash, keyboard notes… Each of these sequences are unique to a song. Well, almost unique. But they essentially make up a fingerprint of the song.
I’ll leave out optimization (peak finding algorithms), but suffice it to say specific “points” can be found which have the “Maximum Intensity for a Frequency at a Time.” let’s just assume we can determine those with ease.
So now we have a bunch of points of frequency vs time. Using further ingenuity, these points can be cataloged in a database. That’ll possibly be worth another post in it’s own right. But essentially, all that is needed are a few points (differences in time and frequencies of artifacts) and that’s all that is needed for a Shazam match.
Since I didn’t have any sufficient graphs, this might have been a dry read. I’ll possibly fix this in the future. There is an article that I likely read several months / years ago that goes into more detail that can be found here. Shazam has been around for many years, and it wasn’t until I started down the path of some audio analysis that I was able to unveil some of the mysteries of a powerful tool that touches upon many points of Engineering / Computer Science.