How does Google Now know it's listening to music?


Question

Once you launch the Google Now app, it listens to determine if it's listening to speech/noise or a song.


What is the logical process that the Google Now app uses to identify noise vs music? I'm assuming it reads the frequency/tone of the song it is picking up through the microphone, but if not, what is the actual process?



enter image description here


enter image description here



Answer

I don't know specifics of Google's system, but there are two technologies that are relevant to solving this problem, and it may use both together (depending on circumstances). Either way, the first thing it does is to Fourier transform short samples of what it's hearing. This gives it the spectrum of the sound, telling it how much of each frequency is present, and (because it has a sequence of samples) how that changes over time. Once it has this, it can filter out some very high- and low-pitched bands that don't help to identify the sound.



The output of this is a collection of "buckets", where each bucket represents a frequency band for a particular fraction of a second, and the value tells you how much sound there is in that bucket. This is where there are two possibilities.



Bayesian classifier



The simplest possibility is to give these buckets to some kind of pre-trained Bayesian classifier. This is a class of AI algorithms that can take some collection of features (the frequency/time buckets) and classify the situation into one of a set of groups you define: in this case, "speech" or "music" or "something else". Although it takes a lot of training data to teach the algorithm how to make this classification, the parameters that result (the pre-trained algorithm) are very compact, so it's practical to include these data in the app, allowing off-line classification. I'd expect that it uses this kind of algorithm on its own if you're using Google's off-line speech recognition feature.



Don't forget that a use of this feature might be to recognise an "OK Google" query in a pub with music in the background, so the classification isn't just as simple as "speech" or "music": it has to decide which is louder or in the foreground.



Music recognition



The other possibility is to try to find a matching piece of music straight away. Music recognition involves a big database of all the music you want to recognise. Each piece of music has already been run through the frequency analysis, so it's indexed by the same frequency/time buckets that the app has recorded. The app just has to go to the cloud to ask the database, "Do you have any music that has these buckets?"



The time part of the buckets is only calculated relative to nearby buckets, so it will find the piece of music regardless of what part of it you're listening to. Also, the information in the buckets is reduced with a hash function (a function that turns some numbers into a smaller number), so it's resilient to losing some frequencies (because of a bad sound system or other noise in the background).



This is a really long-winded thing to do just to tell speech from music, and it requires querying Google's servers to do it, so it wouldn't normally be a good option. It's only a good option because, if it is music, the app will need to do this anyway to recognise it.



Summary



Given the advantages and disadvantages of the two algorithms, I'd expect that the app includes both. For off-line use, it would just use the simple classifier. For on-line use, it might well use both in parallel: using the classifier to give a quick result if the input is very likely to be speech, while it waits for the music database in the cloud to try to recognise the particular piece of music (or say no).


Topics


2D Engines   3D Engines   9-Patch   Action Bars   Activities   ADB   Advertisements   Analytics   Animations   ANR   AOP   API   APK   APT   Architecture   Audio   Autocomplete   Background Processing   Backward Compatibility   Badges   Bar Codes   Benchmarking   Bitmaps   Bluetooth   Blur Effects   Bread Crumbs   BRMS   Browser Extensions   Build Systems   Bundles   Buttons   Caching   Camera   Canvas   Cards   Carousels   Changelog   Checkboxes   Cloud Storages   Color Analysis   Color Pickers   Colors   Comet/Push   Compass Sensors   Conferences   Content Providers   Continuous Integration   Crash Reports   Credit Cards   Credits   CSV   Curl/Flip   Data Binding   Data Generators   Data Structures   Database   Database Browsers   Date &   Debugging   Decompilers   Deep Links   Dependency Injections   Design   Design Patterns   Dex   Dialogs   Distributed Computing   Distribution Platforms   Download Managers   Drawables   Emoji   Emulators   EPUB   Equalizers &   Event Buses   Exception Handling   Face Recognition   Feedback &   File System   File/Directory   Fingerprint   Floating Action   Fonts   Forms   Fragments   FRP   FSM   Functional Programming   Gamepads   Games   Geocaching   Gestures   GIF   Glow Pad   Gradle Plugins   Graphics   Grid Views   Highlighting   HTML   HTTP Mocking   Icons   IDE   IDE Plugins   Image Croppers   Image Loaders   Image Pickers   Image Processing   Image Views   Instrumentation   Intents   Job Schedulers   JSON   Keyboard   Kotlin   Layouts   Library Demos   List View   List Views   Localization   Location   Lock Patterns   Logcat   Logging   Mails   Maps   Markdown   Mathematics   Maven Plugins   MBaaS   Media   Menus   Messaging   MIME   Mobile Web   Native Image   Navigation   NDK   Networking   NFC   NoSQL   Number Pickers   OAuth   Object Mocking   OCR Engines   OpenGL   ORM   Other Pickers   Parallax List   Parcelables   Particle Systems   Password Inputs   PDF   Permissions   Physics Engines   Platforms   Plugin Frameworks   Preferences   Progress Indicators   ProGuard   Properties   Protocol Buffer   Pull To   Purchases   Push/Pull   QR Codes   Quick Return   Radio Buttons   Range Bars   Ratings   Recycler Views   Resources   REST   Ripple Effects   RSS   Screenshots   Scripting   Scroll Views   SDK   Search Inputs   Security   Sensors   Services   Showcase Views   Signatures   Sliding Panels   Snackbars   SOAP   Social Networks   Spannable   Spinners   Splash Screens   SSH   Static Analysis   Status Bars   Styling   SVG   System   Tags   Task Managers   TDD &   Template Engines   Testing   Testing Tools   Text Formatting   Text Views   Text Watchers   Text-to   Toasts   Toolkits For   Tools   Tooltips   Trainings   TV   Twitter   Updaters   USB   User Stories   Utils   Validation   Video   View Adapters   View Pagers   Views   Watch Face   Wearable Data   Wearables   Weather   Web Tools   Web Views   WebRTC   WebSockets   Wheel Widgets   Wi-Fi   Widgets   Windows   Wizards   XML   XMPP   YAML   ZIP Codes