I don't know specifics of Google's system, but there are two technologies that are relevant to solving this problem, and it may use both together (depending on circumstances). Either way, the first thing it does is to Fourier transform short samples of what it's hearing. This gives it the spectrum of the sound, telling it how much of each frequency is present, and (because it has a sequence of samples) how that changes over time. Once it has this, it can filter out some very high- and low-pitched bands that don't help to identify the sound.
The output of this is a collection of "buckets", where each bucket represents a frequency band for a particular fraction of a second, and the value tells you how much sound there is in that bucket. This is where there are two possibilities.
Bayesian classifier
The simplest possibility is to give these buckets to some kind of pre-trained Bayesian classifier. This is a class of AI algorithms that can take some collection of features (the frequency/time buckets) and classify the situation into one of a set of groups you define: in this case, "speech" or "music" or "something else". Although it takes a lot of training data to teach the algorithm how to make this classification, the parameters that result (the pre-trained algorithm) are very compact, so it's practical to include these data in the app, allowing off-line classification. I'd expect that it uses this kind of algorithm on its own if you're using Google's off-line speech recognition feature.
Don't forget that a use of this feature might be to recognise an "OK Google" query in a pub with music in the background, so the classification isn't just as simple as "speech" or "music": it has to decide which is louder or in the foreground.
Music recognition
The other possibility is to try to find a matching piece of music straight away. Music recognition involves a big database of all the music you want to recognise. Each piece of music has already been run through the frequency analysis, so it's indexed by the same frequency/time buckets that the app has recorded. The app just has to go to the cloud to ask the database, "Do you have any music that has these buckets?"
The time part of the buckets is only calculated relative to nearby buckets, so it will find the piece of music regardless of what part of it you're listening to. Also, the information in the buckets is reduced with a hash function (a function that turns some numbers into a smaller number), so it's resilient to losing some frequencies (because of a bad sound system or other noise in the background).
This is a really long-winded thing to do just to tell speech from music, and it requires querying Google's servers to do it, so it wouldn't normally be a good option. It's only a good option because, if it is music, the app will need to do this anyway to recognise it.
Summary
Given the advantages and disadvantages of the two algorithms, I'd expect that the app includes both. For off-line use, it would just use the simple classifier. For on-line use, it might well use both in parallel: using the classifier to give a quick result if the input is very likely to be speech, while it waits for the music database in the cloud to try to recognise the particular piece of music (or say no).