Speach recognition is complex task, that involves several machine learning and signal processing procedure. Most of the currently used software for speach recognition is based on a structture called Hidden Markov Model.
Imagine that you have a friend in USA, and you are playing the following game. Your friend travels a lot between LA and Seattle. He said that he will send you weather records from one of this cities, where he currently is and you have to decide which is it.
You are smart, and you have build Markov models for that purpose, based on historical data.
First let us focus on a concept of Markov Model. Below two such models were presented. First describes the change in weather in Los Angeles, the second one in Seattle.
Now, having the weather records from past 10 days, presented below, decide from which city this records are from.
Sunny, Sunny, Sunny, Sunny, Rainy, Rainy, Sunny, Sunny, Sunny, Sunny
Now, imagine that you do not have records of weather from past 10 days. Because you were winning all the time with your friend, he said that he would make the task more complicated, and instead of weather logs, he will send you information on the item he takes with him to school every day. This item can be a sunglasses, or umbrella.
You are smart again, and you are building a hidden Markov model for this purpose.
For the voice recognition the observation is sound wave (or actually its mathematical representation), the states on the other hand are phonemes. Below the HMM model for of word was presented.