Recognition of Marathi Numerals using MFCC and DTW Features

Recognition of Marathi Numerals using MFCC and
DTW Features.

Abstract. Numeral recognition is one among the most vital problems in pattern
recognition. Its numerous applications like reading postal zip code, passport
number, employee code, bank cheque processing and video gaming etc. To the
best of our knowledge, little work has been done in Marathi language as compared
with those for other Indian and non-Indian languages. This paper has discussed a
novel method for recognition of isolated Marathi numerals. It introduces a Marathi
database and isolated numeral recognition system using Mel-Frequency Cepstral
Coefficient (MFCC) and Distance Time Warping (DTW) as attributes. The
precision of the pre-recorded samples is higher than that of the real-time testing
samples. We have also seen that the accuracy of the speaker dependent samples is
higher than that of the speaker independent samples. Another method called
HMM that statistically models the words is also presented. Experimentally, it is
observed that recognition precision is higher for HMM compared with DTW, but
the training process in DTW is very simple and fast, as compared to the Hidden
Markov Model (HMM). The time required for recognition of numerals using
HMM is more as compared to DTW, as it has to go through the many states,
iterations and many more mathematical modeling, so DTW is preferred for the
real-time applications.
Keywords: Hidden Markov Model (HMM), Mel-Frequency Cepstral Coefficient
(MFCC), Distance Time Warping (DTW).

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!

order now

1. Introduction

Speech recognition systems are used in different fields in our daily life. Due to the
rapid advancement in this field all over the world we can see many systems and
devices with voice input 3. Speech Synthesis and Speech Recognition combinely
form a speech interface. A speech synthesizer transforms text into speech, so it can
read out the textual contents from the screen. Speech recognizer had the ability to
find the spoken words and transform it into text. We require such software’s to be
available for Indian languages.
Speech recognition in computer domain involves many steps with issues
attached with them. The steps needed to make computers perform speech
recognition are: Voice recording, word boundary detection, feature extraction, and
recognition by using knowledge models 1, 2.

2. Problem Definition

The aim of the paper is to build a speech recognition tool for Marathi language,
which is an isolated word speech recognition devices that uses Mel-Frequency
Cepstral Coefficient (MFCC) for Feature Extraction and Distance Time Warping
(DTW) for Feature Matching or to compare the test patterns.

3. Marathi Numeral Recognition using MFCC and DTW Features

The popularly used cepstrum based techniques to check the pattern to find their
similarity are the MFCC and DTW. The MATLAB is utilized for the
implementation of MFCC and DTW attributes.


The MFCCs are used for feature extraction. The efficiency of this phase is
important for the next phase since it affects its behavior 4. In MFCC feature
extraction, the magnitude spectrum of windowed speech frame was filtered by
using a triangular Mel filter bank have twenty Mel filters. From a group of twenty
Mel-scaled log filter bank outputs, MFCC feature vector that consists of thirteen
MFCC and the corresponding delta and acceleration coefficients (total thirty nine
coefficients) is extracted from every frame.
The widespread use of the MFCCs is because of its low computational
complexity and higher performance for ASR in the clean matched conditions.
Performance of MFCC degrades drastically in presence of noise and degradation is
directly proportional to signal-to noise ratio (SNR). The recognition accuracy for
MFCC attribute is taken into account because it mimics the human ear perception
The complete procedure of the MFCC is shown in Fig. 3.1. As shown in Fig.
3.1, MFCC consists of seven computational steps. Every step has its function and
approaches as mentioned in brief as follows.

Fig. 3.1. MFCC Block Diagram 4.

Step 1: Pre–emphasis
This method can increase the energy of signal at higher frequency. It permits
the passing of every speech signal through a 1st order FIR filter which emphasizes
higher frequencies. The 1st order FIR filter equation utilized is
Yn = xn-0.95 x n-1 …………………………………………………………..… (1)
Step 2: Framing.
Every speech signal is split into frames of thirty six ms (milliseconds) and
most of spectral characteristics stay the constant in this period, with 50 % of
Step 3: Windowing
To eliminate edge effects, every frame is formed with hamming window that
works better than other windows. The hamming window is represented by
;#55349;;#56385;?1) ;#3627408458;?;#3627408466;;#3627408479;;#3627408466;, 0?;#3627408475;?;#3627408449;?1……..….. (2)
Step 4: Fast Fourier Transformation (FFT)
The FFT is employed to get log magnitude spectrum to estimate MFCC. We
have utilized 1024 point to obtain higher frequency resolution.
Step 5: Mel Filter Bank Processing
The twenty Mel triangular filters are designed with 50% overlapping. From
every filter the spectrum are included to obtain one coefficient each, hence we
have considered the first thirteen coefficients as our attributes. These frequencies
are converted to Mel scale utilizing conversion formula.
;#55349;;#56377;(;#3627408448;;#3627408466;;#55349;;#56409;)=2595?;#55349;;#56409;;#3627408476;;#3627408468;101+;#3627408467;700………………………….…… (3)
We have taken into account only 13 MFCC coefficients due to the fact it

input Framing Windowing
energy ;
Magnitude spectrum Mel filter bank Mel Spectrum

gives higher recognition accuracy than other coefficients.
Step 6: Discrete Cosine Transformation (DCT)
The DCT of every Mel frequency Ceptral are used for de-correlation and
energy compaction is called as MFCC. The group of coefficient are called MFCC
Acoustic Vectors. So, every input speech signal is converted into a sequence of
MFCC Acoustic Vector from which reference templates are obtained.
Step 7: Delta Energy and Delta Spectrum
The attributes associated to the variation in cepstral features over time are
represented by thirteen delta features (12 cepstral features and one energy
feature), and 13 double delta or acceleration attributes. Each of the 13 delta
features gives the variation between frames, while each of the 13 double delta
attributes gives the variation between frames in the corresponding delta features.
In similar way, all the total 39 MFCC feature are estimated for each frame which
has feature vector. The Mel filter bank created is shown in Fig.3.2.

Fig. 3.2: Mel scale filter bank 2.

The operating process of the MFCC coefficient extraction is:
1. Pre-emphasis of the speech signal, frame, adding window, then use the
FFT to get the frequency information.
2. Pass the signal through the Mel frequency coordinate triangle filter sets to
match the human hearing techniques and the human hearing sensibility to
variant speech spectrum.
3. Estimate the logarithm value of the signal after the Mel filters to get the
logarithmic spectrum.
4. Obtain the discrete cosine transform to the signal and get the MFCC

Mel-frequency wrapping
According to psychophysical studies, human perception of the frequency content
of sounds follows a subjectively defined nonlinear scale called the Mel scale .The
speech signal consists of tones with different frequencies F or each tone with an
actual frequency measured in Hz, a subjective pitch is measured on the ‘Mel’
scale. The mel-frequency scale is a linear spacing below 1000Hz and above
1000Hz is a logarithmic spacing 3.

4. Features Matching (DTW)
4.1 Overview
In this form of speech recognition method, the test data is transformed to
templates. The recognition method then includes the matching the incoming
speech with stored templates. The template with the lowest distance measure
from the input pattern is the known word. The best choice (lowest distance
measure) is based upon dynamic programming. This is called a Dynamic Time
Warping (DTW) word recognizer 5.
To understand the concept of DTW, we require to know this parameters,

Energy in each band

• Features: the information in every signal has to be exhibited in some fashion.
• Distances: some type of metric has be utilized so as to obtain a match path.
Since the feature, vectors may probably have multiple elements, a ways of
calculating the local distance is needed. The distance measure between two
feature vectors is estimated by the Euclidean distance metric. The Euclidean
distance between two points P = (p1, p2…pn) and Q = (q1, q2…qn), is expressed
??(;#3627408477;;#3627408470;?;#3627408478;;#3627408470;)2;#55349;;#56411;;#55349;;#56406;=1……………………………… (4)

4.2 DTW Algorithm

Speech is a time-dependent technique. So, the articulations of the same word will
have variant durations, and articulations of the same word with the same duration
will differ in the middle, as the different parts of the words are being spoken at
variant rates. In order to obtain a global distance between two speech patterns
(given as a sequence of vectors) a time alignment should be performed.
DTW algorithm is relies on Dynamic Programming methods. This method is
for estimating similarity between two time series that may vary in time or speed.
This method is also utilized to obtain the optimal alignment between two times
series if one time series may be “warped” non-linearly by stretching or shrinking
it along its time axis. The warping between two time series can then be utilized to
obtain the corresponding regions between the two time series or to find the
similarity between the two time series. Fig. 4.1 shows the example of how one
times series is ‘warped’ to another 6.
In Fig. 4.1, every vertical line joins a point in one time series to its
respectively similar point in the different time series. The lines have similar
values on the y-axis, while they have been departed so the vertical lines between
them can be seen more easily. When both of the time series in Fig. 4.1 were
identical, all of the lines would be straight vertical lines as no warping would be
required to ‘line up’ the two time series. The warp path distance is a measure of
the variation between the two time series after they have been warped
collectively, which is estimated by the sum of the distances between every pair of
points joined by the vertical lines in Fig. 4.1. Hence, two time series that are same
except for localized stretching of the time axis will have zero DTW distances.
The aim of DTW is to compare two dynamic patterns and calculate its similarity
by finding a minimum distance between them.

Fig. 4.1 Warping between two time series 6.

Suppose we have 2 time series Q and C, of length n and m respectively,
Q = q1, q2,………..qi…, qn …………………………………… (5)
C = c1, c2…………cj….cn ……………………………………. (6)
To align 2 sequences using DTW, an n-by-m matrix where the (ith, jth) element
of the matrix has the distance d (qi, cj) between the two points qi and cj is
designed. Then, the absolute distance between the values of two sequences is

measured by using the Euclidean distance as:
d(qi, cj) = (;#3627408478;;#3627408470;?;#55349;;#56400;;#3627408471;)2 ……………………………….………. (7)
Every matrix element (i, j) matches to the alignment between the points qi and cj.
Then, accumulated distance is given as:-
Di, j = minD(i-1, j-1), D(i-1, j), D(I, j-1)+ d(i, j) …..………..…, (8)
This is shown in Fig. 4.4, where the horizontal axis gives the time of test input
signal, and the vertical axis gives the time sequence of the reference template.
The path shown gives the minimum distance between the input and template
signal. The shaded region shows the search region for the input time to template
time mapping function.
Using dynamic programming methods, the search for the minimum distance
path can be obtained in polynomial time P(t), using the equation as:-
;#3627408477;(;#55349;;#56417;)=;#3627408476;;#3627408449;2 ;#3627408457; …………………………………………… (9)
Where, N is that the length of the sequence, and V is that the number of
templates to be considered.

Fig. 4.2 Example Dynamic time warping (DTW) 6.

Theoretically, the major optimizations to the DTW algorithm come from
observations on the nature of good paths through the grid. These are given in
Sakoe and Chiba and can be summarized as.
Monotonic condition: – The path will not return back on itself, both i and j
indexes either remain the constant or increase, they never decrease.
Continuity condition: – The path forwards one step at a time. Both i and j can
only increase by 1 on every step along the path.
Boundary condition: – The path begins at the bottom left and stops at the top right.
Adjustment window condition: – A good path is the one that wander away from
the diagonal. The distance that the path is allowed to wander is the window
length r.
Slope constraint condition: – The path should not be very steep or shallow. This
restraints very short sequences matching very long ones. The condition is given
as a ratio n/m where m is the number of steps in the x direction and n is the
number in the y direction.
Searching for the best path that matches two time-series signals is the main
task for many researchers, due to its importance in these applications. Dynamic
Time-Warping (DTW) is one of the main techniques to accomplish this task,
especially in speech recognition systems to deal with different speaking speeds.
DTW is a cost minimization matching method, in which a test signal is stretched

or compressed as per a reference template. Dynamic time warping (DTW) is
such a typical method for a template based approach matching for speech
recognition and also DTW stretches and compresses different parts of utterance
in order to obtain alignment that gives the best possible match between template
and utterance on frame by frame manner. The template with nearest match
defined in a manner chosen as recognized word 8.
Fig. 4.3 Template Matching Issues: Dynamic Time Warping 7

The DTW is mainly used in the small-scale embedded-speech recognition
systems like those embedded in cell phones. The reason for this is due to the
simplicity of the hardware implementation of the DTW, hat makes it suitable for
different mobile devices. Also, the training procedure in DTW is very simple
and fast, as compared with the HMM and ANN. DTW has been used in various
areas, like speech recognition, data mining, and movement recognition 5.

5. Hidden Markov Model Training and Recognition

The Hidden Markov Models (HMMs) are widely used statistical tools in
recognition system. It covers from isolated speech recognition to very large
vocabulary unconstrained continuous speech recognition and speaker
identification fields. Therefore most of the current speech recognitions are
conducted based on Hidden Markov Model (HMMs) 9.
The Hidden Markov Model is a statistical model where the system is
modelled as Markov process which can be represented as a state machine with
unknown parameter through it. The main important process is to determine the
unknown parameter from the observable parameter. After determining the
parameter, it then used to perform further analysis. One of the examples of the
Hidden Markov Model is normally use in pattern application. For example,
Hidden Markov Model in pattern application is speech, signature, handwriting,
gesture recognition and bioinformatics and genomics in medical.
The hidden Markov Model is given by the formula, ? = (?, A, B).
? = initial state distribution vector.
A = State transition probability matrix.
B = continuous observation probability density function matrix.

Fig. 5.1 Overview of HMM speech recognition process 9.

6. Results and Discussions

We recorded Marathi database of numerals zero to nine. In this we have
intended to implement a password system with numerals and many other such
applications in everyday life. The 20 samples for each word were recorded from
different people and these samples were then normalized by dividing their
maximum values. Then they were decomposed using Dynamic Time Warping.
Out of 20 samples recorded, 16 samples are used to train the DTW and the unused
4 samples are used for test purpose.
In this project, speech recognition software had been developed using MFCC
; DTW algorithms. The reference file was created for different-different pre
recorded speech signals. When the microphone input signal was applied, its
MFCC coefficients were compared to the pre-recorded speech’s MFCC
coefficients using DTW algorithm. The Output scores of DTW calculate the
nearest sound of the recorded speech signals. End of the software output was
displayed on MATLAB output screen. Software would display correct numeral if
applied microphone signal would be compared with pre-recorded & online
The Results of some of the extracted features of recorded database of numerals
zero to nine in Marathi are shown in the figures below.

Fig. 6.1 Mel Frequency Cepstrum Coefficients Fig. 6.2 Mel Frequency Cepstrum of Shunya. Coefficients of pach.

6.1 Graphical User Interface (GUI) of the system

We have created the GUI of the system for the recognition of the numerals. The
DTW 0-9 Digit Recognizer has the various command buttons like record, open,
play, recognize etc. It shows the opened wave file.
In this project, we have designed a DTW digit recognizer, in which the
command button open reads the pre-recorded numerals and the command button
record the online numeral spoken by the speaker. We can play the pre-recorded &
online numeral spoken by the speaker, and then we can recognize the numeral
using the DTW for feature matching. It matches the template by taking into
account the minimum warping distance between the various numerals. The
Template with closest match defined in manner chosen as recognized numeral & it
is displayed on GUI display.

Fig. 6.3 GUI of DTW Digit Recognizer. Fig. 6.4 GUI of opened wave file.

Fig. 6.5 GUI of pattern matching of shunya. Fig. 6.6 GUI of recognized numeral shunya.

Fig. 6.7 GUI of pattern matching of ek. Fig. 6.8 GUI of recognized numeral ek.

Fig. 6.9 GUI of pattern matching of saha. Fig. 6.10 GUI of recognized numeral saha.

Fig. 6.11 GUI of pattern matching of nau. Fig. 6.12 GUI of recognized numeral nau.
6.2 Testing and Results

6.2.1 Testing with pre-recorded samples

Out of the 20 samples recorded for each word, 16 were used for training purpose.
We tested our program’s accuracy with these 4 unused samples. A total of 20
samples were tested (4 samples each for the 5 words) and the program yielded the
right result for all 20 samples. Thus, we obtained 100% accuracy with pre-
recorded samples.

6.2.2 Real-time testing

For real-time testing, we took a sample using microphone and directly
executed the program using this sample. A total of 30 samples were tested, out of
which 24 samples gave the right result. This gives an accuracy of about 80% with
real-time samples.

6.2.3 Results
? Case 1: Speaker independent (20 templates per digit 10 male, 10 female)
The above implemented work is tested for 100 samples of each word spoken by
50 different speakers with 2 samples of each digit per head.
The testing work leads to the results given in Table 6.1.

Table 6.1 Accuracy of the Speaker Independent Test Results.

DIGIT 0 1 2 3 4 5 6 7 8 9
% ACCURACY 87 88 82 78 79 84 85 81 78 87

? Case 2: Speaker Dependent (one template per digit).
The above implemented work is tested for 10 samples of each word spoken by
single speaker. The results are given in Table 6.2.

Table 6.2 Accuracy of the Speaker Dependent Test Results.

DIGIT 0 1 2 3 4 5 6 7 8 9
% ACCURACY 90 91 84 90 87 88 92 84 86 92

It is observed that the accuracy of the pre-recorded samples is more than that of
the real-time testing samples. We have also observed that the accuracy of the
speaker dependent samples is more than that of the speaker independent samples.

Table 6.3 Confusion Matrix of the MFCC ; DTW Recognition.

ek don teen char pach saha sat aath nau shunya Avg. %
ek 1 1 1 4 1 1 1 1 1 0 80
don 2 2 2 2 3 2 2 2 2 2 90
teen 3 3 3 3 9 3 3 2 2 2 80
char 4 4 5 4 4 4 4 6 4 4 80
pach 5 5 5 5 5 5 5 5 5 3 90
Saha 6 6 6 6 1 6 6 4 6 6 80
Sat 7 7 8 7 7 7 7 7 7 7 90
Aath 2 8 8 8 8 7 8 8 8 8 80
nau 9 9 4 9 9 5 9 9 9 9 80
shunya 0 0 0 0 5 0 0 2 0 0 80

Table 6.4 Confusion Matrix of the MFCC ; HMM Recognition.

ek don teen char pach saha sat aath nau shunya Avg. %
ek 1 1 1 1 1 1 1 3 1 1 90
don 2 2 2 2 2 2 2 2 2 5 90
teen 3 3 3 3 3 3 1 3 3 3 90
char 4 3 4 4 4 4 4 4 8 4 80
pach 5 5 5 5 5 5 5 5 5 5 100
Saha 6 6 6 8 6 6 6 6 6 6 90
Sat 7 7 7 7 7 7 7 5 7 7 90
Aath 8 8 8 8 8 8 8 8 5 8 90
nau 9 9 9 9 9 7 9 9 9 9 90
shunya 0 0 0 7 0 0 0 5 0 0 80

Table 6.5 Comparison Digit Recognition Accuracy Test Results.

Numeral DTW Accuracy HMM Accuracy
ek 80 90
don 90 90
teen 80 90
char 80 80
pach 90 100
Saha 80 90
Sat 90 90
Aath 80 90
nau 80 90
shunya 80 80
Average % 83% 89%

Experimentally, it is observed that recognition accuracy is better for HMM
compared with DTW, but the training procedure in DTW is very simple and
fast, as compared with the HMM.

Fig. 6.13 Recognition accuracy of the DTW ; HMM.

The time required for recognition of numerals using HMM is more as compared
to DTW, as it has to go through the many states, iteratations; many more
mathematical modeling, so DTW is preferred for the real-time applications as
compared with the HMM.

7. Conclusions and Future Scopes

7.1 Conclusions

• Though the advances accomplished throughout the last decades,
automatic speech recognition (ASR) is still a challenging and difficult
task 1.
• The non-parametric method for modeling the human auditory perception
system, Mel Frequency Cepstral Coefficients (MFCCs) isused as
extraction techniques. The nonlinear sequence alignment known as
Dynamic Time Warping (DTW) has been used as features matching
techniques. The nonlinear sequence alignment known as Dynamic Time
Warping (DTW) has been used as features matching techniques. Since

it’s obvious that the voice signal tends to have different temporal rate,
the alignment is important to produce the better performance.
• This paper proposed that higher recognition rates can be achieved using
MFCC features with DTW which is useful for different time varying
numeral speech utterances.
• MFCC analysis provides better recognition rate than LPC as it operates
on a logarithmic scale which resembles human auditory system whereas
LPC has uniform resolution over the frequency plane. This is followed
by pattern recognition. Since the voice signal tends to have different
temporal rate, DTW is one of the methods that provide non-linear
alignment between two voice signals.
• Another method called HMM that statistically models the words is also
presented. Experimentally it is observed that recognition accuracy is
better for HMM compared with DTW, but the training procedure in
DTW is very simple and fast, as compared with the HMM.
• The time required for recognition of numerals using HMM is more as
compared to DTW, as it has to go through the many states, iteratations&
many more mathematical modeling, so DTW is preferred for the real-
time applications as compared with the HMM .
• DTW is a cost minimization matching technique, in which a test signal
is stretched or compressed according to a reference template.
• The accuracy of the pre-recorded samples is more than that of the real-
time testing samples. We have also observed that the accuracy of the
speaker dependent samples is more than that of the speaker independent

7.2 Future Scopes

• One of the key areas where future work can be concentrated is the large
vocabulary generation & to improve robustness of speech recognition
performance 2.
• Another key area of research is focused on an opportunity rather than a
problem. This research attempts to take advantage of the fact that in
many applications there is a large quantity of speech data available, up
to millions of hours. It is too expensive to have humans transcribe such
large quantities of speech, so the research focus is on developing new
methods ofmachine learning that can effectively utilize large quantities
of unlabeled data.
• The better understanding of human capabilities and to use this
understanding to improve machine recognition performance.
• The future work could be towards Online Speech Summarization. The
majority of speech summarization research has focused on extracting the
most informative dialogue acts from recorded, archived data.
• The future work could be towards minimizing the time required for
recognition of numerals using HMM.


1 G. Saon, M. Picheny,” Recent advances in conversational speech recognition using
convolutional and recurrent neural networks”, IBM Journal of Research and Development,
Volume: 61, Issue: 4/5, 2017.
2 Douglas O’Shaughnessy,” Automatic speech recognition”, CHILEAN Conference on Electrical,
Electronics Engineering, Information and Communication Technologies (CHILECON), 2015,
DOI: 10.1109/Chilecon.2015.7400411.
3 L. R. Rabiner, R W Schafer, “Digital Processing Of Speech Signals”, Low Price Edition, 2007.
4 M. A. Anusuya, S. K. Katti “Speech Recognition by Machine: A Review”, (IJCSIS)
International Journal of Computer Science and Information Security, Vol. 6, No. 3, 2009.
5 Bharti W. Gawali, Santosh Gaikwad, Pravin Yannawar, Suresh C.Mehrotra, “Marathi
Isolated Word Recognition System using MFCC and DTW Features”, Proc. of Int. Conf. on

Advances in Computer Science 2010.
6 Lindasalwa Muda, Mumtaj Begam and I. Elamvazuth, “Voice Recognition Algorithms using
Mel Frequency Cepstral Coefficient (MFCC) and Dynamic Time Warping (DTW)
Techniques” Journal of Computing, Volume 2, Issue 3, March 2010, ISSN 2151- 9617.
7 Ms. Vimala. C, Dr. V. Radhab, “Speaker Independent Isolated Speech Recognition System for
Tamil Language using HMM”, International Conference on Communication Technology and
System Design 2011.
8 Anjali Bala, Abhijeet Kumar, Nidhika Birla, “Voice Command Recognition System Based On
MFCC and DTW”, International Journal of Engineering Science and Technology Vol. 2 (12),
2010, 7335-7342.
9 Hui Jiang, Xinwei Li, and Chaojun Liu, Large Margin Hidden Markov Models for Speech
Recognition, IEEE Transactions on Audio, Speech, And Language Processing, Vol. 14, No. 5,
September 2006.