# Mathematical Perspectives on Neural Networks

The next two theorems give sufficient conditions for an estimator to be CFC; the proofs are omitted.

Theorem 17.A1.If q ∈ ℙ, a > 0, b ≤ 0, c ≤ 0, and if a + b + c > 0 then

where θ0 is defined by . In other words under these conditions, θ + ̂(a, b, c) is not only CFC but also consistent in the usual sense.

We say that q ∈ ℙ+ if there is a θ ∈ Θ such that pθ defines the correct channels, that is, if both pθ(xy) ≡ q(xy) and pθ(yx) ≡ q(yx). Obviously, q ∈ ℙ implies q ∈ ℙ+, however, the converse is false.

Theorem 17.A2.If q ∈ ℙ+, a > 0, b ≤ 0, c < 0, and if a + b + c = 0

In other words, θ + ̂(a, b, c) is CFC under these conditions also.

Combining these theorems we see that if q ∈ ℙ and if a + b + c ≥ 0 with b ≤ 0, c < 0, then θ + ̂(a, b, c) is CFC. Thus, the MLE, CMLE, and θ + ̂(2, -1, -1) are CFC, but the result does not cover the MMIE since 1 - 1 - 1 < 0. Indeed, the example of the previous section the MMIE criterion

is an unbounded function of θ for which the corresponding estimator θ + ̂(1, -1, -1) fails to be CFC. All examples of this type known to us, however, involve an unspecified language model, that is, we have no such examples for the CML version of the MMI criterion. In large vocabulary natural language speech recognition the CMLE is used (not the MMIE). The static portion of the speaker independent language model is first trained (by a combination of maximum likelihood and smoothing) on hundreds of millions of words of text and it is fixed thereafter. The static portion of the speaker-dependent acoustic channel model is then trained by either MLE or CMLE on utterances based on approximately a thousand words.

REFERENCES

Bahl, L. R., Brown, P. F., de P. V. Souza, & Mercer, R. L. ( 1986). "Maximum mutual information estimation of hidden Markov parameters for speech recognition". Proceedings of the IEEE International Conference on Acoustic Speech Signal Processing, 49-52.

Bahl, L. R., Cocke, J., Jelinek, F., & Raviv, J. ( 1974). "Optimal decoding of linear codes for minimizing symbol error rate". IEEE Transactions on Information Theory, PAMI-5, 284-287.

Bahl, L. R., & Jelinek, F. ( 1975). "Decoding for channels with insertions, deletions, and substitutions, with applications to speech recognition". IEEE Transactions on Information Theory, IT-21, 404-411.

