Text information extraction using a cluster-based hidden Markov model

doi:10.1201/9781315375120-75

ABSTRACT

In order to improve the accuracy of text information extraction, this paper puts forward a method based on the fuzzy c-means clustering algorithm to determine the initial parameters of the hidden Markov model. In this method, first, the documents are represented in the form of vectors, and in turn, are classified according to their similarity, mainly using the “soft” clustering algorithm of the fuzzy c-means clustering algorithm. In view of the uncertainty of the initial clustering center, using the idea of bottom-up in hierarchy, clustering determines the clustering number in the fuzzy c-means clustering algorithm, and then considering the category as the hidden states of the hidden Markov model, the initial parameters for the model are obtained. It can use Baum–Welch and Viterbi algorithms to study samples and decoding after obtaining the initial parameters of the model. Comparing the values of F _β= ₁ from each extraction domain, it can be found that this method can effectively improve the accuracy of extraction results.