JOURNAL OF COMPUTERS (JCP)
ISSN : 1796-203X
Volume : 3    Issue : 1    Date : January 2008

Analysis and Improved Recognition of Protein Names Using Transductive SVM
Masaki Murata, Tomohiro Mitsumori, and Kouichi Doi
Page(s): 51-62
Full Text:
PDF (341 KB)


Abstract
We first analyzed protein names using various dictionaries and databases and found five problems
with protein names; i.e., the treatment of special characters, the treatment of homonyms, cases
where the protein-name string may be a substring of a different protein-name string, cases where
one protein exists in different organisms, and the treatment of modifiers. We confirmed that we
could use a machine-learning approach to recognizing protein names to solve these problems.
Thus, machine-learning methods have recently been used in research to recognize protein names.
A classifier trained in a specific domain, however, can cause overfitting and be so inflexible that it
can only be used in that domain. We therefore developed a new corpus on breast cancer and
investigated the flexibility of classifiers trained on the GENIA [1] or the breast-cancer corpora. We
used a transductive support vector machine (SVM) to avoid overfitting, and we evaluated the effect of
transductive learning. We found that transductive SVM prevented overfitting in experiments and
yielded higher accuracies than were obtained from the conventional SVM. The transductive SVM
increased the F-scores (70.46 to 79.64 and 70.63 to 74.61) in our two experiments for the criterion
of “Sub” that we define in this paper.

Index Terms
overfitting, protein name recognition, biomedical literature, SVM, transductive SVM, different domain