In recent years, statisical or machine-learning methods for natural language processing have become increasing prevalent. The most successful methods have generally used supervised training. A human annotator marks examples (for example part-of-speech tag sequences, parse tree structures, or named entities in text) which are then used to train a model that recovers similar structures on test data.
Unfortunately, manual labeling of data can be laborious, and may simply not be feasible in some domains. In this talk I will discuss methods that reduce the need for supervision through the use of unlabeled examples.
I will discuss an approach for building a proper-name classifier. The task is to learn a function from an input string (proper name) to its type, which we will assume to be one of the categories Person, Organization, or Location. For example, a good classifier would identify "Mrs. Frank" as a person, "Steptoe & Johnson" as a company, and "Honduras" as a location.
At first glance, the problem seems quite complex: a large number of rules is needed to cover the domain, suggesting that a large number of labeled examples is required to train an accurate classifier. But we will show that the use of unlabeled data can drastically reduce the need for supervision. Given around 90,000 unlabeled examples, the methods described classify names with over 91% accuracy. The only supervision is in the form of 7 seed rules (namely, that New York, California and U.S. are locations; that any name containing Mr. is a person; that any name containing Incorporated is an organization; and that I.B.M. and Microsoft are organizations).
I will present two algorithms. The first method uses a similar algorithm to that of Yarowsky 95, with modifications motivated by Blum and Mitchell 98. The second algorithm extends ideas from boosting algorithms, designed for supervised learning tasks, to the framework suggested by Blum and Mitchell 98.
This is joint work with Yoram Singer.