Vol: 51(65) No: 3 / September 2006 Supervised Term Cluster Creation for Document Clustering Kristof Csorba Department of Automation and Applied Informatics, Budapest University of Technology and Economics, 1111 Budapest, Goldmann Gy. Tér 3., Hungary, phone: +36 1 463-2870, e-mail: kristof@aut.bme.hu, web: http://www.aut.bme.hu/ Istvan Vajk Department of Automation and Applied Informatics, Budapest University of Technology and Economics, 1111 Budapest, Goldmann Gy. Ter 3., Hungary, e-mail: vajk@aut.bme.hu Keywords: term cluster creation, supervised learning, document clustering, confidence. Abstract This paper presents a new technique for supervised term cluster creation for document topic identification. It focuses on the avoidance of misclassifications, but the selection of every document in the target topic has lower priority. A document classification system operating this way may be useful in applications, where there is no strict need for the classification of every document, but the allowed rate of misclassifications is very strictly limited. The system tends to discard ambiguous documents by measuring the confidence of the topic assignment. This allows very high precisions in the classification results. References [1] A. Singhal, “Modern information retrieval: A brief overview,” IEEE Data Engineering Bulletin, vol. 24, no. 4, pp. 35–43, 2001. [2] L. Li and W. Chou, “Improving latent semantic indexing based classifier with information gain,” tech. rep., May 16 2002. [3] N. Slonim and N. Tishby, “Document clustering using word clusters via the information bottleneck method,” in Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Clustering, pp. 208–215, 2000. [4] G. Furnas, S. Deerwester, S. T. Dumais, T. K. Landauer, R. Harshman, L. A. Streeter, and K. E. Lochbaum, “Information retrieval using a singular value decomposition model of latent semantic structure,” in Proceedings of the 11th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Y. Chiaramella, ed.), (Grenoble, France), pp. 465–480, ACM, 1988. [5] K. Lang, “Newsweeder: Learning to filter netnews,” in ICML, pp. 331–339, 1995. |