UUM Electronic Theses and Dissertation
UUM ETD | Universiti Utara Malaysian Electronic Theses and Dissertation
FAQs | Feedback | Search Tips | Sitemap

Document clustering based on inverse document frequency measure

Wan Faridah Hanum, Wan Yaacob (2005) Document clustering based on inverse document frequency measure. Masters thesis, Universiti Utara Malaysia.

[thumbnail of WAN_FARIDAH_HANUM_BT._WAN_YAACOB.pdf] PDF
WAN_FARIDAH_HANUM_BT._WAN_YAACOB.pdf
Restricted to Registered users only

Download (6MB) | Request a copy
[thumbnail of 1.WAN_FARIDAH_HANUM_BT._WAN_YAACOB.pdf]
Preview
PDF
1.WAN_FARIDAH_HANUM_BT._WAN_YAACOB.pdf

Download (1MB) | Preview

Abstract

Automatic classification techniques are capable of providing the necessary information organization by arranging the retrieved data into groups of documents with common subjects. Recently, document clustering has been put forth as an alternative method of organizing the results of retrieval. It been proposed for use in navigating and browsing document collections, and discovers hidden similarity and key concepts. It also summarize a large amount of document using key or common
attributes of cluster and can be used to categorize document databases. This paper describes several narrative clustering techniques such as Porter algorithm, Gusfield algorithm, similarity based on document hierarchy and Inverse Document Frequency (IDF), which intersect the documents in a cluster to determine the set of words (or phrases) shared by all the documents in the cluster. This study proposes document clustering based on IDF, where it is assumes that importance of a keyword in calculating similarity measures is inversely proportional to the total
number of documents that contain it. IDF is easy to understand, has a geometric interpretation, term weighing shown to help clustering, allow partial matching and returns ranked documents. An important finding in this study, where 30 cases of documents tested with the IDF algorithm, and the results are divided into three category; correct cluster, incorrect cluster, and unknown cluster.

Item Type: Thesis (Masters)
Supervisor : Yusoff, Nooraini
Item ID: 1367
Uncontrolled Keywords: Data Retrieval, Document Organizations, Clustering Techniques, Porter Algorithm, Gusfield Algorithm, Inverse Document Frequency (IDF)
Subjects: H Social Sciences > HF Commerce. > HF5001-6182 Business
Divisions: Faculty and School System > Faculty of Information Technology
Date Deposited: 09 Feb 2010 07:33
Last Modified: 12 Nov 2019 02:13
Department: Faculty of Information Technology
Name: Yusoff, Nooraini
URI: https://etd.uum.edu.my/id/eprint/1367

Actions (login required)

View Item
View Item