Comparison of Distance Metrics for Hierarchical Data in Medical Databases
2014-07
Proceedings of the 2014 World Congress on Computational Intelligence (WCCI 2014)
—Distance metrics are broadly used in different research
areas and applications, such as bio-informatics, data
mining and many other fields. However, there are some metrics,
like pq-gram and Edit Distance used specifically for data with a
hierarchical structure. Other metrics used for non-hierarchical
data are the geometric and Hamming metrics. We have applied
these metrics to The Health Improvement Network (THIN)
database which has some hierarchical data. The THIN data has
to be converted into a tree-like structure for the first group of
metrics. For the second group of metrics, the data are converted
into a frequency table or matrix, then for all metrics, all distances
are found and normalised. Based on this particular data set, our
research question: which of these metrics is useful for THIN
data?. This paper compares the metrics, particularly the pqgram
metric on finding the similarities of patients’ data. It
also investigates the similar patients who have the same close
distances as well as the metrics suitability for clustering the
whole patient population. Our results show that the two groups of
metrics perform differently as they represent different structures
of the data. Nevertheless, all the metrics could represent some
similar data of patients as well as discriminate sufficiently well
in clustering the patient population using k-means clustering
algorithm.