-
Notifications
You must be signed in to change notification settings - Fork 3
Description
While pdist
works great it handles NAs in a different way of dist
When calculating the distance between 2 vectors dist
not only ignores NAs but scales the distance to the length of the vector.
dist
help file description of this behaviour:
Missing values are allowed, and are excluded from all computations involving the rows within which they occur. Further, when Inf values are involved, all pairs of values are excluded when their contribution to the distance gave NaN or NA. If some columns are excluded in calculating a Euclidean, Manhattan, Canberra or Minkowski distance, the sum is scaled up proportionally to the number of columns used. If all pairs are excluded when calculating a particular distance, the value is NA.
This is an example of the difference of behaviour:
v1 <- c(1,2,3,4,NA)
v2 <- c(5,NA,4,3,5)
pdist(v1,v2)@dist
[1] 4.24264
dist(rbind(v1,v2))
v1
v2 5.477226
# scaling by sqrt(5/3)
pdist(v1,v2)@dist * sqrt(length(v1)/(length(v1) - sum(is.na(v2) | is.na(v1))))
[1] 5.477225
I wrote a function to compute the scaling but is a way slower than dist
:
pdist.w.scale <- function(X,Y)
{
if (!is.matrix(X))
X = as.matrix(X)
if (!is.matrix(Y))
Y = as.matrix(Y)
distances <- matrix(pdist(X,Y)@dist, ncol=nrow(X), byrow = TRUE)
#count NAs
na.count <- sapply(1:nrow(X),function(i){rowSums(is.na(Y) | is.na(X[i,]))})
#scaling to number of cols
distances * sqrt(ncol(X)/(ncol(X) - na.count))
}
It would be great if the scaling feature (by default or as a option) was incorporated to pdist
.