Skip to content

NA handling difference with dist #4

@fcasarramona

Description

@fcasarramona

While pdist works great it handles NAs in a different way of dist

When calculating the distance between 2 vectors dist not only ignores NAs but scales the distance to the length of the vector.

dist help file description of this behaviour:
Missing values are allowed, and are excluded from all computations involving the rows within which they occur. Further, when Inf values are involved, all pairs of values are excluded when their contribution to the distance gave NaN or NA. If some columns are excluded in calculating a Euclidean, Manhattan, Canberra or Minkowski distance, the sum is scaled up proportionally to the number of columns used. If all pairs are excluded when calculating a particular distance, the value is NA.

This is an example of the difference of behaviour:

v1 <- c(1,2,3,4,NA)
v2 <- c(5,NA,4,3,5)
pdist(v1,v2)@dist
 [1] 4.24264
dist(rbind(v1,v2))
          v1
 v2 5.477226
# scaling by sqrt(5/3)
pdist(v1,v2)@dist * sqrt(length(v1)/(length(v1) - sum(is.na(v2) | is.na(v1))))
 [1] 5.477225

I wrote a function to compute the scaling but is a way slower than dist:

pdist.w.scale <- function(X,Y)
{
  if (!is.matrix(X)) 
    X = as.matrix(X)
  if (!is.matrix(Y)) 
    Y = as.matrix(Y)
  distances <- matrix(pdist(X,Y)@dist, ncol=nrow(X), byrow = TRUE)
  #count NAs
  na.count <- sapply(1:nrow(X),function(i){rowSums(is.na(Y) | is.na(X[i,]))})
  #scaling to number of cols
  distances * sqrt(ncol(X)/(ncol(X) - na.count))
}

It would be great if the scaling feature (by default or as a option) was incorporated to pdist.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions