NA handling difference with `dist`

While `pdist` works great it handles NAs in a different way of `dist`

When calculating the distance between 2 vectors `dist` not only ignores NAs but scales the distance to the length of the vector. 

`dist` help file description of this behaviour:
_Missing values are allowed, and are excluded from all computations involving the rows within which they occur. Further, when Inf values are involved, all pairs of values are excluded when their contribution to the distance gave NaN or NA. If some columns are excluded in calculating a Euclidean, Manhattan, Canberra or Minkowski distance, the sum is scaled up proportionally to the number of columns used. If all pairs are excluded when calculating a particular distance, the value is NA._

This is an example of the difference of behaviour:

```
v1 <- c(1,2,3,4,NA)
v2 <- c(5,NA,4,3,5)
pdist(v1,v2)@dist
 [1] 4.24264
dist(rbind(v1,v2))
          v1
 v2 5.477226
# scaling by sqrt(5/3)
pdist(v1,v2)@dist * sqrt(length(v1)/(length(v1) - sum(is.na(v2) | is.na(v1))))
 [1] 5.477225
```

I wrote a function to compute the scaling but is a way slower than `dist`:

```
pdist.w.scale <- function(X,Y)
{
  if (!is.matrix(X)) 
    X = as.matrix(X)
  if (!is.matrix(Y)) 
    Y = as.matrix(Y)
  distances <- matrix(pdist(X,Y)@dist, ncol=nrow(X), byrow = TRUE)
  #count NAs
  na.count <- sapply(1:nrow(X),function(i){rowSums(is.na(Y) | is.na(X[i,]))})
  #scaling to number of cols
  distances * sqrt(ncol(X)/(ncol(X) - na.count))
}
```

It would be great if the scaling feature (by default or as a option) was incorporated to `pdist`.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

NA handling difference with `dist` #4

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

NA handling difference with dist #4

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

NA handling difference with `dist` #4