performance - In R, how to modify a column in a big dataframe based on rows in a small dataframe in an efficient way -
i have big dataframe called big_set
:
hash is_in_small_set 1 6694662834f3d2942ec4c6af20ab0520 na 2 265e53ecdb68d360890f9aa2d99c1ebe na 3 0b7cd1f468c88de7c8bf822a77d4dc4d na # have printed first 3 rows
and small dataframe called small_set
:
hash result 1 703a4f40f24afe5baadb03412514048f b 2 d0cabfc660bf334524e758ef5c9774a4 3 265e53ecdb68d360890f9aa2d99c1ebe c
i need fill column big_set$is_in_small_set
:
hash is_in_small_set 1 6694662834f3d2942ec4c6af20ab0520 false 2 265e53ecdb68d360890f9aa2d99c1ebe true 3 0b7cd1f468c88de7c8bf822a77d4dc4d false
i have working solution 2 nested for-loop unfortunately slow purposes nrow(big_set)
10k , nrow(small_set)
100.
getrandstring<-function(len=32) return(paste(sample(c(0:9,c('a','b','c','d','e','f')),len,replace=true),collapse='')) myfun <- function(big_sz) { big_set <- data.frame(hash=replicate(big_sz,getrandstring())) big_set$is_in_small_set <- na small_sz <- big_sz/10 small_set <- data.frame(hash=sample(big_set$hash,small_sz,replace=false),result=sample(c("a","b","c"),small_sz,replace=true)) big_rows <- seq(1,big_sz) small_rows <- seq(1,small_sz) (row_index_big in big_rows) { (row_index_small in small_rows) { if (big_set[row_index_big,]$hash == small_set[row_index_small,]$hash) { big_set[row_index_big,]$is_in_small_set = true break } else { big_set[row_index_big,]$is_in_small_set = false } } } } system.time(myfun(10)) system.time(myfun(50)) system.time(myfun(75)) system.time(myfun(100)) system.time(myfun(200)) system.time(myfun(300))
the elapsed times:
user system elapsed 0.01 0.00 0.01 user system elapsed 0.13 0.00 0.13 user system elapsed 0.25 0.01 0.27 user system elapsed 0.51 0.00 0.52 user system elapsed 2.74 0.00 2.75 user system elapsed 7.65 0.00 7.64
i have no idea on how "vectorize" code in order speed up.
as mentioned ananda in comments, typical approach (in base r) use %in%
function, i.e.:
big_set$is_in_small_set <- big_set$hash %in% small_set$hash
or
big_set <- transform(big_set, is_in_small_set = hash %in% small_set$hash)
that should speed code significantly.
Comments
Post a Comment