performance - In R, how to modify a column in a big dataframe based on rows in a small dataframe in an efficient way -

- March 15, 2013

i have big dataframe called big_set:

                              hash is_in_small_set 1 6694662834f3d2942ec4c6af20ab0520              na 2 265e53ecdb68d360890f9aa2d99c1ebe              na 3 0b7cd1f468c88de7c8bf822a77d4dc4d              na # have printed first 3 rows

and small dataframe called small_set:

                              hash result 1 703a4f40f24afe5baadb03412514048f      b 2 d0cabfc660bf334524e758ef5c9774a4      3 265e53ecdb68d360890f9aa2d99c1ebe      c

i need fill column big_set$is_in_small_set:

                              hash is_in_small_set 1 6694662834f3d2942ec4c6af20ab0520           false 2 265e53ecdb68d360890f9aa2d99c1ebe            true 3 0b7cd1f468c88de7c8bf822a77d4dc4d           false

i have working solution 2 nested for-loop unfortunately slow purposes nrow(big_set) 10k , nrow(small_set) 100.

getrandstring<-function(len=32) return(paste(sample(c(0:9,c('a','b','c','d','e','f')),len,replace=true),collapse=''))  myfun <- function(big_sz) {     big_set <- data.frame(hash=replicate(big_sz,getrandstring()))     big_set$is_in_small_set <- na      small_sz <- big_sz/10     small_set <- data.frame(hash=sample(big_set$hash,small_sz,replace=false),result=sample(c("a","b","c"),small_sz,replace=true))      big_rows <- seq(1,big_sz)     small_rows <- seq(1,small_sz)      (row_index_big in big_rows) {         (row_index_small in small_rows) {             if (big_set[row_index_big,]$hash == small_set[row_index_small,]$hash) {                 big_set[row_index_big,]$is_in_small_set = true                 break             } else {                 big_set[row_index_big,]$is_in_small_set = false             }         }     } }  system.time(myfun(10)) system.time(myfun(50)) system.time(myfun(75)) system.time(myfun(100)) system.time(myfun(200)) system.time(myfun(300))

the elapsed times:

   user  system elapsed     0.01    0.00    0.01     user  system elapsed     0.13    0.00    0.13     user  system elapsed     0.25    0.01    0.27     user  system elapsed     0.51    0.00    0.52     user  system elapsed     2.74    0.00    2.75     user  system elapsed     7.65    0.00    7.64

i have no idea on how "vectorize" code in order speed up.

as mentioned ananda in comments, typical approach (in base r) use %in% function, i.e.:

big_set$is_in_small_set <- big_set$hash %in% small_set$hash

big_set <- transform(big_set, is_in_small_set = hash %in% small_set$hash)

that should speed code significantly.

Search This Blog

Hide

performance - In R, how to modify a column in a big dataframe based on rows in a small dataframe in an efficient way -

Comments

Post a Comment

Popular posts from this blog

java - Oracle EBS .ClassNotFoundException: oracle.apps.fnd.formsClient.FormsLauncher.class ERROR -

c# - how to use buttonedit in devexpress gridcontrol -

How do you convert a timestamp into a datetime in python with the correct timezone? -