python - Unreliable performance downloading files from S3 with boto and multiprocessing.Pool -


i want download thousands of files s3. speed process tried out python's multiprocessing.pool, performance unreliable. works , it's faster single core version, files take several seconds multiprocessing run takes longer single process one. few times ssl.sslerror: read operation timed out.

what reason that?

from time import time boto.s3.connection import s3connection boto.s3.key import key multiprocessing import pool import pickle  access_key=xxx secret_key=xxx bucket_name=xxx  path_list = pickle.load(open('filelist.pickle','r')) conn = s3connection(access_key, secret_key) bucket = conn.get_bucket(bucket_name) pool = pool(32)   def read_file_from_s3(path):     starttime = time()     k = key(bucket)     k.key = path     content = k.get_contents_as_string()     print int((time()-starttime)*1000)     return content   results = pool.map(read_file_from_s3, path_list)  # or results = map(read_file_from_s3, path_list) single process comparison pool.close() pool.join() 

[update] ended adding timeouts retry (imap+.next(timeout)) multiprocessing code, because did not want change @ moment. if want right, use jan-philip's appraoch using gevent.

"what reason that?"

not enough detail. 1 reason private internet connection starving many concurrent connections. since did not specify in environment execute piece of code, pure speculation.

what no speculation, however, approach tackle problem inefficient. multiprocessing solving cpu-bound problems. retrieving data via multiple tcp connections @ once not cpu-bound problem. spawning 1 process per tcp connection waste of resources.

the reason why seems slow because in case 1 process spends lot of time waiting system calls return (the operating system on other hand spends lot of time waiting networking module told (and networking component spends lot of time waiting packets arrive on wire)).

you not need multiple processes making computer spend less time on waiting. not need multiple threads. can pull data many tcp connections within single os-level thread, using cooperative scheduling. in python, done using greenlet. higher level module making use of greenlets gevent.

the web full of gevent-based examples firing off many http requests -- concurrently. given proper internet connection, single os-level thread can deal hundreds or thousands or ten-thousands of concurrent connections simultaneously. in these orders of magnitude, problem evolves i/o-bound or cpu-bound, depending on exact purpose of application. is, either network connection or cpu-memory bus or single cpu core limit application.

regarding ssl.sslerror: read operation timed out-like errors: in world of networking, have account such things happen time time , decide (depending on details of application) how want deal these situations. often, simple retry attempt solution.


Comments

Popular posts from this blog

java - Oracle EBS .ClassNotFoundException: oracle.apps.fnd.formsClient.FormsLauncher.class ERROR -

c# - how to use buttonedit in devexpress gridcontrol -

How do you convert a timestamp into a datetime in python with the correct timezone? -