hashtable - hash table suffix tree explanation -


i asking here because couldn't find answer looking elsewhere , don't know else ask this. hope can reply without saying question irrelevant forum. have biology background , using bioinformatics. need understand in lay language hash tables , suffix trees. simple, don't o(n) concepts , stuff, think both kind of same: way store string data? understand better differences. enormously other people me. lot in field now!

thanks in advance.

ok, lets use bioinformatics illustrate differences.

let's have several dna sequences pretty long. if want store these sequences in datastructure.

if want use hashtable

a hashtable useful way store bunch of objects search datastructure see if contain particular object.

one bioinformatics usecase can solve hashtable de-duping large sequence set. let's have huge dataset of next-gen sequenced data , want de-duplicate before assemble. can use hashtable store unique sequences. before inserting sequences hashtable, can first check see if exists in hashtable , if skip read. if not yet in hashtable add it. when done elements in hash unique sequences.

hashtables array of linkedlists. each cell in array call "bin". when insert or search in hashtable, have first know bin in. way determine bin use hash algorithm.

  1. we have come hash algorithm. convert our sequence number. requirement of equation same sequence must evaluate same number. it's ok if different sequences evaluate same number (which called hash collision) since there infinite number of possible sequences , have limited range of possible number values in our hash.

a simple hash algorithm assign value each base =1 g =2 c = 3 t =4 (assume no ambiguities) can sum bases in our sequence. mean sequences same number of as, cs gs , ts have same hash value. if wanted, have more complicated algorithm takes position account same number have have same sequence in same order.

  1. once have our hash algorithm. can make hash table binning sequences hash values. more bins have in our table, fewer hash values per bin. hashtables implemented array of linkedlists. fast lookup because see if sequence in our hashtable or add new sequence our hash table, compute hash value sequence see bin in, have @ values inside bin. can ignore rest of bins.

suffix tree

a suffix tree different datastructure graph each node (in case) residue in our sequence. edges in graph point next node etc. example if our sequence acgt path in graph a->c->g->t->$. if had sequence actt path a->c->t->t->$.

we can combine consecutive nodes if there 1 path in previous example since both sequence start ac paths ac->g->t->$and ac->t->t->$.

in bioinformatics useful substring matching (like finding repetitive regions or primer binding sites etc) since can see there subpaths in our graph match our motif.

hope helps


Comments

Popular posts from this blog

java - Oracle EBS .ClassNotFoundException: oracle.apps.fnd.formsClient.FormsLauncher.class ERROR -

c# - how to use buttonedit in devexpress gridcontrol -

How do you convert a timestamp into a datetime in python with the correct timezone? -