|
Hi programmers/math people.
Okay, here's the problem. I have a hashmap with Strings as keys and values pointing to objects (as seen below in java)
HashMap<String, SomeObject>
The chars within a single string is element of the set {0, 1, #}. # is a wildcard which can represent either a 0 or 1.
When presented by a message, ie: 011010111 (a message's char is element of the set {0, 1}), the following strings are satisfied: 01#01#111 #1101011# 011010111 ######### etc., due to their wildcards.
Which look up/sorting method would you do, such that you have the fastest algorithm to store the strings and also find the strings which are satisfied?
Bruteforce Complexity: Finding all satisfied strings: O(n*p) with n = population of strings, p = size of string.
Bruteforce works ofcourse: for(all strings in hashmap) is string satisfied? Save it next string
Tree Complexity: O(p*n), but very unlikely that all strings are found in ONE leaf. Constructing the tree O(2^p) (!HOLY FUCK!)
Keeping a tree which branches every time a wildcard apears in a string. Each leaf in the tree has a hashset, which looks like the one above. The string's SomeObject ie, 01#01#111 would be possible to find in 4 leafs of the tree: 010010111 010011111 011011111 011010111
The problem is constructing the tree... if the String is big like 20-30 chars, the construction is simply too big to be possible.
How would you do it?
|
what I want to do is solve the problem by "folding" the strings into unique integers somehow, but i'm not sure it can be done.
edit: I really don't think I can help you, sorry.
edit2: what about making a new hashmap with no wildcards by replicating each string/object pairing 2^(num #s) times, then sorting the strings as integers. (big setup time, subsequent searches are O(lgn)).
|
On May 22 2011 01:03 Cube wrote: what I want to do is solve the problem by "folding" the strings into unique integers somehow, but i'm not sure it can be done.
edit: I really don't think I can help you, sorry.
Might actually be a good idea.
Then it's an experiment of how much fold it required to sort it into serveral small hashmaps Ignore what i wrote, I gotta think more about it.
|
What about bruteforcing the other way. Assuming you only care about getting the correct object and not the actual string, make an entry in the hash table for every possible message for each string with a wildcard.
|
Just to clarify, there are no limits on the sizes or types of data, eg a string could be a million characters long and the population of acceptable strings could be arbitrarily large? Also, are we assuming that all strings we're working with are of the same length?
I guess what I'm really asking is whether the problem is for an actual project in real life or just a theoretical puzzle?
|
On May 22 2011 01:33 Famulus wrote: What about bruteforcing the other way. Assuming you only care about getting the correct object and not the actual string, make an entry in the hash table for every possible message for each string with a wildcard.
It would be a good idea, but every possible message is 2^(length of string), that's
length -> combinations 20 -> 1,048,576 40 -> 1,099,511,627,776 (in my case)
Not scalable :/.
|
On May 22 2011 01:34 pullarius1 wrote: Just to clarify, there are no limits on the sizes or types of data, eg a string could be a million characters long and the population of acceptable strings could be arbitrarily large? Also, are we assuming that all strings we're working with are of the same length?
I guess what I'm really asking is whether the problem is for an actual project in real life or just a theoretical puzzle?
Perfectly good questions. All strings are the same length, and so is the message that needs to be satisfied. It's for an XCS engine - I'll provide a paper in a sec.
|
algorithmic description of XCS
It's an AI learning technique, based on "Learning classifier systems". You don't really need to read it to understand the problem though.
Bunch of strings with #10 in it, and a message with 10 which needs to find the strings that satisfies it.
|
I am interested to see if someone comes up with an alternative to iteration for this as I parse huge data files on a daily basis for work.
I often find myself setting up foreach loops with regular expressions to loop through hashtables in perl, and always wondered if there was a more efficient way of doing it.
|
If you're not worried about memory, you can just take the initial HashMap and convert it into a new HashMap<String, ArrayList<SomeObject>> where the key is only {0,1}. You just iterate through the original HashMap and convert all '#' into '0' and '1'. This is worst case O(2^p) for a String of all '#' for storing, but gives you O(1) look-up time.
|
On May 22 2011 02:18 Mx.DeeP wrote: If you're not worried about memory, you can just take the initial HashMap and convert it into a new HashMap<String, ArrayList<SomeObject>> where the key is only {0,1}. You just iterate through the original HashMap and convert all '#' into '0' and '1'. This is worst case O(2^p) for a String of all '#' for storing, but gives you O(1) look-up time.
Exactly - that's the "tree" i talked about..
My message (in my problem) has 40 bits, that's 2^40 in construction of that tree... 1,099,511,627,776 nodes in it - would take too long :/.
I'm gonna go work out, and think about it. .
|
I'm not sure how much you get to work with the lists beforehand, but obviously if you could sort the list of wild strings before hand it would help a lot. But it would be a waste of time if the number of strings you were checking for matches for were very low. That is, if you have 100 01# strings, but only needed to find the matches for a few 10 strings, sorting would probably hurt. But if you had 100 strings to match, it would probably be worth your while.
One thing I thought of that is probably not useful at all: For each string, rehash it into integers in the following way- For each placenumber i, assign the the 2i-th and (2i-1)th prime to it. If that place number holds a 1 one, choose the odd prime, a 0, choose the even prime, a # choose neither. Multiply all the chosen primes together.
For instance 10110 would be (2 or 3) (5 or 7) (11 or 13) (17 or 19) (23 or 29) 2*7*11*17*29 = 75,922
While #01#0 would be (2 or 3) (5 or 7) (11 or 13) (17 or 19) (23 or29) 7*11*29 = 2,233
The benefit of this system would be that wild strings would divide precisely the strings that satisfied them. For whatever that's worth.
...sometimes I wish I had taken some practical programming classes in school :-(.
|
On May 22 2011 02:32 pullarius1 wrote: I'm not sure how much you get to work with the lists beforehand, but obviously if you could sort the list of wild strings before hand it would help a lot. But it would be a waste of time if the number of strings you were checking for matches for were very low. That is, if you have 100 01# strings, but only needed to find the matches for a few 10 strings, sorting would probably hurt. But if you had 100 strings to match, it would probably be worth your while.
One thing I thought of that is probably not useful at all: For each string, rehash it into integers in the following way- For each placenumber i, assign the the 2i-th and (2i-1)th prime to it. If that place number holds a 1 one, choose the odd prime, a 0, choose the even prime, a # choose neither. Multiply all the chosen primes together.
For instance 10110 would be (2 or 3) (5 or 7) (11 or 13) (17 or 19) (23 or 29) 2*7*11*17*29 = 75,922
While #01#0 would be (2 or 3) (5 or 7) (11 or 13) (17 or 19) (23 or29) 7*11*29 = 2,233
The benefit of this system would be that wild strings would divide precisely the strings that satisfied them. For whatever that's worth.
...sometimes I wish I had taken some practical programming classes in school :-(.
this is basically what I had in mind but as the string size grows arbitrarily large this becomes impractical. :[
|
On May 22 2011 02:32 pullarius1 wrote: I'm not sure how much you get to work with the lists beforehand, but obviously if you could sort the list of wild strings before hand it would help a lot. But it would be a waste of time if the number of strings you were checking for matches for were very low. That is, if you have 100 01# strings, but only needed to find the matches for a few 10 strings, sorting would probably hurt. But if you had 100 strings to match, it would probably be worth your while.
One thing I thought of that is probably not useful at all: For each string, rehash it into integers in the following way- For each placenumber i, assign the the 2i-th and (2i-1)th prime to it. If that place number holds a 1 one, choose the odd prime, a 0, choose the even prime, a # choose neither. Multiply all the chosen primes together.
For instance 10110 would be (2 or 3) (5 or 7) (11 or 13) (17 or 19) (23 or 29) 2*7*11*17*29 = 75,922
While #01#0 would be (2 or 3) (5 or 7) (11 or 13) (17 or 19) (23 or29) 7*11*29 = 2,233
The benefit of this system would be that wild strings would divide precisely the strings that satisfied them. For whatever that's worth.
...sometimes I wish I had taken some practical programming classes in school :-(.
well thats actually a great solution, since if message modulo hashed-key = 0 then such an index satisfies the constraint.
So if you map every key by the hash function to this form, and store it in the next slot in the database, as well as with its object pointer, then simply do a linear search for message modulo hashed key = 0 on each element of the array
this circumvents directly hashing onto an array location since that hash function increases faster than n factorial
|
Okay, I gotta re-read it all, cos I'm a bit lost on this one.. .
Edit: okay, I read it! I need to write a sketch over it Might actually work with modulus it with the message.
Your only problem is if the String has 40 wildcards in it - then it takes a long time to write all the possible prime combinations (right?) ... or if you ignore wildcards, is 40x # = 0?
I'm gonna write an algorithm for this rly quick .
|
So when you store an object by its key, create an array with the hashed version of the key (H_i) O(p) and its object pointer O(1). Then store both into the next available position in the dataset O(1).
When you're searching for keys which satisfy a certain message: First hash the key O(p) = H_k. Then do a linear check over all database entries such that H_k modulo H_i = 0 and return it. O(l)
p = length of string l = length of dataset
|
The problem is then, what if the dataset is 5,000,000 strings? :/
|
insertion is O(p) extraction is O(p+l) in which l will probably dominate p so O(l)
which is still acceptable by any means (l = length of array, so linear time) 5,000,000 wouldn't take an enormous amount of time (in fact 5,000,000 is actually really fast to compute)
|
I'm thinking it might be possible to speed up look up.. Perhaps with tree-search, or other sorting methods.. Ofcourse this would kill the insertion-time.
|
^ Yes, it is. I've been thinking about this for the past hour or so and have coded up a working implementation in Java. I'll PM you once I write more tests.
|
|
|
|