|
Thread Rules 1. This is not a "do my homework for me" thread. If you have specific questions, ask, but don't post an assignment or homework problem and expect an exact solution. 2. No recruiting for your cockamamie projects (you won't replace facebook with 3 dudes you found on the internet and $20) 3. If you can't articulate why a language is bad, don't start slinging shit about it. Just remember that nothing is worse than making CSS IE6 compatible. 4. Use [code] tags to format code blocks. |
On December 13 2012 23:16 heishe wrote: Question: Given an ordinary 16-bit number, what's the fastest way to count the number of all consecutive 1s from the left? I've looked at the graphics.stanford site, but I can't apply anything.
E.g. 1110101011000101 should return 3; 1111100100100111... should return 5, etc.
Checking single bits is trivial, but it's also incredibly slow.
So far the only thing I've tried are lookup tables, which are even slower than looping through all bits manually (probably because a lookup table doesn't fit into the cache at all, resulting in tons of cache misses) Lookup tables should be really fast. Storing your answers can be done in 64KB which should fit in the cache:
Because you cannot have more than 16 bits in a row you could even store two values per answer, but involves more math.
If you are having cache missets you can just try doing that and it only takes 32KB.
Finally, you can do the lookup 8 bits at a time instead, but this gets tricky when you have sequences that cross the midpoint.
Basically you can have 3 lookups for long run of 1s. 1s at beginning and 1s at end.
Finally your answer will be the maxium of longest(highest 8 bits), longest(lowest 8 bits), end(highest 8 bits) + beginning (lowest 8 bits)
|
|
On December 14 2012 02:43 meadbert wrote:Show nested quote +On December 13 2012 23:16 heishe wrote: Question: Given an ordinary 16-bit number, what's the fastest way to count the number of all consecutive 1s from the left? I've looked at the graphics.stanford site, but I can't apply anything.
E.g. 1110101011000101 should return 3; 1111100100100111... should return 5, etc.
Checking single bits is trivial, but it's also incredibly slow.
So far the only thing I've tried are lookup tables, which are even slower than looping through all bits manually (probably because a lookup table doesn't fit into the cache at all, resulting in tons of cache misses) Lookup tables should be really fast. Storing your answers can be done in 64KB which should fit in the cache: Because you cannot have more than 16 bits in a row you could even store two values per answer, but involves more math. If you are having cache missets you can just try doing that and it only takes 32KB. Finally, you can do the lookup 8 bits at a time instead, but this gets tricky when you have sequences that cross the midpoint. Basically you can have 3 lookups for long run of 1s. 1s at beginning and 1s at end. Finally your answer will be the maxium of longest(highest 8 bits), longest(lowest 8 bits), end(highest 8 bits) + beginning (lowest 8 bits)
I already implemented and tested a version with a 64k lookup table for the entire 16 bit, as well as a version with two 256 lookup tables. With both versions, L1 cache misses take away so much time that it negates any speed advantage that I might have gained from the lookup tables, which results in the lookup-table versions actually being slower. Remember: A single cache miss into L2 cache already costs you about 20-30 cycles. In those 30 cycles, I can almost read the value bitwise manually. And that's just for L1 cache misses. Since counting these leading zeroes / ones isn't the only thing I do, there are a lot of other things that need to be in cache as well.
The problem is that L1 caches still are very small. The processor I'm using at work (some i7 Hexacore) has L1 cache sizes of 32k for each core. That's 32kbyte. The lookup table requires 64k * 8 bit (since the smallest I can store are 1 byte values in the array), which means that only a very small part of the lookup table fits into l1 cache at any given time. The 256 * 8 bit one actually fits into it entirely, but it still generates a lot of L1 cache misses, because of the other things that fight for cache space as well.
|
Agh! I hope nobody minds me letting off some steam here. Who knows, maybe I am even not seeing something which might resolve my problem and somebody can help. But I am currently really a bit frustrated about the STL container std::set.
The main reason? Because of std::insert(...)! (see here) You can either... a) specify an iterator position, from which std::insert starts its search for the element to be inserted, which speeds up the search alot, if you know some handy information of your std::set, which I do. Or... b) get a bool returned which tells you whether the element was really inserted or already there in the first place, which I desperately need, since I want to program a function for computing the symmetric difference.
But you cannot do both, which I cannot understand why!
And no, I sadly cannot use the std::set_symmetric_difference provided, because it creates a completely new list, which is problematic because my list is guaranteed to be several megabyte in size and the symmetric difference is something which I want to use over one hundred thousand times.
Agh, time to go home, Ritmix already started anyways... -.-
|
On December 14 2012 03:05 heishe wrote:Show nested quote +On December 14 2012 02:43 meadbert wrote:On December 13 2012 23:16 heishe wrote: Question: Given an ordinary 16-bit number, what's the fastest way to count the number of all consecutive 1s from the left? I've looked at the graphics.stanford site, but I can't apply anything.
E.g. 1110101011000101 should return 3; 1111100100100111... should return 5, etc.
Checking single bits is trivial, but it's also incredibly slow.
So far the only thing I've tried are lookup tables, which are even slower than looping through all bits manually (probably because a lookup table doesn't fit into the cache at all, resulting in tons of cache misses) Lookup tables should be really fast. Storing your answers can be done in 64KB which should fit in the cache: Because you cannot have more than 16 bits in a row you could even store two values per answer, but involves more math. If you are having cache missets you can just try doing that and it only takes 32KB. Finally, you can do the lookup 8 bits at a time instead, but this gets tricky when you have sequences that cross the midpoint. Basically you can have 3 lookups for long run of 1s. 1s at beginning and 1s at end. Finally your answer will be the maxium of longest(highest 8 bits), longest(lowest 8 bits), end(highest 8 bits) + beginning (lowest 8 bits) I already implemented and tested a version with a 64k lookup table for the entire 16 bit, as well as a version with two 256 lookup tables. With both versions, L1 cache misses take away so much time that it negates any speed advantage that I might have gained from the lookup tables, which results in the lookup-table versions actually being slower. Remember: A single cache miss into L2 cache already costs you about 20-30 cycles. In those 30 cycles, I can almost read the value bitwise manually. And that's just for L1 cache misses. Since counting these leading zeroes / ones isn't the only thing I do, there are a lot of other things that need to be in cache as well. The problem is that L1 caches still are very small. The processor I'm using at work (some i7 Hexacore) has L1 cache sizes of 32k for each core. That's 32kbyte. The lookup table requires 64k * 8 bit (since the smallest I can store are 1 byte values in the array), which means that only a very small part of the lookup table fits into l1 cache at any given time. The 256 * 8 bit one actually fits into it entirely, but it still generates a lot of L1 cache misses, because of the other things that fight for cache space as well. Using a straight lookup table I was able to do 16 billion lookups in 7 seconds. Is that not fast enough? The bitwise slow way took 65 seconds. Storing 2 results per byte took 17 seconds so fewer cache misses did not offset the extra computation. Finally doing 8 bits at a time took 20 seconds.
|
On December 14 2012 03:33 meadbert wrote:Show nested quote +On December 14 2012 03:05 heishe wrote:On December 14 2012 02:43 meadbert wrote:On December 13 2012 23:16 heishe wrote: Question: Given an ordinary 16-bit number, what's the fastest way to count the number of all consecutive 1s from the left? I've looked at the graphics.stanford site, but I can't apply anything.
E.g. 1110101011000101 should return 3; 1111100100100111... should return 5, etc.
Checking single bits is trivial, but it's also incredibly slow.
So far the only thing I've tried are lookup tables, which are even slower than looping through all bits manually (probably because a lookup table doesn't fit into the cache at all, resulting in tons of cache misses) Lookup tables should be really fast. Storing your answers can be done in 64KB which should fit in the cache: Because you cannot have more than 16 bits in a row you could even store two values per answer, but involves more math. If you are having cache missets you can just try doing that and it only takes 32KB. Finally, you can do the lookup 8 bits at a time instead, but this gets tricky when you have sequences that cross the midpoint. Basically you can have 3 lookups for long run of 1s. 1s at beginning and 1s at end. Finally your answer will be the maxium of longest(highest 8 bits), longest(lowest 8 bits), end(highest 8 bits) + beginning (lowest 8 bits) I already implemented and tested a version with a 64k lookup table for the entire 16 bit, as well as a version with two 256 lookup tables. With both versions, L1 cache misses take away so much time that it negates any speed advantage that I might have gained from the lookup tables, which results in the lookup-table versions actually being slower. Remember: A single cache miss into L2 cache already costs you about 20-30 cycles. In those 30 cycles, I can almost read the value bitwise manually. And that's just for L1 cache misses. Since counting these leading zeroes / ones isn't the only thing I do, there are a lot of other things that need to be in cache as well. The problem is that L1 caches still are very small. The processor I'm using at work (some i7 Hexacore) has L1 cache sizes of 32k for each core. That's 32kbyte. The lookup table requires 64k * 8 bit (since the smallest I can store are 1 byte values in the array), which means that only a very small part of the lookup table fits into l1 cache at any given time. The 256 * 8 bit one actually fits into it entirely, but it still generates a lot of L1 cache misses, because of the other things that fight for cache space as well. Using a straight lookup table I was able to do 16 billion lookups in 7 seconds. Is that not fast enough? The bitwise slow way took 65 seconds. Storing 2 results per byte took 17 seconds so fewer cache misses did not offset the extra computation. Finally doing 8 bits at a time took 20 seconds.
In a closed test? That isn't representative at all. Like I said, there's a lot of other stuff going on as well. Besides, I already wrote that I tested it and it was too slow Why is your data relevant.
|
Here is another fairly fast more cache friendly method. Have three lookup tables. The first is an 8 bit look table for longest streak of ones. Next is an optional 8 bit lookup table for current streak (on average this is fast so table not needed) Finally you have a 256x9x9 matrix where you store the max streak given 3 variables which are: lower 8 bits longest streak from higher 8 bits current streak from higher 8 bits.
This uses about 20K instead of 64 so it fits within your 32K cache. Using a closed test I found it took 9 seconds as opposed to 7 for a straight lookup. Considering cache hits this might be faster for you. + Show Spoiler + #include <stdlib.h> #include <stdio.h>
unsigned char maxBitsSlow( unsigned int n) { unsigned char count = 0; unsigned char max = 0;
while(n > 0) { if(n&1) { count++; if(count > max) { max = count; } } n >>= 1; } return max; }
unsigned char lowBitsSlow( unsigned char n) { unsigned char count = 0; unsigned char max = 0;
while((n&1) > 0) { count++; n >>= 1; } return count; }
unsigned char rowContSlow( unsigned char n, unsigned char count, unsigned char max) { while(n > 0) { if(n&1) { count++; if(count > max) { max = count; } } n >>= 1; } return max; }
unsigned char rowArray[1 << 8]; unsigned char lowArray[1 << 8]; unsigned char rowMatrix[1 << 8][9][9];
void init(void) { unsigned int i, j, k; for(i = 0;i < (1 << 8);i++) { rowArray[i] = maxBitsSlow(i); lowArray[i] = lowBitsSlow(i); } for(i = 0;i < (1 << 8);i++) { for(j = 0;j <= 8;j++) { for(k = 0;k <= 8;k++) { rowMatrix[i][j][k] = rowContSlow(i, j, k); } } } }
unsigned char maxBitsCont( unsigned short n) { unsigned char high = n >> 8; unsigned char low = n&63; unsigned char max = rowArray[high]; unsigned char count = lowArray[high]; return rowMatrix[low][count][max]; }
EDIT: How can I preserve my spacing?
|
On December 13 2012 23:16 heishe wrote: Question: Given an ordinary 16-bit number, what's the fastest way to count the number of all consecutive 1s from the left? I've looked at the graphics.stanford site, but I can't apply anything.
E.g. 1110101011000101 should return 3; 1111100100100111... should return 5, etc.
Checking single bits is trivial, but it's also incredibly slow.
So far the only thing I've tried are lookup tables, which are even slower than looping through all bits manually (probably because a lookup table doesn't fit into the cache at all, resulting in tons of cache misses)
In ASM, do : (assuming your number is in the lower 16 bits of eax)
not ax, ax bsr ax, ax
I am going to break it down quick if you are not familiar with x86 Assembly :
not will replace all 1s with 0s, and all 0s with 1s bsr (Bit Scan Reverse) will return the index of the first bit at 1, which is the value you were looking for.
This will be extremely fast, about 4 cycles (depending on your processor.)
|
bsr doesn't work, it scans from the right (don't suggest bsf, it is the same thing, it just starts the search from the other side ). In the example binaries I gave, your method would return 1 for the first number, and three for the second I need "3" and "5" though.
It effectively counts trailing zeroes, but I need something that counts leading zeroes. I could reverse the bit-order of the relevant word (16 bit) but afaik there's no x86 instruction for that, and anything manual for that would take way too many cycles. At least I don't know an algorithm that does it in sub ~5 cycles.
|
On December 14 2012 04:24 heishe wrote:bsr doesn't work, it scans from the right (don't suggest bsf, it is the same thing, it just starts the search from the other side  ). In the example binaries I gave, your method would return 1 for the first number, and three for the second  I need "3" and "5" though. It effectively counts trailing zeroes, but I need something that counts leading zeroes. I could reverse the bit-order of the relevant word (16 bit) but afaik there's no x86 instruction for that, and anything manual for that would take way too many cycles. At least I don't know an algorithm that does it in sub ~5 cycles.
You seem to be mixing up bsr and bsf ?
And the index it gives is the index in the word, so you will need to substract it to 16 to have the number you wanted, that's one more cycle
|
On December 14 2012 04:24 heishe wrote:bsr doesn't work, it scans from the right (don't suggest bsf, it is the same thing, it just starts the search from the other side  ). In the example binaries I gave, your method would return 1 for the first number, and three for the second  I need "3" and "5" though. It effectively counts trailing zeroes, but I need something that counts leading zeroes. I could reverse the bit-order of the relevant word (16 bit) but afaik there's no x86 instruction for that, and anything manual for that would take way too many cycles. At least I don't know an algorithm that does it in sub ~5 cycles.
Hi Heishe... Here's an idea... I'd love to program it for you but I have to take a final soon (This will probably require assembly... though I guess you could use the ugly struct trick in C to make nibbles )
Assume your number is in a 16-bit register. Example: 1111111011110000
Split the register into four 4-bit nibbles.. so (Or you can just have one nibble pointing to the top-most nibble in the register): 1111 1110 1111 0000
We know that 1111 is some constant value.
Check, starting from the top-most nibble to see if that nibble is equal to whatever value 1111 is. If it's not, then you know there is a zero in that bit pattern, and you can use more normal counting techniques to see how many bits are on the left of 1. If it is, you can increment your counter by four and move onto the next nibble set. Hope you get the idea. This should be pretty quick if you program it properly in assembly 
*edit* haha, good catch on the bsr thing! Nevermind!
|
On December 14 2012 04:29 Denar wrote:Show nested quote +On December 14 2012 04:24 heishe wrote:bsr doesn't work, it scans from the right (don't suggest bsf, it is the same thing, it just starts the search from the other side  ). In the example binaries I gave, your method would return 1 for the first number, and three for the second  I need "3" and "5" though. It effectively counts trailing zeroes, but I need something that counts leading zeroes. I could reverse the bit-order of the relevant word (16 bit) but afaik there's no x86 instruction for that, and anything manual for that would take way too many cycles. At least I don't know an algorithm that does it in sub ~5 cycles. You seem to be mixing up bsr and bsf ? And the index it gives is the index in the word, so you will need to substract it to 16 to have the number you wanted, that's one more cycle 
Oh my god... I'm dumb. Your reply motivated me to do a dummy test of bsr and bsf (I assumed they produce the same result). I wasted hours at work because I tested bsr & bsf incorrectly before and interpreted an answer of some guy o stackoverflow wrong.
Thank you very much.
|
I tried __builtin_clz for gcc and it took 18 seconds for 6 billion lookups.
I also tried the really easy 8 bit lookup where you do a lookup on the first 8 bits and if it returns anything but 8 you are done. If it returns 8 then the answer is 8 + lookup on the lower 8 bits. That got 8 seconds.
I am guessing that gcc is not implementing clz with hardware or possibly it does not exist.
+ Show Spoiler + unsigned char leftBitsFast( unsigned int n) { unsigned char high = n >> 8; unsigned char count = array[high];
if(count != 8) { return count; } return 8 + array[n&255]; }
unsigned char leftBitsAsm( unsigned short n) { unsigned short flipped = ~n;
if(flipped == 0) { /* clz undefined for 0 */ return 16; } return __builtin_clz(flipped) - 16; }
|
Game programmer here, we ask this question in interviews.
Brute force: -Loop through all bits, count 'em all.
Faster, but uses more memory: -Lookup tables of various sizes. Doesn't need to be for all 16 bits. Can do segments of 4 bits otherwise you're getting crazy.
New age cool answer that is somewhat acceptable: -Do any of the two above, but fork to multiple threads.
Best answer that I wouldn't expect in an interview: MIT Hackmem algorithm (http://gurmeet.net/puzzles/fast-bit-counting-routines/)
|
On December 14 2012 05:18 Phunkapotamus wrote: Game programmer here, we ask this question in interviews.
Brute force: -Loop through all bits, count 'em all.
Faster, but uses more memory: -Lookup tables of various sizes. Doesn't need to be for all 16 bits. Can do segments of 4 bits otherwise you're getting crazy.
New age cool answer that is somewhat acceptable: -Do any of the two above, but fork to multiple threads.
Best answer that I wouldn't expect in an interview: MIT Hackmem algorithm (http://gurmeet.net/puzzles/fast-bit-counting-routines/) Didn't you make impossible scenarios maps for BW?
|
On December 14 2012 05:18 Phunkapotamus wrote: Game programmer here, we ask this question in interviews.
Brute force: -Loop through all bits, count 'em all.
Faster, but uses more memory: -Lookup tables of various sizes. Doesn't need to be for all 16 bits. Can do segments of 4 bits otherwise you're getting crazy.
New age cool answer that is somewhat acceptable: -Do any of the two above, but fork to multiple threads.
Best answer that I wouldn't expect in an interview: MIT Hackmem algorithm (http://gurmeet.net/puzzles/fast-bit-counting-routines/)
where X is a binary number.
NOT X CLZ X
Having a special function do it is cheating. :c
|
Okay, so I Went through a giant 1500 page java book and did a big project involving 2k lines of code with gui and sql connection to a server to store/retrieve information. Does anyone know what the next step is in finding a job?
|
On December 14 2012 08:18 xavra41 wrote: Okay, so I Went through a giant 1500 page java book and did a big project involving 2k lines of code with gui and sql connection to a server to store/retrieve information. Does anyone know what the next step is in finding a job?
Sending your CV.
|
On December 14 2012 08:18 xavra41 wrote: Okay, so I Went through a giant 1500 page java book and did a big project involving 2k lines of code with gui and sql connection to a server to store/retrieve information. Does anyone know what the next step is in finding a job? That's roughly equal to a single college course, only without any "proof" of completion / understanding.
It's a useful skill, but also something taught to sophomores in one of their first handful of programming courses. In other words: you've a long way to go still.
|
So what do I need before I can apply?
|
|
|
|