The Big Programming Thread - Page 644
Forum Index > General Forum |
Thread Rules 1. This is not a "do my homework for me" thread. If you have specific questions, ask, but don't post an assignment or homework problem and expect an exact solution. 2. No recruiting for your cockamamie projects (you won't replace facebook with 3 dudes you found on the internet and $20) 3. If you can't articulate why a language is bad, don't start slinging shit about it. Just remember that nothing is worse than making CSS IE6 compatible. 4. Use [code] tags to format code blocks. | ||
Slithe
United States985 Posts
| ||
Blisse
Canada3710 Posts
It's a 100K by 100K dense graph I need to count triangles on. Multi-threading is easy to do because I can abuse a little bit of parallelism for what to parse at what time. But I need to implement a single-threaded solution that doesn't cramp up. Currently we split the edges in half to do triangle counting, but we have to do about 600M set contains() operations on essentially sets of ~100K integers, which Java chokes on because it can't keep all the edges in memory at the same time. I just have no clue what to do to keep all the integers in memory so when I can call contains() on different sets of size 100K and not have it cramp up. The contains() operations are incredibly fast (<5ms), but having 600M of them in serial hurts. Been trying to partition the data so they're potentially allocated closer together but the benefits are intangible/worse most of the time. I might have to just accept this is as fast as I'm going to be able to call contains() 600M times. I'm more of a C guy so doing this in Java is much more work :p | ||
phar
United States1080 Posts
600 million 32 bit integers is like less than 3GB of RAM. What is the actual algorithm you're using to count triangles? What data structures are you using? How many edges do you actually have for 100K nodes? (Depending on the real-world application this may be only around 1M total edges, which you should easily be able to fit in memory in Java on a single machine). Have you actually verified you're running out of memory? If so: How are you invoking java? Are you giving the JVM enough RAM? For example a lot of default jvm will stick you at 1G max heap, if you're allocating >1G of objects you'll have a bad time. try java -XX:+PrintFlagsFinal -version | grep HeapSize if your max heap size is tiny, try -Xmx to bump it up | ||
Nesserev
Belgium2760 Posts
| ||
Acrofales
Spain17848 Posts
On June 24 2015 12:25 Blisse wrote: It's an assignment, Java is required ^^ It's a 100K by 100K dense graph I need to count triangles on. Multi-threading is easy to do because I can abuse a little bit of parallelism for what to parse at what time. But I need to implement a single-threaded solution that doesn't cramp up. Currently we split the edges in half to do triangle counting, but we have to do about 600M set contains() operations on essentially sets of ~100K integers, which Java chokes on because it can't keep all the edges in memory at the same time. I just have no clue what to do to keep all the integers in memory so when I can call contains() on different sets of size 100K and not have it cramp up. The contains() operations are incredibly fast (<5ms), but having 600M of them in serial hurts. Been trying to partition the data so they're potentially allocated closer together but the benefits are intangible/worse most of the time. I might have to just accept this is as fast as I'm going to be able to call contains() 600M times. I'm more of a C guy so doing this in Java is much more work :p Well, clearly you have to think about your data structure. You don't have lists of integers, you have a dense graph with 100K nodes. At least, that's what I understand from your 100K by 100K (which makes it sound more like a matrix, but presumably that is your matrix representation of the graph?) There are many ways of representing that, the most common being a graph datastructure and a matrix (basically a 2D array), both of which have some good Java libraries for manipulating. However, given your task, I actually doubt your graph is dense. A fully connected graph of 100K nodes has C(100K, 3) triangles, or to be precise: 166,661,666,700,000 triangles in a fully connected graph of 100K nodes. Now obviously dense does not mean fully connected, just mostly connected. Which if you store this as a text file using 1 byte per triangle (unrealistic assumption, you will use more), that is still ~166TB to say what the triangles are. Now even if it is not fully connected, it only decreases linearly for a very long time, and for a densely connected graph you can count on a lower bound of about 100K^2 triangles, or using 1 byte per triangle, ~10GB of triangles. In a real problem, these can be the kind of sizes you are looking at. However, in real problems graphs are usually sparsely connected (small world). I did a quick google and cannot actually find a real-world problem that is represented as a dense graph. For a sparse graph you can use different datastructures and algorithms. While you can always represent it as a dense matrix (read: a matrix where the 0s are explicitly represented), it is horribly inefficient. | ||
Blisse
Canada3710 Posts
On June 24 2015 12:59 phar wrote: Did you do an implementation in C and actually get faster results? 600 million 32 bit integers is like less than 3GB of RAM. What is the actual algorithm you're using to count triangles? What data structures are you using? How many edges do you actually have for 100K nodes? (Depending on the real-world application this may be only around 1M total edges, which you should easily be able to fit in memory in Java on a single machine). Have you actually verified you're running out of memory? If so: How are you invoking java? Are you giving the JVM enough RAM? For example a lot of default jvm will stick you at 1G max heap, if you're allocating >1G of objects you'll have a bad time. try java -XX:+PrintFlagsFinal -version | grep HeapSize if your max heap size is tiny, try -Xmx to bump it up 600M contains() operations. We use about 7-8GBs of memory on the servers last time I checked, we might be at 5GB now. Our input is the vertex and a space separated list of other vertices it is connected to (v1, v2) edge pair. There are 100K vertices and 10M edges. We're not running out of memory (we probably will when we have to use the not-as-good servers, but that's for later). The algorithm is, for each vertex, generate all size-2 ascending order combinations from the list of adjacent vertices by partitioning the list into less than the current vertex and greater than the current vertex. For each size-2 combination, check if the edge exists. If it exists, this is a unique triangle, otherwise not. We store the left lower combinations in a 100K list of list of integers in sorted order (sorting takes <1s and gets us 15s faster due to locality). We store the right upper combinations in a 100K list of sets of integers by a hashset (so the contains operations can be as fast as possible). The triangle check is just a for loop over >600M iterations of calls to contains(). This is as efficient as we think we can be without any more assumptions about the graph layout. I think 10M edges in a graph of 100K vertices is like 0.1% the total number of edges, which I assume can be considered dense. The reason I'm asking about this is because I believe I'm at the limits for using standard data structures in Java (only HashSets and sorted ArrayLists really), but I have no clue because I use C/C++ more. LinkedHashSets to get sorted ordering adds too much iteration overhead to get the locality benefits. | ||
Acrofales
Spain17848 Posts
If I understand your algorithm, what you do (conceptually) is:
That seems like a pretty good approach. I can't think of any direct improvements. All of these operations seem like they would be more efficient storing all the nodes in a hashmap with their id as key, and in each node you keep two lists of neighbours: those with ID > node.id and those with ID < node.id. The total number of neighbours is on average about 200 IDs long, making your entire hashmap about 100k * 200 + 100k = 20.1million integers + overhead, which should not be a problem at all. Lookup in a hashtable is O(log n), so total time complexity for the algorithm this way is O(n * e * log n), which seems about as good as you can get it. | ||
Blisse
Canada3710 Posts
Might have to consider trying to store the < and > lists in a single class though instead of two lists, which might help a bit with locality... | ||
Ropid
Germany3557 Posts
| ||
spinesheath
Germany8679 Posts
On June 25 2015 01:57 Acrofales wrote: Lookup in a hashtable is O(log n), so total time complexity for the algorithm this way is O(n * e * log n), which seems about as good as you can get it. That doesn't seem quite right. Is it that way in Java? That would be an awful implementation of a hashtable. If your performance issues come from cache misses because the data is too large (do you have a way to properly verify that?), you definitely should look into ways to split the data up into multiple chunks and process them individually. Even if you have to "overlap" the chunks and execute a lot more instructions because of that, it still should be significantly faster. Cache misses are very expensive in high performance computing scenarios. | ||
Acrofales
Spain17848 Posts
On June 25 2015 02:53 Blisse wrote: Yup that's basically our algorithm, though we already do the pre-processing. We have two 2D arrays, one array of arrays for the ID < node.id, one array of sets for ID > node.id. We originally had HashMaps instead of ArrayLists but it doesn't seem to have a significant impact for the outer array, no noticeable performance difference. Might have to consider trying to store the < and > lists in a single class though instead of two lists, which might help a bit with locality... But then I don't know why you are having problems with your contains() calls. On average, every node should have 200 edges (so even if it's scale-free, the average is 200), and that check should be negligible: each edge is checked at most twice, so you will have at most 2e contains calls (in practice far less). If you have a scale-free topology and are ending up doing your contains calls on the highly connected nodes all the times, switch your ordering (check the least-connected nodes first instead of the most-connected ones). In big-O complexity it makes no difference, but in practice it means you will have lots and lots and lots of small contains() calls, instead of a few large ones. | ||
Acrofales
Spain17848 Posts
On June 25 2015 03:15 spinesheath wrote: That doesn't seem quite right. Is it that way in Java? That would be an awful implementation of a hashtable. If your performance issues come from cache misses because the data is too large (do you have a way to properly verify that?), you definitely should look into ways to split the data up into multiple chunks and process them individually. Even if you have to "overlap" the chunks and execute a lot more instructions because of that, it still should be significantly faster. Cache misses are very expensive in high performance computing scenarios. Yeah. Whoops. It's O(1). And you can probably make it a faster O(1) by implementing your own map for integers, as Blisse seems to have done. However, the log n in the complexity is for checking the presence of "second" in the "neighbours" list. This is log n (assuming neighbours is sorted, as Blisse said it was), not for look up in the hashtable. | ||
Acrofales
Spain17848 Posts
On June 25 2015 03:14 Ropid wrote: Is that ArrayList<Integer>? What's the difference between Integer and int? Would the 8GB turn into 2GB or something if it could be int instead of Integer? http://stackoverflow.com/questions/8661420/do-the-java-integer-and-double-objects-have-unnecessary-overhead Integers have 12 bytes overhead (which, given that a 32-bit integers is 4 bytes, is quite significant). Now this seems to be referencing Java 1.3 (which is ancient), so perhaps it has changed. Using a library (or building your own datastructure) that allows for the use of primitives might very well make a difference. However, it shouldn't be 2GB in the first place for the size of graph he is talking about. So doing more in-depth profiling of what is eating all the memory and where the performance issues are coming from seems necessary before scrapping the standard library. | ||
phar
United States1080 Posts
What's the distribution of your dataset like? Is it uniform, or following a more real-world (i.e. power-law) distribution? If the latter, there's a rather large number of better algorithms you could use. | ||
RoyGBiv_13
United States1275 Posts
On June 24 2015 04:18 sabas123 wrote: You mean if the buffer is big you want to pass it by reference right? Nice guide btw, I enjoyed it alot. The data has to live somewhere on the system. You can declare it on the stack in say, main(), then pass by reference to helper functions to modify/generate/use the data. Once you have the buffer placed in memory, you should almost always pass by reference, since you need the address of the buffer anyway to iterate through it. | ||
SixStrings
Germany2046 Posts
I probably have no business posting here, but here goes: I just finished the Khan course "Introduction to Javascript" and I enjoyed it in a brain-teaser / puzzle kind of way, so I want to get deeper into programming. There's a Python course starting in October, what could I do in the meanwhile? | ||
Nesserev
Belgium2760 Posts
| ||
Isualin
Turkey1903 Posts
| ||
Acrofales
Spain17848 Posts
On June 25 2015 18:45 SixStrings wrote: Hey girls, I probably have no business posting here, but here goes: I just finished the Khan course "Introduction to Javascript" and I enjoyed it in a brain-teaser / puzzle kind of way, so I want to get deeper into programming. There's a Python course starting in October, what could I do in the meanwhile? Hey cutie, Well shucks, I think you should probably stick to embroidery or crotcheting. But if you actually want advice, how about you start by showing respect to the community, and use a gender neutral address. If not, then address the vast majority of people here, which are almost certainly going to be guys: you are on a gaming website dedicated to Starcraft (don't remember the stats, but I believe it was about 90% male players), and in the programming thread (my CS course had about a 95% male population): combine the two populations and the only girl here is the one giving you a blowjob while she's phoning her boyfriend + Show Spoiler + Dating how's your luck thread never forgets | ||
Nesserev
Belgium2760 Posts
| ||
| ||