The Big Programming Thread - Page 644

Prev 1 642 643 644 645 646 1031 Next

Thread Rules
1. This is not a "do my homework for me" thread. If you have specific questions, ask, but don't post an assignment or homework problem and expect an exact solution.
2. No recruiting for your cockamamie projects (you won't replace facebook with 3 dudes you found on the internet and $20)
3. If you can't articulate why a language is bad, don't start slinging shit about it. Just remember that nothing is worse than making CSS IE6 compatible.
4. Use [code] tags to format code blocks.

Slithe

United States985 Posts

June 23 2015 22:20 GMT

#12861

What kind of operations are you doing? 10k integers isn't that big, so I'm surprised that you're running into paging issues.

Blisse

Canada3710 Posts

June 24 2015 03:25 GMT

#12862

It's an assignment, Java is required ^^

It's a 100K by 100K dense graph I need to count triangles on. Multi-threading is easy to do because I can abuse a little bit of parallelism for what to parse at what time. But I need to implement a single-threaded solution that doesn't cramp up. Currently we split the edges in half to do triangle counting, but we have to do about 600M set contains() operations on essentially sets of ~100K integers, which Java chokes on because it can't keep all the edges in memory at the same time.

I just have no clue what to do to keep all the integers in memory so when I can call contains() on different sets of size 100K and not have it cramp up. The contains() operations are incredibly fast (<5ms), but having 600M of them in serial hurts.

Been trying to partition the data so they're potentially allocated closer together but the benefits are intangible/worse most of the time. I might have to just accept this is as fast as I'm going to be able to call contains() 600M times.

I'm more of a C guy so doing this in Java is much more work :p

phar

United States1080 Posts

June 24 2015 03:59 GMT

#12863

Did you do an implementation in C and actually get faster results?

600 million 32 bit integers is like less than 3GB of RAM.

What is the actual algorithm you're using to count triangles? What data structures are you using?

How many edges do you actually have for 100K nodes? (Depending on the real-world application this may be only around 1M total edges, which you should easily be able to fit in memory in Java on a single machine).

Have you actually verified you're running out of memory? If so:

How are you invoking java? Are you giving the JVM enough RAM? For example a lot of default jvm will stick you at 1G max heap, if you're allocating >1G of objects you'll have a bad time.

try

java -XX:+PrintFlagsFinal -version | grep HeapSize

if your max heap size is tiny, try -Xmx to bump it up

Nesserev

Belgium2760 Posts

June 24 2015 13:11 GMT

#12864

--- Nuked ---

Acrofales

Spain17970 Posts

June 24 2015 14:40 GMT

#12865

On June 24 2015 12:25 Blisse wrote:
It's an assignment, Java is required ^^

It's a 100K by 100K dense graph I need to count triangles on. Multi-threading is easy to do because I can abuse a little bit of parallelism for what to parse at what time. But I need to implement a single-threaded solution that doesn't cramp up. Currently we split the edges in half to do triangle counting, but we have to do about 600M set contains() operations on essentially sets of ~100K integers, which Java chokes on because it can't keep all the edges in memory at the same time.

I just have no clue what to do to keep all the integers in memory so when I can call contains() on different sets of size 100K and not have it cramp up. The contains() operations are incredibly fast (<5ms), but having 600M of them in serial hurts.

Been trying to partition the data so they're potentially allocated closer together but the benefits are intangible/worse most of the time. I might have to just accept this is as fast as I'm going to be able to call contains() 600M times.

I'm more of a C guy so doing this in Java is much more work :p

Well, clearly you have to think about your data structure. You don't have lists of integers, you have a dense graph with 100K nodes. At least, that's what I understand from your 100K by 100K (which makes it sound more like a matrix, but presumably that is your matrix representation of the graph?) There are many ways of representing that, the most common being a graph datastructure and a matrix (basically a 2D array), both of which have some good Java libraries for manipulating.

However, given your task, I actually doubt your graph is dense. A fully connected graph of 100K nodes has C(100K, 3) triangles, or to be precise: 166,661,666,700,000 triangles in a fully connected graph of 100K nodes. Now obviously dense does not mean fully connected, just mostly connected. Which if you store this as a text file using 1 byte per triangle (unrealistic assumption, you will use more), that is still ~166TB to say what the triangles are. Now even if it is not fully connected, it only decreases linearly for a very long time, and for a densely connected graph you can count on a lower bound of about 100K^2 triangles, or using 1 byte per triangle, ~10GB of triangles.

In a real problem, these can be the kind of sizes you are looking at. However, in real problems graphs are usually sparsely connected (small world). I did a quick google and cannot actually find a real-world problem that is represented as a dense graph.

For a sparse graph you can use different datastructures and algorithms. While you can always represent it as a dense matrix (read: a matrix where the 0s are explicitly represented), it is horribly inefficient.

Blisse

Canada3710 Posts

June 24 2015 16:20 GMT

#12866

On June 24 2015 12:59 phar wrote:
Did you do an implementation in C and actually get faster results?

600 million 32 bit integers is like less than 3GB of RAM.

What is the actual algorithm you're using to count triangles? What data structures are you using?

How many edges do you actually have for 100K nodes? (Depending on the real-world application this may be only around 1M total edges, which you should easily be able to fit in memory in Java on a single machine).

Have you actually verified you're running out of memory? If so:

How are you invoking java? Are you giving the JVM enough RAM? For example a lot of default jvm will stick you at 1G max heap, if you're allocating >1G of objects you'll have a bad time.

try

java -XX:+PrintFlagsFinal -version | grep HeapSize

if your max heap size is tiny, try -Xmx to bump it up

600M contains() operations. We use about 7-8GBs of memory on the servers last time I checked, we might be at 5GB now.

Our input is the vertex and a space separated list of other vertices it is connected to (v1, v2) edge pair. There are 100K vertices and 10M edges.

We're not running out of memory (we probably will when we have to use the not-as-good servers, but that's for later).

The algorithm is, for each vertex, generate all size-2 ascending order combinations from the list of adjacent vertices by partitioning the list into less than the current vertex and greater than the current vertex. For each size-2 combination, check if the edge exists. If it exists, this is a unique triangle, otherwise not.

We store the left lower combinations in a 100K list of list of integers in sorted order (sorting takes <1s and gets us 15s faster due to locality). We store the right upper combinations in a 100K list of sets of integers by a hashset (so the contains operations can be as fast as possible). The triangle check is just a for loop over >600M iterations of calls to contains().

This is as efficient as we think we can be without any more assumptions about the graph layout.

I think 10M edges in a graph of 100K vertices is like 0.1% the total number of edges, which I assume can be considered dense. The reason I'm asking about this is because I believe I'm at the limits for using standard data structures in Java (only HashSets and sorted ArrayLists really), but I have no clue because I use C/C++ more. LinkedHashSets to get sorted ordering adds too much iteration overhead to get the locality benefits.

Acrofales

Spain17970 Posts

June 24 2015 16:57 GMT

#12867

Sounds sparse to me. You only have 100 times as many edges as you do nodes. While you might make an argument for it being dense for a small world graph, any small world graph is a sparse graph.

If I understand your algorithm, what you do (conceptually) is:


for each node do:
     neighbours = { v in node.getNeighbours() | v.id > node.id }
     for each neighbour do:
          seconds = { v in neighbour.getNeighbours() | v.id < neighbour.id }
          for each second in seconds do:
             if second in neighbours then:
                     add (node.id, neighbour.id, second.id) to triangles
             endif
          endfor
      endfor

That seems like a pretty good approach. I can't think of any direct improvements.

All of these operations seem like they would be more efficient storing all the nodes in a hashmap with their id as key, and in each node you keep two lists of neighbours: those with ID > node.id and those with ID < node.id. The total number of neighbours is on average about 200 IDs long, making your entire hashmap about 100k * 200 + 100k = 20.1million integers + overhead, which should not be a problem at all. Lookup in a hashtable is O(log n), so total time complexity for the algorithm this way is O(n * e * log n), which seems about as good as you can get it.

Blisse

Canada3710 Posts

June 24 2015 17:53 GMT

#12868

Yup that's basically our algorithm, though we already do the pre-processing. We have two 2D arrays, one array of arrays for the ID < node.id, one array of sets for ID > node.id. We originally had HashMaps instead of ArrayLists but it doesn't seem to have a significant impact for the outer array, no noticeable performance difference.

Might have to consider trying to store the < and > lists in a single class though instead of two lists, which might help a bit with locality...

Ropid

Germany3557 Posts

June 24 2015 18:14 GMT

#12869

Is that ArrayList<Integer>? What's the difference between Integer and int? Would the 8GB turn into 2GB or something if it could be int instead of Integer?

spinesheath

Germany8679 Posts

June 24 2015 18:15 GMT

#12870

On June 25 2015 01:57 Acrofales wrote:
Lookup in a hashtable is O(log n), so total time complexity for the algorithm this way is O(n * e * log n), which seems about as good as you can get it.

That doesn't seem quite right. Is it that way in Java? That would be an awful implementation of a hashtable.

If your performance issues come from cache misses because the data is too large (do you have a way to properly verify that?), you definitely should look into ways to split the data up into multiple chunks and process them individually. Even if you have to "overlap" the chunks and execute a lot more instructions because of that, it still should be significantly faster. Cache misses are very expensive in high performance computing scenarios.

Acrofales

Spain17970 Posts

June 24 2015 18:24 GMT

#12871

On June 25 2015 02:53 Blisse wrote:
Yup that's basically our algorithm, though we already do the pre-processing. We have two 2D arrays, one array of arrays for the ID < node.id, one array of sets for ID > node.id. We originally had HashMaps instead of ArrayLists but it doesn't seem to have a significant impact for the outer array, no noticeable performance difference.

Might have to consider trying to store the < and > lists in a single class though instead of two lists, which might help a bit with locality...

But then I don't know why you are having problems with your contains() calls. On average, every node should have 200 edges (so even if it's scale-free, the average is 200), and that check should be negligible: each edge is checked at most twice, so you will have at most 2e contains calls (in practice far less). If you have a scale-free topology and are ending up doing your contains calls on the highly connected nodes all the times, switch your ordering (check the least-connected nodes first instead of the most-connected ones). In big-O complexity it makes no difference, but in practice it means you will have lots and lots and lots of small contains() calls, instead of a few large ones.

Acrofales

Spain17970 Posts

June 24 2015 18:30 GMT

#12872

On June 25 2015 03:15 spinesheath wrote:

Show nested quote +

Yeah. Whoops. It's O(1). And you can probably make it a faster O(1) by implementing your own map for integers, as Blisse seems to have done. However, the log n in the complexity is for checking the presence of "second" in the "neighbours" list. This is log n (assuming neighbours is sorted, as Blisse said it was), not for look up in the hashtable.

Acrofales

Spain17970 Posts

June 24 2015 18:36 GMT

#12873

On June 25 2015 03:14 Ropid wrote:
Is that ArrayList<Integer>? What's the difference between Integer and int? Would the 8GB turn into 2GB or something if it could be int instead of Integer?

http://stackoverflow.com/questions/8661420/do-the-java-integer-and-double-objects-have-unnecessary-overhead

Integers have 12 bytes overhead (which, given that a 32-bit integers is 4 bytes, is quite significant). Now this seems to be referencing Java 1.3 (which is ancient), so perhaps it has changed.

Using a library (or building your own datastructure) that allows for the use of primitives might very well make a difference. However, it shouldn't be 2GB in the first place for the size of graph he is talking about. So doing more in-depth profiling of what is eating all the memory and where the performance issues are coming from seems necessary before scrapping the standard library.

phar

United States1080 Posts

June 25 2015 03:52 GMT

#12874

Ok back up, I am confused. You are or aren't running out of memory? The initial post mentioned something about your data set getting paged out, yet you say you have only 10M edges, which should trivially fit in memory (like honestly 10M int is only 40MB or something). Where are you getting 8GB from??

What's the distribution of your dataset like? Is it uniform, or following a more real-world (i.e. power-law) distribution? If the latter, there's a rather large number of better algorithms you could use.

RoyGBiv_13

United States1275 Posts

June 25 2015 04:08 GMT

#12875

On June 24 2015 04:18 sabas123 wrote:

Show nested quote +

You mean if the buffer is big you want to pass it by reference right?

Nice guide btw, I enjoyed it alot.

The data has to live somewhere on the system. You can declare it on the stack in say, main(), then pass by reference to helper functions to modify/generate/use the data. Once you have the buffer placed in memory, you should almost always pass by reference, since you need the address of the buffer anyway to iterate through it.

SixStrings

Germany2046 Posts

June 25 2015 09:45 GMT

#12876

Hey girls,

I probably have no business posting here, but here goes:

I just finished the Khan course "Introduction to Javascript" and I enjoyed it in a brain-teaser / puzzle kind of way, so I want to get deeper into programming. There's a Python course starting in October, what could I do in the meanwhile?

Nesserev

Belgium2760 Posts

June 25 2015 11:19 GMT

#12877

--- Nuked ---

Isualin

Germany1903 Posts

June 25 2015 11:37 GMT

#12878

I am really interested in procedural generation of mazes and cities after playing dozens of roguelikes. I checked out /r/proceduralgeneration and this stackoverflow link. Visualization part looks really hard to me because i have no experience in shaders, 3d rendering or directx/opengl.(I work as fullstack web-developer) Can anyone recommend me some stuff to work on?

Acrofales

Spain17970 Posts

June 25 2015 13:32 GMT

#12879

On June 25 2015 18:45 SixStrings wrote:
Hey girls,

I probably have no business posting here, but here goes:

I just finished the Khan course "Introduction to Javascript" and I enjoyed it in a brain-teaser / puzzle kind of way, so I want to get deeper into programming. There's a Python course starting in October, what could I do in the meanwhile?

Hey cutie,

Well shucks, I think you should probably stick to embroidery or crotcheting. But if you actually want advice, how about you start by showing respect to the community, and use a gender neutral address. If not, then address the vast majority of people here, which are almost certainly going to be guys: you are on a gaming website dedicated to Starcraft (don't remember the stats, but I believe it was about 90% male players), and in the programming thread (my CS course had about a 95% male population): combine the two populations and the only girl here is the one giving you a blowjob while she's phoning her boyfriend + Show Spoiler +

Dating how's your luck thread never forgets

Nesserev

Belgium2760 Posts

June 25 2015 13:47 GMT

#12880

--- Nuked ---

Prev 1 642 643 644 645 646 1031 Next

Please or register to reply.

The Big Programming Thread - Page 644

Completed

Ongoing

Upcoming