|
Thread Rules 1. This is not a "do my homework for me" thread. If you have specific questions, ask, but don't post an assignment or homework problem and expect an exact solution. 2. No recruiting for your cockamamie projects (you won't replace facebook with 3 dudes you found on the internet and $20) 3. If you can't articulate why a language is bad, don't start slinging shit about it. Just remember that nothing is worse than making CSS IE6 compatible. 4. Use [code] tags to format code blocks. |
On October 19 2013 02:21 HumpingHydra wrote: Is there anyone in this thread who has dealt specifically with python having come from a background of zero programming whatsoever? How long will it take? If I plan to get some sort of coop job within the realm of bioinformatics, is there a chance I can learn a significant amount of python before applying for jobs in the fall and ultimately into summer? Any advice would be nice. Thanks guys.
Well I will try to cover everything you want to know. First of all python is a really good choice as a first language to learn, because it relies on higher level mecanisms compared to C for example, which means that you are quite close to the human language, opposed to binary (assembly).
A lot of new programmers are taught programming using Python, you could actually try out the courses from Berkeley at http://inst.eecs.berkeley.edu/~cs61a/su12/ that cover a lot of Python and data structures. Practice is the key word and the time required depends on your work, I guess that if you do this on your spare time you could actually understand the basics quite quickly.
I have no experience in bioinformatics so I can't tell the level of programming required but from what I have seen it is not trivial, but don't be afraid to apply and see what is expected of you.
|
Programming for sciences such as biology or physics is a different animal from computer science or software engineering or programming as a profession.
Usually you are writing programs to perform some mathematical calculations that you have designed on paper but are difficult to do by hand as it needs to be applied to thousands of lines in a matrix. In that sense, getting to the level of familiarity using Python required to do such calculations shouldn't take too much effort. If it does take longer than a few afternoons, then you may want to find a mentor to help get you get back on track.
The hardest part is just the first step. Once you can read some parts of the code, and look up what you don't know, then it's downhill from there. Think of it as another language to describe mathematics in.
|
On October 19 2013 02:50 Zild. wrote:Show nested quote +On October 19 2013 02:21 HumpingHydra wrote: Is there anyone in this thread who has dealt specifically with python having come from a background of zero programming whatsoever? How long will it take? If I plan to get some sort of coop job within the realm of bioinformatics, is there a chance I can learn a significant amount of python before applying for jobs in the fall and ultimately into summer? Any advice would be nice. Thanks guys. Well I will try to cover everything you want to know. First of all python is a really good choice as a first language to learn, because it relies on higher level mecanisms compared to C for example, which means that you are quite close to the human language, opposed to binary (assembly). A lot of new programmers are taught programming using Python, you could actually try out the courses from Berkeley at http://inst.eecs.berkeley.edu/~cs61a/su12/ that cover a lot of Python and data structures. Practice is the key word and the time required depends on your work, I guess that if you do this on your spare time you could actually understand the basics quite quickly. I have no experience in bioinformatics so I can't tell the level of programming required but from what I have seen it is not trivial, but don't be afraid to apply and see what is expected of you. I might as well help with some courses.
https://www.coursera.org/course/programming2
https://www.coursera.org/course/programming1
Check out some of the books with good reviews on Amazon, if you cover couple of courses and read the books you should be all set. Just remember something that all programmers are taught, you are never "through" with programming, it's not something you can absolutely know, each and every day will be a new experience (same goes for every scientific field just not on such a daily basis I would dare to say).
Bioinformatics is a fast growing field and I applaud your interest, wish you good luck!
|
From what I know, programming for natural sciences usually is either fairly basic like the aforementioned "implement this calculation", or high performance software that is run on supercomputers. And if it's written in python, it sure isn't the latter.
|
On October 19 2013 02:21 HumpingHydra wrote: Hey guys. I am a third year student in university getting my degree in biochemistry. My coop supervisor told me that it would be worth my while to learn a bit of python in my spare time if I would like to get a job within the realm of bioinformatics, and that its not too difficult(I mentioned that I've never really done any programming). However due to the fact I have essentially never done any programming I am unsure if this is worth pursuing. Is there anyone in this thread who has dealt specifically with python having come from a background of zero programming whatsoever? How long will it take? If I plan to get some sort of coop job within the realm of bioinformatics, is there a chance I can learn a significant amount of python before applying for jobs in the fall and ultimately into summer? Any advice would be nice. Thanks guys. I learned Python starting from zero when I was in high school, and I'm now teaching my girlfriend to program using the same... It's a great language to learn in, and there's even a great wikibook called the Non-Programmer's Tutorial for Python 3 that will get you up and running with everything you need to know to start using Python pretty extensively in a couple of weeks if you work at it. Yes there's a chance you can learn enough Python for the jobs you're applying for by next summer, even a chance you could learn enough to be impressive on resumes you send out over the next couple months. Python can be picked up pretty quickly, don't be scared of it!
|
Are people really using Python over R or matlab for bioinformatics?
I know Python is more popular but I would have thought that the sheer advantages R has would have made it better.
I feel R handles math better without the need of a math library since R is a statistical language. A language built for the analysis of data whereas python is general purpose. R handles updates much better and you don't have to worry about people sticking to 2.7 python instead of going to 3.x python. R has more consistent syntax for print statements. It also has more consistent naming conventions. Eg. Tkinter vs tkiner in python 2->3. R's environment is set up automatically. Everyone uses the RGui or command shell to run programs and it's easy to copy and paste. Python has several IDEs and the defaults like Idlex are unstable and don't copy and paste data from the clipboard into the shell very well. You're set up with R from the start. R has a beautiful cran repository that stores and keeps data on every single package and their compatibility and has an easy to set up installer. With python you have no idea if your package is compatible with your other packages or your environment (god help you if you're using windows). There's so many dependencies and binary installers are often hard to find.
I would have thought there were more support and better documentation for statistical analysis in R than in python.
Have you done any research to determine if people in bioinformatics are looking for python instead of R?
|
Im not sure about bioinfo but im taking an AI(Machine Learning) class and the instructor recommend to use Octave instead of matlab. According to my roommate who took this class before, Octave is pretty much matlab but more... 'open'?
|
On October 19 2013 07:36 obesechicken13 wrote: Are people really using Python over R or matlab for bioinformatics?
I know Python is more popular but I would have thought that the sheer advantages R has would have made it better.
I feel R handles math better without the need of a math library since R is a statistical language. A language built for the analysis of data whereas python is general purpose. R handles updates much better and you don't have to worry about people sticking to 2.7 python instead of going to 3.x python. R has more consistent syntax for print statements. It also has more consistent naming conventions. Eg. Tkinter vs tkiner in python 2->3. R's environment is set up automatically. Everyone uses the RGui or command shell to run programs and it's easy to copy and paste. Python has several IDEs and the defaults like Idlex are unstable and don't copy and paste data from the clipboard into the shell very well. You're set up with R from the start. R has a beautiful cran repository that stores and keeps data on every single package and their compatibility and has an easy to set up installer. With python you have no idea if your package is compatible with your other packages or your environment (god help you if you're using windows). There's so many dependencies and binary installers are often hard to find.
I would have thought there were more support and better documentation for statistical analysis in R than in python.
Have you done any research to determine if people in bioinformatics are looking for python instead of R?
R is definitely used for sequencing and large data sets for the reasons you mentioned.
However, any time that heavy mathematics/statistics is not required Python is preferred because of its ease of use, support advantages, and popularity in general in comparison to R.
Scripting and basic things in Python are much easier, and afaik there are more libraries to work with. It's definitely slower, but when you don't need to do intensive math Python is probably a better choice. If you're doing genomic sequencing or other things that require the analysis of large data sets, then R would clearly be the better choice.
As a completely unrelated aside, I've had some job interviews lately and I must stress that anyone looking for any sort of software development or data analysis job should be well-versed in algorithms and data structures. An interesting question I got the other day was this:
You're given an array of numbers such that every element in the array is duplicated except for one. Provide an efficient algorithm to determine the singleton element.
I came up with a really shitty answer at first in about 30 seconds, and then I was encouraged to find a more efficient solution, which took me another 3-4 minutes. Pretty interesting problem.
|
On October 19 2013 15:55 wherebugsgo wrote:Show nested quote +On October 19 2013 07:36 obesechicken13 wrote: Are people really using Python over R or matlab for bioinformatics?
I know Python is more popular but I would have thought that the sheer advantages R has would have made it better.
I feel R handles math better without the need of a math library since R is a statistical language. A language built for the analysis of data whereas python is general purpose. R handles updates much better and you don't have to worry about people sticking to 2.7 python instead of going to 3.x python. R has more consistent syntax for print statements. It also has more consistent naming conventions. Eg. Tkinter vs tkiner in python 2->3. R's environment is set up automatically. Everyone uses the RGui or command shell to run programs and it's easy to copy and paste. Python has several IDEs and the defaults like Idlex are unstable and don't copy and paste data from the clipboard into the shell very well. You're set up with R from the start. R has a beautiful cran repository that stores and keeps data on every single package and their compatibility and has an easy to set up installer. With python you have no idea if your package is compatible with your other packages or your environment (god help you if you're using windows). There's so many dependencies and binary installers are often hard to find.
I would have thought there were more support and better documentation for statistical analysis in R than in python.
Have you done any research to determine if people in bioinformatics are looking for python instead of R? R is definitely used for sequencing and large data sets for the reasons you mentioned. However, any time that heavy mathematics/statistics is not required Python is preferred because of its ease of use, support advantages, and popularity in general in comparison to R. Scripting and basic things in Python are much easier, and afaik there are more libraries to work with. It's definitely slower, but when you don't need to do intensive math Python is probably a better choice. If you're doing genomic sequencing or other things that require the analysis of large data sets, then R would clearly be the better choice. As a completely unrelated aside, I've had some job interviews lately and I must stress that anyone looking for any sort of software development or data analysis job should be well-versed in algorithms and data structures. An interesting question I got the other day was this: You're given an array of numbers such that every element in the array is duplicated except for one. Provide an efficient algorithm to determine the singleton element. I came up with a really shitty answer at first in about 30 seconds, and then I was encouraged to find a more efficient solution, which took me another 3-4 minutes. Pretty interesting problem. If you are doing really large data sets you are not going to use just your harddrivel, you are going to want to use some kind of distributed system so the mathematical speed of R suddenly becomes not as advantageous and the scripting advantage of Python gets even more useful.
What was your answer?
|
what are your opinions about working as an online freelancer vs working in a firm/office environment? (as a programmer).
|
|
|
On October 19 2013 15:55 wherebugsgo wrote: You're given an array of numbers such that every element in the array is duplicated except for one. Provide an efficient algorithm to determine the singleton element. Is that the exact wording of the question? Does "duplicated" mean that the duplicate and the original are in unrelated slots of the array, or are they adjacent? Can there be multiple sets of duplicates of the same number, as in 2, 4, 6 of the number? What range are the numbers from?
Then there also is the question how one defines efficient. It's an inefficient use of programmer time if you don't just use some stupid algorithm that works and is easy to read and understand. Unless that piece of code is a performance bottleneck that needs to be faster. But I'm going to assume that low execution time is the goal here anyways.
|
On October 19 2013 16:07 Sub40APM wrote:Show nested quote +On October 19 2013 15:55 wherebugsgo wrote:On October 19 2013 07:36 obesechicken13 wrote: Are people really using Python over R or matlab for bioinformatics?
I know Python is more popular but I would have thought that the sheer advantages R has would have made it better.
I feel R handles math better without the need of a math library since R is a statistical language. A language built for the analysis of data whereas python is general purpose. R handles updates much better and you don't have to worry about people sticking to 2.7 python instead of going to 3.x python. R has more consistent syntax for print statements. It also has more consistent naming conventions. Eg. Tkinter vs tkiner in python 2->3. R's environment is set up automatically. Everyone uses the RGui or command shell to run programs and it's easy to copy and paste. Python has several IDEs and the defaults like Idlex are unstable and don't copy and paste data from the clipboard into the shell very well. You're set up with R from the start. R has a beautiful cran repository that stores and keeps data on every single package and their compatibility and has an easy to set up installer. With python you have no idea if your package is compatible with your other packages or your environment (god help you if you're using windows). There's so many dependencies and binary installers are often hard to find.
I would have thought there were more support and better documentation for statistical analysis in R than in python.
Have you done any research to determine if people in bioinformatics are looking for python instead of R? R is definitely used for sequencing and large data sets for the reasons you mentioned. However, any time that heavy mathematics/statistics is not required Python is preferred because of its ease of use, support advantages, and popularity in general in comparison to R. Scripting and basic things in Python are much easier, and afaik there are more libraries to work with. It's definitely slower, but when you don't need to do intensive math Python is probably a better choice. If you're doing genomic sequencing or other things that require the analysis of large data sets, then R would clearly be the better choice. As a completely unrelated aside, I've had some job interviews lately and I must stress that anyone looking for any sort of software development or data analysis job should be well-versed in algorithms and data structures. An interesting question I got the other day was this: You're given an array of numbers such that every element in the array is duplicated except for one. Provide an efficient algorithm to determine the singleton element. I came up with a really shitty answer at first in about 30 seconds, and then I was encouraged to find a more efficient solution, which took me another 3-4 minutes. Pretty interesting problem. If you are doing really large data sets you are not going to use just your harddrivel, you are going to want to use some kind of distributed system so the mathematical speed of R suddenly becomes not as advantageous and the scripting advantage of Python gets even more useful. What was your answer? What scripting advantage?
|
On October 19 2013 16:33 spinesheath wrote:Show nested quote +On October 19 2013 15:55 wherebugsgo wrote: You're given an array of numbers such that every element in the array is duplicated except for one. Provide an efficient algorithm to determine the singleton element. Is that the exact wording of the question? Does "duplicated" mean that the duplicate and the original are in unrelated slots of the array, or are they adjacent? Can there be multiple sets of duplicates of the same number, as in 2, 4, 6 of the number? What range are the numbers from? Then there also is the question how one defines efficient. It's an inefficient use of programmer time if you don't just use some stupid algorithm that works and is easy to read and understand. Unless that piece of code is a performance bottleneck that needs to be faster. But I'm going to assume that low execution time is the goal here anyways.
I paraphrased, but basically you can expect to encounter two of every element in the array except for one element. The array is unsorted and the duplicates can appear anywhere. There won't be multiple (read: more than 2) instances of the same number, though the answer would remain the same for even-number duplicates-i.e. you could have 4 of the same number and 2 of another and 6 of another and finding the singleton still works the same.
There is no specification on the range of the numbers, though you can assume that you won't need multiple data structures to represent a single number. i.e. if it makes it easier for you, assume they are 32 bit ints or whatever.
My first solution required O(nlogn) time, and O(1) space. I pretty quickly realized that the time complexity can be improved to O(n) time and there's a relatively simple way of doing that by adding a data structure that takes O(n) space, but the "brilliant" answer (as the interviewer described it-I'd call it elegant) requires no additional space and is also O(n) time.
+ Show Spoiler [my shitty answer] +Sort the array, then step through the array until you find that two consecutive elements are not the same. Sorting takes O(nlogn) time and stepping through the array takes O(n) time; O(nlogn + n) is in O(nlogn).
You could use radix sort and get an O(n) sort, I suppose.
+ Show Spoiler [an approximately equally shitty answer] +Instead of sorting the array, maintain a hash table and a second sum variable. Iterate through the array and for each integer, check if it's in the hash table. If it isn't, then add it to the sum, and store it. If it is, then subtract it from the sum. When we've gone through the entire array, the singleton is the only element we never subtracted, thus the sum variable is the singleton.
This works in O(n) time since hash accesses are O(1), and we need O(n) space for the hash table.
+ Show Spoiler [the model answer] + Just iterate through the array and XOR everything together. A[0] XOR A[1] XOR .... XOR A[n]. The result of this computation is the singleton element.
This works since XOR is commutative and associative. a XOR a is just 0, and 0 XOR a is a. Since all of the elements are duplicated except one of them, all the duplicates will XOR out to 0 and then the singleton will be left at the end.
This holds true also for arrays in which you have even multiples of elements, not just duplicates, though it fails when there are odd multiples.
If you guys like this sort of stuff or need interview practice, here's a few more interesting problems I encountered at career fairs/interviews and company visit sessions:
1. This problem is similar to the one I posed before, but it has a pretty major twist.
The problem statement is to find the missing integer in the array with the following rules:
a. An array A contains every integer from 0 to N except for one of them. b. The elements of the array A may appear in any order. c. We can't actually access an entire element of the array all at once. In other words if you index the array you won't get back the whole element. + Show Spoiler +in other words, normally if you access an array element at index i, you would call A[i] and you would be returned the element at the ith position in A. In this problem you can't do that. Instead, what you can do is ask for the jth bit of the ith element of the array. In other words, I could ask for the 4th bit of the first element of A, and I would be returned the bit in the 4th position of the element ordinarily given by A[0]. d. The algorithm must return the missing integer by looking through at most O(N) bits.
2. Give an approach (or code) to writing an equivalent of the C function atoi. + Show Spoiler +I've also been asked how to implement strlen If you're not a C programmer, atoi is a function that converts an ASCII string to an integer. Describe the limitations or shortcomings of your approch (or code) in as much detail as you can.
3. Suppose we have a function that computes rand2() (i.e. a fair coin toss-it outputs 1 or 2 with equal probability). Design an algorithm that uses this function to simulate a fair die roll, i.e. a function that should output a random number from 1 to 6 with equal probability.
|
I wouldn't call that either brilliant or elegant. Sure, you should know what the operation does, but that doesn't mean you should ever use it unless you're forced to by run time or other constraints. That's the kind of look-I'm-so-clever code I tried to write years ago. Now when I realize I wrote something like that I just delete the whole thing and replace it with a piece of code that doesn't require any bitwise operations on values that aren't explicitly declared as bitflags.
And god knows what will happen when you one day decide that you want floats instead of ints and forget to change that code.
+ Show Spoiler +Or am I the only one who finds that XOR barely ever is appropriate?
|
|
|
On October 19 2013 20:50 spinesheath wrote:I wouldn't call that either brilliant or elegant. Sure, you should know what the operation does, but that doesn't mean you should ever use it unless you're forced to by run time or other constraints. That's the kind of look-I'm-so-clever code I tried to write years ago. Now when I realize I wrote something like that I just delete the whole thing and replace it with a piece of code that doesn't require any bitwise operations on values that aren't explicitly declared as bitflags. And god knows what will happen when you one day decide that you want floats instead of ints and forget to change that code. + Show Spoiler +Or am I the only one who finds that XOR barely ever is appropriate? Why wouldn't it work on floats?
|
On October 19 2013 21:01 gedatsu wrote:Show nested quote +On October 19 2013 20:50 spinesheath wrote:I wouldn't call that either brilliant or elegant. Sure, you should know what the operation does, but that doesn't mean you should ever use it unless you're forced to by run time or other constraints. That's the kind of look-I'm-so-clever code I tried to write years ago. Now when I realize I wrote something like that I just delete the whole thing and replace it with a piece of code that doesn't require any bitwise operations on values that aren't explicitly declared as bitflags. And god knows what will happen when you one day decide that you want floats instead of ints and forget to change that code. + Show Spoiler +Or am I the only one who finds that XOR barely ever is appropriate? Why wouldn't it work on floats?
I think this precise example works (because youre checking for equalitys and floats are still equal if they have the same bits) but the general rule is that it's bad practice to use bit operations on numeric values. For example, if you want to divide an int by 2 or check if it's even and use bit shifting, it wont work if you change the int to float.
|
On October 19 2013 09:35 NB wrote: Im not sure about bioinfo but im taking an AI(Machine Learning) class and the instructor recommend to use Octave instead of matlab. According to my roommate who took this class before, Octave is pretty much matlab but more... 'open'? It's more or less the same but only Matlab has much more libraries, functions etcetera, but for your basic work (I take it by professor you mean dr. Ng from the course I recommended you? ) in the course Octave will suffice, since the license for using Matlab costs, A LOT!
|
On October 19 2013 21:01 gedatsu wrote:Show nested quote +On October 19 2013 20:50 spinesheath wrote:I wouldn't call that either brilliant or elegant. Sure, you should know what the operation does, but that doesn't mean you should ever use it unless you're forced to by run time or other constraints. That's the kind of look-I'm-so-clever code I tried to write years ago. Now when I realize I wrote something like that I just delete the whole thing and replace it with a piece of code that doesn't require any bitwise operations on values that aren't explicitly declared as bitflags. And god knows what will happen when you one day decide that you want floats instead of ints and forget to change that code. + Show Spoiler +Or am I the only one who finds that XOR barely ever is appropriate? Why wouldn't it work on floats? Well, if you try to consider this as a not completely made up problem, then the array would have to come from somewhere. Chances are, the numbers aren't created by exactly duplicating all but one value in another array. You're usually not guaranteed that the same float value always has the same bit signature. Am I even guaranteed that after I assign one float to another, bitwise equality for the two floats holds? I guess it's likely, and probably stated so in the language specifications, but with denormalization and such there might be alternative implementations.
Generally, I don't like making assumptions about an implementation even on basic data types like integers, and especially not floats, unless I have a good reason. I prefer to treat numbers as numbers where I have normal arithmetic operations like addition and multiplication. If I use an integer as a string of bitflags, then I prefer to use bitwise operations only and never treat it as a number.
Sure, in this special case the XOR is fine as it's just one simple line of code. But in reality things often get more complex and then I'd prefer not having to think about the details of XOR in special contexts.
|
|
|
|
|
|