|
United States24615 Posts
Often in science we use equipment to collect data, and then test to see how well the data fits a theoretical equation. This is easy to do using Microsoft Excel when the theory is a linear equation, parabolic, or one of several other fitting functions that are built-in using the 'trendline' feature. However, how do you fit data to a curve when the equation of the curve is not one of the built-in functions?
First you need to install the Solver add-on, if it isn't already activated in your Excel. Here is Microsoft's Guide for how to do this: http://office.microsoft.com/en-us/excel-help/load-the-solver-add-in-HP010021570.aspx . In the 2007 version of Excel, you know the solver is already available if you see it all the way on the right side of the 'data' tab under 'data analysis.'
I will explain this procedure by using an example, and explaining each step along the way. It is very simple to calculating standard deviation, so feel free to review how to calculate standard deviation if you want some mathematical understanding of why I use the steps I do. Resources for standard deviation are readily available online, and I don't want to make this guide too long (nor am I an expert in statistics by any means).
Suppose we have the following data:
This data was measured using laboratory equipment. I am going to avoid addressing uncertainty/error to keep this simple, so let's assume we have the same uncertainty attached to each data point (I will also neglect vertical error bars for this demonstration). According to scientific theory, the equation that should relate the time in seconds and data in meters is:
I picked this formula because it is not available as a built-in Excel trend-line, but is a simple formula to use. I have no idea what the practical applications of this particular formula are.
The question I will show you how to solve is: For what values of A and B does the data best fit the theory? The first thing to do (after placing the data, shown above, into a spreadsheet) is to dedicate cells to house the values of A and B (if the equation was more complicated, you would need more than two cells). Here is my spreadsheet so far:
The next column over will be the 'theoretical' data: in other words, the f(t) values you get for each value of t using the equation (shown above). These values depend on the value of A, and B, so I will pick some arbitrary values for A and B to allow the cells C2 to C11 to populate. My spreadsheet looks like this:
The code for cell C2 is: =$B$13*A2+$B$14/A2 The code for cell C3 is: =$B$13*A3+$B$14/A3 etc
Notice the dollar signs ($) before the letter and number of cells B13 and B14. This is called 'referencing' the cell. In other words, when you drag cell C2 down to populate cells C3 through C11 (or use the fill option), you want "A2" in the formula to change to A3, A4, etc, but you always want cells B13 and B14 to remain the same. The way to tell Excel not to look at cells B14 and B15 when calculating the value of cell C3 is to reference any mention of B13 and B14 in the formula for cell C2 before populating cells C3 through C11. You either do this by manually typing dollar signs as shown above, or by hitting the 'f4' key directly after typing in the name of the cell you want to reference. For example, the way I typed the code for C2 was:
- =
- B13
- F4 key
- *
- A2
- +
- B14
- F4 key
- /
- A2
- enter
Unless you picked really good values for A and B, the values in the data column and the theory column are probably way off. The next column you will need is a 'difference' column. It will calculate the difference between the theoretical values and the data values for each time. For example, cell D2's code is: "=C2-B2" Although if you did B2-C2 that would work also. The final column you need I will call "square" because you are going to square the values of column D. For example, the code for cell E2 should be "=D2^2" The spreadsheet now looks like this:
Next we need to add up the values in the square column. I put the word "sum" in cell D12 and then put the actual sum in cell E12 with the code "=SUM(E2:E11)" It looks like this:
Now click on the 'solver' located under the Data tab of Excel. Use these settings:
The target cell, in this case E12, needs to be minimized by adjusting the values of A, and B. Click 'solve' and it will give you the best result it could find. I plotted the data and the theoretical data:
So in conclusion, with the given data and expected theoretical curve, values of A=1.74 and B=9.53 best fit the data. I won't go into determining how well it fits the data, as that gets more much complicated, but is also much more useful!
As a final tip, how would you fit data to an equation of a sine wave over x^2? I would do it as such:
f(x) = A*sin(B*x+C)/x^2+D
I would have the solver minimize the squares column by changing the values of A, B, C, and D.
Notice I needed more fitting parameters that time! This topic can actually be fun, so play around with it!
   
|
You could also use a program like SciDaVis or LabPlot (both are Open Source but un-maintained). In fact anything beside basic fitting (usual functions) should be done with a dedicated program because you need to know which algorithm was used to do the fitting.
Still as an how-to for basic plot fitting this is well written.
|
I do these type of fits with Mathematica, though it kinda takes the "magic" out of it. In the case of your example, I would give the command ("data" is the array of datapoints): Fit[data, {x, 1/x}, x] and the program will obtain the best coefficients for a fit with a linear combination of x and 1/x (so a * x + b * 1/x).
A similar function is available for the non-linear fitting-function in your final example.
|
infinity21
Canada6683 Posts
I wouldn't recommend limiting yourself to a certain formula without strong theoretical reasons or a good intuition about the underlying structure of the data. If you want the line of best fit, you could look at something like LOESS to do some smoothing.
|
United States24615 Posts
Good points. From a learning perspective I find doing it all manually has benefits, but I doubt most scientists do it this way.
|
That's not the most efficient way of doing it in excel. Excel's built in solver algorithm is pretty shitty at coming up with points that aren't near the "initial guesses" and can often send you to a non-global min / max.
I don't recall the exact method, but you essentially want to divide the error term by one of the variables to get each individual error term close to one.
|
On December 06 2012 04:42 micronesia wrote: Good points. From a learning perspective I find doing it all manually has benefits, but I doubt most scientists do it this way.
It gets a bit redundant to do all the fitting manually after doing it a bunch of times. So I like the methods of automation that are at my disposal.
The real fun is when you don't know which function fits the underlying data well. Guessing the correct form of the function is interesting. That is, if you don't go for the easy way out and just pick the polynomial of degree n-1 (for n datapoints) that perfectly fits your data ^^
|
On December 06 2012 04:08 micronesia wrote: As a final tip, how would you fit data to an equation of a sine wave over x^2? I would do it as such:
f(x) = A*sin(B*x+C)/x^2+D
I would have the solver minimize the squares column by changing the values of A, B, C, and D. This is fine; you need all four of those constants (period adjustment, amplitude adjustment, left/right translations, up/down translations), and without more insight into whatever quantity you're trying to model there's no reason to include other terms.
Edit: I'm sure you know this, but just in case, if the whole thing stabilizes around some nonzero value, keep it as you wrote it, but if the whole thing eventually drops to [near] zero, put the D on top of the fraction (i.e. (Asin(Bx+C)+D)/x^2).
|
if u wanted to do it by hand how would you do it
|
United States24615 Posts
On December 06 2012 06:04 snively wrote: if u wanted to do it by hand how would you do it Calculating the sum of the squares of the differences wouldn't be bad by hand (for reasonable number of data points), but actually minimizing the sum by modifying the values of A and B (what the solver does) would be quite difficult to do by hand, and I don't know how you would.
|
Sounds like a function using the least square method... I think a lesson on how it works and a way to write it would make more sense since you seldom know the function of the measured data. I don't really agree on skipping the part of how accurate the result is, as that's all the reasoning behind using the technique. I would also believe this is the most accurate result you will get as you have minimized the error as much as possible without any oscillations between the points, which is usually a problem as the error goes to zero.
|
United States24615 Posts
On December 06 2012 07:05 ZpuX wrote: Sounds like a function using the least square method... I think a lesson on how it works and a way to write it would make more sense since you seldom know the function of the measured data. I often find myself knowing the formula... of course it depends on what you are doing. I can't really justify doing a lesson on how it works without already doing pretty much everything in the OP, so you are implying I would have been better off covering more material or doing nothing, which I don't agree with (this may not have been your intention). This has its uses and limitations, and I was up front about that.
I don't really agree on skipping the part of how accurate the result is, as that's all the reasoning behind using the technique. For people who really want/need to understand the theory here and how to apply it to their experiment, they are going to need to branch away from what I can explain to them (I'm no statistician either). The purpose of this is to provide a mini knowhow-esque guide to how you can use excel to minimize the deviation of a function to a list of data points, without having a program do all of the work for you (of course the solver itself is utilized in this method).
I encourage you to write a guide on the things I haven't addressed, though!
|
On December 06 2012 07:17 micronesia wrote:Show nested quote +On December 06 2012 07:05 ZpuX wrote: Sounds like a function using the least square method... I think a lesson on how it works and a way to write it would make more sense since you seldom know the function of the measured data. I often find myself knowing the formula... of course it depends on what you are doing. I can't really justify doing a lesson on how it works without already doing pretty much everything in the OP, so you are implying I would have been better off covering more material or doing nothing, which I don't agree with (this may not have been your intention). This has its uses and limitations, and I was up front about that. Show nested quote +I don't really agree on skipping the part of how accurate the result is, as that's all the reasoning behind using the technique. For people who really want/need to understand the theory here and how to apply it to their experiment, they are going to need to branch away from what I can explain to them (I'm no statistician either). The purpose of this is to provide a mini knowhow-esque guide to how you can use excel to minimize the deviation of a function to a list of data points, without having a program do all of the work for you (of course the solver itself is utilized in this method). I encourage you to write a guide on the things I haven't addressed, though! Fair enough After both teaching and using quite a lot of numerical methods, both curve fitting and function solving, it just felt so wrong to see you explain the excel steps in such detail while not discussing the actual methods at hand
|
infinity21
Canada6683 Posts
On December 06 2012 07:17 micronesia wrote:Show nested quote +On December 06 2012 07:05 ZpuX wrote: Sounds like a function using the least square method... I think a lesson on how it works and a way to write it would make more sense since you seldom know the function of the measured data. I often find myself knowing the formula... of course it depends on what you are doing. I can't really justify doing a lesson on how it works without already doing pretty much everything in the OP, so you are implying I would have been better off covering more material or doing nothing, which I don't agree with (this may not have been your intention). This has its uses and limitations, and I was up front about that. Show nested quote +I don't really agree on skipping the part of how accurate the result is, as that's all the reasoning behind using the technique. For people who really want/need to understand the theory here and how to apply it to their experiment, they are going to need to branch away from what I can explain to them (I'm no statistician either). The purpose of this is to provide a mini knowhow-esque guide to how you can use excel to minimize the deviation of a function to a list of data points, without having a program do all of the work for you (of course the solver itself is utilized in this method). I encourage you to write a guide on the things I haven't addressed, though! Can you give some examples of scenarios where you know the right formula to use? Is this more common in physics and/or in the classroom?
I agree that you have to draw the line somewhere. The section for when you don't know the underlying function is an entire field of study in itself. You could also play around with different cost functions (e.g. absolute difference vs. square difference) and other underlying functions to show the bias/variance trade-off (i.e. under/overfitting).
On December 06 2012 07:05 ZpuX wrote: Sounds like a function using the least square method... I think a lesson on how it works and a way to write it would make more sense since you seldom know the function of the measured data. I don't really agree on skipping the part of how accurate the result is, as that's all the reasoning behind using the technique. I would also believe this is the most accurate result you will get as you have minimized the error as much as possible without any oscillations between the points, which is usually a problem as the error goes to zero. There are functions that have many local optima and finding a global minimum heuristically (or however Excel does it) would be impossible so your last point is not necessarily true.
|
United States24615 Posts
On December 06 2012 08:13 infinity21 wrote:Show nested quote +On December 06 2012 07:17 micronesia wrote:On December 06 2012 07:05 ZpuX wrote: Sounds like a function using the least square method... I think a lesson on how it works and a way to write it would make more sense since you seldom know the function of the measured data. I often find myself knowing the formula... of course it depends on what you are doing. I can't really justify doing a lesson on how it works without already doing pretty much everything in the OP, so you are implying I would have been better off covering more material or doing nothing, which I don't agree with (this may not have been your intention). This has its uses and limitations, and I was up front about that. I don't really agree on skipping the part of how accurate the result is, as that's all the reasoning behind using the technique. For people who really want/need to understand the theory here and how to apply it to their experiment, they are going to need to branch away from what I can explain to them (I'm no statistician either). The purpose of this is to provide a mini knowhow-esque guide to how you can use excel to minimize the deviation of a function to a list of data points, without having a program do all of the work for you (of course the solver itself is utilized in this method). I encourage you to write a guide on the things I haven't addressed, though! Can you give some examples of scenarios where you know the right formula to use? Is this more common in physics and/or in the classroom? Most of the labs I did in physics classes in college required me to experimentally determine constants and compare them to known values. In these cases, you know the type of equation based on the theory, and use the experiment simply to determine the constants within the equations. For example...
Measure the electric field strength on the axis of a charged ring as a function of distance from the center of the ring.
You know from theory that the electric field looks like A*x / (x^2+B^2)^(3/2), and simply compare the values of A and B to what you would expect based on the amount of charge, size of ring, etc.
Of course there are many other applications of these techniques besides verifying formulas.
|
infinity21
Canada6683 Posts
Yeah, I just wanted to get a sense of what kind of scenarios you were using this in. I figured your experience with this was physics related.
To verify that your model is a good fit, you'll have to plot the residuals and test for homoscedasticity, use it for prediction on a different set of data and measure the error compared to other functions, etc.
I'm a statistics major so I could talk about this stuff for hours on end lol
|
I picked this formula because it is not available as a built-in Excel trend-line, but is a simple formula to use. I have no idea what the practical applications of this particular formula are.
For your random knowledge, one practical application for the formula relates the viscosity of the polymer, which you measure with one of these cool things, with the average molecular weight of the polymer chain.
So... uh..... science!
|
On December 06 2012 08:13 infinity21 wrote: Can you give some examples of scenarios where you know the right formula to use? Is this more common in physics and/or in the classroom? I know micronesia already answered your post, but this is more common that you might think in any sort of real-world modeling. The "standard" way to model a system (say, a population) is to write down a whole bunch of things that affect how it changes over time -- i.e. it grows proportional to this, shrinks proportional to this, grows inversely proportional to that, has an upper limit somewhere around here, and is stable when this other quantity is small -- and add them all up to yield a rough guess at what the derivative of that population with respect to time is. Using either knowledge of solutions to similar differential equations, a computer algebra system, or black magic, turn this into an expression for the population size as a function of time. Of course, you won't just get a function of time, you'll get a function of time and all those proportionality constants and other parameters. That's exactly the situation that the OP described. From there, you play around with possible values of all those constants, and settle on using the ones that produce curves that more or less appear to match your data. If you can't find any, reexamine the assumptions you made about the derivative and start over with the appropriate modifications.
|
infinity21
Canada6683 Posts
On December 06 2012 10:15 Iranon wrote:Show nested quote +On December 06 2012 08:13 infinity21 wrote: Can you give some examples of scenarios where you know the right formula to use? Is this more common in physics and/or in the classroom? I know micronesia already answered your post, but this is more common that you might think in any sort of real-world modeling. The "standard" way to model a system (say, a population) is to write down a whole bunch of things that affect how it changes over time -- i.e. it grows proportional to this, shrinks proportional to this, grows inversely proportional to that, has an upper limit somewhere around here, and is stable when this other quantity is small -- and add them all up to yield a rough guess at what the derivative of that population with respect to time is. Using either knowledge of solutions to similar differential equations, a computer algebra system, or black magic, turn this into an expression for the population size as a function of time. Of course, you won't just get a function of time, you'll get a function of time and all those proportionality constants and other parameters. That's exactly the situation that the OP described. From there, you play around with possible values of all those constants, and settle on using the ones that produce curves that more or less appear to match your data. If you can't find any, reexamine the assumptions you made about the derivative and start over with the appropriate modifications. Yeah, I remember doing something like this in an intro to differential equations class. There are certainly scenarios in which known universal relationships between variables exist (e.g. physics) and scenarios in which certain assumptions, whether or not they're true, can help you model (e.g. population growth/decay).
For statistics, you rarely have these strong known/assumed properties so the focus is more on finding these underlying structures rather than knowing/assuming it from the get-go.
|
|
|
|