Introductory data analysis with Jupyter Notebooks

Specialized IPython/Jupyter Notebook material developed by Rebeckah Fussell, Megan Renz, Philip Krasicky, Robert Fulbright, Jr., and Natasha Holmes - Published July 8, 2022

DOI: 10.1119/PICUP.Exercise.DataAnalysis

Would you like your students to understand how to interpret the data they collect in experiments? Would you like them to discern what it means for two different measurements to be distinguishable? Would you like them to comprehend what it means to fit a set of data to a mathematical model? Here are a set of tutorials that introduce universally important concepts in the analysis of physics lab data and provide practice in using them and clarifying what they mean—uncertainty and how to calculate it, determining the distinguishability of data sets, fitting linear data using weighted chi-squared, ethics in collecting and interpreting data, data linearization, and uncertainty propagation. These activities also introduce basic principles of interpreting and editing code for data analysis. Exercises contain a mixture of short-answer questions and questions in which students are asked to write and edit short snippets of code. The overall experience is intended to provide a foundation of experience developing and using standard tools for analyzing and interpreting real-world scientific experiments. We would like to acknowledge Cameron Flynn for his edits and comments on these tutorials and thank the Lab Teaching Assistants who helped us debug them.
Subject Areas Experimental / Labs and Programming Introductions
Levels First Year and Beyond the First Year
Specialized Implementation IPython/Jupyter Notebook
Learning Objectives
* Define basic programming structures, including variables, arrays, functions, and package imports (**Exercise 0**) * Create a CSV file of lab data and load it into python (**Exercise 0 and 5**) * Describe the properties of standard deviation and standard uncertainty of the mean, and how statistical uncertainty interacts with instrumental uncertainty (**Exercise 1**) * Make comparisons between two measurements with uncertainties (using $t^\prime$ to calculate distinguishability) (**Exercise 2**) * Identify strategies to avoid both conscious and unconscious unethical behaviors in physical measurement and data analysis (**Exercise 3**) * Describe the properties of weighted $\chi^2$ and assess goodness of fit using residuals plots (**Exercise 4**) * Use our autoFit function to fit linear data and calculate uncertainty in the slope and intercept of the line (**Exercise 5**) * Use code to calculate mean, standard deviation, standard uncertainty of the mean, $t^\prime$, and weighted $\chi^2$ and conduct their own data analysis (**Exercise 6**) * Apply the $t^\prime$ and $\chi^2$ analysis tools in multiple contexts (**Exercise 6**) * Linearize data from exponential and power law relationships using semi-log and log-log plots (**Exercise 7**) * Propagate uncertainty through a formula and use code to perform uncertainty propagation (**Exercise 8**)
# Exercise 1: Uncertainty In this tutorial we will learn about statistical tools to understand repeated measurements (for example, a data set generated by repeatedly measuring the period of a pendulum without any changes to the lab setup). **Question 1:** Generate 25 data values between 0 and 10 at random from a normal distribution and plot the histogram. Repeat this two more times with different data within the same range. How does the histogram change each time you plot new random data? What stays the same? When we take the same measurement over and over, we will often find some natural variation in the data. If the variation is a result of statistical uncertainty, we can characterize the width of statistical variation with the **standard deviation**, a statistical measure of the uncertainty in a single measurement. The standard deviation characterizes the spread in the distribution of individual values due to statistical uncertainty in the measurement process. Standard deviation is calculated with the following formula: $\sigma = \sqrt{\frac{1}{N-1} \sum_{i = 1}^{N} (x_i - \bar{x})^2}$ where $N$ is the number of measurements, *$x_i$* is the $i^{th}$ value of the measurement $x$, and $\bar{x}$ is the mean. **Question 2:** Explain how the formula for $\sigma$ takes into account each of these desired properties: **a)** We want to incorporate all available points into the calculation. **b)** The standard deviation should stay approximately the same as we add more points with the same level of variation. **c)** Standard deviation should increase as we measure more points further away from the mean. **d)** The units of standard deviation should be the same as the units of the measurement. **e)** The standard deviation should treat values on either side of the mean the same way. **Question 3:** Calculate the standard deviation of each of the data sets you generated in part 1. What happened to the value of the standard deviation for each of your random data sets? **Question 4:** Look back at the histograms you generated and the formula for standard deviation. Why can the standard deviation $\sigma$ also be viewed as an estimate of the uncertainty in any single measured value? Hint: consider the averaging behavior of the equation for standard deviation. **Question 5:** If we added 10 more measurements to our data set, the standard deviation $\sigma$ would not necessarily get larger. Why not? Although we calculate standard deviation from our measurements, we use it as a representation of the physical phenomenon we are measuring. After taking repeated measurements of the same phenomenon, we expect that 68\% of our measurements will fall within one standard deviation of the mean. A single measurement is informative, but we gain much more information from the mean of a set of multiple measurements. The uncertainty in the mean of multiple measurements, therefore, should have a smaller uncertainty than that of a single measurement. We define the uncertainty in the mean as: $\delta \bar{x} = \frac{\sigma}{\sqrt{N}}$ where $\sigma$ is the standard deviation and $N$ is the number of measurements. This definition serves to reward you for taking more data, so that the uncertainty in the mean of a collection of many measurements is smaller than the uncertainty in any single measurement. Note: Read the symbol $\delta$ as "uncertainty in". As discussed above, $\bar{x}$ is the symbol for the mean, so $\delta \bar{x}$ should be read as the mathematical notation for "uncertainty in the mean". It does *not* mean the "uncertainty times the mean". **Question 6:** Say you have taken *m* measurements (where *m* is some arbitrary number). If you do one hundred times as many measurements (i.e., $100m$), how much smaller is the uncertainty in the mean compared to when you had only had *m* measurements? **Question 7:** Calculate the standard uncertainty of the mean of one of the data sets you generated in question 1. Is the measurement of the uncertainty larger or smaller than the standard deviation? Explain whether or not that makes sense. **Question 8:** Generate 7 more histograms of 25 random data points in the range of 0 to 10 each from a random distribution (so you should have 10 data sets and histograms total). Calculate the mean and standard uncertainty in the mean of each data set. Finally, calculate the standard deviation of the 10 means from your 10 data sets. **Question 9:** How does the standard deviation of the list of means compare to the standard uncertainty of each of the means (round each to 1 digit), and why does that make sense? Because the uncertainty of the mean $\delta \bar{x}$ is actually less than the uncertainty in each individual measured value $\sigma$ by a factor $1/\sqrt{N}$, we can reduce the uncertainty in our best estimate of a measured quantity by taking more and more repeated measurements. **Question** **10:** Generate a histogram from a random distribution with 900 repeated trials between values of 0 and 10. Plot the histogram and calculate the mean, standard deviation, and standard uncertainty of the mean. How do the mean, standard deviation, and standard uncertainty of the mean of this data set compare to the data sets with 25 repeated trials? Explain in your own words why these results make sense. In comparing this histogram with the original histogram for 25 trials, three main features may be apparent: i) The mean appears to be roughly the same as it was for 25 trials. ii) The standard deviation $\sigma$ appears to be roughly the same as it was for 25 trials. The additional trials have simply "filled out" the distribution more smoothly so that the histogram has a more consistent shape. iii) The uncertainty of the mean $\delta \bar{x}$ is noticeably less than it was for 25 trials. Recall also that each of the measurements were integer values, so we can say that the instrumental precision of these measurements was limited to $\pm 0.5$. **Question 11:** How does the uncertainty of the mean of our data set with 900 measurements compare with the instrumental precision of the measuring instrument? **Question 12:** What does this say about the possibility of "beating instrumental precision" by taking repeated measurements? **Question 13:** Write the best estimate for your measurement from these 900 trials in the form $X = \bar{x} \pm \delta \bar{x}$. **Question 14:** Consider a measurement of the period of a pendulum using a stopwatch that can be read to 0.01 s, so that instrumental precision is 0.005 s (half the smallest digit). If the standard deviation for repeated measurements of the period of this pendulum is around 0.1 s, roughly how many repeated measurements of the pendulum period would be needed to reduce the uncertainty of the mean to $1/2$ of the instrumental precision of 0.005 s (i.e., as small as 0.0025 s)? # Exercise 2: Distinguishability In this tutorial we will continue to apply the concepts of mean, standard deviation, and standard uncertainty of the mean. We will also learn about how to compare the means of different data sets and test for distinguishability. In the uncertainty tutorial we talked about taking repeated measurements of a phenomenon and that we expect to see a distribution of values around the mean. On average, we would expect to see 68\% of our measurements falling within one standard deviation of the mean. If we take many such sets of measurements, we would expect the mean of each set to fall within one uncertainty of the mean 68% of the time. Often we need to determine if the means of two data sets of measurements are coming from the same underlying physical phenomenon or if we are measuring two different phenomena. Even when we are measuring the same phenomena through two different data sets, we would still expect to see some difference in the means of the two data sets. For example, if we take several measurements of the period of a pendulum when released from 10 degrees and several more measurements of the period at 20 degrees, we need a method to know if the average period is the same (and therefore we are measuring the same physical phenomenon) or different when released from the two angles. We are going to call this distinguishibility. To make comparisons between two measurements with uncertainties $A \pm \delta A$ and $B \pm \delta B$ (such as the mean of of two sets of repeated measurements with their uncertainties in the means), we use a quantity known as $t^\prime$ (pronounced ''t-prime''). This is defined as: $t^\prime = |\frac{A-B}{\sqrt{\delta A^2 + \delta B^2}}|$. $A$ and $B$ might be the means for two sets of repeated measurements, such as our means of repeated measurements of the period of pendula released from 10 and 20 degrees. $\delta$ should be read as "uncertainty in", so $\delta A$ and $\delta B$ are the uncertainties in the measurements A and B (*not* the uncertainty times the measurements). **Question 1:** Explain how the formula for $t^\prime$ takes into account each of these three desired properties: **a)** Takes into account the difference between the means of the two data sets, such that a larger difference between the means implies the values are more distinguishable. **b)** Takes into account the uncertainty of the means of the two data sets, such that the difference in the means is balanced by how well we know the measurements. **c)** Does not depend on units (so values of $t^\prime$ can be interpreted in a standard way). **Question 2:** Say we measure $A = 2 \pm 1 s$ and $B = 8 \pm 1 s$. **a)** Intuitively, would you say that the measurements $A$ and $B$ are distinguishable or indistinguishable? **b)** What about if the uncertainties were larger: are $C = 2 \pm 5 s$ and $D = 8 \pm 5 s$ distinguishable or indistinguishable? **c)** Calculate the $t^\prime$ value between these two sets of measurements (i.e., between $A$ and $B$ and between $C$ and $D$). You can think of $t^\prime$ as telling us how far apart the means of the distributions are in "units" of uncertainty. The larger the $t^\prime$ value, the less likely it is that the two sets of measurements came from the same distribution (i.e., are measuring the same physical phenomenon). **Question 3:** Plot two Gaussian distributions with means of $\bar{x}_1 = 0.05$ and $\bar{x}_2 = 0.10$ and standard deviations of $\sigma x_1 = 0.25$ and $\sigma x_2 = 0.25$. **a)** Calculate the $t^\prime$ assuming both measurements have $N=20$ trials. Change the values of $\bar{x}_2$, $\sigma x_1$, $\sigma x_2$, and $N$, and calculate the $t^\prime$ between the measurements each time. % for a few different values of x, x1, x2, and N. **b)** What combination of factors causes $t^\prime$ to increase? **c)** What causes it to decrease? **d)** Does the difference in the mean have to change for $t^\prime$ to change? **e)** What combinations of values produce a $t^\prime \approx 1$? If we have two sets of measurements of the same physical phenomenon, we would expect to see a $t^\prime$ value of approximately 1 on average. This is because, on average, most (68\%) of our measurements should be within one standard deviation of the mean, and so any two measurements should almost always be approximately one standard deviation away from each other. After calculating $t^\prime$ for two measurements, you can evaluate their dissimilarity (or distinguishability) through the following interpretation: a) $t^\prime\approx1$ If we have two sets of measurements and a $t^\prime$ value of approximately 1, then the sets are indistinguishible and they may represent the same physical phenomenon. b) $t^\prime<<1$ If we have a $t^\prime$ value much less than 1, then it is possible that either we overestimated our uncertainties or that our current level of precision is not good enough for the phenomenon that we are trying to measure. c) $1\lesssim t^\prime<3$ This is a grey area. It is still possible that our two sets of measurements are coming from the same phenomenon, but it is less likely than if our $t^\prime$ were somewhere close to 1. d) $t^\prime >3$ If our $t^\prime$ is greater than 3, then it is unlikely that our two sets of measurements are measuring the same phenomena. This means that we have likely distinguished between two sets of physical phenomena. NOTE: $|t^\prime| \le 1$ **does not** mean that A and B are the same. It only tells us that the given data cannot distinguish between the two sets. For example, if you do a better measurement and decrease the uncertainties, you might later uncover a difference between A and B. That is, poor precision may be hiding a subtle difference! **Question 4:** Based on these interpretations and your exploration with the distributions above, what do you think we should do next in each of these 4 scenarios? **a)** $t^\prime\approx1$ **b)** $t^\prime<<1$ **c)** $1\lesssim t^\prime<3$ **d)** $t^\prime >3$ ***Example Problem*** Let's try an example problem. An experimenter measured the period of a pendulum at 10 degrees and 20 degrees using a simple stopwatch. They measured the time for the pendulumn to swing for one single period and conducted 14 trials for each angle. The data they collected is given below. Generate two histograms for each of these data sets. Period at 10 degrees (in s): 1.23, 1.36, 1.35, 1.36, 1.30, 1.27, 1.30, 1.32, 1.26, 1.26, 1.38, 1.29, 1.29, 1.32 Period at 20 degrees (in s): 1.37, 1.35, 1.38, 1.27, 1.33, 1.33, 1.26, 1.36, 1.36, 1.27, 1.29, 1.44, 1.29, 1.32 **Question 5:** Calculate the standard deviation of i) the 10 degree data set and ii) the 20 degree data set. Now we will shift to asking some questions about the experimenter's data set based on what you learned in class and in the statistics reading. **Question 6:** Based on the data, what is the uncertainty in the measurements of the period from instrumental precision? Hint: look at the reported data values. Notice that all the data reports the same number of digits after the decimal point. What does that say about the precision of the timer? **Question 7:** Calculate the uncertainty in a single trial of each set of measurements for the period of this pendulum from: i) ten degrees and ii) twenty degrees. Hint: You will need to calculate either standard deviation, standard uncertainty of the mean, or $t^\prime$. **Question 8:** Calculate the mean of each set of measurements of the period of this pendulum, including its uncertainty, from: i) ten degrees and ii) twenty degrees. **Question 9:** Quantitatively compare the measurements of the period of this pendulum when released from 10 degrees and 20 degrees. **Question 10:** Interpret the results of your quantitative comparison in the previous question. Propose at least two next steps the experimenter could reasonably take based on these results. Interpretation: Next step 1: Next step 2: # Exercise 3: Ethics In 2002, a scandal rocked the physics community. A well-regarded scientist at Bell Labs was found to have falsified data in many papers that would have revolutionized the field of nanotechnology. Some of that data included in the papers appeared to have been generated by functions in computers and then passed off as experiments. This, combined with another scandal involving the falsification of data of the creation of element 118, caused the American Physical Society to re-examine their ethical standards. Falsifying data is obviously unethical and is a major breach of trust in the scientific community. Still, roughly 2% of scientists admit to having falsified data or research. However, this is not the only way that scientists can engage in unethical behavior. **Question 1:** Give 3 other examples of unethical behavior that scientists can knowingly engage in. The speed of light is a very important quantity in physics. Below is a graph of different measurements of the speed of light over time. ![](images/DataAnalysis/speedoflight.png "") You may notice a clustering effect. There seemed to be some consensus between 1880-1900. Then the values jumped. Again, there seemed to be growing consensus between 1920-1945. Then, the values jumped again, and once more outside of the previous uncertainties. **Question 2:** Assuming that the scientists making these measurements were honest, what do you think could have happened? The focus of this assignment and this course is not going to be explicit unethical behavior. If you are tempted… don’t. Remember, we are never trying to reproduce a specific result. We are interested in teaching you skills to pose and investigate scientific questions and to accurately describe what you found. During your previous lab, we presented you with a model in which the period of a pendulum is independent of the oscillation amplitude. You were asked to perform experiments to test this model. Many students take this as a challenge to reproduce the result that “confirms” their hypothesis that the model is descriptive. In doing so, they may have missed an important point. The calculation that gives us this model is based on the small angle approximation. In order to derive the result we gave you, we simplified our problem by saying that sin(?) = ?. This is exactly true at ? = 0, but this approximation gets less and less valid as ? moves away from zero. This causes a small difference in the period of the pendulum at different angles. As your experimental precision got better, you may have been able to see this effect. You do not have to respond here (though you may if you wish), but mentally answer the questions: a. Were you looking for evidence to disconfirm your hypothesis? b. Was it easier to believe that something was wrong with your experiment, or that the given model needed improvement? c. What evidence would have convinced you that your model needed improvement? The first lab was designed to get you thinking about these issues. **Question 3:** Is there anything about your expectations, the way the lab was presented, or your prior experience that you think made you more likely to look for confirming rather than disconfirming evidence? What? Here is an optional but quite entertaining YouTube video that shows you this effect: https://www.youtube.com/watch?v=vKA4w2O61Xo People frequently reach conclusions based on intuition and then look for evidence to support or rationalize their intuition. This can be seen on physics problems and in many other examples ranging from the justice system to the stock market. This can cause scientists to discount evidence that does not support their hypothesis, leading to confirmation biases. **Question 4:** Give three real-life examples in which you may be implicitly motivated to look for evidence that confirms your hypothesis and models. **Question 5:** Give a real-life example in which you think you are not implicitly motivated to look for evidence to confirm your intuition. We can never eliminate biases. Just knowing that biases exist does not make you immune. However, we can employ strategies that uncover biases and mitigate their effects. **Question 6:** Give three strategies scientists (on an individual basis) can use to reduce the effects of their biases. **Question 7:** Give three strategies the scientific community as a whole can use to reduce the effects of biases on the work that is published. One of the common ways that the scientific community tries to mitigate bias (and also ensure that scientific work is original and of good quality) is to use peer review. However, even peer review can have issues. For example, reviewers often show a bias towards well-known authors and institutions. In addition, they may show confirmation bias, bias against an author’s identity, and many other examples. **Question 8:** Give 3 ways the scientific community can reduce bias in the peer review process. Let’s turn back to the experiment the first lab that you did. We told you that there was a simplification made that brought us to the model that the period of a pendulum does not depend on the angle. A common first response that students have to hearing this is that a good way to avoid bias is to stop making assumptions and simplifications. However, while the pendulum may appear to be a very simple system, any model we come up with will have inherent assumptions and simplifications built in. **Question 9:** Why must we have assumptions and simplifications? Give an example. It is important to note that one can never escape simplifications and assumptions, only mitigate their effects. **Question 10:** Give three strategies you can use to mitigate the effects of assumptions and simplifications you know you are making. **Question 11:** Give three strategies you can use to discover assumptions and simplifications that you did not know you were making. Finally, we want to learn to accurately describe the findings of our experiments. For the first lab, students may have done their initial experiment timing the period at 10 and 20 degrees and found a t’ value of .9 when they had uncertainties of .1 seconds. They repeated the experiment, reduced the uncertainties down to .04 seconds each, and got a t’ value of .25. Therefore, they came to the conclusion: “Since the t’ value was less than 1, the periods of the pendulum at 10 and 20 degrees are indistinguishable. When we reduced the uncertainties, t’ got smaller. Therefore, we have definitively proven that the period of a pendulum does not depend on the angle.” **Question 12:** Can you prove that the period of a pendulum does not depend on the angle with any experiment? Upon talking with their peers, the students were less sure of their conclusion. They decided to re-phrase it to say: “Since the t’ value was less than 1, the periods of the pendulum at 10 and 20 degrees are indistinguishable at the level of precision we were able to measure. The model describes our pendulum to within our uncertainties, and if we want to find a difference between the model and the pendulum, we would need to make our uncertainties smaller.” **Question 13:** Which conclusion do you think is more accurate? Why? What is the importance of qualifying our statements when reporting our results? **Question 14:** How can you apply what you learned in this tutorial to what you want to do in your chosen field? # Exercise 4: Fitting 1 - Manual Fitting and Metrics Oftentimes, we will want to figure out the relationship between two variables, i.e., $x$ and $y$, as a function: $f(x)=y$. The most common question will be if the relationship between $x$ and $y$ is linear; in this case, we need to also figure out what the slope and intercept of that line should be. Let's say we have some data, which we want to plot as $y$ vs $x$ and find out if the relationship between them is linear. Below we have a graph where the data are the blue dots and the solid red and dotted green lines show two attempts to fit the data. ![](images/DataAnalysis/H3_Fig1.png "") **Question 1:** Which one of the two lines above seems like a better fit to the data? Please explain your reasoning. Most of the time, the difference between possible fit lines will be a bit more subtle. In these cases, we want to come up with a way to make our goodness-of-fit assessment quantitative instead of qualitative. To do that, we are going to use our data and our function and come up with a number or "score" that tells us if our fit is good or bad. When the score is small, our fit is good, and when the score is large, our fit is bad. We want to take several things into account: a) The score should increase as the points get further away from the function, by our definition above. b) We want points with smaller uncertainties to "count" more towards the score; if the function is far away from a point with a small uncertainty, our fit is worse than if the function is far away from a point that has large uncertainty. c) Our score should not depend on units. That is, we want the score to be dimensionless so we can have a standard way of interpreting a "good" or "bad" fit, regardless of the units in our data. d) Our score should not change as we add more points that are similar to the ones we already have. That is, we want a standard way of interpreting a "good" or "bad" fit, regardless of the number of data points. While there are many ways to assess how well a curve fits data, the method that we will use here is called Chi-Squared ("chi" is pronounced like "kai", which rhymes with "sky", but without the 's'): $$\chi^2=\frac{1}{N} \sum_{i=1}^N \frac{(f(x_i)-y_i)^2}{\delta y_i^2}$$ where we have data points $(x_1, y_1) ... (x_N, y_N)$ with associated uncertainties $\delta y_i$, and $f(x_i)$ is the function we are fitting evaluated at $x_i$. In the graph above, the red and green lines are examples of possible functions $f(x)$. **Question 2:** Explain how the formula for $\chi^2$ fulfills the four requirements above. **Question 3:** Compare the equation for $\chi^2$ to the equation for $t^{\prime}$ from the distinguishability tutorial. In what ways are the equations similar and in what ways are they different? **Question 4:** What might a small $\chi^2$ value mean? What should count as "small"? **Question 5:** What might a large $\chi^2$ mean? What should count as "large"? Let's take a look at fitting a line to some data. The table shows data of an experiment in which a spring is stretched and the spring's force ($y$) is measured at certain stretching distances ($x$). | $x$ (Stretching distance) | $y$ (Force) | $\delta y$ (Uncertainty in Force) | | ------------------------- | ----------- | --------------------------------- | | 1.00 | 1.36 | 1 | | 1.78 | 3.36 | 1 | | 2.56 | 3.92 | 1 | | 3.33 | 4.11 | 1 | | 4.11 | 3.43 | 4 | | 4.89 | 5.22 | 1 | | 5.67 | 8.29 | 1 | | 6.44 | 8.22 | 1 | | 7.22 | 11.15 | 1 | | 8.00 | 10.86 | 1 | The first graph shows the data points (black dots with uncertainty bars) and a possible fit line (in blue). The second graph is a *residuals* plot, which shows the difference between $f(x_i)$ (the value of our function at $x_i$) and $y_i$ (the measured data at $x_i$) at each $x_i$. ![](images/DataAnalysis/H3_Fig2.png "") Use the data in the table to generate the same two plots for the straight line fit function $f(x) = x$ (that is, the slope is equal to 1.00 and the intercept is equal to 0.00). **Question 6:** Calculate the $\chi^2$ value for this fit. **Question 7:** Try picking different values for the slope and intercept. What happens to the $\chi^2$ value as you change the values for the slope and intercept? (You may want to look at corresponding changes on the graph to explain what is going on.) **Question 8:** What values for the slope and intercept give the smallest value of $\chi^2$ (to the first decimal place)? What is the corresponding $\chi^2$ value? In general, we are looking for a fit function, $f(x)$, that gives us the smallest possible $\chi^2$ value. That is, the best fit line should be the one that minimizes the (squared) distances between the point and the line. This brings us to another question - what if our $\chi^2$ is really small? (This is a rhetorical question). Attempt the same exercise as above, except with the all of the uncertainties set to 5. Now you should have a large range of values for which the $\chi^2$ value is quite small. For example, a fit with intercept=-4.6 and slope=2.2 and one with intercept=3.2 and slope=.6 both give $\chi^2$ values around 0.2. **Question 9:** How confident are you that either of these sets of fit parameters are a good representation of the underlying phenomena? Do you trust them? If your $\chi^2$ is too small (e.g. $\chi^2 <<1$), you may have overestimated your uncertainties. That is, your fit is telling you that you measured these data points much more precisely than you thought! Uncertainty overestimation is a problem because it means that it is hard to identify which of the lines that appear to be a good fit actually reflect the underlying physics. **Question 10:** What do you think you should do if you obtain a very small $\chi^2$ value? A $\chi^2$ value larger than 9 is considered a very poor fit for the data. (Why 9?) For $\chi^2$, there are a few possible outcomes: a) $\chi^2\approx1$ b) $\chi^2<<1$ c) $1\lesssim\chi^2<9$ d) $\chi^2 >9$ **Question 11:** Write down different interpretations for what each of these $\chi^2$ values could mean, and what you should do in each case. *Hint: refer to the interpretations of values of $t^\prime$ from the distinguishability tutorial.* You should never manipulate your uncertainties to obtain a specific $\chi^2$ value. Your uncertainties should always reflect your real measurements. Now let's further investigate the graph called "Residuals". This is a graph of $f(x_i)-y_i$, the difference between what our fit predicts and what we actually got during the experiment. The x-axis is the same as the graph "force vs. extension", but the y-axis is the vertical distance between the line and points. **Question 12:** Given how you expect points to be distributed around a line of best fit, what do you expect to see in your residuals graph, if $f(x_i)$ is a good fit? Looking at the residuals graph is a good way to tell if you are trying to fit the right kind of function. The $\chi^2$ value does not necessarily tell the whole story. **Question13:** Return to the data in the original table above (i.e., with the smaller uncertainties) and fit a line with slope=5, intercept=-18. Qualitatively describe the shape of the residuals graph. While the $\chi^2$ value tells us that this fit is bad (large $\chi^2$), the residual graph can give us an idea about *why*. In this case, the residuals show a trend that suggests that the first half of the data points are systematically above the line and the second half are systematically below the line. This should clearly suggest that you should change the slope of the line! Consider the example in the figure below, where the residuals show an upside down "v". ![](images/DataAnalysis/H3_Fig3_copy.png "") **Question 14:** What do you think this shape of residuals might suggest about the fit? How might you change the function to get a better fit? # Exercise 5: Fitting 2 - Automatic Fitting and Uncertainties in Fit Parameters Fitting data by hand can be fun, but scientists fairly rarely do this in real life (anymore). Before, we were looking for the values of the slope and intercept that gave us the smallest $\chi^2$ given the data. This sounds like a minimization problem! For linear fits, we can use simple derivatives to find the values of the slope, $m$, and intercept, $b$, that minimize the $\chi^2$. For a derivation of these expressions, see J.R. Taylor, "An Introduction to Error Analysis", Section 8.2. This derivation will give you: $ m = \frac{\sum_i^N{\frac{1}{\delta y_i^2}}\sum_i^N{\frac{x_iy_i}{\delta y_i^2}}-\sum_i^N{\frac{x_i}{\delta y_i^2}}\sum_i^N{\frac{y_i}{\delta y_i^2}}}{\Delta} $ $ b = \frac{\sum_i^N{\frac{x_i^2}{\delta y_i^2}}\sum_i^N{\frac{y_i}{\delta y_i^2}} - \sum_i^N{\frac{x_i}{\delta y_i^2}}\sum_i^N{\frac{x_iy_i}{\delta y_i^2}}}{\Delta}$ where $\Delta = \sum_i^N{\frac{1}{\delta y_i^2}} \sum_i^N{\frac{x_i^2}{\delta y_i^2}}-\left(\sum_i^N{\frac{x_i}{\delta y_i^2}}\right)^2$ In these expressions, $x_i$ and $y_i$ are the individually measured $x_i$ and $y_i$ values with uncertainty $\delta y_i$. You can also derive the uncertainties in the fit parameters: $\delta m = \sqrt{\frac{\sum_i^N{\frac{1}{\delta y_i^2}}}{\Delta}}$ $\delta b = \sqrt{\frac{\sum_i^N{\frac{x^2}{\delta y_i^2}}}{\Delta}}$ As an example, let's say you did an experiment where you stretched a rubber band to 6 different extensions, 5 times each. So, you have measurements that look like: | Extension (cm) | Force Trial 1 (N) | Trial 2 (N) | Trial 3 (N) | Trial 4 (N) | Trial 5 (N) | | -------------- | ----------------- | ----------- | ----------- | ----------- | ----------- | | 1.0 | 1.03 | 1.147 | 0.934 | 1.049 | 0.924 | | 2.0 | 1.81 | 2.178 | 2.127 | 2.005 | 1.963 | | 3.0 | 3.265 | 3.107 | 3.499 | 3.135 | 2.889 | | 4.0 | 3.7 | 3.983 | 4.003 | 4.07 | 4.055 | | 5.0 | 5.041 | 4.892 | 4.949 | 5.055 | 4.955 | | 6.0 | 5.896 | 6.366 | 5.89 | 6.136 | 6.08 | You want to figure out how well Hooke's law describes the rubber band for the data that you took. In order to do that, you are going to plot the data and fit a line to the points, just like in the manual fit, but this time you'll use the minimized functions to find the best fit line immediately. **Question 1:** Use the data in the table to generate a set of $x$ and $y$ values with uncertainties $\delta y$ that give you the average force with its uncertainty for each extension. **Question 2:** Use the minimized fitting functions, above, to find the values of $m$ and $b$ (with their uncertainties) that minimize the $\chi^2$ value between the fit line and the data. **Question 3:** What is the $\chi^2$ value for the best fit slope and intercept? **Question 4:** If you haven't already, plot a graph of the data and the best-fit line as well as a graph of the residuals. From the graphs, does it appear that Hooke's Law describes your rubber band so far? **Question 5:** How much can you change the value of the slope from the best-fitting value before the $\chi^2$ value is no longer "small"? How much can you change the value of the intercept before the $\chi^2$ value is no longer :small"? How do these ranges compare to the uncertainties in the best fitting values of $m$ and $b$? **Question 6:** Adjust the data so that the uncertainties are all very large (say, equal to 1.0). What happens to $\chi^2$ when you increase the uncertainties on the points? **Question 7:** What happens to the uncertainties in the slope and intercept when you increase the uncertainties on the points? Let's say you keep taking data for the rubber band at three more extensions. | Extension (cm) | Force Trial 1 (N) | Trial 2 (N) | Trial 3 (N) | Trial 4 (N) | Trial 5 (N) | | -------------- | ----------------- | ----------- | ----------- | ----------- | ----------- | | 7.0 | 7.064 | 7.03 | 7.087 | 7.073 | 7.06 | | 8.0 | 9.841 | 9.676 | 9.837 | 9.946 | 9.86 | | 9.0 | 13.019 | 12.861 | 12.932 | 13.332 | 13.071 | # Exercise 6: Analysis Practice The questions in this activity refer to the data in the table below, which we will assume were collected previously from an experiment where the researchers expected that the relationship between $x$ and $y$ was modeled as $y=8x - 5$. | x | y | dy | | ---- | ---- | ---- | | 1 | 3 | 1 | | 2 | 6 | 3 | | 3 | 19 | 4 | | 4 | 33 | 10 | | 5 | 42 | 10 | | 6 | 47 | 4 | | 7 | 59 | 3 | | 8 | 72 | 1 | | 9 | 74 | 7 | | 10 | 80 | 2 | **Question 1:** Using whatever software or technology you're comfortable with (e.g., calculator, excel), determine whether the measurements of $y$ at $x = 1$ and $x = 2$ are distinguishable using a $t^\prime$. **Question 2:** Generate a plot of the data in the table ($y$ versus $x$, with the uncertainties in $y$). From the plot, but without fitting a line, we can qualitatively see that the relationship between $x$ and $y$ appears fairly linear. We'll explore two ways to check how well the data fit the function $f(x)=8x-5$ (Please refer to the Theory section for more information about least-squares fitting). **Question 3:** The first way is to evaluate how well the expected line fits the data. What is the $\chi^2$ value between the data and the fit line $f(x)=8x-5$? What does this say about how well the line $f(x)=8x-5$ fits to the data? **Question 4:** The second way is to find the best fitting line to the data and compare the best-fitting parameters to the proposed fit parameters. What is the best-fitting line to the data? Using the $\chi^2$ value, the plot, and the residual graph, how well do you think the line fits to the data? **Question 5:**Use the uncertainties in the fit parameters to compare each fit parameter to the predicted values (slope = 8 and intercept = -5) using a $t^\prime$. Hint: What are the uncertainties on your model when no measurement is involved? **Question 6:** Given the researchers' expectation for the relationship, what are three reasonable things the researchers could do next? # Exercise 7: Linearization There are multiple ways to analyze non-linear data, but many of them require us to have some sense of the form of the relationship. Whenever possible, it is much easier to try to transform our data so that we can plot it to look like a straight line and use our linear fitting techniques. This is called "linearizing" our data. Linearization involves creating two types of plots using the **natural logarithm**: semi-log and log-log plots. Semi-log plots have the y-axis transformed to the natural logarithm (plotting $\ln y$ vs. $x$), while log-log plots have both the x- and y-axes transformed to their natural logarithms (plotting $\ln y$ vs. $\ln x$). **Question 1:** Why are semi-log and log-log graphs useful for distinguishing power law from exponential relationships? *Hint: which type of graph makes each type of function linear?* **Question 2:** For each graph type that linearizes the function (power law or exponential), how would the slope and intercept in each case relate to the constants A and B from above? *Hint: Write out what $\ln\left(f\left(x\right)\right)$ would be for each type of function and map it onto an equation for a straight line.* For the next few questions, use the data in the table below: | x | y | dy | | ---- | ---- | ---- | | 1 | 3 | 2 | | 2 | 9 | 4 | | 3 | 21 | 2 | | 4 | 33 | 4 | | 5 | 50 | 2 | | 6 | 77 | 4 | | 7 | 103 | 2 | | 8 | 130 | 4 | | 9 | 166 | 2 | | 10 | 205 | 4 | **Question 3:** Create three graphs: one with linear scales ($y$ vs $x$), one with semi-log scales ($\ln y$ vs $x$), and one with log-log scales ($\ln y$ vs $\ln x$). Examine the three graphs produced qualitatively. Do you think $x$ and $y$ are related linearly, exponentially, or according to a power law? Please explain your reasoning. **Question 4:** Using least-squares fitting, find the best fitting straight line to the data in each of the three graphs ($y$ vs $x$, $\ln y$ vs $x$, and $\ln y$ vs $\ln x$). Examine the three fits produced quantitatively. Do you think $x$ and $y$ are related linearly, exponentially, or according to a power law? Please explain your reasoning. **Question 5:** Use the information from the best fit to draw a preliminary conclusion for an approximate relationship between $x$ and $y$, using the graphs to estimate numerical values for any relevant constants. **Question 6:** Now plot the best fit function $f$ along with the data on the linear scale. Does it look like the function matches? **Question 7:** Summarize, in your own words, why linearizing through log-log and semi-log plots is helpful for identifying non-linear relationships. # Exercise 8: Uncertainty propagation In the analysis activities so far, we’ve only needed to worry about the uncertainty in the raw measurements themselves. Now that we are linearizing our data, however, we need to think about what happens to the uncertainty when we manipulate the variable. **Question 1:** Take a standard ruler (or a digital ruler) and some blank lined paper. Measure the distance between the first and last line on the page with an estimate of the uncertainty (approximately half the smallest division on the ruler). Report that value here. **Question 2:** Measure the distance between two consecutive lines with uncertainty and multiply it by the number of lines between the first and last line on the page to estimate that total distance. **Question 3:** Which of the two measurements you made do you think is more precise and why? **Question 4:** In measuring the area of a square, you could calculate its area by measuring $x$, the length of one side, and using the formula $A=x^2$. Or, you could measure each side $x$ and $y$ and find the area with $A=xy$. Which method of deriving the area would have the lesser uncertainty, according to both your intuition and the actual expressions? How might we make sense of the result from the expressions? **Question 5:** Show that the general rule for propagating uncertainties with derivatives reproduces the rule for propagating uncertainties through multiplication (as laid out in the theory section). **Question 6:** As practice, propagate uncertainty through the following functions for the measurements $x\pm \delta x$ and $y\pm \delta y$. The capital letters in each case are constants (no uncertainty). **a)** $R_1 = f\left(x,y\right)=Ax^2y$ **b)** $R_2 = f\left(x\right) = \ln x$ **c)** $R_3 = f\left(x,y\right)= C\left(\sin x\right) + y$ NOTE: The rule in *b)* is what you will need for propagating uncertainty through our linearization methods. **Question 7:** For each of the functions in Q6, what do the uncertainties reduce to if the uncertainty in $x$ (i.e., $\delta x$) is very small (i.e., $\delta x \lll 1$)? **Question 8:** Use your measurements from Q1 and Q2 to define two measurements $x \pm \delta x$ (from Q1) and $y \pm \delta y$ (from Q2). **Question 9:** Find the uncertainty in $R$ in each of the following cases using your measurements in Q8. **a)** $R = x+y$ **b)** $R = x-y$ **c)** $R = x*y$ **d)** $R = x/y$ **e)** $R = x^n$ (use n=2, with no uncertainty) **f)** $R= \ln x$ **Question 10:** In your own words, explain why propagating uncertainty is common and important in data analysis.

Download Options

Share a Variation

Did you have to edit this material to fit your needs? Share your changes by Creating a Variation

Credits and Licensing

Rebeckah Fussell, Megan Renz, Philip Krasicky, Robert Fulbright, Jr., and Natasha Holmes, "Introductory data analysis with Jupyter Notebooks," Published in the PICUP Collection, July 2022, https://doi.org/10.1119/PICUP.Exercise.DataAnalysis.

DOI: 10.1119/PICUP.Exercise.DataAnalysis

The instructor materials are ©2022 Rebeckah Fussell, Megan Renz, Philip Krasicky, Robert Fulbright, Jr., and Natasha Holmes.

The exercises are released under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 license

Creative Commons Attribution-NonCommercial-ShareAlike 4.0 license