planetwater

ground- water, geo- statistics, environmental- engineering, earth- science

Regression 101 (SD line, statistics)

It is not a hidden fact that I work in geostatistics. More specific, I try to use copulas to model fields of spatially distributed parameters by mathematically describing their dependence structure. I set out to shed some light into what this all means, or what I understand of it, by writing a series of blog posts on planetwater.org.

In this first post I am going to try and explain a very basic tool of traditional statistics: regression. In future posts I am going to try and explain more topics with the goal to describe how copulas can work in spatial statistics. Today, I am going to use the classic Galton data-set of pairwise measurements of the heights of 1078 fathers and their sons. In the next post I am trying to show some of these concept using some current meteorological data. Granted, some aspects of the analysis are adapted based on the classic book “Statistics” by Freedman.

Here’s what wikipedia has to say about the Galton data-set:

Pearson was instrumental in the development of this theory. One of his classic data sets (originally collected by Galton) involves the regression of sons’ height upon that of their fathers’.

The data for this post was taken from this server at Berkeley, a similar analysis as presented here can be found in this pdf by Michael Wichura. A scatterplot of these pairs of measurements is shown on Figure 1, showing the fathers’ heights along the x-axis, and the corresponding sons’ heights along the y-axis.

Figure 1: Scatterplot of Galton's data-set

Figure 1 shows that the measured humans were around 68 inches tall, which is the range of heights where most dots in the scatterplot occur. The point-cloud is not circular shaped but like an ellipse, pointing toward the top right. Hence, on average, within the measured population, the taller the fathers are, the taller are their sons.

Fair enough, generally, tall fathers have tall sons. But are the sons taller than the fathers? If yes, then by how much? To answer this question, let’s look a bit closer, beginning at some basic measures of descriptive statistics, then let’s look at some classic regression analysis results.

Descriptive Statistics

The few results of some basic calculations of data-analysis are presented in Table 1. Those results prove in little more detail what we have seen by the first glimpse on Figure 1:

• fathers and sons actually are around 68 inches tall
• on average, sons tend to be a little (0.99 inches) taller than their fathers.
Table 1: Values of traditional descriptive statistics measures for the Galton data-set comprising fathers’ and their sons’ heights in inches.
mean std. dev. min max
fathers 67.69 2.74 59.00 75.43
sons 68.68 2.81 58.51 78.36

Note that the range for sons (19.85 inches) is bigger than for fathers (16.43 inches).

Regression

For the upcoming analysis, let’s call sons’ hight y and fathers’ heights x. Figure 3 summarizes the key results. A classic linear regression line is shown in solid blue and solid green. Both lines are identical, the difference here is that the blue line is calculated by an intrinsic function in Mathematica; the green line is numerically calculated:

$$r = \frac{s_{xy}}{s_{x} s_{y}}\ m = \frac{s_{xz}}{s_{x}^{2}}$$

where $$r$$ is the correlation coefficient and it has a value of 0.51; $$s_{xy}$$ the covariance and $$s_{x}$$ and $$s_{y}$$ the standard deviations of x and y, respectively. The equation for the regression line is then given by:

$$y = 0.51 \cdot x + 33.89$$

Both the green and the blue line go through the point $$( \bar{x}, \bar{y} )$$. Its slope m can be calculated by combining the first two equations:

$$m = r \cdot \frac{S_{y}}{S_{x}}$$

Figure 3 shows a third solid line, coloured in red, which is pointing into a similar direction as the green/blue line and which is also going through $$( \bar{x}, \bar{y} )$$. However, the slope of the red line is with 0.98 very close to 1.0. The calculation of the slope of the red line is very similar to the procedure for the green/blue line, except that Sign(r) is taken instead of r. Even more, $$s_{x} \sim s_{y}$$, resulting in a slope of about 1.0. The red line is also called “SD line” because it goes through multiples of the standard deviations away from the mean, both for x and y.

Now, what really is the difference between both lines? The fathers who are one standard deviation above the average fathers’ height are plotted on the orange dashed line. Similarly, the sons who are one standard deviation above average sons’ height are plotted on the yellow dashed lines. Both dashed lines intersect as expected on the red SD line. However, most of the points along the orange dashed line are below the yellow dashed line (see also Figure 3). In other words, most of the sons whose fathers were one standard deviation above average fathers height were quite a bit smaller than than one standard deviation above average sons’ height. This is where the correlation r of ~0.5 comes in place. Associated with an increase of one standard deviation in fathers’ height is an increase of only 0.5 standard deviations of sons’ height, on average. That point is exactly where the green solid line intersects the orange dashed line, on the regression line. This also means that for an r of about 1, the SDLine and the regression line would be identical.

Figure 2: Classical regression line and SDline shown on the scatterplot fo Figure 1.

Let’s try and shed some more light into the relation between r, the regression line, and the SDline. Let’s pretend we are moving along the orange dashed line in Figure 2 and while we’re moving, we count how many dots we pass and what the corresponding y value is. When we’re done we plot the “collected” data on a histogram, shown on Figure 3. It turns out, that about half the points we encounter are below 70.1, indicated by the green line. 70.1 is the value of the regression line at 1SD to the right of $$\bar{x}$$. The intersection with the regression line occurs at ~71.5, which is $$r \cdot s_{y}$$ higher than 70.1.

Histogram parallel to the y-axis of Figure 2, along a line indicated by 1SD to the right of the mean of the x-values (dashed orange line). The dashed line and the orange line indicate the locations where the regression line (green) and the SDline (red) intersect with the dashed orange line

Conclusion

The regression line is a linear least squares fit to your data. As Freedman puts it: “Among all lines, the regression line for y on x makes the smallest root mean square error in predicting y from x”. For each change of one SD in x direction, the regression line changes in y direction by $$r * s_{y}$$. The SDline can be used as a different summary statistics of the scatterplot.

It remains to point out

• that the concept of regression works only for linear relationships!
• that under all circumstances you should think first very carefully, if the two variables that you are using for regression are the ones that are really related!

In the next post, I will do a very similar analysis with some real meteorological data. You can access the mathematica v.7 notebook that I used for this post here.

Written by Claus

September 30th, 2009 at 2:39 pm

Posted in

Tagged with , , ,

• NASA found 99% pure water ice on Mars: http://bit.ly/HrvtD #
• Google sightseeing: The Vajont dam: The Vajont Dam <http://bit.ly/2i8X6x> #

Written by Claus

September 28th, 2009 at 10:00 am

Posted in identi.ca

Tipping Point Crossed for “Planetary Boundaries”

Twenty-eight scientists published the concept of a “safe operating space for humanity” in “Nature” two days ago. Here’s the description of what this operating space is, straight from their paper:

To meet the challenge of maintaining the Holocene state, we propose a framework based on ‘planetary boundaries’. These boundaries define the safe operating space for humanity with respect to the Earth system and are associated with the planet’s biophysical subsystems or processes. Although Earth’s complex systems sometimes respond smoothly to changing pressures, it seems that this will prove to be the exception rather than the rule. Many subsystems of Earth react in a nonlinear, often abrupt, way, and are particularly sensitive around threshold levels of certain key variables. If these thresholds are crossed, then important subsystems, such as a monsoon system, could shift into a new state, often with deleterious or potentially even disastrous consequences for humans.

The figure below is used to illustrate their concept:

The inner green shading represents the proposed safe operating space for nine planetary systems. The red wedges represent an estimate of the current position for each variable. The boundaries in three systems (rate of biodiversity loss, climate change and human interference with the nitrogen cycle), have already been exceeded.

It should be noted, that their analysis is based on data, even though I haven’t found a clear description how they calculated the distance away from the tipping point. Here is some more detailed description from their paper

Three of the Earth-system processes — climate change, rate of biodiversity loss and interference with the nitrogen cycle — have already transgressed their boundaries. [This transgression] cannot continue without significantly eroding the resilience of major components of Earth-system functioning. Here we describe these three processes.

Although the planetary boundaries are described in terms of individual quantities and separate processes, the boundaries are tightly coupled. We do not have the luxury of concentrating our efforts on any one of them in isolation from the others. If one boundary is transgressed, then other boundaries are also under serious risk. For instance, significant land-use changes in the Amazon could influence water resources as far away as Tibet. The climate-change boundary depends on staying on the safe side of the freshwater, land, aerosol, nitrogen–phosphorus, ocean and stratospheric boundaries. Transgressing the nitrogen–phosphorus boundary can erode the resilience of some marine ecosystems, potentially reducing their capacity to absorb CO2 and thus affecting the climate boundary.

It seems like a good idea to promote the idea that we have to take care of many tipping points at the same time. It seems even more important to stress non-linear behaviour and non-reversible behaviour. This is nothing new, but it is important to stress such important things once in a while. If a contaminant plume was reversible much of our subsurface remediation problems would be solved quite easily. However, there is dispersion, and hence a plume cannot be reversed. A similar example, related to the contamination of a lake, is given by Shahid Naeem, as quoted by Carl Zimmer:

A lake, for example, can absorb a fair amount of phosphorus from fertilizer runoff In five areas, the scientists found, the world has not yet reached the danger threshold. without any sign of change. ‘You add a little, not much happens. Add a little more, not much happens. Add a little… then, all of sudden, you add a little more and — boom! — phytoplankton bloom, oxygen depletion, fish die-off, smelliness. Remove the little phosphorus that caused the tipping of the system, and it does not reverse. In fact, you have to go back to much cleaner water than you would have imagined.

To conclude, it seems like a neat idea to establish such indicators that seem to tell us in what areas we are doing ok and in what other areas we exceeded the threshold. However such a compartmented visualization seems to contradict the intention of the authors when they write how they had coupling of the compartments in mind.

Where does this leave us on an operational level? Are those guys going to publish their indicator-levels every half year from now on, and then we can see the areas where we improved and where things got worse? Could we even narrow all human activities down to one indicator? If not, then why those seven? And how come we exceeded the outer limit of earth for “Biodiversity loss” while we’re only one step outside the green zone for climate change?

It remains to be noted, that both ” Atmospheric aerosol loading” as well as “Chemical Pollution” are not yet quantified and it is not clear as to why they are not yet quantified.

Further resources

A safe operating space for humanity Johan Rockström, Will Steffen, Kevin Noone, Åsa Persson, F. Stuart Chapin, III, Eric F. Lambin, Timothy M. Lenton, Marten Scheffer, Carl Folke, Hans Joachim Schellnhuber, Björn Nykvist, Cynthia A. de Wit, Terry Hughes, Sander van der Leeuw, Henning Rodhe, Sverker Sörlin, Peter K. Snyder, Robert Costanza, Uno Svedin, Malin Falkenmark, Louise Karlberg, Robert W. Corell, Victoria J. Fabry, James Hansen, Brian Walker, Diana Liverman, Katherine Richardson, Paul Crutzen & Jonathan A. Foley Nature 461, 472-475(24 September 2009) doi:10.1038/461472a

Written by Claus

September 25th, 2009 at 4:32 pm

Posted in

Tagged with , ,

Written by Claus

September 24th, 2009 at 10:00 am

Posted in identi.ca

Videos of ESRI Conference

ESRI posted videos of their recent user conference. I couldn’t find a way to link to the individual videos. But on the link goes to the an overview page where you should find your way.

There are two presentations that I found very interesting:

• A presentation by FedEx on how they use GIS. Real time baby!
• A keynote by Willie Smits, initiator of the Masarang foundation, of on how he uses GIS to fundamentally analyze the situation of oil palms on Borneo. Here is his workflow for how he deals with illegal logging: After a suspicion, on day 1, data is collected. On day 2 the data is processed, and a map with relevant locations is sent to the field. On day 3 the area is flown over with the help of ultra-light planes. On day 4 a crew is sent to the narrowed down set of locations. On day 5 the illegal loggers are in jail.

Written by Claus

September 23rd, 2009 at 3:58 pm

Posted in

Tagged with

Written by Claus

September 23rd, 2009 at 10:00 am

Posted in identi.ca

Written by Claus

September 22nd, 2009 at 10:00 am

Posted in identi.ca

Written by Claus

September 21st, 2009 at 10:00 am

Posted in identi.ca

Written by Claus

September 16th, 2009 at 10:00 am

Posted in identi.ca

Hydrogeology Books

Mirko asked me today what books I would recommend in the “hydrogeology – area”. This is a relatively broad area. However, I thought it might be useful to some of you to write up a list of books that I remember I found useful when studying hydrogeology at a Masters level.

Freeze and Cherry, 1979

This is a classic. No doubt. I think it’s really well written. There are lot’s of stories circulating about page 29, but hey, what a great page. Please note, that the cover is upside down… Most importantly I like the clearness in its style.

There are a variety of modelling-related books, but I’ll cover those the next time!

Written by Claus

September 15th, 2009 at 9:37 pm

Posted in

Tagged with