## Archive for September, 2009

## Regression 101 (SD line, statistics)

It is not a hidden fact that I work in geostatistics. More specific, I try to use copulas to model fields of spatially distributed parameters by mathematically describing their dependence structure. I set out to shed some light into what this all means, or what I understand of it, by writing a series of blog posts on planetwater.org.

In this first post I am going to try and explain a very basic tool of traditional statistics: regression. In future posts I am going to try and explain more topics with the goal to describe how copulas can work in spatial statistics. Today, I am going to use the classic Galton data-set of pairwise measurements of the heights of 1078 fathers and their sons. In the next post I am trying to show some of these concept using some current meteorological data. Granted, some aspects of the analysis are adapted based on the classic book “Statistics” by Freedman.

Here’s what wikipedia has to say about the Galton data-set:

Pearson was instrumental in the development of this theory. One of his classic data sets (originally collected by Galton) involves the regression of sons’ height upon that of their fathers’.

The data for this post was taken from this server at Berkeley, a similar analysis as presented here can be found in this pdf by Michael Wichura. A scatterplot of these pairs of measurements is shown on Figure 1, showing the fathers’ heights along the x-axis, and the corresponding sons’ heights along the y-axis.

Figure 1 shows that the measured humans were around 68 inches tall, which is the range of heights where most dots in the scatterplot occur. The point-cloud is not circular shaped but like an ellipse, pointing toward the top right. Hence, on average, within the measured population, the taller the fathers are, the taller are their sons.

Fair enough, generally, tall fathers have tall sons. But are the sons taller than the fathers? If yes, then by how much? To answer this question, let’s look a bit closer, beginning at some basic measures of descriptive statistics, then let’s look at some classic regression analysis results.

## Descriptive Statistics

The few results of some basic calculations of data-analysis are presented in Table 1. Those results prove in little more detail what we have seen by the first glimpse on Figure 1:

- fathers and sons actually are around 68 inches tall
- on average, sons tend to be a little (0.99 inches) taller than their fathers.

mean | std. dev. | min | max | |
---|---|---|---|---|

fathers | 67.69 | 2.74 | 59.00 | 75.43 |

sons | 68.68 | 2.81 | 58.51 | 78.36 |

Note that the range for sons (19.85 inches) is bigger than for fathers (16.43 inches).

## Regression

For the upcoming analysis, let’s call sons’ hight *y* and fathers’ heights *x*. Figure 3 summarizes the key results. A classic linear regression line is shown in solid blue and solid green. Both lines are identical, the difference here is that the blue line is calculated by an intrinsic function in Mathematica; the green line is numerically calculated:

$$ r = \frac{s_{xy}}{s_{x} s_{y}}\ m = \frac{s_{xz}}{s_{x}^{2}} $$

where $$r$$ is the correlation coefficient and it has a value of 0.51; $$s_{xy}$$ the covariance and $$s_{x}$$ and $$s_{y}$$ the standard deviations of x and y, respectively. The equation for the regression line is then given by:

$$ y = 0.51 \cdot x + 33.89 $$

Both the green and the blue line go through the point $$( \bar{x}, \bar{y} ) $$. Its slope *m* can be calculated by combining the first two equations:

$$ m = r \cdot \frac{S_{y}}{S_{x}} $$

Figure 3 shows a third solid line, coloured in red, which is pointing into a similar direction as the green/blue line and which is also going through $$( \bar{x}, \bar{y} ) $$. However, the slope of the red line is with 0.98 very close to 1.0. The calculation of the slope of the red line is very similar to the procedure for the green/blue line, except that Sign(r) is taken instead of r. Even more, $$s_{x} \sim s_{y}$$, resulting in a slope of about 1.0. The red line is also called “SD line” because it goes through multiples of the standard deviations away from the mean, both for *x* and *y*.

Now, what really is the difference between both lines? The fathers who are one standard deviation above the average fathers’ height are plotted on the orange dashed line. Similarly, the sons who are one standard deviation above average sons’ height are plotted on the yellow dashed lines. Both dashed lines intersect as expected on the red SD line. However, most of the points along the orange dashed line are below the yellow dashed line (see also Figure 3). In other words, most of the sons whose fathers were one standard deviation above average fathers height were quite a bit smaller than than one standard deviation above average sons’ height. This is where the correlation *r* of ~0.5 comes in place. Associated with an increase of one standard deviation in fathers’ height is an increase of only 0.5 standard deviations of sons’ height, on average. That point is exactly where the green solid line intersects the orange dashed line, on the regression line. This also means that for an r of about 1, the SDLine and the regression line would be identical.

Let’s try and shed some more light into the relation between r, the regression line, and the SDline. Let’s pretend we are moving along the orange dashed line in Figure 2 and while we’re moving, we count how many dots we pass and what the corresponding y value is. When we’re done we plot the “collected” data on a histogram, shown on Figure 3. It turns out, that about half the points we encounter are below 70.1, indicated by the green line. 70.1 is the value of the regression line at 1SD to the right of $$ \bar{x}$$. The intersection with the regression line occurs at ~71.5, which is $$ r \cdot s_{y} $$ higher than 70.1.

## Conclusion

The regression line is a linear least squares fit to your data. As Freedman puts it: “Among all lines, the regression line for y on x makes the smallest root mean square error in predicting y from x”. For each change of one SD in x direction, the regression line changes in y direction by $$ r * s_{y} $$. The SDline can be used as a different summary statistics of the scatterplot.

It remains to point out

- that the concept of regression works only for linear relationships!
- that under all circumstances you should think first very carefully, if the two variables that you are using for regression are the ones that are really related!

In the next post, I will do a very similar analysis with some real meteorological data. You can access the mathematica v.7 notebook that I used for this post here.

## Identi.ca Updates for 2009-09-28

- NASA found 99% pure water ice on Mars: http://bit.ly/HrvtD #
- Google sightseeing: The Vajont dam: The Vajont Dam <http://bit.ly/2i8X6x> #

## Tipping Point Crossed for “Planetary Boundaries”

Twenty-eight scientists published the concept of a “safe operating space for humanity” in “Nature” two days ago. Here’s the description of what this operating space is, straight from their paper:

To meet the challenge of maintaining the Holocene state, we propose a framework based on ‘planetary boundaries’. These boundaries define the safe operating space for humanity with respect to the Earth system and are associated with the planet’s biophysical subsystems or processes. Although Earth’s complex systems sometimes respond smoothly to changing pressures, it seems that this will prove to be the exception rather than the rule. Many subsystems of Earth react in a nonlinear, often abrupt, way, and are particularly sensitive around threshold levels of certain key variables. If these thresholds are crossed, then important subsystems, such as a monsoon system, could shift into a new state, often with deleterious or potentially even disastrous consequences for humans.

The figure below is used to illustrate their concept:

It should be noted, that their analysis is based on data, even though I haven’t found a clear description how they calculated the distance away from the tipping point. Here is some more detailed description from their paper

Three of the Earth-system processes —

climate change,rate of biodiversity lossandinterference with the nitrogen cycle— have already transgressed their boundaries. [This transgression] cannot continue without significantly eroding the resilience of major components of Earth-system functioning. Here we describe these three processes.Although the planetary boundaries are described in terms of individual quantities and separate processes, the boundaries are tightly

coupled. We do not have the luxury of concentrating our efforts on any one of them in isolation from the others. If one boundary is transgressed, then other boundaries are also under serious risk. For instance, significant land-use changes in the Amazon could influence water resources as far away as Tibet. The climate-change boundary depends on staying on the safe side of the freshwater, land, aerosol, nitrogen–phosphorus, ocean and stratospheric boundaries. Transgressing the nitrogen–phosphorus boundary can erode the resilience of some marine ecosystems, potentially reducing their capacity to absorb CO2 and thus affecting the climate boundary.

It seems like a good idea to promote the idea that we have to take care of many tipping points at the same time. It seems even more important to stress non-linear behaviour and non-reversible behaviour. This is nothing new, but it is important to stress such important things once in a while. If a contaminant plume was reversible much of our subsurface remediation problems would be solved quite easily. However, there is dispersion, and hence a plume cannot be reversed. A similar example, related to the contamination of a lake, is given by Shahid Naeem, as quoted by Carl Zimmer:

A lake, for example, can absorb a fair amount of phosphorus from fertilizer runoff In five areas, the scientists found, the world has not yet reached the danger threshold. without any sign of change. ‘You add a little, not much happens. Add a little more, not much happens. Add a little… then, all of sudden, you add a little more and — boom! — phytoplankton bloom, oxygen depletion, fish die-off, smelliness. Remove the little phosphorus that caused the tipping of the system, and it does not reverse. In fact, you have to go back to much cleaner water than you would have imagined.

To conclude, it seems like a neat idea to establish such indicators that seem to tell us in what areas we are doing ok and in what other areas we exceeded the threshold. However such a compartmented visualization seems to contradict the intention of the authors when they write how they had *coupling* of the compartments in mind.

Where does this leave us on an operational level? Are those guys going to publish their indicator-levels every half year from now on, and then we can see the areas where we improved and where things got worse? Could we even narrow all human activities down to one indicator? If not, then why those seven? And how come we exceeded the outer limit of earth for “Biodiversity loss” while we’re only one step outside the green zone for climate change?

It remains to be noted, that both ” Atmospheric aerosol loading” as well as “Chemical Pollution” are not yet quantified and it is not clear as to why they are not yet quantified.

## Further resources

- The Stockholm Resilience Centre, where the lead author Johan Rockström is based at;
- An editorial at Nature;
- Commentaries by “seven experts” (one for each category)
- discussion at wired.com

A safe operating space for humanity Johan Rockström, Will Steffen, Kevin Noone, Åsa Persson, F. Stuart Chapin, III, Eric F. Lambin, Timothy M. Lenton, Marten Scheffer, Carl Folke, Hans Joachim Schellnhuber, Björn Nykvist, Cynthia A. de Wit, Terry Hughes, Sander van der Leeuw, Henning Rodhe, Sverker Sörlin, Peter K. Snyder, Robert Costanza, Uno Svedin, Malin Falkenmark, Louise Karlberg, Robert W. Corell, Victoria J. Fabry, James Hansen, Brian Walker, Diana Liverman, Katherine Richardson, Paul Crutzen & Jonathan A. Foley Nature 461, 472-475(24 September 2009) doi:10.1038/461472a

## Identi.ca Updates for 2009-09-24

- The NYTimes on violations against the Clean Water Act: http://twiturl.de/dadej #
- The economist on "the global water crisis": http://twiturl.de/capaj #
- overview article of the economist on water and tradable water rights: http://twiturl.de/depor #
- Plant ecologists predict: in 60 years: species from Spain and Turkey in Germany: http://twiturl.de/rimob via http://twiturl.de/necap #

## Videos of ESRI Conference

ESRI posted videos of their recent user conference. I couldn’t find a way to link to the individual videos. But on the link goes to the an overview page where you should find your way.

There are two presentations that I found very interesting:

- A presentation by FedEx on how they use GIS. Real time baby!
- A keynote by Willie Smits, initiator of the Masarang foundation, of on how he uses GIS to fundamentally analyze the situation of oil palms on Borneo. Here is his workflow for how he deals with illegal logging: After a suspicion, on day 1, data is collected. On day 2 the data is processed, and a map with relevant locations is sent to the field. On day 3 the area is flown over with the help of ultra-light planes. On day 4 a crew is sent to the narrowed down set of locations. On day 5 the illegal loggers are in jail.

## Identi.ca Updates for 2009-09-23

- a look at the water-footprint of coffee: http://twiturl.de/lirul #

## Identi.ca Updates for 2009-09-22

- #hackspace in Stuttgart! http://bit.ly/3ohYdo! Cheers to the initiators! @ansi @momorientes @ jvanvinvkenroye #hsp0711 #

## Identi.ca Updates for 2009-09-21

- novel approach to waste water cleaning: decentralized, on roofs: http://twiturl.de/kihod #
- Regman verlost the Next Big Thing von Apple (Tablet, iPhone??). Ich bin auch dabei: http://www.regman.de/news #

## Identi.ca Updates for 2009-09-16

- SevenSnap verlost ein MacBook Pro. Ich bin hiermit dabei http://www.sevensnap.de/win.php #

## Hydrogeology Books

Mirko asked me today what books I would recommend in the “hydrogeology – area”. This is a relatively broad area. However, I thought it might be useful to some of you to write up a list of books that I remember I found useful when studying hydrogeology at a Masters level.

Applied Hydrogeology by Fetter (2001) This is the book I learned most of my physical hydrogeology from. It’s ok.

This is a classic. No doubt. I think it’s really well written. There are lot’s of stories circulating about page 29, but hey, what a great page. Please note, that the cover is upside down… Most importantly I like the clearness in its style.

also by Fetter, there’s a book Contaminant Hydrogeology (1998). I’ve neither used it nor do I own it, but people tell me it’s ok.

When I learned contaminant hydrogeology, I guess I used a combination of three books: “Geochemistry, Groundwater and Pollution” by Appelo and Postma (2005), “Aquatic Chemistry: Chemical Equilibria and Rates in Natural Waters” by Stumm and Morgan (1995) and “Dense Chlorinated Solvents” edited by Pankow and Cherry (1996)”.

For the unsaturated zone I can highly recommend “Environmental Soil Physics: Fundamentals, Applications, and Environmental Considerations: Fundamentals, Applications and Environmental Considerations” by Hillel (1996) – very simple and “to the point” explanations! Another recommendation, however I don’t think you can get it anymore, is “Mechanics of heterogeneous fluids in porous media” by Corey (1977). “Corey” as in Brooks-Corey relationship.

On the practical side, I can recommend the legendary “Groundwater and Wells” by Driscoll (1986) (as a reference), “Analysis and Evaluation of Pumping Test Data” by Kruseman and de Ridder (1990) for hydraulic aquifer analyses, as well as for general field-work (as a useful reminder for things not to forget): “Manual of Applied Field Hydrogeology” by Weight and Sonderegger (2000)

After having taken the short course that I keep mentioning, I turned into a big fan of “Dynamics of Fluids in Porous Media” by Jacob Bear (1972). Jacob Bear has also noteworthy other books, and “Modeling Groundwater Flow and Contaminant Transport (Theory and Applications of Transport in Porous Media)” to be released in October.

There are a variety of modelling-related books, but I’ll cover those the next time!