## Crisp and Interval-Based Conditional Probabilities

“Censored data” is the common statistical term for values that are within an interval. A typical example from environmental data-sets are measurements below a certain detection limit. If a measurement is below detection limit, due to the analytical performance of the device that measures the concentration, we don’t know its precise value, but we do know that it is somewhere in the interval \[\smash{\in(0, \textnormal{detection limit})}\].

## Application in Environmental Hydrology

A typical example where censored measurements play an important role are solute concentrations in groundwater. The measured concentration value of some solute depends on the analytical method that was used for quantification of the concentration. Sometimes, the concentration is so small that we can not be certain about it’s value, and we assume that the true value is somewhere between zero and the analytical detection limit.

In a recent example, my coauthors and I demonstrated the importance of including censored measurements to derive a representative concentration of chlorinated solutes in a hydrogeological layer at two boreholes within a fractured sandstone. Due to the fractured nature of the sandstone, at most depths the concentrations were fairly small and frequently below detection limit, whereas in the fractures, typically large concentrations were encountered. Taking the censored measurements (the concentrations below detection limit) in a statistical meaningful way into account lead to an estimate of representative concentrations that corresponded to the conceptual site hydrogeological model at the upstream and downstream borehole, and can be important for site assessment.

Related to censored measurements, but different, are true zeros. An example of a measurement of true zero is a rain gauge that measures precipitation when it does not rain. The distinction between a true zero and a measurement below detection limit can be tricky, because they are both small values. If you’re interested in how to include true zeros in this approach, please continue to read here. A truely zero measurement means that its value is zero and not in an interval between zero and the detection limit.

If you are interested in a statistically reasonable treatment of censored measurements, you can find the related publication in Environmental Science and Technology.

I’ll explain the basic underlying theory below.

## Basic Statistics Example

I have written about conditional_probabilities quite some time ago. This can be viewed as an extension.

A crisp condition is something like “what is the probability of event A to occur, given event B has occurred”. This is how conditional probabilities are typically taught with. Compared to a univariate density, a conditional density should have a smaller variance, and is shifted towards the condition. So far so good.

It turns out that there is a “not-crisp” condition. This is something like “probability of event A given that ‘event’ B is somewhere between zero and b”. The funny thing is, that the uncertainty about this event to occur is smaller than a corresponding normally-distributed univariate event.

When looking at the figure below, this means:

- the yellow line indicates a standard (variance=1) normal Gaussian density
- two crisp conditional densities are shown by the solid (\[p(x|y=-2.0)\]) and the dashed (\[p(x|y=+2.0)\]). Both those densities have a smaller uncertainty (variance) than the univariate standard normal
- two interval-based conditional densities are shown in red (\[p(x|y \leq -2.0)\]) and blue (\[p(x|y \leq +2.0)\]). The interval-based densities have the same location as the crisp conditionals. Their uncertainties are smaller than the corresponding univariate, but larger than the crisp conditionals.

Hello micro.blog

## Hockey Stats (Nürnberg Plays Cologne Tonight)

I have been playing a bit with hockey data. There is some data wrangling, there is some interesting basic statistics, and there is some Bayes. As this has nothing to do directly with water (other than that it’s played on frozen water), I posted here.

tl,dr: The statistics related to both teams seems to suggest that the series is very close. Guess what, this is also what I saw when I watched it. Despite this similarity, the numbers favour Cologne slightly but consistently. Granted, the analysis is fairly averaging and not deeply distinguishing.

## Digging Into My Research Database

A new version of Script Debugger was released recently, and I dug a bit into it, using my research database Papers.

For fun, I linked AppleScript (that digs into my database on MacOS) with Python, that processes the data (creates a histogram).

The process worked nicely, and being able to debug AppleScript is wonderful.

More info at claus-haslauer.de

## Smartphones and Creative Ideas

At NASA, at least some people get rid of “smart phones” to get creative ideas back: Lynda Barry at NASA’s Goddard Space Flight Centre (via ruk.ca):

Barry’s impact on the assembled Goddard employees was immediate; from the moment she arrived, she insisted on abandoning all electronic devices. “They were really flipped out about it,” says Barry. “The phone gives us a lot but it takes away three key elements of discovery: loneliness, uncertainty and boredom. Those have always been where creative ideas come from.”

At the time of writing this, the Süddeutsche Zeitung insists that social media (WhatsApp) “belong into classrooms“

**update 2017-Oct-11**

- die Tagesschau reports that 14-29 year old Germans are online for about 4.5 hours per day
- the guardian has a longer report on how smartphones are hijacking ones minds. The text warns about a much more severe consequence: “Drawing a straight line between addiction to social media and political earthquakes like Brexit and the rise of Donald Trump, they contend that digital forces have completely upended the political system and, left unchecked, could even render democracy as we know it obsolete.” The article goes on to explain how there are certain hooks emplaced in smartphone-related technology that are designed to keep you there and make for the companies advertising dollars.

## Days 2&3 at #spatialstatistics2017

It became increasingly difficult to post updates on the spatial statistics conference. The icebreaker, another day full with diverse interesting talks, the dinner, another day that ended the conference with an interesting session honouring the achievements of Peter Diggle. Former and current colleagues such as Paulo Ribeiro and Emanuel Giorgi gave enlightening talks that stressed both the scientific achievements and the great kindness and humanity of Peter Diggle. CHICAS, the center for health informatics, computing, and statistics, is the current culmination of his efforts.

It’s hard to pick topics that stood out during the last two days of the conference, just because there were many great talks on a large variety of topics. Here is an attempt.

## Point Processes

There were a number of talks covering Point Processes, notably the keynotes by Thordis Thorarinsdottir and Rasmus Waagepetersen. Thordis had a variety of interesting quotes including this one by Frank H Bigelow from 1905:

There are three processes that are generally essential for the complete development of any branch of science, and they must be accurately applied before the subject can be considered to be satisfactorily explained. The first is the discovery of a mathematical analysis, the second is the discussion of numerous observations, and the third is a correct application of the mathematics to the observations, including a demonstration that these are in agreement.

Thordis urged the need for more and better inference methods. I might be worth pointing out that Bigelow went on to state that

Often a good theory is misapplied to good observations, or good observations are explained by a poor theory.

In summary, these thoughts are not too far away from Peter Diggle’s triangle, pictured above.

## Copulas

There were two nice talks that employed copulas for multivariate spatial models and one that I missed, unfortunately:

- Jonathan Tawn from the University of Lancaster presented on “
*Modelling Spatial Extreme Events*“; he takes great care of marginal distributions and how to reasonably include extremes there for a better joint representation in copula space; - Fakhereh Alidoost and Alfred Stein from the University of Twente presented on “
*Interpolation of Daily Mean Air Temperature Data via Spatial and Non-Spatial Copulas*” - the talk that I missed was entitled “
*Hierarchical Copula Regression Models for Areal Data*” presented by D. Musgrove, J. Hughes and L. Eberly

## Various

- Denis Allard presented on weather generators, the issues related to different dependence structures in the variables included typically, and advertised a workshop on stochastic weather generators coming up in Berlin.
- Ricardo Carrizo Vergara, a student of Denis Allard, is investigating the relationship between SPDEs and geostatistics.
- Pierre Goovaerts showed his insight into the Flint water crisis, which is published in three papers (1, 2, 3).

## Day 1 at #spatialstatistics2017

Peter Atkinson opened the conference with pointing out the broad scope of the conference: “one health” (e.g., CDC, UC Davis) that relates to human, veterinary, and environmental health. I was glad that my talk with interpolating groundwater quality data fit right into that scope.

I saw too many interesting talks and met too many interesting and nice people, to list everything here. Instead, this is a small selection.

## Connections

First off, it’s nice to encounter similarly minded work. Particularly, I was happy to see the following presentations:

- Emilie Chautru presented a poster entitled “Cokriging of Nonnegative Data on the L1 Sphere”, on Cokriging compositional data;
- Svenia Behm from the University of Passau presented a talk entitled “Statistical Inference in the RIO Model – the Detrending Step Revisited”. She calculates something similar to my “locally mixed distributions”;
- A. Lawson pointed out the importance of properly taking censored measurements and true zeros into account, both in his keynote (“One Health: Spatial Statistics at the Border of Human and Veterinary Health”) and in his talk (“Bayesian Cure-Rate Survival Model With Spatially Structured Censoring”). I didn’t talk about it at this conference, but it is dear to my heart;

## Cool Stuff

- M. Pereira showed cool images of road crash density estimates based on data from Paris, France. Benedikt Gräler showed a poster with the Envirocar initiative. Data related to driving patterns and fuel consumption is collected while driving, is analysed, and can be viewed online.
- Samir Bhatt gave a great keynote presentationon mapping malaria endemicity. Besides the interesting issues related directly to malaria, this talk raised some interesting questions on modelling philosophies. Samir Bhatt proposed “richer models” as a way forward beyond his current practice of using multivariate models. Alternatively, he phrased it as models that “include mechanisms”. Peter Diggle asked how his approach relates to the concept of parsimonity. It is interesting to me that Samir Batt suggests to include mechanistic models in his data driven models, whereas for the groundwater quality mapping project I am working on, I have moved to a stochastic model. On the scale of the state, I see that deterministic, pde-based models are not feasible (too many unknown parameters and processes).

## New Papers!

I published two new papers recently! Find the titles and the links to more information below. Happy reading!

- “
*Detecting and Modelling Structures on the Micro and the Macro Scales: Assessing Their Effects on Solute Transport Behaviour*” – This paper sheds light onto a tricky issue: Is a spatial data-set stationary or not? This paper shows a method that can help to decide to delineate a boundary (“macroscale”) between regions that are at least somewhat more stationary than the entire domain. Furthermore, this paper- validates the algorithm based on a data-set where a boundary layer has previously been delineated;
- demonstrates the effects of the macro structure and the smaller scale heterogeneity (“micro structure”) on solute transport behaviour; The micro structure is modelled by multivariate Gaussian and multivariate non-Gaussian structures.

- “
*Estimating a Representative Value and Proportion of True Zeros for Censored Analytical Data with Applications to Contaminated Site Assessment*” – True zeros such as no precipitation occur frequently in nature. This is one of the very few studies I know that treats those values statistically meaningfully and is based on a real-world data-set. We applied the methodology on a data-set related to contaminated sites, but this has implications everywhere else.

## Own Your Writing

I just posted on claus-haslauer.de about “Own Your Writing!”.

In this post, I

## Thresholds

In my work about spatial dependence, I do see that in different ranges of quantiles, the type of dependence can differ. More generally, this means that thresholds are an important characteristic of environmental systems.

This is why I think this video that I noticed on kottke.org is so inspiring: sometimes something small leads to a big change — a “threshold” is “jumped over”: