Biostatistics With R Free Download

05 Dec, 2021 Post a Comment

!!!ALREADY AVAILABLE FROM THE PUBLISHER AND FROM amazon.co.uk!!! From amazon.com it will be available in August 2020. Biostatistics with R provides a straightforward introduction on how to analyse data from the wide field of biological research, including nature protection and global change monitoring. The book is centred around traditional statistical approaches, focusing on those prevailing in research publications. The authors cover t-tests, ANOVA and regression models, but also the advanced methods of generalised linear models and classification and regression trees. Chapters usually start with several useful case examples, describing the structure of typical datasets and proposing research-related questions. All chapters are supplemented by example datasets, step-by-step R code demonstrating analytical procedures and interpretation of results. The authors also provide examples of how to appropriately describe statistical procedures and results of analyses in research papers. This accessible textbook will serve a broad audience, from students, researchers or professionals looking to improve their everyday statistical practice, to lecturers of introductory undergraduate courses. Additional resources are provided on www.cambridge.org/biostatistics.

Discover the world's research

20+ million members
135+ million publications
700k+ research projects

Join for free

Cambridge University Press

978-1-108-48038-3 — Biostatistics with R

Jan Lepš , Petr Šmilauer

Frontmatter

More Information

www.cambridge.org

Biostatistics with R

An Introductory Guide for Field Biologists

Biostatistics with R provides a straightforward introduction on how to analyse

data from the wide ﬁ eld of biological research, including nature protection and

global change monitoring. The book is centred around traditional statistical

approaches, focusing on those prevailing in research publications. The authors

cover t tests, ANOVA and regression models, but also the advanced methods of

generalised linear models and classiﬁ cation and regression trees. Chapters usually

start with several useful case examples, describing the structure of typical datasets

and proposing research-related questions. All chapters are supplemented by

example datasets and thoroughly explained, step-by-step R code demonstrating

the analytical procedures and interpretation of results. The authors also provide

examples of how to appropriately describe statistical procedures and results of

analyses in research papers. This accessible textbook will serve a broad audience

of interested readers, from students, researchers or professionals looking to

improve their everyday statistical practice, to lecturers of introductory under-

graduate courses. Additional resources are provided on www.cambridge.org/

biostatistics.

Jan Lepš is Professor of Ecology in the Department of Botany, Faculty of Science,

University of South Bohemia, Č eské Budě jovice, Czech Republic and senior

researcher in the Biology Centre of the Czech Academy of Sciences in České

Budě jovice. His main research interests include plant functional ecology, particu-

larly the mechanisms of species coexistence and stability, and ecological data

analysis. He has taught many ecological and statistical courses and supervised

more than 80 student theses, from undergraduate to PhD.

Petr Š milauer is Associate Professor of Ecology in the Department of Ecosystem

Biology, Faculty of Science, University of South Bohemia, Č eské Budějovice,

Czech Republic. His main research interests are multivariate statistical analysis,

modern regression methods and the role of arbuscular mycorrhizal symbiosis in

the functioning of plant communities. He is co-author of multivariate analysis

software Canoco 5, CANOCO for Windows 4.5 and TWINSPAN for Windows.

Cambridge University Press

978-1-108-48038-3 — Biostatistics with R

Jan Lepš , Petr Šmilauer

Frontmatter

More Information

www.cambridge.org

'We will never have a textbook of statistics for biologists that satisﬁ es everybody. However,

this book may come closest. It is based on many years of ﬁ eld research and the teaching of

statistical methods by both authors. All useful classic and advanced statistical concepts and

methods are explained and illustrated with data examples and R programming procedures.

Besides traditional topics that are covered in the premier textbooks of biometry/biostatistics

(e.g. R. R. Sokal & F. J. Rohlf, J. H. Zar), two extensive chapters on multivariate methods in

classiﬁ cation and ordination add to the strength of this book. The text was originally published

in Czech in 2016. The English edition has been substantially updated and two new chapters

'Survival Analysis 'and 'Classiﬁ cation and Regression Trees 'have been added. The book will

be essential reading for undergraduate and graduate students, professional researchers, and

informed managers of natural resources.'

Marcel Rejmánek,

Department of Evolution and Ecology, University of California, Davis, CA, USA

Cambridge University Press

978-1-108-48038-3 — Biostatistics with R

Jan Lepš , Petr Šmilauer

Frontmatter

More Information

www.cambridge.org

Biostatistics with R

An Introductory Guide for Field Biologists

JAN LEPŠ

University of South Bohemia, Czech Republic

PETR Š MILAUER

University of South Bohemia, Czech Republic

Cambridge University Press

978-1-108-48038-3 — Biostatistics with R

Jan Lepš , Petr Šmilauer

Frontmatter

More Information

www.cambridge.org

University Printing House, Cambridge CB2 8BS, United Kingdom

One Liberty Plaza, 20th Floor, New York, NY 10006, USA

477 Williamstown Road, Port Melbourne, VIC 3207, Australia

314– 321, 3rd Floor, Plot 3, Splendor Forum, Jasola District Centre, New Delhi – 110025, India

79 Anson Road, #06-04/06, Singapore 079906

Cambridge University Press is part of the University of Cambridge.

It furthers the University' s mission by disseminating knowledge in the pursuit of

education, learning, and research at the highest international levels of excellence.

www.cambridge.org

Information on this title: www.cambridge.org/9781108480383

DOI: 10.1017/9781108616041

This publication is in copyright. Subject to statutory exception

and to the provisions of relevant collective licensing agreements,

no reproduction of any part may take place without the written

permission of Cambridge University Press.

First published 2020

Printed in the United Kingdom by TJ International Ltd, Padstow Cornwall

A catalogue record for this publication is available from the British Library.

ISBN 978-1-108-48038-3 Hardback

ISBN 978-1-108-72734-1 Paperback

Additional resources for this publication at www.cambridge.org/biostatistics

Cambridge University Press has no responsibility for the persistence or accuracy of

URLs for external or third-party internet websites referred to in this publication

and does not guarantee that any content on such websites is, or will remain,

accurate or appropriate.

Cambridge University Press

978-1-108-48038-3 — Biostatistics with R

Jan Lepš , Petr Šmilauer

Frontmatter

More Information

www.cambridge.org

Contents

Preface page xiii

Acknowledgements xvii

1 Basic Statistical Terms, Sample Statistics 1

1.1 Cases, Variables and Data Types 1

1.2 Population and Random Sample 3

1.3 Sample Statistics 4

1.4 Precision of Mean Estimate, Standard Error of Mean 9

1.5 Graphical Summary of Individual Variables 10

1.6 Random Variables, Distribution, Distribution Function,

Density Distribution 10

1.7 Example Data 13

1.8 How to Proceed in R 13

1.9 Reporting Analyses 17

1.10 Recommended Reading 18

2 Testing Hypotheses, Goodness-of-Fit Test 19

2.1 Principles of Hypothesis Testing 19

2.2 Possible Errors in Statistical Tests of Hypotheses 21

2.3 Null Models with Parameters Estimated from the Data:

Testing Hardy– Weinberg Equilibrium 26

2.4 Sample Size 26

2.5 Critical Values and Signiﬁ cance Level 27

2.6 Too Good to Be True 29

2.7 Bayesian Statistics: What is It? 30

2.8 The Dark Side of Signiﬁ cance Testing 32

2.9 Example Data 35

2.10 How to Proceed in R 35

2.11 Reporting Analyses 37

2.12 Recommended Reading 37

3 Contingency Tables 39

3.1 Two-Way Contingency Tables 39

3.2 Measures of Association Strength 44

3.3 Multidimensional Contingency Tables 46

3.4 Statistical and Causal Relationship 47

3.5 Visualising Contingency Tables 49

3.6 Example Data 50

3.7 How to Proceed in R 50

Cambridge University Press

978-1-108-48038-3 — Biostatistics with R

Jan Lepš , Petr Šmilauer

Frontmatter

More Information

www.cambridge.org

3.8 Reporting Analyses 54

3.9 Recommended Reading 54

4 Normal Distribution 55

4.1 Main Properties of a Normal Distribution 55

4.2 Skewness and Kurtosis 56

4.3 Standardised Normal Distribution 57

4.4 Verifying the Normality of a Data Distribution 58

4.5 Example Data 60

4.6 How to Proceed in R 60

4.7 Reporting Analyses 63

4.8 Recommended Reading 64

5 Student' st Distribution 65

5.1 Use Case Examples 65

5.2 t Distribution and its Relation to the Normal Distribution 66

5.3 Single Sample Test and Paired t Test 67

5.4 One-Sided Tests 70

5.5 Conﬁ dence Interval of the Mean 72

5.6 Test Assumptions 73

5.7 Reporting Data Variability and Mean Estimate Precision 74

5.8 How Large Should a Sample Size Be? 77

5.9 Example Data 79

5.10 How to Proceed in R 79

5.11 Reporting Analyses 82

5.12 Recommended Reading 83

6 Comparing Two Samples 84

6.1 Use Case Examples 84

6.2 Testing for Differences in Variance 85

6.3 Comparing Means 87

6.4 Example Data 88

6.5 How to Proceed in R 88

6.6 Reporting Analyses 91

6.7 Recommended Reading 91

7 Non-parametric Methods for Two Samples 92

7.1 Mann– Whitney Test 93

7.2 Wilcoxon Test for Paired Observations 95

7.3 Using Rank-Based Tests 97

7.4 Permutation Tests 97

7.5 Example Data 99

7.6 How to Proceed in R 99

7.7 Reporting Analyses 102

7.8 Recommended Reading 103

vi Table of Contents

Cambridge University Press

978-1-108-48038-3 — Biostatistics with R

Jan Lepš , Petr Šmilauer

Frontmatter

More Information

www.cambridge.org

8 One-Way Analysis of Variance (ANOVA) and Kruskal– Wallis Test 104

8.1 Use Case Examples 104

8.2 ANOVA: A Method for Comparing More Than Two Means 104

8.3 Test Assumptions 105

8.4 Sum of Squares Decomposition and the F Statistic 106

8.5 ANOVA for Two Groups and the Two-Sample t Test 108

8.6 Fixed and Random Effects 108

8.7 F Test Power 109

8.8 Violating ANOVA Assumptions 110

8.9 Multiple Comparisons 111

8.10 Non-parametric ANOVA: Kruskal– Wallis Test 115

8.11 Example Data 116

8.12 How to Proceed in R 117

8.13 Reporting Analyses 127

8.14 Recommended Reading 128

9 Two-Way Analysis of Variance 129

9.1 Use Case Examples 129

9.2 Factorial Design 130

9.3 Sum of Squares Decomposition and Test Statistics 132

9.4 Two-Way ANOVA with and without Interactions 134

9.5 Two-Way ANOVA with No Replicates 135

9.6 Experimental Design 135

9.7 Multiple Comparisons 137

9.8 Non-parametric Methods 138

9.9 Example Data 139

9.10 How to Proceed in R 139

9.11 Reporting Analyses 149

9.12 Recommended Reading 150

10 Data Transformations for Analysis of Variance 151

10.1 Assumptions of ANOVA and their Possible Violations 151

10.2 Log-transformation 153

10.3 Arcsine Transformation 156

10.4 Square-Root and Box– Cox Transformation 156

10.5 Concluding Remarks 157

10.6 Example Data 158

10.7 How to Proceed in R 158

10.8 Reporting Analyses 163

10.9 Recommended Reading 163

11 Hierarchical ANOVA, Split-Plot ANOVA, Repeated Measurements 164

11.1 Hierarchical ANOVA 164

11.2 Split-Plot ANOVA 167

11.3 ANOVA for Repeated Measurements 169

Table of Contents vii

Cambridge University Press

978-1-108-48038-3 — Biostatistics with R

Jan Lepš , Petr Šmilauer

Frontmatter

More Information

www.cambridge.org

11.4 Example Data 171

11.5 How to Proceed in R 171

11.6 Reporting Analyses 181

11.7 Recommended Reading 182

12 Simple Linear Regression: Dependency Between Two Quantitative

Variables 183

12.1 Use Case Examples 183

12.2 Regression and Correlation 184

12.3 Simple Linear Regression 184

12.4 Testing Hypotheses 187

12.5 Conﬁ dence and Prediction Intervals 190

12.6 Regression Diagnostics and Transforming Data in Regression 190

12.7 Regression Through the Origin 195

12.8 Predictor with Random Variation 197

12.9 Linear Calibration 197

12.10 Example Data 198

12.11 How to Proceed in R 198

12.12 Reporting Analyses 204

12.13 Recommended Reading 205

13 Correlation: Relationship Between Two Quantitative Variables 206

13.1 Use Case Examples 206

13.2 Correlation as a Dependency Statistic for Two Variables on an Equal

Footing 206

13.3 Test Power 209

13.4 Non-parametric Methods 212

13.5 Interpreting Correlations 212

13.6 Statistical Dependency and Causality 213

13.7 Example Data 216

13.8 How to Proceed in R 216

13.9 Reporting Analyses 218

13.10 Recommended Reading 218

14 Multiple Regression and General Linear Models 219

14.1 Use Case Examples 219

14.2 Dependency of a Response Variable on Multiple Predictors 219

14.3 Partial Correlation 223

14.4 General Linear Models and Analysis of Covariance 224

14.5 Example Data 225

14.6 How to Proceed in R 226

14.7 Reporting Analyses 237

14.8 Recommended Reading 238

viii Table of Contents

Cambridge University Press

978-1-108-48038-3 — Biostatistics with R

Jan Lepš , Petr Šmilauer

Frontmatter

More Information

www.cambridge.org

15 Generalised Linear Models 239

15.1 Use Case Examples 239

15.2 Properties of Generalised Linear Models 240

15.3 Analysis of Deviance 242

15.4 Overdispersion 243

15.5 Log-linear Models 243

15.6 Predictor Selection 244

15.7 Example Data 245

15.8 How to Proceed in R 246

15.9 Reporting Analyses 250

15.10 Recommended Reading 251

16 Regression Models for Non-linear Relationships 252

16.1 Use Case Examples 252

16.2 Introduction 253

16.3 Polynomial Regression 253

16.4 Non-linear Regression 255

16.5 Example Data 256

16.6 How to Proceed in R 256

16.7 Reporting Analyses 259

16.8 Recommended Reading 260

17 Structural Equation Models 261

17.1 Use Case Examples 261

17.2 SEMs and Path Analysis 261

17.3 Example Data 265

17.4 How to Proceed in R 265

17.5 Reporting Analyses 272

17.6 Recommended Reading 272

18 Discrete Distributions and Spatial Point Patterns 274

18.1 Use Case Examples 274

18.2 Poisson Distribution 274

18.3 Comparing the Variance with the Mean to Measure Spatial

Distribution 276

18.4 Spatial Pattern Analyses Based on the K-function 279

18.5 Binomial Distribution 280

18.6 Example Data 283

18.7 How to Proceed in R 283

18.8 Reporting Analyses 289

18.9 Recommended Reading 289

19 Survival Analysis 290

19.1 Use Case Examples 290

19.2 Survival Function and Hazard Rate 291

Table of Contents ix

Cambridge University Press

978-1-108-48038-3 — Biostatistics with R

Jan Lepš , Petr Šmilauer

Frontmatter

More Information

www.cambridge.org

19.3 Differences in Survival Among Groups 293

19.4 Cox Proportional Hazard Model 293

19.5 Example Data 295

19.6 How to Proceed in R 295

19.7 Reporting Analyses 302

19.8 Recommended Reading 302

20 Classiﬁ cation and Regression Trees 303

20.1 Use Case Examples 303

20.2 Introducing CART 304

20.3 Pruning the Tree and Crossvalidation 306

20.4 Competing and Surrogate Predictors 307

20.5 Example Data 308

20.6 How to Proceed in R 309

20.7 Reporting Analyses 316

20.8 Recommended Reading 316

21 Classiﬁ cation 317

21.1 Use Case Examples 317

21.2 Aims and Properties of Classiﬁ cation 317

21.3 Input Data 319

21.4 Similarity and Distance 319

21.5 Clustering Algorithms 320

21.6 Displaying Results 320

21.7 Divisive Methods 321

21.8 Example Data 322

21.9 How to Proceed in R 322

21.10 Other Software 324

21.11 Reporting Analyses 325

21.12 Recommended Reading 325

22 Ordination 326

22.1 Use Case Examples 327

22.2 Unconstrained Ordination Methods 327

22.3 Constrained Ordination Methods 330

22.4 Discriminant Analysis 331

22.5 Example Data 333

22.6 How to Proceed in R 333

22.7 Alternative Software 340

22.8 Reporting Analyses 341

22.9 Recommended Reading 341

Appendix A: First Steps with R Software 343

A.1 Starting and Ending R, Command Line, Organising Data 343

A.2 Managing Your Data 349

xTable of Contents

Cambridge University Press

978-1-108-48038-3 — Biostatistics with R

Jan Lepš , Petr Šmilauer

Frontmatter

More Information

www.cambridge.org

A.3 Data Types in R 351

A.4 Importing Data into R 357

A.5 Simple Graphics 359

A.6 Frameworks for R 360

A.7 Other Introductions to Work with R 362

Index 363

Table of Contents xi

Cambridge University Press

978-1-108-48038-3 — Biostatistics with R

Jan Lepš , Petr Šmilauer

Frontmatter

More Information

www.cambridge.org

Cambridge University Press

978-1-108-48038-3 — Biostatistics with R

Jan Lepš , Petr Šmilauer

Frontmatter

More Information

www.cambridge.org

Preface

Modern biology is a quantitative science. A biologist weighs, measures and

counts, whether she works with aphid or ﬁ sh individuals, with plant communities

or with nuclear DNA. Every number obtained in this way, however, is affected by

random variation. Aphid counts repeatedly obtained from the same plant individ-

ual will differ. The counts of aphids obtained from different plants will differ

more, even if those plants belong to the same species, and samples coming from

plants of different species are likely to differ even more. Similar differences will

be found in the nuclear DNA content of plants from the same population, in

nitrogen content of soil samples taken from the same or different sites, or in the

population densities of copepods across repeated samplings from the same lake.

We say that our data contain a random component: the values we obtain are

random quantities, with a part of their variation resulting from randomness.

But what actually is this randomness? In posing such a question, we

move into the realm of philosophy or to axioms of probability theory. But what is

probability? A biologist is usually happy with a pragmatic concept: we consider

an event to be random if we do not have a causal explanation for it. Statistics is a

research ﬁ eld which provides recipes for how to work with data containing

random components, and how to distinguish deterministic patterns from random

variation. Popular wisdom says that statistics is a branch of science where precise

work is carried out with imprecise numbers. But the term statistics has multiple

meanings. The layman sees it as an assorted collection of values (football league

statistics of goals and points, statistics of MP voting, statistics of cars passing

along a highway, etc.). Statistics is also a research ﬁ eld (often called mathematical

statistics) providing tools for obtaining useful information from such datasets. It is

Cambridge University Press

978-1-108-48038-3 — Biostatistics with R

Jan Lepš , Petr Šmilauer

Frontmatter

More Information

www.cambridge.org

a separate branch of science, to a certain extent representing an application of probability

theory. The term statistic (often in singular form) is also used in another sense: a numerical

characteristic computed from data. For example, the well-known arithmetic average is a

statistic characterising a given data sample.

In scientiﬁ c thinking, we can distinguish deductive and inductive approaches. The

deductive approach leads us from known facts to their consequences. Sherlock Holmes may

use the facts that a room is locked, has no windows and is empty to deduce that the room must

have been locked from the outside. Mathematics is a typical example of a deductive system:

based on axioms, we can use a purely logical (deductive) path to derive further statements,

which are always correct if the initial axioms are also correct (unless we made a mistake in the

derivation). Using the deductive approach, we proceed in a purely logical manner and do not

need any comparison with the situation in real terms.

The inductive approach is different: we try to ﬁ nd general rules based on many

observations. If we tread upon 1-cm-thick ice one hundred times and the ice breaks each time,

we can conclude that ice of this thickness is unable to carry the weight of a grown person. We

conclude this using inductive thinking. We could, however, also employ the deductive

approach by using known physical laws, strength measurements of ice and the known weight

of a grown person. But usually, when treading on thin ice, we do not know its exact thickness

and sometimes the ice breaks and sometimes it does not. Usually we ﬁ nd, only after breaking

through it, that the ice was quite thin. Sometimes even thicker ice breaks, but such an event is

affected by many circumstances we are not able to quantify (ice structure, care in treading, etc.)

and we therefore consider them as random. Using many observations, however, we can estimate

the probability of breaking through ice based on its thickness by using the methods of

mathematical statistics. Statistics is therefore a tool of inductive thinking in such cases, where

the outcome of an experiment (or observation) is affected by random variability.

Thanks to advances in computer technology, statistics is now available to all

biologists. Statistical analysis of data is a necessary prerequisite of manuscript acceptance

in most biological journals. These days, it is impossible to fully understand most of the

research papers in biological journals without understanding the basic principles of statistics.

All biologists must plan their observations and experiments, as only correctly collected data

can be useful when answering their questions with the aid of statistical methods. To collect

your data correctly, you need to have abasic understanding of statistics.

A knowledge of statistics has therefore become essential for successful enquiry in

almost all ﬁ elds of biology. But statistics are also often misused. Some even say that there are

three kinds of lies: a non-intentional lie, an intentional lie and statistics. We can ' adorn'bad

data by employing a complex statistical method so that the result looks like a substantial

contribution to our knowledge (even ﬁ nding its way into prestigious journals). Another

common case of statistical misuse is interpreting statistical (' correlational' ) dependency as

causal. In this way, one can ' prove' almost anything. A knowledge of statistics also allows

biologists to differentiate statements which provide new and useful information from those

where statistics are used to simply mask a lack of information, or are misused to support

incorrect statements.

The way statistics are used in the everyday practice of biology changed substantially

with the increased availability of statistical software. Today, everyone can evaluate her/his

data on a personal computer; the results are just a few mouse clicks away. While your

xiv Preface

Cambridge University Press

978-1-108-48038-3 — Biostatistics with R

Jan Lepš , Petr Šmilauer

Frontmatter

More Information

www.cambridge.org

computer will (almost) always offer some results, often in the form of a nice-looking graph,

this rather convenient process is not without its dangers. There are users who present the

results provided to them by statistical programs without ever understanding what was

computed. Our book therefore tries not only to teach you how to analyse your data, but also

how to understand what the results of statistical processing mean.

What is biostatistics ? We do not think that this is a separate research ﬁ eld. In using

this term, we simply imply a focus on the application of statistics to biological problems.

Alternatively, the term biometry is sometimes used in a similar sense. In our book, we place

an emphasis on understanding the principles of the methods presented and the rules of their

use, not on the mathematical derivation of the methods. We present individual methods in a

way that we believe is convenient for biologists: we ﬁ rst show a few examples of biological

problems that can be solved by a given method, and only then do we present its principles and

assumptions. In our explanations we assume that the reader has attended an introductory

undergraduate mathematical course, including the basics of the theory of probability. Even so,

we try to avoid complex mathematical explanations whenever possible.

This book provides only basic information. We recommend that all readers continue

a more detailed exploration of those methods of interest to them. The three most recom-

mended textbooks for this are Quinn & Keough (2002), Sokal & Rohlf (2012) and Zar (2010).

The ﬁ rst and last of these more closely reﬂ ect the mind of the biologist, as their authors have

themselves participated in ecological research. In this book, we adopt some ideas from Zar 's

textbook about the sequence in which to present selected topics. After every chapter, we give

page ranges for the three referred textbooks, each containing additional information

about the particular methods. Our book is only a slight extension of a one-term course

(2 hours lectures + 2 hours practicals per week) in Biostatistics, and therefore sufﬁ cient detail

is lacking on some of the statistical methods useful for biologists. This primarily concerns the

use of multivariate statistical analysis, traditionally addressed in separate textbooks and

courses.

We assume that our readers will evaluate their data using a personal computer and we

illustrate the required steps and the format of results using two different types of software. The

program R lacks some of the user-friendliness provided by alternative statistical packages, but

offers practically all known statistical methods, including the most modern ones, for free

(more details at cran.r‑ project.org), and so it became de facto a standard tool, prevailing in

published biological research papers. We assume that the reader will have a basic working

knowledge of R, including working with its user interface, importing data or exporting results.

The knowledge required is, however, summarised in Appendix A of this book, which can be

found after the last chapter. The program Statistica represents software for the less demanding

user, with a convenient range of menu choices and extensive dialogue boxes, as well as an

easily accessible and modiﬁ able graphical presentation of results. Instructions for its use are

available to the reader at the textbook' s website: www.cambridge.org/biostatistics.

Example data used throughout this book are available at the same website, but also

from our own university' s web address: www.prf.jcu.cz/biostat-data-eng.xlsx.

Note that in most of our ' use case examples' (and often also in the example data), the

actual (or suggested) number of replicates is very low, perhaps too low to provide reasonable

support for a real-world study. This is just to make the data easily tractable while we

demonstrate the computation of test statistics. For real-world studies, we recommend the

Preface xv

Cambridge University Press

978-1-108-48038-3 — Biostatistics with R

Jan Lepš , Petr Šmilauer

Frontmatter

More Information

www.cambridge.org

reader strives to attain more extensive datasets. If there is no citation for our example dataset,

such data are not real.

In each chapter, we also show how the results derived from statistical software can be

presented in research papers and also how to describe the particular statistical methods there.

In this book, we will most frequently refer to the following three statistical textbooks

providing more details about the methods:

•J. H. Zar (2010) Biostatistical Analysis , 5th edn. Pearson, San Francisco, CA.

•G. P. Quinn & M. J. Keough (2002) Experimental Design and Data Analysis for Biologists.

Cambridge University Press, Cambridge.

•R. R. Sokal & E. J. Rohlf (2012) Biometry , 4th edn. W. H. Freeman, San Francisco, CA.

Other useful textbooks include:

•R. H. Green (1979) Sampling Design and Statistical Methods for Environmental Biologists.

Wiley, New York.

•R. H. G. Jongmann, C. J. F. ter Braak & O. F. R. van Tongeren (1995) Data Analysis in

Community and Landscape Ecology. Cambridge University Press, Cambridge.

•P. Š milauer & J. Lepš (2014) Multivariate Analysis of Ecological Data Using Canoco 5,

2nd edn. Cambridge University Press, Cambridge.

More advanced readers will ﬁ nd the following textbook useful:

•R. Mead (1990) The Design of Experiments. Statistical Principles for Practical Application.

Cambridge University Press, Cambridge.

Where appropriate, we cite additional books and papers at the end of the correspond-

ing chapter.

xvi Preface

Cambridge University Press

978-1-108-48038-3 — Biostatistics with R

Jan Lepš , Petr Šmilauer

Frontmatter

More Information

www.cambridge.org

Acknowledgements

Both authors are thankful to their wives Olina and Majka for their ceaseless

support and understanding. Our particular thanks go to Petr ' s wife Majka (Marie

Šmilauerová), who created all the drawings which start and enliven each chapter.

We are grateful to Conor Redmond for his careful and efﬁ cient work at

improving our English grammar and style.

The feedback of our students was of great help when writing this book,

particularly the in-depth review from a student point of view provided by Václava

Hazuková. We appreciate the revision of Section 2.7, kindly provided by Cajo

ter Braak.

Cambridge University Press

978-1-108-48038-3 — Biostatistics with R

Jan Lepš , Petr Šmilauer

Frontmatter

More Information

www.cambridge.org

Cambridge University Press

978-1-108-48038-3 — Biostatistics with R

Jan Lepš , Petr Šmilauer

Excerpt

More Information

www.cambridge.org

1 Basic Statistical Terms, Sample

Statistics

1.1 Cases, Variables and Data Types

In our research, we observe a set of objects (cases ) of interest and record some information for

each of them. We call all of this collected information the data . If plants are our cases, for

example, then the data might contain information about ﬂower colour, number of leaves, height

of the plant stem or plant biomass. Each characteristic that is measured or estimated for our cases

is called a variable . We can distinguish several data types, each differing in their properties

and consequently in the way we handle the corresponding variables during statistical analysis.

Data on a ratio scale, such as plant height, number of leaves, animal weight, etc.,

are usually quantitative (numerical) data, representing some measurable amount – mass, length,

energy. Such data have a constant distance between any adjacent unit values (e.g. the difference

between lengths of 5 and 6 cm is the same as between 8 and 9 cm) and a naturally de ﬁned zero

value. We can also think about such data as ratios, e.g. a length of 8 cm is twice the length of

4 cm. Usually, these data are non-negative (i.e. their value is either zero or positive).

Data on an interval scale, such as temperature readings in degrees Celsius, are

again quantitative data with a constant distance (interval) between adjacent unit values, but

there is no naturally deﬁ ned zero. When we compare e.g. the temperature scales of Celsius and

Fahrenheit, both have a zero value at different temperatures, which are deﬁ ned rather

Cambridge University Press

978-1-108-48038-3 — Biostatistics with R

Jan Lepš , Petr Šmilauer

Excerpt

More Information

www.cambridge.org

arbitrarily. For such scales it makes no sense to consider ratios of their values: we cannot say

that 8



C is twice as high a temperature as 4



C. These scales usually cover negative, zero, as

well as positive values. On the contrary, temperature values in Kelvin (



K) can be considered a

variable on a ratio scale.

A special case of data on an interval scale are circular scale data : time of day, days

in a year, compass bearing – azimuth, used often in ﬁ eld ecology to describe the exposition

of a slope. The maximum value for such scales is usually identical with (or adjacent to) the

minimum value (e.g. 0



and 360



). Data on a circular scale must be treated in a speciﬁ c way

and thus there is a special research area developing the appropriate statistical methods to do so

(so-called circular statistics).

Data on an ordinal scale can be exempliﬁ ed by the state of health of some

individuals: excellent health, lightly ill, heavily ill, dead. A typical property of such data

is that there is no constant distance between adjacent values as this distance cannot be

quantiﬁ ed. But we can order the individual values, i.e. to comparatively relate any two distinct

values (greater than, equal to, less than). In biological research, data on an ordinal scale are

employed when the use of quantitative data is generally not possible or meaningful, e.g. when

measuring the strength of a reaction in ethological studies. Measurements on an ordinal scale are

also often used as a surrogate when the ideal approach to measuring a characteristic (i.e. in a

quantitative manner, using ratio or interval scale) is simply too laborious. This happens e.g. when

recording the degree of herbivory damage on a leaf as none, low, medium, high. In this case it

would of course be possible to attain a more quantitative description by scanning the leaves and

calculating the proportion of area lost, but this might be too time-demanding.

Data on a nominal scale (also called categorical or categorial variables ,or

factors). To give some examples, a nominal variable can describe colour, species identity,

location, identity of experimental block or bedrock type. Such data deﬁ ne membership of

a particular case in a class, i.e. a qualitative characteristic of the object. For this scale, there

are no constant (or even quantiﬁ able) differences among categories, neither can we order the

cases based on such a variable. Categorical data with just two possible values (very often yes

and no ) are often called binary data . Most often they represent the presence or absence of a

character (leaves glabrous or hairy, males or females, organism is alive or dead, etc.).

Ordinal as well as categorical variables are often coded in statistical software as

natural numbers. For example, if we are sampling in multiple locations, we would naturally

code the ﬁ rst location as 1, the second as 2, the third as 3, etc. The software might not know

that these values represent categorical data (if we do not tell it in some way) and be willing

to compute e.g. an arithmetic average of the location identity, quite a nonsensical value.

So beware, some operations can only be done with particular types of data.

Quantitative data (on an interval or a ratio scale) can be further distinguished into

discrete vs. continuous data. For continuous data (such as weights), between any two

measurement values there may typically lie another. In contrast we have discrete data, which

are most often (but not always) counts (e.g. number of leaves per plant), that is non-negative

integer numbers. In biological research, the distinction between discrete and continuous data

is often blurred. For example, the counts of algal cells per 1 ml of water can be considered as a

continuous variable (usually the measurement precision is less than 1 cell). In contrast, when

we estimate tree height in the ﬁeld using a hypsometer (an optical instrument for measuring

tree height quickly), measurement precision is usually 0.5 m (modern devices using lasers

may be more precise), despite the fact that tree height is a continuous variable. So even when

2Basic Statistical Terms, Sample Statistics

Cambridge University Press

978-1-108-48038-3 — Biostatistics with R

Jan Lepš , Petr Šmilauer

Excerpt

More Information

www.cambridge.org

the measured variable is continuous, the obtained values have a discrete nature. But this is an

artefact of our measurement method, not a property of the measured characteristic: although

the recorded values of tree height will be repeated across the dataset, the probability of ﬁnding

two trees in a forest with identical height is close to zero.

1.2 Population and Random Sample

Our research usually refers to a large (potentially even inﬁ nitely large) group of cases, the

statistical population (or statistical universe), but our conclusions are based on a smaller

group of cases, representing collected observations. This smaller group of observations is called

the random sample, or often simply the sample. Even when we do not use the word random,

we assume randomness in the choice of cases included in our sample. The term (statistical)

population is often not related to what a biologist calls a population. In statistics this word has a

more general meaning. The process of obtaining the sample is called sampling.

To obtain a random sample (as is generally assumed by statistical methods), we must

follow certain rules during case selection: each member (e.g. an individual) in the statistical

population must have the same and independent chance of being selected. The randomness of

our choice should be assured by using random numbers. In the simplest (but often not workable)

approach, we would label all cases in the sampled population with numbers from 1 to N .We

then obtain the required sample of size n by choosing n random whole numbers from the

interval (1, N )insuchawaythateachnumberinthatintervalhasthesamechanceofbeing

selected and we reject the random numbers suggested by the software where the same choice is

repeated. We then proceed by measuring the cases labelled with the selected n numbers.

In ﬁ eld studies estimating e.g. the aboveground biomass in an area, we would

proceed by selecting several sample plots in the area in which the biomass is being collected.

Those plots are chosen by deﬁ ning a system of rectangular coordinates for the whole area and

then generating random coordinates for the centres of individual plots. Here we assume that

the sampled area has a rectangular shape

and is large enough so that we can ignore the

possibility that the sample plots will overlap.

It is much more difﬁ cult to select e.g. the individuals from a population of freely

living organisms, because it is not possible to number all existing individuals. For this, we

typically sample in a way that is assumed to be close to random sampling, and subsequently

work with the sample as if it were random, while often not appreciating the possible dangers

of our results being affected by sampling bias. To give an example, we might want to study a

dormouse population in a forest. We could sample them using traps without knowing the size

of the sampled population. We can consider the individuals caught in traps as a random

sample, but this is likely not a correct expectation. Older, more experienced individuals are

probably better at avoiding traps and therefore will be less represented in our sample. To

adequately account for the possible consequences of this bias, and/or to develop a better

sampling strategy, we need to know a lot about the life history of the dormouse.

But even sampling sedentary organisms is not easy. Numbering all plant individuals

in an area of ﬁ ve acres and then selecting a truly random sample, while certainly possible in

principle, is often unmanageable in practical terms. We therefore require a sampling method

But if not, we can still use a rectangular envelope enclosing the more complex area and simply reject the random

coordinates falling outside the actual area.

1.2 Population and Random Sample 3

Cambridge University Press

978-1-108-48038-3 — Biostatistics with R

Jan Lepš , Petr Šmilauer

Excerpt

More Information

www.cambridge.org

suitable for the target objects and their spatial distribution. It is important to note that a

frequently used sampling strategy in which we choose a random location in the study area

(by generating point coordinates using random values) and then select an individual closest to

this point is not truly random sampling. This is because solitary individuals have a higher

chance of being sampled than those growing in a group. If individuals growing in groups are

smaller (as is often the case due to competition), our estimates of plant characteristics based on

this sampling procedure will be biased.

Stratiﬁ ed sampling represents a speciﬁ c group of sampling strategies. In this

approach, the statistical population is ﬁ rst split into multiple, more homogeneous subsets and

then each subset is randomly sampled. For example, in a morphometric study of a spider species

we can randomly sample males and females to achieve a balanced representation of both sexes.

To take another example, in a study examining the effects of an invasive plant species on the

richness of native communities, we can randomly sample within different climatic regions.

Subjectively choosing individuals, either considered typical for the subject or

seemingly randomly chosen (e.g. following a line across a sampling location and

occasionally picking an individual), is not random sampling and therefore is not

recommended to deﬁ ne a dataset for subsequent statistical analysis.

The sampled population can sometimes be deﬁ ned solely in a hypothetical manner.

For example, in a glasshouse experiment with 10 individuals of meadow sweetgrass (Poa

pratensis), the reference population is a potential set of all possible individuals of this species,

grown under comparable conditions, in the same season, etc.

1.3 Sample Statistics

Let us assume we want to describe the height for a set of 50 pine (Pinus sp.) trees. Fifty values

of their height would represent a complete, albeit somewhat complex, view of the trees. We

therefore need to simplify (summarise) this information, but with a minimal loss of detail.

This type of summarisation can be achieved in two general ways: we can transform our

numerical data into a graphical form (visualise them) or we can describe the set of values with

a few descriptive statistics that summarise the most important properties of the whole dataset.

Among the choice of graphical summaries we have at our disposal, one of the most

often used is the frequency histogram (see Fig. 1.2 later). We can construct a frequency

histogram for a particular numerical variable by dividing the range of values into several

classes (sub-ranges) of the same width and plotting (as the vertical height of each bar) the

count of cases in each class. Sometimes we might want to plot the relative frequencies of cases

rather than simple counts, e.g. as the percentage of the total number of cases in the whole sample

(the histogram' s shape or the information it portrays does not change, only the scale used on the

vertical axis). When we have a sufﬁ cient number of cases and sufﬁ ciently narrow classes

(intervals), the shape of the histogram approaches a characteristic of the variable' sdistribution

called probability density (see Section 1.6 and Fig. 1.2 later). Further information about

graphical summaries is provided in aseparate section on graphical data summaries (Section 1.5).

Alternatively, we can summarise our data using descriptive statistics. Using our

pine heights example, we are interested primarily in two aspects of our dataset: what is the

typical (' mean' ) height of the trees and how much do the individual heights in our sample

4Basic Statistical Terms, Sample Statistics

Cambridge University Press

978-1-108-48038-3 — Biostatistics with R

Jan Lepš , Petr Šmilauer

Excerpt

More Information

www.cambridge.org

differ. The ﬁ rst aspect is quantiﬁ ed using the characteristics of position (also called central

tendency), the second by the characteristics of variability . The characteristics of a ﬁ nite set

of values (of a random sample or a ﬁ nite statistical population) can be determined precisely. In

contrast, the characteristics of an inﬁ nitely large statistical population (or of a population for

which we have not measured all the cases) must be estimated using a random sample. As a

formal rule, the characteristics of a statistical population are labelled by Greek letters, while

we label the characteristics of a random sample using standard (Latin) letters. The counts of

cases represent an exception: N is the number of cases in a statistical population, while n is the

number of cases (size) of a random sample.

1.3.1 Characteristics of Position

Example questions: What is the height of pine trees in a particular valley? What is the pH of

water in the brooks of a particular region? For trees, we can either measure all of them or be

happy with a random sample. For water pH, we must rely on a random sample, measuring its

values at certain places within certain parts of the season.

Both examples demonstrate how important it is to have a well-deﬁ ned statistical

population (universe). In the case of our pine trees, we would probably be interested in mature

individuals, because mixing the height of mature individuals with that of seedlings and

saplings will not provide useful information. This means that in practice, we will need an

operational deﬁ nition of a ' mature individual' (e.g. at least 20 years old, as estimated by

coring at a speciﬁ c height).

Similarly, for water pH measurements, we would need to specify the type of

streams we are interested in (and then, probably using a geographic information system –

GIS, we select the sampling sites in a way that will correspond to random sampling). Further,

because pH varies systematically during each day, and around the year, we will also need

to specify some time window when we should perform our measurements. In each case,

we need to think carefully about what we consider to be our statistical population with

respect to the aims of study. Mixing pH of various water types might blur the information we

want to obtain. It might be better to have a narrow time window to avoid circadian variability,

but we must consider how informative is, say, the morning pH for the whole ecosystem.

It is probably not reasonable to pool samples from various seasons. In any case, all these

decisions must be speciﬁ ed when reporting the results. Saying that the average pH of streams in

an area is 6.3 without further speciﬁ cation is not very informative, and might be misleading if

we used a narrow subset of all possible streams or a narrow time window. Both of these

examples also demonstrate the difﬁ culty of obtaining a truly random sample; often we must

simply try our best to select cases that will at least resemble a random sample.

Generally, we are interested in the ' mean' value of some characteristic, so we ask what

the location of values on the chosen measurement scale is. Such an intuitively understood mean

value can be described by multiple characteristics. We will discuss some of these next.

1.3.1.1 Arithmetic Mean (Average)

The arithmetic mean of the statistical population μis

μ¼P N

i¼1 X i

N(1.1)

1.3 Sample Statistics 5

Cambridge University Press

978-1-108-48038-3 — Biostatistics with R

Jan Lepš , Petr Šmilauer

Excerpt

More Information

www.cambridge.org

while the arithmetic mean of a random sample Xis

X¼P n

i¼1 X i

n(1.2)

Example calculation: The height of ﬁ ve pine trees (in centimetres, measured with a precision

of 10 cm) was 950, 1120, 830, 990, 1060. The arithmetic average is then (950 + 1120 + 830 +

990 + 1060)/5 = 990 cm. The mean is calculated in exactly the same way whether the ﬁve

individuals represent our entire population (i.e. all individuals which we are interested in, say

for example if we planted these ﬁ ve individuals 20 years ago and wish to examine their success)

or whether these ﬁ ve individuals form our random sample representing all of the individuals in

the study area, this being our statistical population .Intheﬁrst case, we will denote the mean by

μ, and this is an exact value. In the second scenario (much more typical in biological sciences),

we will never know the exact value of μ , i.e. the mean height of all the individuals in the area,

but we use the sample mean 

Xto estimate its value (i.e. 

Xis the estimate of μ).

Be aware that the arithmetic mean (or any other characteristics of location) cannot be

used for raw data measured on a circular scale. Imagine we are measuring the geographic

exposition of tree trunks bearing a particular lichen species. We obtain the following values in

degrees (where both 0 and 360 degrees represent north): 5, 10, 355, 350, 15, 145. Applying

Eq. (1.2), we obtain an average value of 180, suggesting that the mean orientation is facing

south, but actually most trees have a northward orientation. The correct approach to working

with circular data is outlined e.g. in Zar (2010, pp. 605–668).

1.3.1.2 Median and Other Quantiles

The median is deﬁ ned as a value which has an identical number of cases, both above and

below this particular value. Or we can say (for an inﬁ nitely large set) that the probability of the

value for a randomly chosen case being larger than the median (but also smaller than the

median) is identical, i.e. equal to 0.5. For theoretical data distributions (see Section 1.6 later in

this chapter), the median is the value of a random variable with a corresponding distribution

function value equal to 0.5. We can use the median statistic for data on ratio, interval or

ordinal scales. There is no generally accepted symbol for the median statistic.

Besides the median, we can also use other quantiles . The most frequently used are

the two quartiles – the upper quartile ,de ﬁned as the value that separates one-quarter of the

highest-value cases and the lower quartile ,de ﬁned as the value that separates one-quarter of

the lowest-value cases. The other quantiles can be deﬁ ned similarly, and we will return to this

topic when describing the properties of distributions.

In our pine heights example (see Section 1.3.1.1), the median value is equal

to 990 cm (which is equal to the mean, just by chance). We estimate the median by ﬁrst

sorting the values according to their size. When the sample size (n ) is odd, the median is equal

to X

(n +1)/2

, i.e. to the value in the centre of the list of sorted cases. When n is even, the

median is estimated as the centre of the interval between the two middle observations, i.e. as

n/2

n/2+1

)/2. For example, if we are dealing with animal weights equal to 50, 52, 60, 63,

70, 94 g, the median estimate is 61.5 g. The median is sometimes calculated in a special way when

its location falls among multiple cases with identical values (tied observations), see Zar (2010, p. 26).

As we will see later, the population median value is identical to the value of the

arithmetic mean if the data have a symmetrical distribution. The manner in which the

arithmetic mean and median differ in asymmetrical distributions (see also Fig. 1.1) is shown

6Basic Statistical Terms, Sample Statistics

Cambridge University Press

978-1-108-48038-3 — Biostatistics with R

Jan Lepš , Petr Šmilauer

Excerpt

More Information

www.cambridge.org

below. In this example we are comparing two groups of organisms which differ in the way

they obtain their food, with each group comprising 11 individuals. The amount of food

(transformed into grams of organic C per day) obtained by each individual was as follows:

Group 1: 15, 16, 16, 17, 17, 18, 18, 19, 19, 20, 21

Group 2: 5, 5, 6, 6, 7, 8, 9, 15, 35, 80, 120

In the ﬁ rst group, the arithmetic average of consumed C is 17.8 g, while the average for the

second group is 26.9 g. The average consumption is therefore higher in the second group. But

if we use medians, the value for the ﬁ rst group is 18, but just 8 in the second group. A typical

individual (characterised by the fact that half of the individuals consume more and the other

half less) consumes much more in the ﬁ rst group.

1.3.1.3 Mode

The mode is deﬁ ned as the most frequent value. For data with a continuous distribution, this is

the variable value corresponding to the local maximum (or local maxima) of the probability

density. There might be more than one mode value for a particular variable, as a distribution

can also be bimodal (with two mode values) or even polymodal. The mode is deﬁ ned for all

data types. For continuous data it is usually estimated as the centre of the value interval for the

highest bar in a frequency histogram. If this is a polymodal distribution, we can use the bars

with heights exceeding the height of surrounding bars. It is worth noting that such an estimate

depends on our choice of intervals in the frequency histogram. The fact that we can obtain a

sample histogram that has multiple modes (given the choice of intervals) is not sufﬁcient

evidence of a polymodal distribution for our sampled population values.

1.3.1.4 Geometric Mean

The geometric mean is deﬁ ned as the n -th root of a multiple (Π operator represents the

multiplication) of n values in our sample:

GM ¼ ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ

i¼1 X i

q¼ Y n

i¼1 X i



1= n

(1.3)

The geometric mean of our ﬁ ve pines example will be (950  1120  830  990 1060)

1/5

984.9. The geometric mean is generally used for data on a ratio scale which do not contain

zeros and its value is smaller than the arithmetic mean.

ABC

mean

median

modus

median

mean modus

median

mean

Figure 1.1 Frequency histograms idealised into probability density curves, with marked locations

indicating different characteristics of position. Data values are plotted along the horizontal axis and

frequency (probability) on the vertical axis. The distribution in plot A is symmetrical, while in plot B it is

positively skewed and in plot C it is negatively skewed.

1.3 Sample Statistics 7

Cambridge University Press

978-1-108-48038-3 — Biostatistics with R

Jan Lepš , Petr Šmilauer

Excerpt

More Information

www.cambridge.org

1.3.2 Characteristics of Variability (Spread)

Besides the ' mean value' of the characteristic under observation, we are often interested in the

extent of differences among individual values in the sample, i.e. how variable they are. This is

addressed by the characteristics of variability.

Example question: How variable is the height of our pine trees?

1.3.2.1 Range

The range is the difference between the largest (maximum) and the smallest (minimum)

values in our dataset. In the tree height example the range is 290 cm. Please note that the range

of values grows with increasing sample size. Therefore, the range estimated from a random

sample is not a good estimate of the range in the sampled statistical population.

1.3.2.2 Variance

The variance and the statistics derived from it are the most often used characteristics of

variability. The variance is deﬁ ned as an average value of the second powers (squares) of the

deviations of individual observed values from their arithmetic average. For a statistical

population, the variance is deﬁ ned as follows:

σ2 ¼P N

i¼1 X i μ ðÞ

N(1.4)

For a sample, the variance is deﬁ ned as

s2 ¼P n

i¼1 X i  

X ðÞ

n1(1.5)

The s

term is sometimes replaced with var or VAR . The variance of a sample is the best

(unbiased) estimate of the variance of the sampled population.

Example calculation: For our pine trees, the variance is deﬁ ned (if we consider the

ﬁve trees as the whole population) as ((950  990)

+ (1120 990)

+ (830 990)

+ (990 

990)

+ (1060 990)

)/5 = 9800. However, it is more likely that these values would represent

a random sample, so the proper estimate of variance is calculated as ((950  990)

+ (1120 

990)

+ (830 990)

+ (990 990)

+ (1060 990)

)/4 = 12,250. Comparing Eqs (1.4) and

(1.5), we can see that the difference between these two estimates diminishes with increasing

n:forﬁ ve specimens the difference is relatively large, but it is more or less negligible for large n.

The denominator value, i.e. n 1 and not n , is used in the sample because we do not know the

real mean and thus must estimate it. Naturally, the larger our n is, the smaller the difference is

between the estimate 

Xand an (unknown) real value of the mean μ.

1.3.2.3 Standard Deviation

The standard deviation is the square root of the variance (for both a sample and a population).

Besides being denoted by an s , it is often marked as s.d. , S.D. or SD . The standard deviation of

a statistical population is deﬁ ned as

σ¼ﬃﬃﬃﬃﬃ

σ2

p(1.6)

The standard deviation of a sample is deﬁ ned as

s¼ﬃﬃﬃﬃ

p(1.7)

When we consider the ﬁ ve tree heights as a random sample, s =√ 12,250 cm

= 110.70 cm.

8Basic Statistical Terms, Sample Statistics

Cambridge University Press

978-1-108-48038-3 — Biostatistics with R

Jan Lepš , Petr Šmilauer

Excerpt

More Information

www.cambridge.org

1.3.2.4 Coefﬁ cient of Variation

In many variables measured on a ratio scale, the standard deviation is scaled with

the mean (sizes of individuals are a typical example). We can ask whether the height of

individuals is more variable in a population of the plant species Impatiens glandulifera (with a

typical height of about 2 m) or in a population of Impatiens noli-tangere (with a typical height of

about 30 cm). We must therefore relate the variation with the average height of both groups. In

other similar cases, we characterise variability by the coefﬁcient of variation (CV , sometimes also

CoV), which is a standard deviation estimate divided by the arithmetic mean:

CV ¼ s



X(1.8)

The coefﬁ cient of variation is meaningful for data on a ratio scale. It is used when we

want to compare the variability of two or more groups of objects differing in their

mean values.

In contrast, it is not possible to use this coefﬁ cient for data on an interval scale, such as

comparing the variation in temperature among groups differing in their average temperature.

There is no natural zero value and hence the coef ﬁcient of variation gives different results

depending on the chosen temperature scale (e.g. degrees Celsius vs. degrees Fahrenheit).

Similarly, it does not make sense to use the CV for log-transformed data (including pH). In

many cases the standard deviation of log-transformed data provides information similar to CV .

1.3.2.5 Interquartile Range

The interquartile range – calculated as the difference between the upper and lower quartiles –

is also a measure of variation. It is a better characteristic of variation than the range, as it is not

systematically related to the size of our sample. The interquartile range as a measure of

variation (spread) is a natural counterpart to the median as a measure of position (location).

1.4 Precision of Mean Estimate, Standard Error of Mean

The sample arithmetic mean is also a random variable (while the arithmetic mean of a

statistical population is not). So this estimate also has its own variation: if we sample a

statistical population repeatedly, the means calculated from individual samples will differ.

Their variation can be estimated using the variance of the statistical population (or of its

estimate, as the true value is usually not available). The variance of the arithmetic average is



X¼s 2

X=n(1.9)

The square root of this variance is the standard deviation of the mean' s estimate and is

typically called the standard error of the mean . It is often labelled as s

x,SEM or s.e.m., and

is the most commonly employed characteristic of precision for an estimate of the arithmetic

mean. Another often-used statistic is the conﬁ dence interval, calculated from the standard

error and discussed later in Chapter 5. Based on Eq. (1.9), we can obtain a formula for directly

computing the standard error of the mean:

s

X¼ s X

ﬃﬃﬃ

p(1.10)

1.4 Precision of Mean Estimate, Standard Error of Mean 9

Cambridge University Press

978-1-108-48038-3 — Biostatistics with R

Jan Lepš , Petr Šmilauer

Excerpt

More Information

www.cambridge.org

Do not confuse the standard deviation and the standard error of the mean: the

standard deviation describes the variation in sampled data and its estimate is not

systematically dependent on the sample size; the standard error of the mean charac-

terises the precision of our estimate and its value decreases with increasing sample

size – the larger the sample, the greater the precision of the mean' s estimate.

1.5 Graphical Summary of Individual Variables

Most research papers present the characteristics under investigation using the arithmetic mean

and standard deviation, and/or the standard error of the mean estimate. In this way, however,

we lose a great deal of information about our data, e.g. about their distribution. In general,

a properly chosen graph summarising our data can provide much more information than just

one or a couple of numerical statistics.

To summarise the shape of our data distribution, it is easiest to plot a frequency

histogram (see Figs 1.2 and 1.3 below). Another type of graph summarising variable distribu-

tion is the box-and-whisker plot (see Fig. 1.4 explaining individual components of this plot

type and Fig. 1.5 providing an example of its use). Some statistical software packages (this

does not concern R) use the box-and-whisker plot (by default) to present an arithmetic mean

and standard deviation. Such an approach is suitable only if we can assume that the statistical

population for the visualised variable' s values has a normal (Gaussian) distribution (see

Chapter 4). But generally, it is more informative to plot such a graph based on median and

quartiles, as this shows clearly any existing peculiarities of the data distribution and possibly

also identiﬁ es unusual values included in our sample.

1.6 Random Variables, Distribution, Distribution Function,

Density Distribution

All the equations provided so far can be used only for datasets and samples of ﬁ nite size. As

an example, to calculate the mean for a set of values, we must measure all cases in that set and

this is possible only for a set of ﬁ nite size. Imagine now, however, that our sampled statistical

population is inﬁ nite, or we are observing some random process which can be repeated any

number of times and which results in producing a particular value –a particular random entity.

For example, when studying the distribution of plant seeds, we can release each seed using a

tube at a particular height above the soil surface and subsequently measure its speed at the

end of the tube.

Such a measurement process can be repeated an inﬁ nite number of times.

Measured speed can be considered a random variable and the measured times are the

realisations of that random variable. Observed values of a random variable are actually a

random sample from a potentially inﬁ nite set of values – in this case all possible speeds of the

seeds. This is true for almost all variables we measure in our research, whether in the ﬁ eld or

in the lab.

So-called terminal velocity , considered to be a good characteristic of a seed' s ability to disperse in the wind.

In practice this is not so simple. When we aim to characterise the dispersal ability of a plant species we should vary

the identity of the seeds, with the tested seeds being a random sample from all the seeds of given species.

10 Basic Statistical Terms, Sample Statistics

ResearchGate has not been able to resolve any citations for this publication.

ResearchGate has not been able to resolve any references for this publication.

Posted by: lawsofunion.blogspot.com

Source: https://www.researchgate.net/publication/342590443_Biostatistics_with_R_An_Introductory_Guide_for_Field_Biologists_wwwcambridgeorgbiostatistics

Union Laws

Widget HTML Atas

Biostatistics With R Free Download