(PDF) R in Action | Chuchu Wang - Academia.edu
SECOND EDITION IN ACTION Data analysis and graphics with R Robert I. Kabacoff MANNING www.it-ebooks.info Praise for the First Edition Lucid and engaging—this is without doubt the fun way to learn R! —Amos A. Folarin, University College London Be prepared to quickly raise the bar with the sheer quality that R can produce. —Patrick Breen, Rogers Communications Inc. An excellent introduction and reference on R from the author of the best R website. —Christopher Williams, University of Idaho Thorough and readable. A great R companion for the student or researcher. —Samuel McQuillin, University of South Carolina Finally, a comprehensive introduction to R for programmers. —Philipp K. Janert, Author of Gnuplot in Action Essential reading for anybody moving to R for the first time. —Charles Malpas, University of Melbourne One of the quickest routes to R proficiency. You can buy the book on Friday and have a working program by Monday. —Elizabeth Ostrowski, Baylor College of Medicine One usually buys a book to solve the problems they know they have. This book solves problems you didn't know you had. —Carles Fenollosa, Barcelona Supercomputing Center Clear, precise, and comes with a lot of explanations and examples…the book can be used by beginners and professionals alike, and even for teaching R! —Atef Ouni, Tunisian National Institute of Statistics A great balance of targeted tutorials and in-depth examples. —Landon Cox, 360VL Inc. www.it-ebooks.info ii www.it-ebooks.info R in Action SECOND EDITION Data analysis and graphics with R ROBERT I. KABACOFF MANNING SHELTER ISLAND www.it-ebooks.info iv For online information and ordering of this and other Manning books, please visit www.manning.com. The publisher offers discounts on this book when ordered in quantity. For more information, please contact Special Sales Department Manning Publications Co. 20 Baldwin Road PO Box 761 Shelter Island, NY 11964 Email: orders@manning.com ©2015 by Manning Publications Co. All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by means electronic, mechanical, photocopying, or otherwise, without prior written permission of the publisher. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in the book, and Manning Publications was aware of a trademark claim, the designations have been printed in initial caps or all caps. Recognizing the importance of preserving what has been written, it is Manning’s policy to have the books we publish printed on acid-free paper, and we exert our best efforts to that end. Recognizing also our responsibility to conserve the resources of our planet, Manning books are printed on paper that is at least 15 percent recycled and processed without elemental chlorine. Manning Publications Co. 20 Baldwin Road PO Box 761 Shelter Island, NY 11964 Development editor: Copyeditor: Proofreader: Typesetter: Cover designer: ISBN: 9781617291388 Printed in the United States of America 1 2 3 4 5 6 7 8 9 10 – EBM – 20 19 18 17 16 15 www.it-ebooks.info Jennifer Stout Tiffany Taylor Toma Mulligan Marija Tudor Marija Tudor brief contents PART 1 PART 2 PART 3 GETTING STARTED ...................................................... 1 1 ■ Introduction to R 3 2 ■ Creating a dataset 20 3 ■ Getting started with graphs 4 ■ Basic data management 5 ■ Advanced data management 89 46 71 BASIC METHODS ...................................................... 115 6 ■ Basic graphs 7 ■ 117 Basic statistics 137 INTERMEDIATE METHODS ........................................ 165 8 ■ Regression 167 9 ■ Analysis of variance 212 10 ■ Power analysis 11 ■ Intermediate graphs 12 ■ Resampling statistics and bootstrapping 239 255 v www.it-ebooks.info 279 vi PART 4 PART 5 BRIEF CONTENTS ADVANCED METHODS ............................................... 299 13 ■ Generalized linear models 14 ■ 301 Principal components and factor analysis 15 ■ Time series 16 ■ Cluster analysis 17 ■ Classification 18 ■ Advanced methods for missing data 319 340 369 389 414 EXPANDING YOUR SKILLS ......................................... 435 19 ■ Advanced graphics with ggplot2 437 20 ■ Advanced programming 463 21 ■ Creating a package 491 22 ■ Creating dynamic reports 23 ■ Advanced graphics with the lattice package www.it-ebooks.info 513 1 online only contents preface xvii acknowledgments xix about this book xxi about the cover illustration PART 1 1 xxvii GETTING STARTED ........................................... 1 Introduction to R 3 1.1 1.2 1.3 Why use R? 5 Obtaining and installing R Working with R 7 7 Getting started 8 Getting help Input and output 13 ■ 1.4 Packages 10 ■ The workspace 11 15 What are packages? 15 Installing a package 15 Loading a package 15 Learning about a package 16 ■ ■ 1.5 1.6 1.7 Batch processing 16 Using output as input: reusing results Working with large datasets 17 vii www.it-ebooks.info 17 viii CONTENTS 1.8 1.9 2 Working through an example Summary 19 18 Creating a dataset 20 2.1 2.2 Understanding datasets Data structures 22 Vectors 22 Factors 28 2.3 ■ ■ 21 Matrices 23 Lists 30 ■ Arrays 24 ■ Data frames 25 Data input 32 Entering data from the keyboard 33 Importing data from a delimited text file 34 Importing data from Excel 37 Importing data from XML 38 Importing data from the web 38 Importing data from SPSS 38 Importing data from SAS 39 Importing data from Stata 40 Importing data from NetCDF 40 Importing data from HDF5 40 Accessing database management systems (DBMSs) 40 Importing data via Stat/Transfer 42 ■ ■ ■ ■ ■ ■ ■ ■ 2.4 Annotating datasets Variable labels 2.5 2.6 3 43 ■ 43 Value labels 43 Useful functions for working with data objects Summary 44 43 Getting started with graphs 46 3.1 3.2 3.3 Working with graphs 47 A simple example 49 Graphical parameters 50 Symbols and lines 51 Colors 52 Graph and margin dimensions 54 ■ 3.4 ■ Text characteristics Adding text, customized axes, and legends 56 Titles 56 Axes 57 Reference lines 60 Legend Text annotations 61 Math annotations 63 ■ ■ ■ ■ 3.5 Combining graphs 64 Creating a figure arrangement with fine control 3.6 4 Summary 70 Basic data management 71 4.1 4.2 A working example 71 Creating new variables 73 www.it-ebooks.info 53 68 60 ix CONTENTS 4.3 4.4 4.5 Recoding variables 75 Renaming variables 76 Missing values 77 Recoding values to missing from analyses 78 4.6 Date values 78 Excluding missing values ■ 79 Converting dates to character variables further 81 4.7 4.8 4.9 Going ■ Type conversions 81 Sorting data 82 Merging datasets 83 Adding columns to a data frame 83 a data frame 84 4.10 81 Subsetting datasets ■ Adding rows to 84 Selecting (keeping) variables 84 Excluding (dropping) variables 84 Selecting observations 85 The subset() function 86 Random samples 87 ■ ■ ■ ■ 4.11 4.12 5 Using SQL statements to manipulate data frames 87 Summary 88 Advanced data management 89 5.1 5.2 A data-management challenge 90 Numerical and character functions 91 Mathematical functions 91 Statistical functions 92 Probability functions 94 Character functions 97 Other useful functions 98 Applying functions to matrices and data frames 99 ■ ■ ■ 5.3 5.4 A solution for the data-management challenge Control flow 105 Repetition and looping 105 execution 106 5.5 5.6 Conditional User-written functions 107 Aggregation and reshaping 109 Transpose 110 package 111 5.7 ■ 101 Summary ■ Aggregating data 113 www.it-ebooks.info 110 ■ The reshape2 x CONTENTS PART 2 6 BASIC METHODS .......................................... 115 Basic graphs 117 6.1 Bar plots 118 Simple bar plots 118 Stacked and grouped bar plots Mean bar plots 120 Tweaking bar plots 121 Spinograms 122 ■ 119 ■ 6.2 6.3 6.4 6.5 Pie charts 123 Histograms 125 Kernel density plots Box plots 129 127 Using parallel box plots to compare groups 129 plots 132 6.6 6.7 7 ■ Violin Dot plots 133 Summary 136 Basic statistics 137 7.1 Descriptive statistics 138 A menagerie of methods 138 Even more methods 140 Descriptive statistics by group 142 Additional methods by group 143 Visualizing results 144 ■ ■ ■ 7.2 Frequency and contingency tables 144 Generating frequency tables 145 Tests of independence 151 Measures of association Visualizing results 153 ■ ■ 7.3 Correlations 152 153 Types of correlations 153 Testing correlations for significance 156 Visualizing correlations 158 ■ ■ 7.4 T-tests 158 Independent t-test 158 Dependent t-test 159 When there are more than two groups 160 ■ 7.5 Nonparametric tests of group differences Comparing two groups 160 groups 161 7.6 7.7 ■ 160 Comparing more than two Visualizing group differences Summary 164 www.it-ebooks.info 163 xi CONTENTS PART 3 8 INTERMEDIATE METHODS ............................. 165 Regression 167 8.1 The many faces of regression 168 Scenarios for using OLS regression know 170 8.2 OLS regression 169 ■ What you need to 171 Fitting regression models with lm() 172 Simple linear regression 173 Polynomial regression 175 Multiple linear regression 178 Multiple linear regression with interactions 180 ■ ■ ■ 8.3 Regression diagnostics 182 A typical approach 183 An enhanced approach 187 Global validation of linear model assumption 193 Multicollinearity 193 ■ 8.4 Unusual observations 194 Outliers 194 High-leverage points 195 observations 196 ■ 8.5 Corrective measures ■ Influential 198 Deleting observations 199 Transforming variables 199 Adding or deleting variables 201 Trying a different approach 201 ■ ■ 8.6 Selecting the “best” regression model Comparing models 202 8.7 9 Variable selection Taking the analysis further Cross-validation 8.8 ■ Summary 206 ■ 201 203 206 Relative importance 208 211 Analysis of variance 212 9.1 9.2 A crash course on terminology Fitting ANOVA models 215 The aov() function 9.3 One-way ANOVA 215 ■ The order of formula terms 216 218 Multiple comparisons 219 9.4 One-way ANCOVA ■ Assessing test assumptions 222 223 Assessing test assumptions 225 9.5 213 Two-way factorial ANOVA www.it-ebooks.info ■ Visualizing the results 226 225 xii CONTENTS 9.6 9.7 Repeated measures ANOVA 229 Multivariate analysis of variance (MANOVA) Assessing test assumptions 234 9.8 9.9 10 ANOVA as regression Summary 238 ■ Robust MANOVA 232 235 236 Power analysis 239 10.1 10.2 A quick review of hypothesis testing 240 Implementing power analysis with the pwr package 242 t-tests 243 ANOVA 245 Correlations 245 Linear models 246 Tests of proportions 247 Chi-square tests 248 Choosing an appropriate effect size in novel situations 249 ■ ■ ■ ■ 10.3 10.4 10.5 11 Creating power analysis plots Other packages 252 Summary 253 251 Intermediate graphs 255 11.1 Scatter plots 256 Scatter-plot matrices 259 High-density scatter plots 261 3D scatter plots 263 Spinning 3D scatter plots 265 Bubble plots 266 ■ ■ 11.2 11.3 11.4 11.5 12 Line charts 268 Corrgrams 271 Mosaic plots 276 Summary 278 Resampling statistics and bootstrapping 12.1 12.2 279 Permutation tests 280 Permutation tests with the coin package 282 Independent two-sample and k-sample tests 283 Independence in contingency tables 285 Independence between numeric variables 285 Dependent two-sample and k-sample tests 286 Going further 286 ■ ■ ■ 12.3 Permutation tests with the lmPerm package 287 Simple and polynomial regression 287 Multiple regression 288 One-way ANOVA and ANCOVA 289 Two-way ANOVA 290 ■ ■ www.it-ebooks.info xiii CONTENTS 12.4 12.5 12.6 Additional comments on permutation tests Bootstrapping 291 Bootstrapping with the boot package 292 Bootstrapping a single statistic statistics 296 12.7 PART 4 13 Summary 294 ■ 291 Bootstrapping several 298 ADVANCED METHODS ................................... 299 Generalized linear models 301 13.1 Generalized linear models and the glm() function 302 The glm() function 303 Supporting functions 304 Model fit and regression diagnostics 305 ■ 13.2 Logistic regression 306 Interpreting the model parameters 308 Assessing the impact of predictors on the probability of an outcome 309 Overdispersion 310 Extensions 311 ■ ■ 13.3 Poisson regression 312 Interpreting the model parameters 314 Extensions 317 13.4 14 Summary ■ Overdispersion 315 318 Principal components and factor analysis 319 14.1 14.2 Principal components and factor analysis in R Principal components 322 321 Selecting the number of components to extract 323 Extracting principal components 324 Rotating principal components 327 Obtaining principal components scores 328 ■ ■ 14.3 Exploratory factor analysis 330 Deciding how many common factors to extract 331 Extracting common factors 332 Rotating factors 333 Factor scores 336 Other EFA-related packages 337 ■ ■ 14.4 14.5 15 Other latent variable models Summary 338 337 Time series 340 15.1 Creating a time-series object in R www.it-ebooks.info 343 xiv CONTENTS 15.2 Smoothing and seasonal decomposition Smoothing with simple moving averages 345 decomposition 347 15.3 Exponential forecasting models 345 ■ Seasonal 352 Simple exponential smoothing 353 Holt and Holt-Winters exponential smoothing 355 The ets() function and automated forecasting 358 ■ ■ 15.4 ARIMA forecasting models 359 Prerequisite concepts 359 ARMA and ARIMA models 361 Automated ARIMA forecasting 366 ■ 15.5 15.6 16 Going further 367 Summary 367 Cluster analysis 369 16.1 16.2 16.3 16.4 Common steps in cluster analysis 370 Calculating distances 372 Hierarchical cluster analysis 374 Partitioning cluster analysis 378 K-means clustering 16.5 16.6 17 ■ Partitioning around medoids 382 Avoiding nonexistent clusters Summary 387 Classification 17.1 17.2 17.3 378 389 Preparing the data 390 Logistic regression 392 Decision trees 393 Classical decision trees 393 17.4 17.5 Random forests 399 Support vector machines Tuning an SVM 17.6 17.7 17.8 18 384 ■ Conditional inference trees 401 403 Choosing a best predictive solution 405 Using the rattle package for data mining 408 Summary 413 Advanced methods for missing data 414 18.1 18.2 Steps in dealing with missing data Identifying missing values 417 www.it-ebooks.info 415 397 xv CONTENTS 18.3 Exploring missing-values patterns 418 Tabulating missing values 419 Exploring missing data visually 419 Using correlations to explore missing values 422 ■ ■ 18.4 18.5 18.6 18.7 18.8 Understanding the sources and impact of missing data 424 Rational approaches for dealing with incomplete data 425 Complete-case analysis (listwise deletion) 426 Multiple imputation 428 Other approaches to missing data 432 Pairwise deletion 432 imputation 433 18.9 PART 5 19 Summary Simple (nonstochastic) ■ 433 EXPANDING YOUR SKILLS ............................. 435 Advanced graphics with ggplot2 437 19.1 19.2 19.3 19.4 19.5 19.6 19.7 The four graphics systems in R 438 An introduction to the ggplot2 package 439 Specifying the plot type with geoms 443 Grouping 447 Faceting 450 Adding smoothed lines 453 Modifying the appearance of ggplot2 graphs 455 Axes 455 Legends 457 Scales Multiple graphs per page 461 458 ■ Themes 460 Control structures 470 ■ Creating ■ 19.8 19.9 20 ■ Saving graphs 462 Summary 462 Advanced programming 20.1 A review of the language Data types 464 functions 473 20.2 20.3 ■ 464 Working with environments 475 Object-oriented programming 477 Generic functions 20.4 463 477 Writing efficient code ■ Limitations of the S3 model 479 www.it-ebooks.info 479 xvi CONTENTS 20.5 Debugging 483 Common sources of errors 483 Debugging tools Session options that support debugging 486 ■ 20.6 20.7 21 484 Going further 489 Summary 490 Creating a package 491 21.1 Nonparametric analysis and the npar package 492 Comparing groups with the npar package 21.2 Developing the package 494 496 Computing the statistics 497 Printing the results 500 Summarizing the results 501 Plotting the results 504 Adding sample data to the package 505 ■ ■ 21.3 21.4 21.5 21.6 22 Creating the package documentation Building the package 508 Going further 512 Summary 512 506 Creating dynamic reports 513 22.1 22.2 22.3 22.4 22.5 22.6 afterword appendix A appendix B appendix C appendix D appendix E appendix F appendix G A template approach to reports 515 Creating dynamic reports with R and Markdown 517 Creating dynamic reports with R and LaTeX 522 Creating dynamic reports with R and Open Document 525 Creating dynamic reports with R and Microsoft Word 527 Summary 531 Into the rabbit hole 532 Graphical user interfaces 535 Customizing the startup environment Exporting data from R 540 Matrix algebra in R 542 Packages used in this book 544 Working with large datasets 551 Updating an R installation 555 538 references 558 index bonus chapter 23 563 Advanced graphics with the lattice package 1 available online at manning.com/RinActionSecondEdition also available in this eBook www.it-ebooks.info preface What is the use of a book, without pictures or conversations? —Alice, Alice’s Adventures in Wonderland It’s wondrous, with treasures to satiate desires both subtle and gross; but it’s not for the timid. —Q, “Q Who?” Stark Trek: The Next Generation When I began writing this book, I spent quite a bit of time searching for a good quote to start things off. I ended up with two. R is a wonderfully flexible platform and language for exploring, visualizing, and understanding data. I chose the quote from Alice’s Adventures in Wonderland to capture the flavor of statistical analysis today—an interactive process of exploration, visualization, and interpretation. The second quote reflects the generally held notion that R is difficult to learn. What I hope to show you is that is doesn’t have to be. R is broad and powerful, with so many analytic and graphic functions available (more than 50,000 at last count) that it easily intimidates both novice and experienced users alike. But there is rhyme and reason to the apparent madness. With guidelines and instructions, you can navigate the tremendous resources available, selecting the tools you need to accomplish your work with style, elegance, efficiency—and more than a little coolness. I first encountered R several years ago, when applying for a new statistical consulting position. The prospective employer asked in the pre-interview material if I was conversant in R. Following the standard advice of recruiters, I immediately said yes, xvii www.it-ebooks.info xviii PREFACE and set off to learn it. I was an experienced statistician and researcher, had 25 years experience as an SAS and SPSS programmer, and was fluent in a half dozen programming languages. How hard could it be? Famous last words. As I tried to learn the language (as fast as possible, with an interview looming), I found either tomes on the underlying structure of the language or dense treatises on specific advanced statistical methods, written by and for subject-matter experts. The online help was written in a spartan style that was more reference than tutorial. Every time I thought I had a handle on the overall organization and capabilities of R, I found something new that made me feel ignorant and small. To make sense of it all, I approached R as a data scientist. I thought about what it takes to successfully process, analyze, and understand data, including ■ ■ ■ ■ ■ ■ ■ Accessing the data (getting the data into the application from multiple sources) Cleaning the data (coding missing data, fixing or deleting miscoded data, transforming variables into more useful formats) Annotating the data (in order to remember what each piece represents) Summarizing the data (getting descriptive statistics to help characterize the data) Visualizing the data (because a picture really is worth a thousand words) Modeling the data (uncovering relationships and testing hypotheses) Preparing the results (creating publication-quality tables and graphs) Then I tried to understand how I could use R to accomplish each of these tasks. Because I learn best by teaching, I eventually created a website (www.statmethods.net) to document what I had learned. Then, about a year later, Marjan Bace, Manning’s publisher, called and asked if I would like to write a book on R. I had already written 50 journal articles, 4 technical manuals, numerous book chapters, and a book on research methodology, so how hard could it be? At the risk of sounding repetitive—famous last words. A year after the first edition came out in 2011, I started working on the second edition. The R platform is evolving, and I wanted to describe these new developments. I also wanted to expand the coverage of predictive analytics and data mining—important topics in the world of big data. Finally, I wanted to add chapters on advanced data visualization, software development, and dynamic report writing. The book you’re holding is the one that I wished I had so many years ago. I have tried to provide you with a guide to R that will allow you to quickly access the power of this great open source endeavor, without all the frustration and angst. I hope you enjoy it. P.S. I was offered the job but didn’t take it. But learning R has taken my career in directions that I could never have anticipated. Life can be funny. www.it-ebooks.info acknowledgments A number of people worked hard to make this a better book. They include ■ ■ ■ ■ ■ ■ ■ Marjan Bace, Manning’s publisher, who asked me to write this book in the first place. Sebastian Stirling and Jennifer Stout, development editors on the first and second editions, respectively. Each spent many hours helping me organize the material, clarify concepts, and generally make the text more interesting. Pablo Domínguez Vaselli, technical proofreader, who helped uncover areas of confusion and provided an independent and expert eye for testing code. I came to rely on his vast knowledge, careful reviews, and considered judgment. Olivia Booth, the review editor, who helped obtain reviewers and coordinate the review process. Mary Piergies, who helped shepherd this book through the production process, and her team of Tiffany Taylor, Toma Mulligan, Janet Vail, David Novak, and Marija Tudor. The peer reviewers who spent hours of their own time carefully reading through the material, finding typos, and making valuable substantive suggestions: Bryce Darling, Christian Theil Have, Cris Weber, Deepak Vohra, Dwight Barry, George Gaines, Indrajit Sen Gupta, Dr. L. Duleep Kumar Samuel, Mahesh Srinivason, Marc Paradis, Peter Rabinovitch, Ravishankar Rajagopalan, Samuel Dale McQuillin, and Zekai Otles. The many Manning Early Access Program (MEAP) participants who bought the book before it was finished, asked great questions, pointed out errors, and made helpful suggestions. xix www.it-ebooks.info xx ACKNOWLEDGMENTS Each contributor has made this a better and more comprehensive book. I would also like to acknowledge the many software authors who have contributed to making R such a powerful data-analytic platform. They include not only the core developers, but also the selfless individuals who have created and maintain contributed packages, extending R’s capabilities greatly. Appendix E provides a list of the authors of contributed packages described in this book. In particular, I would like to mention John Fox, Hadley Wickham, Frank E. Harrell, Jr., Deepayan Sarkar, and William Revelle, whose works I greatly admire. I have tried to represent their contributions accurately, and I remain solely responsible for any errors or distortions inadvertently included in this book. I really should have started this book by thanking my wife and partner, Carol Lynn. Although she has no intrinsic interest in statistics or programming, she read each chapter multiple times and made countless corrections and suggestions. No greater love has any person than to read multivariate statistics for another. Just as important, she suffered the long nights and weekends that I spent writing this book, with grace, support, and affection. There is no logical explanation why I should be this lucky. There are two other people I would like to thank. One is my father, whose love of science was inspiring and who gave me an appreciation of the value of data. I miss him dearly. The other is Gary K. Burger, my mentor in graduate school. Gary got me interested in a career in statistics and teaching when I thought I wanted to be a clinician. This is all his fault. www.it-ebooks.info about this book If you picked up this book, you probably have some data that you need to collect, summarize, transform, explore, model, visualize, or present. If so, then R is for you! R has become the worldwide language for statistics, predictive analytics, and data visualization. It offers the widest range of methodologies for understanding data currently available, from the most basic to the most complex and bleeding edge. As an open source project it’s freely available for a range of platforms, including Windows, Mac OS X, and Linux. It’s under constant development, with new procedures added daily. Additionally, R is supported by a large and diverse community of data scientists and programmers who gladly offer their help and advice to users. Although R is probably best known for its ability to create beautiful and sophisticated graphs, it can handle just about any statistical problem. The base installation provides hundreds of data-management, statistical, and graphical functions out of the box. But some of its most powerful features come from the thousands of extensions (packages) provided by contributing authors. This breadth comes at a price. It can be hard for new users to get a handle on what R is and what it can do. Even the most experienced R user is surprised to learn about features they were unaware of. R in Action, Second Edition provides you with a guided introduction to R, giving you a 2,000-foot view of the platform and its capabilities. It will introduce you to the most important functions in the base installation and more than 90 of the most useful contributed packages. Throughout the book, the goal is practical application—how you can make sense of your data and communicate that understanding to others. When you finish, you should have a good grasp of how R works and what it can do and where xxi www.it-ebooks.info xxii ABOUT THIS BOOK you can go to learn more. You’ll be able to apply a variety of techniques for visualizing data, and you’ll have the skills to tackle both basic and advanced data analytic problems. What’s new in the second edition If you want to delve into the use of R more deeply, the second edition offers more than 200 pages of new material. Concentrated in the second half of the book are new chapters on data mining, predictive analytics, and advanced programming. In particular, chapters 15 (time series), 16 (cluster analysis), 17 (classification), 19 (ggplot2 graphics), 20 (advanced programming), 21 (creating a package), and 22 (creating dynamic reports) are new. In addition, chapter 2 (creating a dataset) has more detailed information on importing data from text and SAS files, and appendix F (working with large datasets) has been expanded to include new tools for working with big data problems. Finally, numerous updates and corrections have been made throughout the text. Who should read this book R in Action, Second Edition should appeal to anyone who deals with data. No background in statistical programming or the R language is assumed. Although the book is accessible to novices, there should be enough new and practical material to satisfy even experienced R mavens. Users without a statistical background who want to use R to manipulate, summarize, and graph data should find chapters 1–6, 11, and 19 easily accessible. Chapters 7 and 10 assume a one-semester course in statistics; and readers of chapters 8, 9, and 12–18 will benefit from two semesters of statistics. Chapters 20–22 offer a deeper dive into the R language and have no statistical prerequisites. I’ve tried to write each chapter in such a way that both beginning and expert data analysts will find something interesting and useful. Roadmap This book is designed to give you a guided tour of the R platform, with a focus on those methods most immediately applicable for manipulating, visualizing, and understanding data. The book has 22 chapters and is divided into 5 parts: “Getting Started,” “Basic Methods,” “Intermediate Methods,” “Advanced Methods,” and “Expanding Your Skills." Additional topics are covered in seven appendices. Chapter 1 begins with an introduction to R and the features that make it so useful as a data-analysis platform. The chapter covers how to obtain the program and how to enhance the basic installation with extensions that are available online. The remainder of the chapter is spent exploring the user interface and learning how to run programs interactively and in batch. Chapter 2 covers the many methods available for getting data into R. The first half of the chapter introduces the data structures R uses to hold data, and how to enter www.it-ebooks.info ABOUT THIS BOOK xxiii data from the keyboard. The second half discusses methods for importing data into R from text files, web pages, spreadsheets, statistical packages, and databases. Many users initially approach R because they want to create graphs, so we jump right into that topic in chapter 3. No waiting required. We review methods of creating graphs, modifying them, and saving them in a variety of formats. Chapter 4 covers basic data management, including sorting, merging, and subsetting datasets, and transforming, recoding, and deleting variables. Building on the material in chapter 4, chapter 5 covers the use of functions (mathematical, statistical, character) and control structures (looping, conditional execution) for data management. I then discuss how to write your own R functions and how to aggregate data in various ways. Chapter 6 demonstrates methods for creating common univariate graphs, such as bar plots, pie charts, histograms, density plots, box plots, and dot plots. Each is useful for understanding the distribution of a single variable. Chapter 7 starts by showing how to summarize data, including the use of descriptive statistics and cross-tabulations. We then look at basic methods for understanding relationships between two variables, including correlations, t-tests, chi-square tests, and nonparametric methods. Chapter 8 introduces regression methods for modeling the relationship between a numeric outcome variable and a set of one or more numeric predictor variables. Methods for fitting these models, evaluating their appropriateness, and interpreting their meaning are discussed in detail. Chapter 9 considers the analysis of basic experimental designs through the analysis of variance and its variants. Here we’re usually interested in how treatment combinations or conditions affect a numerical outcome. Methods for assessing the appropriateness of the analyses and visualizing the results are also covered. Chapter 10 provides a detailed treatment of power analysis. Starting with a discussion of hypothesis testing, the chapter focuses on how to determine the sample size necessary to detect a treatment effect of a given size with a given degree of confidence. This can help you to plan experimental and quasi-experimental studies that are likely to yield useful results. Chapter 11 expands on the material in chapter 6, covering the creation of graphs that help you to visualize relationships among two or more variables. These include various types of 2D and 3D scatter plots, scatter-plot matrices, line plots, correlograms, and mosaic plots. Chapter 12 presents analytic methods that work well in cases where data are sampled from unknown or mixed distributions, where sample sizes are small, where outliers are a problem, or where devising an appropriate test based on a theoretical distribution is too complex and mathematically intractable. They include both resampling and bootstrapping approaches—computer-intensive methods that are easily implemented in R. Chapter 13 expands on the regression methods in chapter 8 to cover data that are not normally distributed. The chapter starts with a discussion of generalized linear www.it-ebooks.info xxiv ABOUT THIS BOOK models and then focuses on cases where you’re trying to predict an outcome variable that is either categorical (logistic regression) or a count (Poisson regression). One of the challenges of multivariate data problems is simplification. Chapter 14 describes methods of transforming a large number of correlated variables into a smaller set of uncorrelated variables (principal component analysis), as well as methods for uncovering the latent structure underlying a given set of variables (factor analysis). The many steps involved in an appropriate analysis are covered in detail. Chapter 15 describes methods for creating, manipulating, and modeling time series data. It covers visualizing and decomposing time series data, as well as exponential and ARIMA approaches to forecasting future values. Chapter 16 illustrates methods of clustering observations into naturally occurring groups. The chapter begins with a discussion of the common steps in a comprehensive cluster analysis, followed by a presentation of hierarchical clustering and partitioning methods. Several methods for determining the proper number of clusters are presented. Chapter 17 presents popular supervised machine-learning methods for classifying observations into groups. Decision trees, random forests, and support vector machines are considered in turn. You’ll also learn about methods for evaluating the accuracy of each approach. In keeping with my attempt to present practical methods for analyzing data, chapter 18 considers modern approaches to the ubiquitous problem of missing data values. R supports a number of elegant approaches for analyzing datasets that are incomplete for various reasons. Several of the best are described here, along with guidance for which ones to use when, and which ones to avoid. Chapter 19 wraps up the discussion of graphics with a presentation of one of R’s most useful and advanced approaches to visualizing data: ggplot2. The ggplot2 package implements a grammar of graphics that provides a powerful and consistent set of tools for graphing multivariate data. Chapter 20 covers advanced programming techniques. You’ll learn about objectoriented programming techniques and debugging approaches. The chapter also presents a variety of tips for efficient programming. This chapter will be particularly helpful if you’re seeking a greater understanding of how R works, and it’s a prerequisite for chapter 21. Chapter 21 provides a step-by-step guide to creating R packages. This will allow you to create more sophisticated programs, document them efficiently, and share them with others. Finally, chapter 22 offers several methods for creating attractive reports from within R. You’ll learn how to generate web pages, reports, articles, and even books from your R code. The resulting documents can include your code, tables of results, graphs, and commentary. The afterword points you to many of the best internet sites for learning more about R, joining the R community, getting questions answered, and staying current with this rapidly changing product. www.it-ebooks.info ABOUT THIS BOOK xxv Last, but not least, the seven appendices (A through G) extend the text’s coverage to include such useful topics as R graphic user interfaces, customizing and upgrading an R installation, exporting data to other applications, using R for matrix algebra (à la MATLAB), and working with very large datasets. We also offer a bonus chapter, which is available online only from the publisher’s website at manning.com/RinActionSecondEdition. Online chapter 23 covers the lattice package, which is introduced in chapter 19. Advice for data miners Data mining is a field of analytics concerned with discovering patterns in large data sets. Many data-mining specialists are turning to R for its cutting-edge analytical capabilities. If you’re a data miner making the transition to R and want to access the language as quickly as possible, I recommend the following reading sequence: chapter 1 (introduction), chapter 2 (data structures and those portions of importing data that are relevant to your setting), chapter 4 (basic data management), chapter 7 (descriptive statistics), chapter 8 (sections 1, 2, and 6; regression), chapter 13 (section 2; logistic regression), chapter 16 (clustering), chapter 17 (classification), and appendix F (working with large datasets). Then review the other chapters as needed. Code examples In order to make this book as broadly applicable as possible, I’ve chosen examples from a range of disciplines, including psychology, sociology, medicine, biology, business, and engineering. None of these examples require a specialized knowledge of that field. The datasets used in these examples were selected because they pose interesting questions and because they’re small. This allows you to focus on the techniques described and quickly understand the processes involved. When you’re learning new methods, smaller is better. The datasets are provided with the base installation of R or available through add-on packages that are available online. The source code for each example is available from www.manning.com/RinAction SecondEdition and at www.github.com/kabacoff/RiA2. To get the most out of this book, I recommend that you try the examples as you read them. Finally, a common maxim states that if you ask two statisticians how to analyze a dataset, you’ll get three answers. The flip side of this assertion is that each answer will move you closer to an understanding of the data. I make no claim that a given analysis is the best or only approach to a given problem. Using the skills taught in this text, I invite you to play with the data and see what you can learn. R is interactive, and the best way to learn is to experiment. Code conventions The following typographical conventions are used throughout this book: ■ A monospaced font is used for code listings that should be typed as is. www.it-ebooks.info xxvi ABOUT THIS BOOK ■ ■ ■ ■ ■ A monospaced font is also used within the general text to denote code words or previously defined objects. Italics within code listings indicate placeholders. You should replace them with appropriate text and values for the problem at hand. For example, path_to _my_file would be replaced with the actual path to a file on your computer. R is an interactive language that indicates readiness for the next line of user input with a prompt (> by default). Many of the listings in this book capture interactive sessions. When you see code lines that start with >, don’t type the prompt. Code annotations are used in place of inline comments (a common convention in Manning books). Additionally, some annotations appear with numbered bullets like b that refer to explanations appearing later in the text. To save room or make text more legible, the output from interactive sessions may include additional white space or omit text that is extraneous to the point under discussion. Author Online Purchase of R in Action, Second Edition includes free access to a private web forum run by Manning Publications where you can make comments about the book, ask technical questions, and receive help from the author and from other users. To access the forum and subscribe to it, point your web browser to www.manning.com/RinActionSecond Edition. This page provides information on how to get on the forum once you’re registered, what kind of help is available, and the rules of conduct on the forum. Manning’s commitment to our readers is to provide a venue where a meaningful dialog between individual readers and between readers and the author can take place. It isn’t a commitment to any specific amount of participation on the part of the author, whose contribution to the AO forum remains voluntary (and unpaid). We suggest you try asking the author some challenging questions, lest his interest stray! The AO forum and the archives of previous discussions will be accessible from the publisher’s website as long as the book is in print. About the author Dr. Robert Kabacoff is Vice President of Research for Management Research Group, an international organizational development and consulting firm. He has more than 20 years of experience providing research and statistical consultation to organizations in health care, financial services, manufacturing, behavioral sciences, government, and academia. Prior to joining MRG, Dr. Kabacoff was a professor of psychology at Nova Southeastern University in Florida, where he taught graduate courses in quantitative methods and statistical programming. For the past five years, he has managed Quick-R (www.statmethods.net), a popular R tutorial website. www.it-ebooks.info about the cover illustration The figure on the cover of R in Action, Second Edition is captioned “A man from Zadar.” The illustration is taken from a reproduction of an album of Croatian traditional costumes from the mid-nineteenth century by Nikola Arsenovic, published by the Ethnographic Museum in Split, Croatia, in 2003. The illustrations were obtained from a helpful librarian at the Ethnographic Museum in Split, itself situated in the Roman core of the medieval center of the town: the ruins of Emperor Diocletian’s retirement palace from around AD 304. The book includes finely colored illustrations of figures from different regions of Croatia, accompanied by descriptions of the costumes and of everyday life. Zadar is an old Roman-era town on the northern Dalmatian coast of Croatia. It’s over 2,000 years old and served for hundreds of years as an important port on the trading route from Constantinople to the West. Situated on a peninsula framed by small Adriatic islands, the city is picturesque and has become a popular tourist destination with its architectural treasures of Roman ruins, moats, and old stone walls. The figure on the cover wears blue woolen trousers and a white linen shirt, over which he dons a blue vest and jacket trimmed with the colorful embroidery typical for this region. A red woolen belt and cap complete the costume. Dress codes and lifestyles have changed over the last 200 years, and the diversity by region, so rich at the time, has faded away. It’s now hard to tell apart the inhabitants of different continents, let alone of different hamlets or towns separated by only a few miles. Perhaps we have traded this cultural diversity for a more varied personal life— certainly for a more varied and fast-paced technological life. xxvii www.it-ebooks.info xxviii ABOUT THE COVER ILLUSTRATION Manning celebrates the inventiveness and initiative of the computer business with book covers based on the rich diversity of regional life of two centuries ago, brought back to life by illustrations from old books and collections like this one. www.it-ebooks.info Part 1 Getting started W elcome to R in Action! R is one of the most popular platforms for data analysis and visualization currently available. It’s free, open source software, available for Windows, Mac OS X, and Linux operating systems. This book will provide you with the skills needed to master this comprehensive software and apply it effectively to your own data. The book is divided into four sections. Part I covers the basics of installing the software, learning to navigate the interface, importing data, and massaging it into a useful format for further analysis. Chapter 1 is all about becoming familiar with the R environment. The chapter begins with an overview of R and the features that make it such a powerful platform for modern data analysis. After briefly describing how to obtain and install the software, the user interface is explored through a series of simple examples. Next, you’ll learn how to enhance the functionality of the basic installation with extensions (called contributed packages), that can be freely downloaded from online repositories. The chapter ends with an example that allows you to test out your new skills. Once you’re familiar with the R interface, the next challenge is to get your data into the program. In today’s information-rich world, data can come from many sources and in many formats. Chapter 2 covers the wide variety of methods available for importing data into R. The first half of the chapter introduces the data structures R uses to hold data and describes how to input data manually. The second half discusses methods for importing data from text files, web pages, spreadsheets, statistical packages, and databases. www.it-ebooks.info 2 Getting started CHAPTER From a workflow point of view, it would probably make sense to discuss data management and data cleaning next. But many users approach R for the first time out of an interest in its powerful graphics capabilities. Rather than frustrating that interest and keeping you waiting, we dive right into graphics in chapter 3. The chapter reviews methods for creating graphs, customizing them, and saving them in a variety of formats. The chapter describes how to specify the colors, symbols, lines, fonts, axes, titles, labels, and legends used in a graph, and ends with a description of how to combine several graphs into a single plot. Once you’ve had a chance to try out R’s graphics capabilities, it’s time to get back to the business of analyzing data. Data rarely comes in a readily usable format. Significant time must often be spent combining data from different sources, cleaning messy data (miscoded data, mismatched data, missing data), and creating new variables (combined variables, transformed variables, recoded variables) before the questions of interest can be addressed. Chapter 4 covers basic data-management tasks in R, including sorting, merging, and subsetting datasets, and transforming, recoding, and deleting variables. Chapter 5 builds on the material in chapter 4. It covers the use of numeric (arithmetic, trigonometric, and statistical) and character functions (string subsetting, concatenation, and substitution) in data management. A comprehensive example is used throughout this section to illustrate many of the functions described. Next, control structures (looping, conditional execution) are discussed, and you’ll learn how to write your own R functions. Writing custom functions allows you to extend R’s capabilities by encapsulating many programming steps into a single, flexible function call. Finally, powerful methods for reorganizing (reshaping) and aggregating data are discussed. Reshaping and aggregation are often useful in preparing data for further analyses. After having completed part I, you’ll be thoroughly familiar with programming in the R environment. You’ll have the skills needed to enter or access your data, clean it up, and prepare it for further analyses. You’ll also have experience creating, customizing, and saving a variety of graphs. www.it-ebooks.info Introduction to R This chapter covers ■ Installing R ■ Understanding the R language ■ Running programs How we analyze data has changed dramatically in recent years. With the advent of personal computers and the internet, the sheer volume of data we have available has grown enormously. Companies have terabytes of data about the consumers they interact with, and governmental, academic, and private research institutions have extensive archival and survey data on every manner of research topic. Gleaning information (let alone wisdom) from these massive stores of data has become an industry in itself. At the same time, presenting the information in easily accessible and digestible ways has become increasingly challenging. The science of data analysis (statistics, psychometrics, econometrics, and machine learning) has kept pace with this explosion of data. Before personal computers and the internet, new statistical methods were developed by academic researchers who published their results as theoretical papers in professional journals. It could take years for these methods to be adapted by programmers and incorporated into the statistical packages widely available to data analysts. Today, 3 www.it-ebooks.info 4 CHAPTER 1 Introduction to R new methodologies appear daily. Statistical researchers publish new and improved methods, along with the code to produce them, on easily accessible websites. The advent of personal computers had another effect on the way we analyze data. When data analysis was carried out on mainframe computers, computer time was precious and difficult to come by. Analysts would carefully set up a computer run with all the parameters and options thought to be needed. When the procedure ran, the resulting output could be dozens or hundreds of pages long. The analyst would sift through this output, extracting useful material and discarding the rest. Many popular statistical packages were originally developed during this period and still follow this approach to some degree. With the cheap and easy Import Data access afforded by personal computers, modern data analyPrepare, explore, and clean data sis has shifted to a different paradigm. Rather than setting up a Fit a statistical model complete data analysis all at once, the process has become Evaluate the model fit highly interactive, with the outCross-validate the model put from each stage serving as the input for the next stage. An Evaluate model prediction on new data example of a typical analysis is shown in figure 1.1. At any Produce report point, the cycles may include transforming the data, imputing Figure 1.1 Steps in a typical data analysis missing values, adding or deleting variables, and looping back through the whole process again. The process stops when the analyst believes they understand the data intimately and have answered all the relevant questions that can be answered. The advent of personal computers (and especially the availability of high-resolution monitors) has also had an impact on how results are understood and presented. A picture really can be worth a thousand words, and human beings are adept at extracting useful information from visual presentations. Modern data analysis increasingly relies on graphical presentations to uncover meaning and convey results. Today’s data analysts need to access data from a wide range of sources (database management systems, text files, statistical packages, and spreadsheets), merge the pieces of data together, clean and annotate them, analyze them with the latest methods, present the findings in meaningful and graphically appealing ways, and incorporate the results into attractive reports that can be distributed to stakeholders and the public. As you’ll see in the following pages, R is a comprehensive software package that’s ideally suited to accomplish these goals. www.it-ebooks.info Why use R? 1.1 5 Why use R? R is a language and environment for statistical computing and graphics, similar to the S language originally developed at Bell Labs. It’s an open source solution to data analysis that’s supported by a large and active worldwide research community. But there are many popular statistical and graphing packages available (such as Microsoft Excel, SAS, IBM SPSS, Stata, and Minitab). Why turn to R? R has many features to recommend it: ■ ■ ■ ■ ■ ■ ■ ■ ■ ■ Most commercial statistical software platforms cost thousands, if not tens of thousands, of dollars. R is free! If you’re a teacher or a student, the benefits are obvious. R is a comprehensive statistical platform, offering all manner of data-analytic techniques. Just about any type of data analysis can be done in R. R contains advanced statistical routines not yet available in other packages. In fact, new methods become available for download on a weekly basis. If you’re a SAS user, imagine getting a new SAS PROC every few days. R has state-of-the-art graphics capabilities. If you want to visualize complex data, R has the most comprehensive and powerful feature set available. R is a powerful platform for interactive data analysis and exploration. From its inception, it was designed to support the approach outlined in figure 1.1. For example, the results of any analytic step can easily be saved, manipulated, and used as input for additional analyses. Getting data into a usable form from multiple sources can be a challenging proposition. R can easily import data from a wide variety of sources, including text files, database-management systems, statistical packages, and specialized data stores. It can write data out to these systems as well. R can also access data directly from web pages, social media sites, and a wide range of online data services. R provides an unparalleled platform for programming new statistical methods in an easy, straightforward manner. It’s easily extensible and provides a natural language for quickly programming recently published methods. R functionality can be integrated into applications written in other languages, including C++, Java, Python, PHP, Pentaho, SAS, and SPSS. This allows you to continue working in a language that you may be familiar with, while adding R’s capabilities to your applications. R runs on a wide array of platforms, including Windows, Unix, and Mac OS X. It’s likely to run on any computer you may have. (I’ve even come across guides for installing R on an iPhone, which is impressive but probably not a good idea.) If you don’t want to learn a new language, a variety of graphic user interfaces (GUIs) are available, offering the power of R through menus and dialogs. You can see an example of R’s graphic capabilities in figure 1.2. This graph, created with a single line of code, describes the relationships between income, education, and www.it-ebooks.info CHAPTER 1 20 bc prof wc 60 80 100 RR.engineer 40 60 income 40 Introduction to R 80 6 100 20 minister 40 60 80 education RR.engineer 100 20 RR.engineer 80 prestige minister 0 20 40 60 RR.engineer 20 40 60 0 80 20 40 60 80 100 Figure 1.2 Relationships between income, education, and prestige for blue-collar (bc), whitecollar (wc), and professional (prof) jobs. Source: car package (scatterplotMatrix() function) written by John Fox. Graphs like this are difficult to create in other statistical programming languages but can be created with a line or two of code in R. prestige for blue-collar, white-collar, and professional jobs. Technically, it’s a scatterplot matrix with groups displayed by color and symbol, two types of fit lines (linear and loess), confidence ellipses, two types of density display (kernel density estimation, and rug plots). Additionally, the largest outlier in each scatter plot has been automatically labeled. If these terms are unfamiliar to you, don’t worry. We’ll cover them in later chapters. For now, trust me that they’re really cool (and that the statisticians reading this are salivating). Basically, this graph indicates the following: ■ ■ Education, income, and job prestige are linearly related. In general, blue-collar jobs involve lower education, income, and prestige, whereas professional jobs involve higher education, income, and prestige. White-collar jobs fall in between. www.it-ebooks.info Working with R ■ 7 There are some interesting exceptions. Railroad engineers have high income and low education. Ministers have high prestige and low income. Chapter 8 will have much more to say about this type of graph. The important point is that R allows you to create elegant, informative, highly customized graphs in a simple and straightforward fashion. Creating similar plots in other statistical languages would be difficult, time-consuming, or impossible. Unfortunately, R can have a steep learning curve. Because it can do so much, the documentation and help files available are voluminous. Additionally, because much of the functionality comes from optional modules created by independent contributors, this documentation can be scattered and difficult to locate. In fact, getting a handle on all that R can do is a challenge. The goal of this book is to make access to R quick and easy. We’ll tour the many features of R, covering enough material to get you started on your data, with pointers on where to go when you need to learn more. Let’s begin by installing the program. 1.2 Obtaining and installing R R is freely available from the Comprehensive R Archive Network (CRAN) at http:// cran.r-project.org. Precompiled binaries are available for Linux, Mac OS X, and Windows. Follow the directions for installing the base product on the platform of your choice. Later we’ll talk about adding functionality through optional modules called packages (also available from CRAN). Appendix G describes how to update an existing R installation to a newer version. 1.3 Working with R R is a case-sensitive, interpreted language. You can enter commands one at a time at the command prompt (>) or run a set of commands from a source file. There are a wide variety of data types, including vectors, matrices, data frames (similar to datasets), and lists (collections of objects). We’ll discuss each of these data types in chapter 2. Most functionality is provided through built-in and user-created functions and the creation and manipulation of objects. An object is basically anything that can be assigned a value. For R, that is just about everything (data, functions, graphs, analytic results, and more). Every object has a class attribute telling R how to handle it. All objects are kept in memory during an interactive session. Basic functions are available by default. Other functions are contained in packages that can be attached to a current session as needed. Statements consist of functions and assignments. R uses the symbol <- for assignments, rather than the typical = sign. For example, the statement x <- rnorm(5) creates a vector object named x containing five random deviates from a standard normal distribution. www.it-ebooks.info 8 CHAPTER 1 Introduction to R R allows the = sign to be used for object assignments. But you won’t find many programs written that way, because it’s not standard syntax, there are some situations in which it won’t work, and R programmers will make fun of you. You can also reverse the assignment direction. For instance, rnorm(5) -> x is equivalent to the previous statement. Again, doing so is uncommon and isn’t recommended in this book. NOTE Comments are preceded by the # symbol. Any text appearing after the # is ignored by the R interpreter. 1.3.1 Getting started If you’re using Windows, launch R from the Start menu. On a Mac, double-click the R icon in the Applications folder. For Linux, type R at the command prompt of a terminal window. Any of these will start the R interface (see figure 1.3 for an example). To get a feel for the interface, let’s work through a simple, contrived example. Say that you’re studying physical development and you’ve collected the ages and weights of 10 infants in their first year of life (see table 1.1). You’re interested in the distribution of the weights and their relationship to age. Figure 1.3 Example of the R interface on Windows Table 1.1 The ages and weights of 10 infants Age (mo.) Weight (kg.) Age (mo.) Weight (kg.) 01 4.4 09 7.3 03 5.3 03 6.0 05 7.2 09 10.4 02 5.2 12 10.2 11 8.5 03 6.1 Note: These are fictional data. www.it-ebooks.info 9 Working with R The analysis is given in listing 1.1. Age and weight data are entered as vectors using the function c(), which combines its arguments into a vector or list. The mean and standard deviation of the weights, along with the correlation between age and weight, are provided by the functions mean(), sd(), and cor(), respectively. Finally, age is plotted against weight using the plot() function, allowing you to visually inspect the trend. The q() function ends the session and lets you quit. Listing 1.1 A sample R session > age <- c(1,3,5,2,11,9,3,9,12,3) > weight <- c(4.4,5.3,7.2,5.2,8.5,7.3,6.0,10.4,10.2,6.1) > mean(weight) [1] 7.06 > sd(weight) [1] 2.077498 > cor(age,weight) [1] 0.9075655 > plot(age,weight) > q() 5 6 7 weight 8 9 10 You can see from listing 1.1 that the mean weight for these 10 infants is 7.06 kilograms, that the standard deviation is 2.08 kilograms, and that there is strong linear relationship between age in months and weight in kilograms (correlation = 0.91). The relationship can also be seen in the scatter plot in figure 1.4. Not surprisingly, as infants get older, they tend to weigh more. 2 4 6 8 10 age Figure 1.4 Scatter plot of infant weight (kg) by age (mo) www.it-ebooks.info 12 10 CHAPTER 1 Figure 1.5 Introduction to R A sample of the graphs created with the demo() function The scatter plot in figure 1.4 is informative but somewhat utilitarian and unattractive. In later chapters, you’ll see how to customize graphs to suit your needs. To get a sense of what R can do graphically, enter demo()at the command prompt. A sample of the graphs produced is included in figure 1.5. Other demonstrations include demo(Hershey), demo(persp), and demo(image). To see a complete list of demonstrations, enter demo() without parameters. TIP 1.3.2 Getting help R provides extensive help facilities, and learning to navigate them will help you significantly in your programming efforts. The built-in help system provides details, references, and examples of any function contained in a currently installed package. You can obtain help using the functions listed in table 1.2. www.it-ebooks.info 11 Working with R Table 1.2 R help functions Function Action help.start() General help help("foo") or ?foo Help on function foo (quotation marks optional) help.search("foo") or ??foo Searches the help system for instances of the string foo example("foo") Examples of function foo (quotation marks optional) RSiteSearch("foo") Searches for the string foo in online help manuals and archived mailing lists apropos("foo", mode="function") Lists all available functions with foo in their name data() Lists all available example datasets contained in currently loaded packages vignette() Lists all available vignettes for currently installed packages vignette("foo") Displays specific vignettes for topic foo The function help.start() opens a browser window with access to introductory and advanced manuals, FAQs, and reference materials. The RSiteSearch() function searches for a given topic in online help manuals and archives of the R-Help discussion list and returns the results in a browser window. The vignettes returned by the vignette() function are practical introductory articles provided in PDF format. Not all packages have vignettes. As you can see, R provides extensive help facilities, and learning to navigate them will definitely aid your programming efforts. It’s a rare session that I don’t use ? to look up the features (such as options or return values) of some function. 1.3.3 The workspace The workspace is your current R working environment and includes any user-defined objects (vectors, matrices, functions, data frames, and lists). At the end of an R session, you can save an image of the current workspace that’s automatically reloaded the next time R starts. Commands are entered interactively at the R user prompt. You can use the up and down arrow keys to scroll through your command history. Doing so allows you to select a previous command, edit it if desired, and resubmit it using the Enter key. The current working directory is the directory from which R will read files and to which it will save results by default. You can find out what the current working directory is by using the getwd() function. You can set the current working directory by using the setwd() function. If you need to input a file that isn’t in the current working directory, use the full pathname in the call. Always enclose the names of files and www.it-ebooks.info 12 CHAPTER 1 Introduction to R directories from the operating system in quotation marks. Some standard commands for managing your workspace are listed in table 1.3. Table 1.3 Functions for managing the R workspace Function Action getwd() Lists the current working directory. setwd("mydirectory") Changes the current working directory to mydirectory. ls() Lists the objects in the current workspace. rm(objectlist) Removes (deletes) one or more objects. help(options) Provides information about available options. options() Lets you view or set current options. history(#) Displays your last # commands (default = 25). savehistory("myfile") Saves the commands history to myfile (default = .Rhistory). loadhistory("myfile") Reloads a command’s history (default = .Rhistory). save.image("myfile") Saves the workspace to myfile (default = .RData). save(objectlist, file="myfile") Saves specific objects to a file. load("myfile") Loads a workspace into the current session. q() Quits R. You’ll be prompted to save the workspace. To see these commands in action, look at the following listing. Listing 1.2 An example of commands used to manage the R workspace setwd("C:/myprojects/project1") options() options(digits=3) x <- runif(20) summary(x) hist(x) q() First, the current working directory is set to C:/myprojects/project1, the current option settings are displayed, and numbers are formatted to print with three digits after the decimal place. Next, a vector with 20 uniform random variates is created, and summary statistics and a histogram based on this data are generated. When the q() function is executed, the user is prompted to save their workspace. If they type y, the session history is saved to the file .Rhistory, and the workspace (including vector x) is saved to the file .RData in the current directory. The session is ended, and R closes. Note the forward slashes in the pathname of the setwd() command. R treats the backslash (\) as an escape character. Even when you’re using R on a Windows www.it-ebooks.info 13 Working with R platform, use forward slashes in pathnames. Also note that the setwd() function won’t create a directory that doesn’t exist. If necessary, you can use the dir.create() function to create a directory and then use setwd() to change to its location. It’s a good idea to keep your projects in separate directories. You may want to start an R session by issuing the setwd() command with the appropriate path to a project, followed by the load(".RData") command. This lets you start up where you left off in your last session and keeps both your objects and history separate between projects. On Windows and Mac OS X platforms, it’s even easier. Just navigate to the project directory and double-click the saved image file. Doing so starts R, loads the saved workspace, and sets the current working directory to this location. 1.3.4 Input and output By default, launching R starts an interactive session with input from the keyboard and output to the screen. But you can also process commands from a script file (a file containing R statements) and direct output to a variety of destinations. INPUT The source("filename") function submits a script to the current session. If the filename doesn’t include a path, the file is assumed to be in the current working directory. For example, source("myscript.R") runs a set of R statements contained in the file myscript.R. By convention, script filenames end with an .R extension, but this isn’t required. TEXT OUTPUT The sink("filename") function redirects output to the file filename. By default, if the file already exists, its contents are overwritten. Include the option append=TRUE to append text to the file rather than overwriting it. Including the option split=TRUE will send output to both the screen and the output file. Issuing the command sink() without options will return output to the screen alone. GRAPHIC OUTPUT Although sink()redirects text output, it has no effect on graphic output. To redirect graphic output, use one of the functions listed in table 1.4. Use dev.off() to return output to the terminal. Table 1.4 Functions for saving graphic output Function Output bmp("filename.bmp") BMP file jpeg("filename.jpg") JPEG file pdf("filename.pdf") PDF file png("filename.png") PNG file postscript("filename.ps") PostScript file www.it-ebooks.info 14 CHAPTER 1 Table 1.4 Introduction to R Functions for saving graphic output (continued) Function Output svg("filename.svg") SVG file win.metafile("filename.wmf") Windows metafile Let’s put it all together with an example. Assume that you have three script files containing R code (script1.R, script2.R, and script3.R). Issuing the statement source("script1.R") submits the R code from script1.R to the current session, and the results appear on the screen. If you then issue the statements sink("myoutput", append=TRUE, split=TRUE) pdf("mygraphs.pdf") source("script2.R") the R code from file script2.R is submitted, and the results again appear on the screen. In addition, the text output is appended to the file myoutput, and the graphic output is saved to the file mygraphs.pdf. Finally, if you issue the statements sink() dev.off() source("script3.R") source("script1.R") the R code from script3.R is submitted, and the results appear on the screen. This time, no text or graphic output is saved to files. The sequence is outlined in figure 1.6. R provides quite a bit of flexibility and control over where input comes from and where it goes. In section 1.5, you’ll learn how to run a program in batch mode. script1.R Current session sink("myoutput", append=TRUE, split=TRUE) source("script2.R") Current session script2.R myoutput pdf("mygraphs.pdf") sink(), dev.off() source("script3.R") Figure 1.6 Input with the source() function and output with the sink() function script3.R www.it-ebooks.info Current session Output added to the ile Packages 1.4 15 Packages R comes with extensive capabilities right out of the box. But some of its most exciting features are available as optional modules that you can download and install. There are more than 5,500 user-contributed modules called packages that you can download from http://cran.r-project.org/web/packages. They provide a tremendous range of new capabilities, from the analysis of geospatial data to protein mass spectra processing to the analysis of psychological tests! You’ll use many of these optional packages in this book. 1.4.1 What are packages? Packages are collections of R functions, data, and compiled code in a well-defined format. The directory where packages are stored on your computer is called the library. The function .libPaths() shows you where your library is located, and the function library() shows you what packages you’ve saved in your library. R comes with a standard set of packages (including base, datasets, utils, grDevices, graphics, stats, and methods). They provide a wide range of functions and datasets that are available by default. Other packages are available for download and installation. Once installed, they must be loaded into the session in order to be used. The command search() tells you which packages are loaded and ready to use. 1.4.2 Installing a package A number of R functions let you manipulate packages. To install a package for the first time, use the install.packages() command. For example, install.packages() without options brings up a list of CRAN mirror sites. Once you select a site, you’re presented with a list of all available packages. Selecting one downloads and installs it. If you know what package you want to install, you can do so directly by providing it as an argument to the function. For example, the gclus package contains functions for creating enhanced scatter plots. You can download and install the package with the command install.packages("gclus"). You only need to install a package once. But like any software, packages are often updated by their authors. Use the command update.packages() to update any packages that you’ve installed. To see details on your packages, you can use the installed.packages() command. It lists the packages you have, along with their version numbers, dependencies, and other information. 1.4.3 Loading a package Installing a package downloads it from a CRAN mirror site and places it in your library. To use it in an R session, you need to load the package using the library() command. For example, to use the package gclus, issue the command library(gclus). Of course, you must have installed a package before you can load it. You’ll only have to load the package once in a given session. If desired, you can customize your startup environment to automatically load the packages you use most often. Customizing your startup is covered in appendix B. www.it-ebooks.info 16 1.4.4 CHAPTER 1 Introduction to R Learning about a package When you load a package, a new set of functions and datasets becomes available. Small illustrative datasets are provided along with sample code, allowing you to try out the new functionalities. The help system contains a description of each function (along with examples) and information about each dataset included. Entering help(package="package_name") provides a brief description of the package and an index of the functions and datasets included. Using help() with any of these function or dataset names provides further details. The same information can be downloaded as a PDF manual from CRAN. Common mistakes in R programming Some common mistakes are made frequently by both beginning and experienced R programmers. If your program generates an error, be sure to check for the following: ■ ■ ■ ■ ■ Using the wrong case—help(), Help(), and HELP() are three different functions (only the first will work). Forgetting to use quotation marks when they’re needed—install.packages("gclus") works, whereas install.packages(gclus) generates an error. Forgetting to include the parentheses in a function call—For example, help() works, but help doesn’t. Even if there are no options, you still need the (). Using the \ in a pathname on Windows—R sees the backslash character as an escape character. setwd("c:\mydata") generates an error. Use setwd("c:/ mydata") or setwd("c:\\mydata") instead. Using a function from a package that’s not loaded—The function order.clusters() is contained in the gclus package. If you try to use it before loading the package, you’ll get an error. The error messages in R can be cryptic, but if you’re careful to follow these points, you should avoid seeing many of them. 1.5 Batch processing Most of the time, you’ll be running R interactively, entering commands at the command prompt and seeing the results of each statement as it’s processed. Occasionally, you may want to run an R program in a repeated, standard, and possibly unattended fashion. For example, you may need to generate the same report once a month. You can write your program in R and run it in batch mode. How you run R in batch mode depends on your operating system. On Linux or Mac OS X systems, you can use the following command in a terminal window R CMD BATCH options infile outfile where infile is the name of the file containing R code to be executed, outfile is the name of the file receiving the output, and options lists options that control execution. By convention, infile is given the extension .R, and outfile is given the extension .Rout. www.it-ebooks.info Working with large datasets 17 For Windows, use "C:\Program Files\R\R-3.1.0\bin\R.exe" CMD BATCH ➥ --vanilla --slave "c:\my projects\myscript.R" adjusting the paths to match the location of your R.exe binary and your script file. For additional details on how to invoke R, including the use of command-line options, see the “Introduction to R” documentation available from CRAN (http:// cran.r-project.org). 1.6 Using output as input: reusing results One of the most useful design features of R is that the output of analyses can easily be saved and used as input to additional analyses. Let’s walk through an example, using one of the datasets that comes preinstalled with R. If you don’t understand the statistics involved, don’t worry. We’re focusing on the general principle here. First, run a simple linear regression predicting miles per gallon (mpg) from car weight (wt), using the automotive dataset mtcars. This is accomplished with the following function call: lm(mpg~wt, data=mtcars) The results are displayed on the screen, and no information is saved. Next, run the regression, but store the results in an object: lmfit <- lm(mpg~wt, data=mtcars) The assignment creates a list object called lmfit that contains extensive information from the analysis (including the predicted values, residuals, regression coefficients, and more). Although no output is sent to the screen, the results can be both displayed and manipulated further. Typing summary(lmfit) displays a summary of the results, and plot(lmfit) produces diagnostic plots. The statement cook<-cooks.distance(lmfit) generates and stores influence statistics, and plot(cook) graphs them. To predict miles per gallon from car weight in a new set of data, you’d use predict(lmfit, mynewdata). To see what a function returns, look at the Value section of the R help page for that function. Here you’d look at help(lm) or ?lm. This tells you what’s saved when you assign the results of that function to an object. 1.7 Working with large datasets Programmers frequently ask me if R can handle large data problems. Typically, they work with massive amounts of data gathered from web research, climatology, or genetics. Because R holds objects in memory, you’re generally limited by the amount of RAM available. For example, on my 5-year-old Windows PC with 2 GB of RAM, I can easily handle datasets with 10 million elements (100 variables by 100,000 observations). On an iMac with 4 GB of RAM, I can usually handle 100 million elements without difficulty. But there are two issues to consider: the size of the dataset and the statistical methods that will be applied. R can handle data analysis problems in the gigabyte to www.it-ebooks.info 18 CHAPTER 1 Introduction to R terabyte range, but specialized procedures are required. The management and analysis of very large datasets is discussed in appendix F. 1.8 Working through an example We’ll finish this chapter with an example that ties together many of these ideas. Here’s the task: 1 2 3 4 5 6 7 Open the general help, and look at the “Introduction to R” section. Install the vcd package (a package for visualizing categorical data that you’ll be using in chapter 11). List the functions and datasets available in this package. Load the package, and read the description of the dataset Arthritis. Print out the Arthritis dataset (entering the name of an object will list it). Run the example that comes with the Arthritis dataset. Don’t worry if you don’t understand the results; it basically shows that arthritis patients receiving treatment improved much more than patients receiving a placebo. Quit. Figure 1.7 Output from listing 1.3, including (left to right) output from the arthritis example, general help, information about the vcd package, information about the Arthritis dataset, and a graph displaying the relationship between arthritis treatment and outcome www.it-ebooks.info Summary 19 The code required is provided in the following listing, with a sample of the results displayed in figure 1.7. As this short exercise demonstrates, you can accomplish a great deal with a small amount of code. Listing 1.3 Working with a new package help.start() install.packages("vcd") help(package="vcd") library(vcd) help(Arthritis) Arthritis example(Arthritis) q() 1.9 Summary In this chapter, we looked at some of the strengths that make R an attractive option for students, researchers, statisticians, and data analysts trying to understand the meaning of their data. We walked through the program’s installation and talked about how to enhance R’s capabilities by downloading additional packages. We explored the basic interface, running programs interactively and in a batch, and produced a few sample graphs. You also learned how to save your work to both text and graphic files. Because R can be a complex program, we spent some time looking at how to access the extensive help that’s available. Hopefully you’re getting a sense of how powerful this freely available software can be. Now that you have R up and running, it’s time to get your data into the mix. In the next chapter, we’ll look at the types of data R can handle and how to import them into R from text files, other programs, and database management systems. www.it-ebooks.info Creating a dataset This chapter covers ■ Exploring R data structures ■ Using data entry ■ Importing data ■ Annotating datasets The first step in any data analysis is the creation of a dataset containing the information to be studied, in a format that meets your needs. In R, this task involves the following: ■ ■ Selecting a data structure to hold your data Entering or importing your data into the data structure The first part of this chapter (sections 2.1–2.2) describes the wealth of structures that R can use to hold data. In particular, section 2.2 describes vectors, factors, matrices, data frames, and lists. Familiarizing yourself with these structures (and the notation used to access elements within them) will help you tremendously in understanding how R works. You might want to take your time working through this section. The second part of this chapter (section 2.3) covers the many methods available for importing data into R. Data can be entered manually or imported from an 20 www.it-ebooks.info 21 Understanding datasets external source. These data sources can include text files, spreadsheets, statistical packages, and database-management systems. For example, the data that I work with typically comes from SQL databases. On occasion, though, I receive data from legacy DOS systems and from current SAS and SPSS databases. It’s likely that you’ll only have to use one or two of the methods described in this section, so feel free to choose those that fit your situation. Once a dataset is created, you’ll typically annotate it, adding descriptive labels for variables and variable codes. The third portion of this chapter (section 2.4) looks at annotating datasets and reviews some useful functions for working with datasets (section 2.5). Let’s start with the basics. 2.1 Understanding datasets A dataset is usually a rectangular array of data with rows representing observations and columns representing variables. Table 2.1 provides an example of a hypothetical patient dataset. Table 2.1 A patient dataset PatientID AdmDate Age Diabetes Status 1 10/15/2014 25 Type1 Poor 2 11/01/2014 34 Type2 Improved 3 10/21/2014 28 Type1 Excellent 4 10/28/2014 52 Type1 Poor Different traditions have different names for the rows and columns of a dataset. Statisticians refer to them as observations and variables, database analysts call them records and fields, and those from the data-mining and machine-learning disciplines call them examples and attributes. We’ll use the terms observations and variables throughout this book. You can distinguish between the structure of the dataset (in this case, a rectangular array) and the contents or data types included. In the dataset shown in table 2.1, PatientID is a row or case identifier, AdmDate is a date variable, Age is a continuous variable, Diabetes is a nominal variable, and Status is an ordinal variable. R contains a wide variety of structures for holding data, including scalars, vectors, arrays, data frames, and lists. Table 2.1 corresponds to a data frame in R. This diversity of structures provides the R language with a great deal of flexibility in dealing with data. The data types or modes that R can handle include numeric, character, logical (TRUE/FALSE), complex (imaginary numbers), and raw (bytes). In R, PatientID, AdmDate, and Age are numeric variables, whereas Diabetes and Status are character variables. Additionally, you need to tell R that PatientID is a case identifier, that AdmDate contains dates, and that Diabetes and Status are nominal and ordinal www.it-ebooks.info 22 CHAPTER 2 Creating a dataset variables, respectively. R refers to case identifiers as rownames and categorical variables (nominal, ordinal) as factors. We’ll cover each of these in the next section. You’ll learn about dates in chapter 3. 2.2 Data structures R has a wide variety of objects for holding data, including scalars, vectors, matrices, arrays, data frames, and lists. They differ in terms of the type of data they can hold, how they’re created, their structural complexity, and the notation used to identify and access individual elements. Figure 2.1 shows a diagram of these data structures. Let’s look at each structure in turn, starting with vectors. (b) Matrix (c) Array (a) Vector (d) Data frame Vectors (e) List Columns can be different modes Figure 2.1 Arrays Data frames Lists R data structures Some definitions Several terms are idiosyncratic to R and thus confusing to new users. In R, an object is anything that can be assigned to a variable. This includes constants, data structures, functions, and even graphs. An object has a mode (which describes how the object is stored) and a class (which tells generic functions like print how to handle it). A data frame is a structure in R that holds data and is similar to the datasets found in standard statistical packages (for example, SAS, SPSS, and Stata). The columns are variables, and the rows are observations. You can have variables of different types (for example, numeric or character) in the same data frame. Data frames are the main structures you use to store datasets. Factors are nominal or ordinal variables. They’re stored and treated specially in R. You’ll learn about factors in section 2.2.5. Most other terms used in R should be familiar to you and follow the terminology used in statistics and computing in general. 2.2.1 Vectors Vectors are one-dimensional arrays that can hold numeric data, character data, or logical data. The combine function c() is used to form the vector. Here are examples of each type of vector: a <- c(1, 2, 5, 3, 6, -2, 4) b <- c("one", "two", "three") c <- c(TRUE, TRUE, TRUE, FALSE, TRUE, FALSE) www.it-ebooks.info 23 Data structures Here, a is a numeric vector, b is a character vector, and c is a logical vector. Note that the data in a vector must be only one type or mode (numeric, character, or logical). You can’t mix modes in the same vector. Scalars are one-element vectors. Examples include f <- 3, g <- "US", and h <- TRUE. They’re used to hold constants. NOTE You can refer to elements of a vector using a numeric vector of positions within brackets. For example, a[c(2, 4)] refers to the second and fourth elements of vector a. Here are additional examples: > a <- c("k", "j", "h", "a", "c", "m") > a[3] [1] "h" > a[c(1, 3, 5)] [1] "k" "h" "c" > a[2:6] [1] "j" "h" "a" "c" "m" The colon operator used in the last statement generates a sequence of numbers. For example, a <- c(2:6) is equivalent to a <- c(2, 3, 4, 5, 6). 2.2.2 Matrices A matrix is a two-dimensional array in which each element has the same mode (numeric, character, or logical). Matrices are created with the matrix() function. The general format is myymatrix <- matrix(vector, nrow=number_of_rows, ncol=number_of_columns, byrow=logical_value, dimnames=list( char_vector_rownames, char_vector_colnames)) where vector contains the elements for the matrix, nrow and ncol specify the row and column dimensions, and dimnames contains optional row and column labels stored in character vectors. The option byrow indicates whether the matrix should be filled in by row (byrow=TRUE) or by column (byrow=FALSE). The default is by column. The following listing demonstrates the matrix function. Listing 2.1 Creating matrices > y <- matrix(1:20, nrow=5, ncol=4) > y Creates a 5 × 4 matrix [,1] [,2] [,3] [,4] [1,] 1 6 11 16 [2,] 2 7 12 17 [3,] 3 8 13 18 [4,] 4 9 14 19 [5,] 5 10 15 20 > cells <- c(1,26,24,68) > rnames <- c("R1", "R2") > cnames <- c("C1", "C2") > mymatrix <- matrix(cells, nrow=2, ncol=2, byrow=TRUE, 2 × 2 matrix dimnames=list(rnames, cnames)) filled by rows b c www.it-ebooks.info 24 CHAPTER 2 Creating a dataset > mymatrix C1 C2 R1 1 26 R2 24 68 > mymatrix <- matrix(cells, nrow=2, ncol=2, byrow=FALSE, dimnames=list(rnames, cnames)) > mymatrix C1 C2 R1 1 24 R2 26 68 d 2 × 2 matrix filled by columns First you create a 5 × 4 matrix b. Then you create a 2 × 2 matrix with labels and fill the matrix by rows c. Finally, you create a 2 × 2 matrix and fill the matrix by columns d. You can identify rows, columns, or elements of a matrix by using subscripts and brackets. X[i,] refers to the ith row of matrix X, X[,j] refers to the j th column, and X[i, j] refers to the ij th element, respectively. The subscripts i and j can be numeric vectors in order to select multiple rows or columns, as shown in the following listing. Listing 2.2 Using matrix subscripts > x <- matrix(1:10, nrow=2) > x [,1] [,2] [,3] [,4] [,5] [1,] 1 3 5 7 9 [2,] 2 4 6 8 10 > x[2,] [1] 2 4 6 8 10 > x[,2] [1] 3 4 > x[1,4] [1] 7 > x[1, c(4,5)] [1] 7 9 First a 2 × 5 matrix is created containing the numbers 1 to 10. By default, the matrix is filled by column. Then the elements in the second row are selected, followed by the elements in the second column. Next, the element in the first row and fourth column is selected. Finally, the elements in the first row and the fourth and fifth columns are selected. Matrices are two-dimensional and, like vectors, can contain only one data type. When there are more than two dimensions, you use arrays (section 2.2.3). When there are multiple modes of data, you use data frames (section 2.2.4). 2.2.3 Arrays Arrays are similar to matrices but can have more than two dimensions. They’re created with an array function of the following form myarray <- array(vector, dimensions, dimnames) where vector contains the data for the array, dimensions is a numeric vector giving the maximal index for each dimension, and dimnames is an optional list of dimension www.it-ebooks.info Data structures 25 labels. The following listing gives an example of creating a three-dimensional (2 × 3 × 4) array of numbers. Listing 2.3 Creating an array > > > > > , dim1 <- c("A1", "A2") dim2 <- c("B1", "B2", "B3") dim3 <- c("C1", "C2", "C3", "C4") z <- array(1:24, c(2, 3, 4), dimnames=list(dim1, dim2, dim3)) z , C1 B1 B2 B3 A1 1 3 5 A2 2 4 6 , , C2 B1 B2 B3 A1 7 9 11 A2 8 10 12 , , C3 B1 B2 B3 A1 13 15 17 A2 14 16 18 , , C4 B1 B2 B3 A1 19 21 23 A2 20 22 24 As you can see, arrays are a natural extension of matrices. They can be useful in programming new statistical methods. Like matrices, they must be a single mode. Identifying elements follows what you’ve seen for matrices. In the previous example, the z[1,2,3] element is 15. 2.2.4 Data frames A data frame is more general than a matrix in that different columns can contain different modes of data (numeric, character, and so on). It’s similar to the dataset you’d typically see in SAS, SPSS, and Stata. Data frames are the most common data structure you’ll deal with in R. The patient dataset in table 2.1 consists of numeric and character data. Because there are multiple modes of data, you can’t contain the data in a matrix. In this case, a data frame is the structure of choice. A data frame is created with the data.frame() function mydata <- data.frame(col1, col2, col3,...) where col1, col2, col3, and so on are column vectors of any type (such as character, numeric, or logical). Names for each column can be provided with the names function. The following listing makes this clear. www.it-ebooks.info 26 CHAPTER 2 Creating a dataset Listing 2.4 Creating a data frame > > > > > > 1 2 3 4 patientID <- c(1, 2, 3, 4) age <- c(25, 34, 28, 52) diabetes <- c("Type1", "Type2", "Type1", "Type1") status <- c("Poor", "Improved", "Excellent", "Poor") patientdata <- data.frame(patientID, age, diabetes, status) patientdata patientID age diabetes status 1 25 Type1 Poor 2 34 Type2 Improved 3 28 Type1 Excellent 4 52 Type1 Poor Each column must have only one mode, but you can put columns of different modes together to form the data frame. Because data frames are close to what analysts typically think of as datasets, we’ll use the terms columns and variables interchangeably when discussing data frames. There are several ways to identify the elements of a data frame. You can use the subscript notation you used before (for example, with matrices), or you can specify column names. Using the patientdata data frame created earlier, the following listing demonstrates these approaches. Listing 2.5 Specifying elements of a data frame > patientdata[1:2] patientID age 1 1 25 2 2 34 3 3 28 4 4 52 > patientdata[c("diabetes", "status")] diabetes status 1 Type1 Poor 2 Type2 Improved 3 Type1 Excellent Indicates the age variable 4 Type1 Poor in the patient data frame > patientdata$age [1] 25 34 28 52 b The $ notation in the third example is new b. It’s used to indicate a particular variable from a given data frame. For example, if you want to cross-tabulate diabetes type by status, you can use the following code: > table(patientdata$diabetes, patientdata$status) Type1 Type2 Excellent Improved Poor 1 0 2 0 1 0 It can get tiresome typing patientdata$ at the beginning of every variable name, so shortcuts are available. You can use either the attach() and detach() or with() functions to simplify your code. www.it-ebooks.info 27 Data structures ATTACH, DETACH, AND WITH The attach() function adds the data frame to the R search path. When a variable name is encountered, data frames in the search path are checked for the variable in order. Using the mtcars data frame from chapter 1 as an example, you could use the following code to obtain summary statistics for automobile mileage (mpg) and plot this variable against engine displacement (disp) and weight (wt): summary(mtcars$mpg) plot(mtcars$mpg, mtcars$disp) plot(mtcars$mpg, mtcars$wt) This can also be written as follows: attach(mtcars) summary(mpg) plot(mpg, disp) plot(mpg, wt) detach(mtcars) The detach() function removes the data frame from the search path. Note that detach() does nothing to the data frame itself. The statement is optional but is good programming practice and should be included routinely. (I’ll sometimes ignore this sage advice in later chapters in order to keep code fragments simple and short.) The limitations with this approach are evident when more than one object can have the same name. Consider the following code: > mpg <- c(25, 36, 47) > attach(mtcars) The following object(s) are masked _by_ '.GlobalEnv': > plot(mpg, wt) Error in xy.coords(x, y, xlabel, ylabel, log) : 'x' and 'y' lengths differ > mpg [1] 25 36 47 mpg Here you already have an object named mpg in your environment when the mtcars data frame is attached. In such cases, the original object takes precedence, which isn’t what you want. The plot statement fails because mpg has 3 elements and disp has 32 elements. The attach() and detach() functions are best used when you’re analyzing a single data frame and you’re unlikely to have multiple objects with the same name. In any case, be vigilant for warnings that say that objects are being masked. An alternative approach is to use the with() function. You can write the previous example as with(mtcars, { print(summary(mpg)) plot(mpg, disp) plot(mpg, wt) }) In this case, the statements within the {} brackets are evaluated with reference to the mtcars data frame. You don’t have to worry about name conflicts. If there’s only one statement (for example, summary(mpg)), the {} brackets are optional. www.it-ebooks.info 28 CHAPTER 2 Creating a dataset The limitation of the with() function is that assignments exist only within the function brackets. Consider the following: > with(mtcars, { stats <- summary(mpg) stats }) Min. 1st Qu. Median Mean 3rd Qu. 10.40 15.43 19.20 20.09 22.80 > stats Error: object 'stats' not found Max. 33.90 If you need to create objects that will exist outside of the with() construct, use the special assignment operator <<- instead of the standard one (<-). It saves the object to the global environment outside of the with() call. This can be demonstrated with the following code: > with(mtcars, { nokeepstats <- summary(mpg) keepstats <<- summary(mpg) }) > nokeepstats Error: object 'nokeepstats' not found > keepstats Min. 1st Qu. Median Mean 3rd Qu. 10.40 15.43 19.20 20.09 22.80 Max. 33.90 Most books on R recommend using with() instead of attach(). I think that ultimately the choice is a matter of preference and should be based on what you’re trying to achieve and your understanding of the implications. You’ll use both in this book. CASE IDENTIFIERS In the patient data example, patientID is used to identify individuals in the dataset. In R, case identifiers can be specified with a rowname option in the data-frame function. For example, the statement patientdata <- data.frame(patientID, age, diabetes, status, row.names=patientID) specifies patientID as the variable to use in labeling cases on various printouts and graphs produced by R. 2.2.5 Factors As you’ve seen, variables can be described as nominal, ordinal, or continuous. Nominal variables are categorical, without an implied order. Diabetes (Type1, Type2) is an example of a nominal variable. Even if Type1 is coded as a 1 and Type2 is coded as a 2 in the data, no order is implied. Ordinal variables imply order but not amount. Status (poor, improved, excellent) is a good example of an ordinal variable. You know that a patient with a poor status isn’t doing as well as a patient with an improved status, but not by how much. Continuous variables can take on any value within some range, and both order and amount are implied. Age in years is a continuous variable and can www.it-ebooks.info Data structures 29 take on values such as 14.5 or 22.8 and any value in between. You know that someone who is 15 is one year older than someone who is 14. Categorical (nominal) and ordered categorical (ordinal) variables in R are called factors. Factors are crucial in R because they determine how data is analyzed and presented visually. You’ll see examples of this throughout the book. The function factor() stores the categorical values as a vector of integers in the range [1… k], (where k is the number of unique values in the nominal variable) and an internal vector of character strings (the original values) mapped to these integers. For example, assume that you have this vector: diabetes <- c("Type1", "Type2", "Type1", "Type1") The statement diabetes <- factor(diabetes) stores this vector as (1, 2, 1, 1) and associates it with 1 = Type1 and 2 = Type2 internally (the assignment is alphabetical). Any analyses performed on the vector diabetes will treat the variable as nominal and select the statistical methods appropriate for this level of measurement. For vectors representing ordinal variables, you add the parameter ordered=TRUE to the factor() function. Given the vector status <- c("Poor", "Improved", "Excellent", "Poor") the statement status <- factor(status, ordered=TRUE) will encode the vector as (3, 2, 1, 3) and associate these values internally as 1 = Excellent, 2 = Improved, and 3 = Poor. Additionally, any analyses performed on this vector will treat the variable as ordinal and select the statistical methods appropriately. By default, factor levels for character vectors are created in alphabetical order. This worked for the status factor, because the order “Excellent,” “Improved,” “Poor” made sense. There would have been a problem if “Poor” had been coded as “Ailing” instead, because the order would have been “Ailing,” “Excellent,” “Improved.” A similar problem would exist if the desired order was “Poor,” “Improved,” “Excellent.” For ordered factors, the alphabetical default is rarely sufficient. You can override the default by specifying a levels option. For example, status <- factor(status, order=TRUE, levels=c("Poor", "Improved", "Excellent")) assigns the levels as 1 = Poor, 2 = Improved, 3 = Excellent. Be sure the specified levels match your actual data values. Any data values not in the list will be set to missing. Numeric variables can be coded as factors using the levels and labels options. If sex was coded as 1 for male and 2 for female in the original data, then sex <- factor(sex, levels=c(1, 2), labels=c("Male", "Female")) would convert the variable to an unordered factor. Note that the order of the labels must match the order of the levels. In this example, sex would be treated as categorical, the labels “Male” and “Female” would appear in the output instead of 1 and 2, and any sex value that wasn’t initially coded as a 1 or 2 would be set to missing. www.it-ebooks.info 30 CHAPTER 2 Creating a dataset The following listing demonstrates how specifying factors and ordered factors impacts data analyses. Listing 2.6 Using factors > patientID <- c(1, 2, 3, 4) > age <- c(25, 34, 28, 52) Enter data > diabetes <- c("Type1", "Type2", "Type1", "Type1") as vectors. > status <- c("Poor", "Improved", "Excellent", "Poor") > diabetes <- factor(diabetes) > status <- factor(status, order=TRUE) > patientdata <- data.frame(patientID, age, diabetes, status) > str(patientdata) Displays ‘data.frame’: 4 obs. of 4 variables: the object $ patientID: num 1 2 3 4 structure $ age : num 25 34 28 52 $ diabetes : Factor w/ 2 levels "Type1","Type2": 1 2 1 1 $ status : Ord.factor w/ 3 levels "Excellent"<"Improved"<..: 3 2 1 3 > summary(patientdata) Displays patientID age diabetes status the object Min. :1.00 Min. :25.00 Type1:3 Excellent:1 summary 1st Qu.:1.75 1st Qu.:27.25 Type2:1 Improved :1 Median :2.50 Median :31.00 Poor :2 Mean :2.50 Mean :34.75 3rd Qu.:3.25 3rd Qu.:38.50 Max. :4.00 Max. :52.00 b c d First you enter the data as vectors b. Then you specify that diabetes is a factor and status is an ordered factor. Finally, you combine the data into a data frame. The function str(object) provides information about an object in R (the data frame, in this case) c. It clearly shows that diabetes is a factor and status is an ordered factor, along with how they’re coded internally. Note that the summary() function treats the variables differently d. It provides the minimum, maximum, mean, and quartiles for the continuous variable age, and frequency counts for the categorical variables diabetes and status. 2.2.6 Lists Lists are the most complex of the R data types. Basically, a list is an ordered collection of objects (components). A list allows you to gather a variety of (possibly unrelated) objects under one name. For example, a list may contain a combination of vectors, matrices, data frames, and even other lists. You create a list using the list() function mylist <- list(object1, object2, ...) where the objects are any of the structures seen so far. Optionally, you can name the objects in a list: mylist <- list(name1=object1, name2=object2, ...) The following listing shows an example. www.it-ebooks.info Data structures 31 Listing 2.7 Creating a list > g <- "My First List" > h <- c(25, 26, 18, 39) > j <- matrix(1:10, nrow=5) Creates a list > k <- c("one", "two", "three") > mylist <- list(title=g, ages=h, j, k) > mylist Prints the entire list $title [1] "My First List" $ages [1] 25 26 18 39 [[3]] [,1] [,2] [1,] 1 6 [2,] 2 7 [3,] 3 8 [4,] 4 9 [5,] 5 10 [[4]] [1] "one" "two" "three" Prints the second component > mylist[[2]] [1] 25 26 18 39 > mylist[["ages"]] [[1] 25 26 18 39 In this example, you create a list with four components: a string, a numeric vector, a matrix, and a character vector. You can combine any number of objects and save them as a list. You can also specify elements of the list by indicating a component number or a name within double brackets. In this example, mylist[[2]] and mylist[["ages"]] both refer to the same four-element numeric vector. For named components, mylist$ages would also work. Lists are important R structures for two reasons. First, they allow you to organize and recall disparate information in a simple way. Second, the results of many R functions return lists. It’s up to the analyst to pull out the components that are needed. You’ll see numerous examples of functions that return lists in later chapters. A note for programmers Experienced programmers typically find several aspects of the R language unusual. Here are some features of the language you should be aware of: ■ The period (.) has no special significance in object names. The dollar sign ($) has a somewhat analogous meaning to the period in other object-oriented languages and can be used to identify the parts of a data frame or list. For example, A$x refers to variable x in data frame A. www.it-ebooks.info 32 CHAPTER 2 Creating a dataset (continued) ■ ■ R doesn’t provide multiline or block comments. You must start each line of a multiline comment with #. For debugging purposes, you can also surround code that you want the interpreter to ignore with the statement if(FALSE){...}. Changing the FALSE to TRUE allows the code to be executed. Assigning a value to a nonexistent element of a vector, matrix, array, or list expands that structure to accommodate the new value. For example, consider the following: > x <- c(8, 6, 4) > x[7] <- 10 > x [1] 8 6 4 NA NA NA 10 The vector x has expanded from three to seven elements through the assignment. x <- x[1:3] would shrink it back to three elements. ■ ■ ■ R doesn’t have scalar values. Scalars are represented as one-element vectors. Indices in R start at 1, not at 0. In the vector earlier, x[1] is 8. Variables can’t be declared. They come into existence on first assignment. To learn more, see John Cook’s excellent blog post, “R Language for Programmers” (http://mng.bz/6NwQ). Programmers looking for stylistic guidance may also want to check out “Google’s R Style Guide” (http://mng.bz/i775). 2.3 Data input Now that you have data structures, you need to put some data in them! As a data analyst, you’re typically faced with data that comes from a variety of sources and in a variety of formats. Your task is to import the data into your tools, analyze the data, and report on the results. R provides a wide range of tools for importing data. The definitive guide for importing data in R is the R Data Import/Export manual available at http://mng.bz/urwn. As you can see in figure 2.2, R can import data from the keyboard, from text files, from Microsoft Excel and Access, from popular statistical packages, from a variety of Statistical packages SAS SPSS Stata Keyboard ASCII Text files XML Excel R netCFD Webscraping SQL Other HDF5 MySQL Oracle Access Database management systems www.it-ebooks.info Figure 2.2 Sources of data that can be imported into R Data input 33 relational database management systems, from specialty databases, and from web sites and online services. Because you never know where your data will come from, we’ll cover each of them here. You only need to read about the ones you’re going to be using. 2.3.1 Entering data from the keyboard Perhaps the simplest way to enter data is from the keyboard. There are two common methods: entering data through R’s built-in text editor and embedding data directly into your code. We’ll consider the editor first. The edit() function in R invokes a text editor that lets you enter data manually. Here are the steps: 1 2 Create an empty data frame (or matrix) with the variable names and modes you want to have in the final dataset. Invoke the text editor on this data object, enter your data, and save the results to the data object. The following example creates a data frame named mydata with three variables: age (numeric), gender (character), and weight (numeric). You then invoke the text editor, add your data, and save the results: mydata <- data.frame(age=numeric(0), gender=character(0), weight=numeric(0)) mydata <- edit(mydata) Assignments like age=numeric(0) create a variable of a specific mode, but without actual data. Note that the result of the editing is assigned back to the object itself. The edit() function operates on a copy of the object. If you don’t assign it a destination, all of your edits will be lost! The results of invoking the edit() function on a Windows platform are shown in figure 2.3. In this figure, I’ve added some data. If you click a column title, the editor Figure 2.3 Entering data via the built-in editor on a Windows platform www.it-ebooks.info 34 CHAPTER 2 Creating a dataset gives you the option of changing the variable name and type (numeric or character). You can add variables by clicking the titles of unused columns. When the text editor is closed, the results are saved to the object assigned (mydata, in this case). Invoking mydata <- edit(mydata) again allows you to edit the data you’ve entered and to add new data. A shortcut for mydata <- edit(mydata) is fix(mydata). Alternatively, you can embed the data directly in your program. For example, the code mydatatxt <- " age gender weight 25 m 166 30 f 115 18 f 120 " mydata <- read.table(header=TRUE, text=mydatatxt) creates the same data frame as that created with the edit() function. A character string is created containing the raw data, and the read.table() function is used to process the string and return a data frame. The read.table() function is described more fully in the next section. Keyboard data entry can be convenient when you’re working with small datasets. For larger datasets, you’ll want to use the methods described next: importing data from existing text files, Excel spreadsheets, statistical packages, or databasemanagement systems. 2.3.2 Importing data from a delimited text file You can import data from delimited text files using read.table(), a function that reads a file in table format and saves it as a data frame. Each row of the table appears as one line in the file. The syntax is mydataframe <- read.table(file, options) where file is a delimited ASCII file and the options are parameters controlling how data is processed. The most common options are listed in table 2.2. Table 2.2 read.table() options Option Description header A logical value indicating whether the file contains the variable names in the first line. sep The delimiter separating data values. The default is sep="", which denotes one or more spaces, tabs, new lines, or carriage returns. Use sep="," to read commadelimited files, and sep="\t" to read tab-delimited files. row.names An optional parameter specifying one or more variables to represent row identifiers. col.names If the first row of the data file doesn’t contain variable names (header=FALSE), you can use col.names to specify a character vector containing the variable names. If header=FALSE and the col.names option is omitted, variables will be named V1, V2, and so on. www.it-ebooks.info 35 Data input Table 2.2 read.table() options Option Description na.strings Optional character vector indicating missing-values codes. For example, na.strings =c("-9", "?") converts each -9 and ? value to NA as the data is read. colClasses Optional vector of classes to be assign to the columns. For example, colClasses =c("numeric", "numeric", "character", "NULL", "numeric") reads the first two columns as numeric, reads the third column as character, skips the fourth column, and reads the fifth column as numeric. If there are more than five columns in the data, the values in colClasses are recycled. When you’re reading large text files, including the colClasses option can speed up processing considerably. quote Character(s) used to delimit strings that contain special characters. By default this is either double (") or single (') quotes. skip The number of lines in the data file to skip before beginning to read the data. This option is useful for skipping header comments in the file. stringsAsFactors A logical value indicating whether character variables should be converted to factors. The default is TRUE unless this is overridden by colClasses. When you’re processing large text files, setting stringsAsFactors=FALSE can speed up processing. text A character string specifying a text string to process. If text is specified, leave file blank. An example is given in section 2.3.1. Consider a text file named studentgrades.csv containing students’ grades in math, science, and social studies. Each line of the file represents a student. The first line contains the variable names, separated with commas. Each subsequent line contains a student’s information, also separated with commas. The first few lines of the file are as follows: StudentID,First,Last,Math,Science,Social Studies 011,Bob,Smith,90,80,67 012,Jane,Weary,75,,80 010,Dan,"Thornton, III",65,75,70 040,Mary,"O'Leary",90,95,92 The file can be imported into a data frame using the following code: grades <- read.table("studentgrades.csv", header=TRUE, row.names="StudentID", sep=",") The results are as follows: > grades 11 12 10 40 First Bob Jane Dan Mary Last Math Science Social.Studies Smith 90 80 67 Weary 75 NA 80 Thornton, III 65 75 70 O'Leary 90 95 92 > str(grades) www.it-ebooks.info 36 CHAPTER 2 Creating a dataset 'data.frame': 4 obs. of 5 variables: $ First : Factor w/ 4 levels "Bob","Dan","Jane",..: 1 3 2 4 $ Last : Factor w/ 4 levels "O'Leary","Smith",..: 2 4 3 1 $ Math : int 90 75 65 90 $ Science : int 80 NA 75 95 $ Social.Studies: int 67 80 70 92 There are several interesting things to note about how the data is imported. The variable name Social Studies is automatically renamed to follow R conventions. The StudentID column is now the row name, no longer has a label, and has lost its leading zero. The missing science grade for Jane is correctly read as missing. I had to put quotation marks around Dan's last name in order to escape the comma between Thornton and III. Otherwise, R would have seen seven values on that line, rather than six. I also had to put quotation marks around O'Leary. Otherwise, R would have read the single quote as a string delimiter (which isn’t what I want). Finally, the first and last names are converted to factors. By default, read.table() converts character variables to factors, which may not always be desirable. For example, there would be little reason to convert a character variable containing a respondent’s comments into a factor. You can suppress this behavior in a number of ways. Including the option stringsAsFactors=FALSE turns off this behavior for all character variables. Alternatively, you can use the colClasses option to specify a class (for example, logical, numeric, character, or factor) for each column. Importing the same data with grades <- read.table("studentgrades.csv", header=TRUE, row.names="StudentID", sep=",", colClasses=c("character", "character", "character", "numeric", "numeric", "numeric")) produces the following data frame: > grades 011 012 010 040 First Bob Jane Dan Mary Last Math Science Social.Studies Smith 90 80 67 Weary 75 NA 80 Thornton, III 65 75 70 O'Leary 90 95 92 > str(grades) 'data.frame': 4 obs. of 5 variables: $ First : chr "Bob" "Jane" "Dan" "Mary" $ Last : chr "Smith" "Weary" "Thornton, III" "O'Leary" $ Math : num 90 75 65 90 $ Science : num 80 NA 75 95 $ Social.Studies: num 67 80 70 92 Note that the row names retain their leading zero and First and Last are no longer factors. Additionally, the grades are stored as real values rather than integers. www.it-ebooks.info Data input 37 The read.table() function has many options for fine-tuning data imports. See help(read.table) for details. Importing data via connections Many of the examples in this chapter import data from files that exist on your computer. R provides several mechanisms for accessing data via connections as well. For example, the functions file(), gzfile(), bzfile(), xzfile(), unz(), and url() can be used in place of the filename. The file() function allows you to access files, the clipboard, and C-level standard input. The gzfile(), bzfile(), xzfile(), and unz() functions let you read compressed files. The url() function lets you access internet files through a complete URL that includes http://, ftp://, or file://. For HTTP and FTP, proxies can be specified. For convenience, complete URLs (surrounded by double quotation marks) can usually be used directly in place of filenames as well. See help(file) for details. 2.3.3 Importing data from Excel The best way to read an Excel file is to export it to a comma-delimited file from Excel and import it into R using the method described earlier. Alternatively, you can import Excel worksheets directly using the xlsx package. Be sure to download and install it before you first use it. You’ll also need the xlsxjars and rJava packages and a working installation of Java (http://java.com). The xlsx package can be used to read, write, and format Excel 97/2000/XP/ 2003/2007 files. The read.xlsx() function imports a worksheet into a data frame. The simplest format is read.xlsx(file, n) where file is the path to an Excel workbook, n is the number of the worksheet to be imported, and the first line of the worksheet contains the variable names. For example, on a Windows platform, the code library(xlsx) workbook <- "c:/myworkbook.xlsx" mydataframe <- read.xlsx(workbook, 1) imports the first worksheet from the workbook myworkbook.xlsx stored on the C: drive and saves it as the data frame mydataframe. The read.xlsx() function has options that allow you to specify specific rows (rowIndex) and columns (colIndex) of the worksheet, along with the class of each column (colClasses). For large worksheets (say, 100,000+ cells), you can also use read.xlsx2(). It performs more of the processing work in Java, resulting in significant performance gains. See help(read.xlsx) for details. There are other packages that can help you work with Excel files. Alternatives include the XLConnect and openxlsx packages; XLConnect depends on Java, but openxlsx doesn’t. All of these package can do more than import worksheets—they can create and manipulate Excel files as well. Programmers who need to develop an interface between R and Excel should check out one or more of these packages. www.it-ebooks.info 38 2.3.4 CHAPTER 2 Creating a dataset Importing data from XML Increasingly, data is provided in the form of files encoded in XML. R has several packages for handling XML files. For example, the XML package written by Duncan Temple Lang allows you to read, write, and manipulate XML files. Coverage of XML is beyond the scope of this text; if you’re interested in accessing XML documents from within R, see the excellent package documentation at www.omegahat.org/RSXML. 2.3.5 Importing data from the web Data can be obtained from the web via webscraping or the use of application programming interfaces (APIs). Webscraping is used to extract the information embedded in specific web pages, whereas APIs allow you to interact with web services and online data stores. Typically, webscraping is used to extract data from a web page and save it into an R structure for further analysis. For example, the text on a web page can be downloaded into an R character vector using the readLines() function and manipulated with functions such as grep() and gsub(). For complex web pages, the RCurl and XML packages can be used to extract the information desired. For more information, including examples, see “Webscraping Using readLines and RCurl,” available from the website Programming with R (www.programmingr.com). APIs specify how software components should interact with each other. A number of R packages use this approach to extract data from web-accessible resources. These include data sources in biology, medicine, Earth sciences, physical science, economics and business, finance, literature, marketing, news, and sports. For example, if you’re interested in social media, you can access Twitter data via twitteR, Facebook data via Rfacebook, and Flickr data via Rflickr. Other packages allow you to access popular web services provided by Google, Amazon, Dropbox, Salesforce, and others. For a comprehensive list of R packages that can help you access web-based resources, see the CRAN Task view on Web Technologies and Services (http://mng.bz/370r). 2.3.6 Importing data from SPSS IBM SPSS datasets can be imported into R via the read.spss() function in the for- eign package. Alternatively, you can use the spss.get() function in the Hmisc package. spss.get() is a wrapper function that automatically sets many parameters of read.spss() for you, making the transfer easier and more consistent with what data analysts expect as a result. First, download and install the Hmisc package (the foreign package is already installed by default): install.packages("Hmisc") Then use the following code to import the data: library(Hmisc) mydataframe <- spss.get("mydata.sav", use.value.labels=TRUE) www.it-ebooks.info Data input 39 In this code, mydata.sav is the SPSS data file to be imported, use.value.labels=TRUE tells the function to convert variables with value labels into R factors with those same levels, and mydataframe is the resulting R data frame. 2.3.7 Importing data from SAS A number of functions in R are designed to import SAS datasets, including read.ssd() in the foreign package, sas.get() in the Hmisc package, and read.sas7bdat() in the sas7bdat package. If you have SAS installed, sas.get() can be a good option. Let’s say that you want to import an SAS dataset named clients.sas7bdat that resides in the C:/mydata directory on a Windows machine. The following code imports the data and saves it as an R data frame: library(Hmisc) datadir <- "C:/mydata" sasexe <- "C:/Program Files/SASHome/SASFoundation/9.4/sas.exe" mydata <- sas.get(libraryName=datadir, member="clients", sasprog=sasexe) libraryName is a directory containing the SAS dataset, member is the dataset name (excluding the sas7bdat extension), and sasprog is the full path to the SAS executable. Many additional options are available; see help(sas.get) for details. You can also save the SAS dataset as a comma-delimited text file from within SAS using PROC EXPORT, and you can read the resulting file into R using the method described in section 2.3.2. Here’s an example: SAS program: libname datadir "C:\mydata"; proc export data=datadir.clients outfile="clients.csv" dbms=csv; run; R program: mydata <- read.table("clients.csv", header=TRUE, sep=",") The previous two approaches require that you have a fully functional version of SAS installed. If you don’t have access to SAS, the read.sas7bdat() function may be a good alternative. The function can read an SAS dataset in sas7bdat format directly. The code for this example would be library(sas7bdat) mydata <- read.sas7bdat("C:/mydata/clients.sas7bdat") Unlike sas.get(), the read.sas7bdat() function ignores SAS user-defined formats. Additionally, it takes significantly longer to run. Although I’ve had good luck with this package, it’s still considered experimental. Finally, a commercial product named Stat/Transfer (described in section 2.3.12) does an excellent job of saving SAS datasets (including any existing variable formats) as R data frames. As with read.sas7dbat(), access to an SAS installation isn’t required. www.it-ebooks.info 40 2.3.8 CHAPTER 2 Creating a dataset Importing data from Stata Importing data from Stata to R is straightforward. The necessary code looks like this: library(foreign) mydataframe <- read.dta("mydata.dta") Here, mydata.dta is the Stata dataset, and mydataframe is the resulting R data frame. 2.3.9 Importing data from NetCDF Unidata’s Network Common Data Form (NetCDF) open source software contains machine-independent data formats for the creation and distribution of array-oriented scientific data. NetCDF is commonly used to store geophysical data. The ncdf and ncdf4 packages provide high-level R interfaces to NetCDF data files. The ncdf package provides support for data files created with Unidata’s NetCDF library (version 3 or earlier) and is available for Windows, Mac OS X, and Linux platforms. The ncdf4 package supports version 4 or earlier but isn’t yet available for Windows. Consider this code: library(ncdf) nc <- nc_open("mynetCDFfile") myarray <- get.var.ncdf(nc, myvar) In this example, all the data from the variable myvar, contained in the NetCDF file mynetCDFfile, is read and saved into an R array called myarray. Note that both the ncdf and ncdf4 packages have received major recent upgrades and may operate differently than previous versions. Additionally, function names in the two packages differ. Read the online help for details. 2.3.10 Importing data from HDF5 Hierarchical Data Format (HDF5) is a software technology suite for the management of extremely large and complex data collections. The rhdf5 package provides an R interface for HDF5. The package is available on the Bioconductor website rather than CRAN. You can install it with the following code: source("http://bioconductor.org/biocLite.R") biocLite("rhdf5") Like XML, HDF5 is beyond the scope of this book. To learn more, visit the HDF Group website (www.hdfgroup.org). There is an excellent tutorial for the rhdf5 package by Bernd Fischer at http://mng.bz/eg6j. 2.3.11 Accessing database management systems (DBMSs) R can interface with a wide variety of relational database management systems (DBMSs), including Microsoft SQL Server, Microsoft Access, MySQL, Oracle, PostgreSQL, DB2, Sybase, Teradata, and SQLite. Some packages provide access through native database drivers, whereas others offer access via ODBC or JDBC. Using R to www.it-ebooks.info 41 Data input access data stored in external DMBSs can be an efficient way to analyze large datasets (see appendix F) and takes advantage of the power of both SQL and R. THE ODBC INTERFACE Perhaps the most popular method of accessing a DBMS in R is through the RODBC package, which allows R to connect to any DBMS that has an ODBC driver. This includes all the DBMSs listed earlier. The first step is to install and configure the appropriate ODBC driver for your platform and database (these drivers aren’t part of R). If the requisite drivers aren’t already installed on your machine, an internet search should provide you with options. Once the drivers are installed and configured for the database(s) of your choice, install the RODBC package. You can do so by using the install.packages("RODBC") command. The primary functions included with RODBC are listed in table 2.3. Table 2.3 RODBC functions Function Description odbcConnect(dsn,uid="",pwd="") Opens a connection to an ODBC database sqlFetch(channel,sqltable) Reads a table from an ODBC database into a data frame sqlQuery(channel,query) Submits a query to an ODBC database and returns the results sqlSave(channel,mydf,tablename = sqltable,append=FALSE) Writes or updates (append=TRUE) a data frame to a table in the ODBC database sqlDrop(channel,sqltable) Removes a table from the ODBC database close(channel) Closes the connection The RODBC package allows two-way communication between R and an ODBCconnected SQL database. This means you can not only read data from a connected database into R, but also use R to alter the contents of the database itself. Assume that you want to import two tables (Crime and Punishment) from a DBMS into two R data frames called crimedat and pundat, respectively. You can accomplish this with code similar to the following: library(RODBC) myconn <-odbcConnect("mydsn", uid="Rob", pwd="aardvark") crimedat <- sqlFetch(myconn, Crime) pundat <- sqlQuery(myconn, "select * from Punishment") close(myconn) Here, you load the RODBC package and open a connection to the ODBC database through a registered data source name (mydsn) with a security UID (rob) and password (aardvark). The connection string is passed to sqlFetch(), which copies the table Crime into the R data frame crimedat. You then run the SQL select statement www.it-ebooks.info 42 CHAPTER 2 Creating a dataset against the table Punishment and save the results to the data frame pundat. Finally, you close the connection. The sqlQuery() function is powerful because any valid SQL statement can be inserted. This flexibility allows you to select specific variables, subset the data, create new variables, and recode and rename existing variables. DBI-RELATED PACKAGES The DBI package provides a general and consistent client-side interface to a DBMS. Building on this framework, the RJDBC package provides access to a DBMS via a JDBC driver. Be sure to install the necessary JDBC drivers for your platform and database. Other useful DBI-based packages include RMySQL, ROracle, RPostgreSQL, and RSQLite. These packages provide native database drivers for their respective databases but may not be available on all platforms. Check the documentation on CRAN (http://cran .r-project.org) for details. 2.3.12 Importing data via Stat/Transfer Before we end our discussion of importing data, it’s worth mentioning a commercial product that can make the task significantly easier. Stat/Transfer (www.stattransfer .com) is a standalone application that can transfer data among 34 data formats, including R (see figure 2.4). Stat/Transfer is available for Windows, Mac, and Unix platforms. It supports the latest versions of the statistical packages we’ve discussed so far, as well as ODBCaccessed DBMSs such as Oracle, Sybase, Informix, and DB/2. Figure 2.4 Stat/Transfer’s main dialog on Windows www.it-ebooks.info Useful functions for working with data objects 2.4 43 Annotating datasets Data analysts typically annotate datasets to make the results easier to interpret. Annotating generally includes adding descriptive labels to variable names and value labels to the codes used for categorical variables. For example, for the variable age, you might want to attach the more descriptive label “Age at hospitalization (in years).” For the variable gender, coded 1 or 2, you might want to associate the labels “male” and “female.” 2.4.1 Variable labels Unfortunately, R’s ability to handle variable labels is limited. One approach is to use the variable label as the variable’s name and then refer to the variable by its position index. Consider the earlier example, where you have a data frame containing patient data. The second column, age, contains the ages at which individuals were first hospitalized. The code names(patientdata)[2] <- "Age at hospitalization (in years)" renames age to "Age at hospitalization (in years)". Clearly this new name is too long to type repeatedly. Instead, you can refer to this variable as patientdata[2], and the string "Age at hospitalization (in years)" will print wherever age would have originally. Obviously, this isn’t an ideal approach, and you may be better off trying to come up with better variable names (for example, admissionAge). 2.4.2 Value labels The factor() function can be used to create value labels for categorical variables. Continuing the example, suppose you have a variable named gender, which is coded 1 for male and 2 for female. You can create value labels with the code patientdata$gender <- factor(patientdata$gender, levels = c(1,2), labels = c("male", "female")) Here levels indicates the actual values of the variable, and labels refers to a character vector containing the desired labels. 2.5 Useful functions for working with data objects We’ll end this chapter with a brief summary of useful functions for working with data objects (see table 2.4). Table 2.4 Useful functions for working with data objects Function Purpose length(object) Gives the number of elements/components. dim(object) Gives the dimensions of an object. str(object) Gives the structure of an object. www.it-ebooks.info 44 CHAPTER 2 Table 2.4 Creating a dataset Useful functions for working with data objects (continued) Function Purpose class(object) Gives the class of an object. mode(object) Determines how an object is stored. names(object) Gives the names of components in an object. c(object, object,...) Combines objects into a vector. cbind(object, object, ...) Combines objects as columns. rbind(object, object, ...) Combines objects as rows. object Prints an object. head(object) Lists the first part of an object. tail(object) Lists the last part of an object. ls() Lists current objects. rm(object, object, ...) Deletes one or more objects. The statement rm(list = ls()) removes most objects from the working environment. newobject <- edit(object) Edits object and saves it as newobject. fix(object) Edits an object in place. We’ve already discussed most of these functions. head() and tail() are useful for quickly scanning large datasets. For example, head(patientdata) lists the first six rows of the data frame, whereas tail(patientdata) lists the last six. We’ll cover functions such as length(), cbind(), and rbind() in the next chapter; they’re gathered here as a reference. 2.6 Summary One of the most challenging tasks in data analysis is data preparation. We’ve made a good start in this chapter by outlining the various structures that R provides for holding data and the many methods available for importing data from both keyboard and external sources. In particular, we’ll use the definitions of vector, matrix, data frame, and list again and again in later chapters. Your ability to specify elements of these structures via the bracket notation will be particularly important in selecting, subsetting, and transforming data. As you’ve seen, R offers a wealth of functions for accessing external data. This includes data from flat files, web files, statistical packages, spreadsheets, and databases. Although the focus of this chapter has been on importing data into R, you can also export data from R into these external formats. Exporting data is covered in appendix C, and methods of working with large datasets (in the gigabyte to terabyte range) are covered in appendix F. www.it-ebooks.info Summary 45 Once you import your datasets into R, it’s likely that you’ll have to manipulate them into a more conducive format (actually, I find guilt works well). In chapter 4, we’ll explore ways to create new variables, transform and recode existing variables, merge datasets, and select observations. But before turning to data-management tasks, let’s spend some time with R graphics. Many readers have turned to R out of an interest in its graphing capabilities, and I don’t want to make you wait any longer. In the next chapter, we’ll jump directly into the creation of graphs. The emphasis will be on general methods for managing and customizing graphs that can be applied throughout the remainder of this book. www.it-ebooks.info Getting started with graphs This chapter covers ■ Creating and saving graphs ■ Customizing symbols, lines, colors, and axes ■ Annotating with text and titles ■ Controlling a graph’s dimensions ■ Combining multiple graphs into one On many occasions, I’ve presented clients with carefully crafted statistical results in the form of numbers and text, only to have their eyes glaze over while the chirping of crickets permeated the room. Yet those same clients had enthusiastic “Ah-ha!” moments when I presented the same information to them in the form of graphs. Often I can see patterns in data or detect anomalies in data values by looking at graphs—patterns or anomalies that I completely missed when conducting more formal statistical analyses. Human beings are remarkably adept at discerning relationships from visual representations. A well-crafted graph can help you make meaningful comparisons among thousands of pieces of information, extracting patterns not easily found through other methods. This is one reason why advances in the field of statistical 46 www.it-ebooks.info Working with graphs 47 graphics have had such a major impact on data analysis. Data analysts need to look at their data, and this is one area where R shines. In this chapter, we’ll review general methods for working with graphs. We’ll start with how to create and save graphs. Then we’ll look at how to modify the features that are found in any graph. These features include graph titles, axes, labels, colors, lines, symbols, and text annotations. Our focus will be on generic techniques that apply across graphs. (In later chapters, we’ll focus on specific types of graphs.) Finally, we’ll investigate ways to combine multiple graphs into one overall graph. 3.1 Working with graphs R is an amazing platform for building graphs. I’m using the term building intentionally. In a typical interactive session, you build a graph one statement at a time, adding features, until you have what you want. Consider the following five lines: attach(mtcars) plot(wt, mpg) abline(lm(mpg~wt)) title("Regression of MPG on Weight") detach(mtcars) The first statement attaches the data frame mtcars. The second statement opens a graphics window and generates a scatter plot between automobile weight on the horizontal axis and miles per gallon on the vertical axis. The third statement adds a line of best fit. The fourth statement adds a title. The final statement detaches the data frame. In R, graphs are typically created in this interactive fashion (see figure 3.1). You can save your graphs via code or through GUI menus. To save a graph via code, sandwich the statements that produce the graph between a statement that sets a destination and a statement that closes that destination. For example, the following will Figure 3.1 Creating a graph www.it-ebooks.info 48 CHAPTER 3 Getting started with graphs save the graph as a PDF document named mygraph.pdf in the current working directory: pdf("mygraph.pdf") attach(mtcars) plot(wt, mpg) abline(lm(mpg~wt)) title("Regression of MPG on Weight") detach(mtcars) dev.off() In addition to pdf(), you can use the functions win.metafile(), png(), jpeg(), bmp(), tiff(), xfig(), and postscript() to save graphs in other formats. (Note: The Windows metafile format is only available on Windows platforms.) See chapter 1, section 1.3.4 for more details on sending graphic output to files. Saving graphs via the GUI is platform specific. On a Windows platform, select File > Save As from the graphics window, and choose the format and location desired in the resulting dialog. On a Mac, choose File > Save As from the menu bar when the Quartz graphics window is highlighted. The only output format provided is PDF. On a Unix platform, graphs must be saved via code. In appendix A, we’ll consider alternative GUIs for each platform that will give you more options. Creating a new graph by issuing a high-level plotting command such as plot(), hist() (for histograms), or boxplot() typically overwrites a previous graph. How can you create more than one graph and still have access to each? There are several methods. First, you can open a new graph window before creating a new graph: dev.new() statements to create graph 1 dev.new() statements to create a graph 2 etc. Each new graph will appear in the most recently opened window. Second, you can access multiple graphs via the GUI. On a Mac platform, you can step through the graphs at any time using Back and Forward on the Quartz menu. On a Windows platform, you must use a two-step process. After opening the first graph window, choose History > Recording. Then use the Previous and Next menu items to step through the graphs that are created. Finally, you can use the functions dev.new(), dev.next(), dev.prev(), dev.set(), and dev.off() to have multiple graph windows open at one time and choose which output is sent to which windows. This approach works on any platform. See help(dev.cur) for details on this approach. R creates attractive graphs with a minimum of input on your part. But you can also use graphical parameters to specify fonts, colors, line styles, axes, reference lines, and annotations. This flexibility allows for a wide degree of customization. In this chapter, we’ll start with a simple graph and explore the ways you can modify and enhance it to meet your needs. Then we’ll look at more complex examples that www.it-ebooks.info 49 A simple example illustrate additional customization methods. The focus will be on techniques that you can apply to a wide range of the graphs you’ll create in R. The methods discussed here will work on all the graphs described in this book, with the exception of those created with the ggplot2 package in chapter 19. (The ggplot2 package has its own methods for customizing a graph’s appearance.) In other chapters, we’ll explore each specific type of graph and discuss where and when each is most useful. A simple example Let’s begin with the simple fictitious dataset given in table 3.1. It describes patient responses to two drugs at five dosage levels. Table 3.1 Patient responses to two drugs at five dosage levels Dosage Response to Drug A Response to Drug B 20 16 15 30 20 18 40 27 25 45 40 31 60 60 40 You can input this data using the following code: dose <- c(20, 30, 40, 45, 60) drugA <- c(16, 20, 27, 40, 60) drugB <- c(15, 18, 25, 31, 40) A simple line graph relating dose to response for drug A can be created using plot(dose, drugA, type="b") 50 40 drugA 30 axis, plots the (x, y) data points, and connects them with line segments. The option type="b" indicates that both points and lines should be plotted. Use help(plot) to view other options. The graph is displayed in figure 3.2. 60 plot() is a generic function that plots objects in R (its output varies according to the type of object being plotted). In this case, plot(x, y, type="b") places x on the horizontal axis and y on the vertical Figure 3.2 Line plot of dose vs. response for drug A 20 3.2 20 30 40 dose www.it-ebooks.info 50 60 50 CHAPTER 3 Getting started with graphs Line plots are covered in detail in chapter 11. Now let’s modify the appearance of this graph. Graphical parameters You can customize many features of a graph (fonts, colors, axes, and labels) through options called graphical parameters. One way is to specify these options through the par() function. Values set in this manner will be in effect for the rest of the session or until they’re changed. The format is par(optionname=value, optionname=value, ...). Specifying par() without parameters produces a list of the current graphical settings. Adding the no.readonly=TRUE option produces a list of current graphical settings that can be modified. Continuing the example, let’s say that you’d like to use a solid triangle rather than an open circle as your plotting symbol, and connect points using a dashed line rather than a solid line. You can do so with the following code: 50 40 dr ugA 30 The resulting graph is shown in figure 3.3. The first statement makes a copy of the current settings. The second statement changes the default line type to dashed (lty=2) and the default symbol for plotting points to a solid triangle (pch=17). You then generate the plot and restore the original settings. Line types and symbols are covered in section 3.3.1. You can have as many par() functions as desired, so par(lty=2, pch=17) could also be written as 60 opar <- par(no.readonly=TRUE) par(lty=2, pch=17) plot(dose, drugA, type="b") par(opar) 20 3.3 20 30 40 50 60 dose Figure 3.3 Line plot of dose vs. response for drug A with modified line type and symbol par(lty=2) par(pch=17) A second way to specify graphical parameters is by providing the optionname=value pairs directly to a high-level plotting function. In this case, the options are only in effect for that specific graph. You could generate the same graph with this code: plot(dose, drugA, type="b", lty=2, pch=17) www.it-ebooks.info 51 Graphical parameters Not all high-level plotting functions allow you to specify all possible graphical parameters. See the help for a specific plotting function (such as ?plot, ?hist, or ?boxplot) to determine which graphical parameters can be set in this way. The remainder of section 3.3 describes many of the important graphical parameters that you can set. 3.3.1 Symbols and lines As you’ve seen, you can use graphical parameters to specify the plotting symbols and lines used in your graphs. The relevant parameters are shown in table 3.2. Table 3.2 Parameters for specifying symbols and lines Parameter Description pch Specifies the symbol to use when plotting points (see figure 3.4). cex Specifies the symbol size. cex is a number indicating the amount by which plotting symbols should be scaled relative to the default. 1 = default, 1.5 is 50% larger, 0.5 is 50% smaller, and so forth. lty Specifies the line type (see figure 3.5). lwd Specifies the line width. lwd is expressed relative to the default (1 = default). For example, lwd=2 generates a line twice as wide as the default. The pch= option specifies the symbols to use when plotting points. Possible values are shown in figure 3.4. For symbols 21 through 25, you can also specify the border (col=) and fill (bg=) colors. Use lty= to specify the type of line desired. The option values are shown in figure 3.5. Taking these options together, the code plot(dose, drugA, type="b", lty=3, lwd=3, pch=15, cex=2) line types: lty= plot symbols: pch= 6 0 5 10 15 20 25 1 6 11 16 21 2 7 12 17 22 3 8 13 18 23 2 4 9 14 19 24 1 5 4 Figure 3.4 Plotting symbols specified with the pch parameter 3 Figure 3.5 Line types specified with the lty parameter www.it-ebooks.info Figure 3.6 Line plot of dose vs. response for drug A with modified line type, line width, symbol, and symbol width 50 40 drugA would produce a plot with a dotted line that was three times wider than the default width, connecting points displayed as filled squares that are twice as large as the default symbol size. The results are shown in figure 3.6. Next, let’s look at specifying colors. 60 Getting started with graphs 30 CHAPTER 3 20 52 20 30 40 50 60 dose 3.3.2 Colors There are several color-related parameters in R. Table 3.3 shows some of the common ones. Table 3.3 Parameters for specifying colors Parameter Description col Default plotting color. Some functions (such as lines and pie) accept a vector of values that are recycled. For example, if col=c("red", "blue") and three lines are plotted, the first line will be red, the second blue, and the third red. col.axis Color for axis text. col.lab Color for axis labels. col.main Color for titles. col.sub Color for subtitles. fg Color for the plot’s foreground. bg Color for the plot’s background. You can specify colors in R by index, name, hexadecimal, RGB, or HSV. For example, col=1, col="white", col="#FFFFFF", col=rgb(1,1,1), and col=hsv(0,0,1) are equivalent ways of specifying the color white. The function rgb()creates colors based on red-green-blue values, whereas hsv() creates colors based on hue-saturation values. See the help feature on these functions for more details. The function colors() returns all available color names. Earl F. Glynn has created an excellent online chart of R colors, available at http://mng.bz/9C5p. R also has a number of functions that can be used to create vectors of contiguous colors. These www.it-ebooks.info Graphical parameters 53 include rainbow(), heat.colors(), terrain.colors(), topo.colors(), and cm.colors(). For example, rainbow(10) produces 10 contiguous “rainbow” colors. The RColorBrewer package is particularly popular for creating attractive color palettes. Be sure to download it (install.packages("RColorBrewer")) before first use. Once it’s installed, use the brewer.pal(n, name) function to generate a vector of col- ors. For example, the code library(RColorBrewer) n <- 7 mycolors <- brewer.pal(n, "Set1") barplot(rep(1,n), col=mycolors) returns a vector of seven colors in hexadecimal format from the Set1 palette. To get a list of the available palettes, type brewer.pal.info; or type display.brewer.all() to produces a plot of each palette in a single display. See help(RColorBrewer) for more details. Finally, gray levels are generated with the gray() function in the base installation. In this case, you specify gray levels as a vector of numbers between 0 and 1. gray(0:10/10) produces 10 gray levels. Try the following code to see how this works: n <- 10 mycolors <- rainbow(n) pie(rep(1, n), labels=mycolors, col=mycolors) mygrays <- gray(0:n/n) pie(rep(1, n), labels=mygrays, col=mygrays) As you can see, R provides numerous methods for generating color vectors. You’ll see examples that use color parameters throughout this chapter. 3.3.3 Text characteristics Graphic parameters are also used to specify text size, font, and style. Parameters controlling text size are explained in table 3.4. Font family and style can be controlled with font options (see table 3.5). Table 3.4 Parameters specifying text size Parameter Description cex Number indicating the amount by which plotted text should be scaled relative to the default. 1 = default, 1.5 is 50% larger, 0.5 is 50% smaller, and so on. cex.axis Magnification of axis text relative to cex. cex.lab Magnification of axis labels relative to cex. cex.main Magnification of titles relative to cex. cex.sub Magnification of subtitles relative to cex. For example, all graphs created after the statement par(font.lab=3, cex.lab=1.5, font.main=4, cex.main=2) www.it-ebooks.info 54 CHAPTER 3 Getting started with graphs will have italic axis labels that are 1.5 times the default text size and bold italic titles that are twice the default text size. Table 3.5 Parameters specifying font family, size, and style Parameter Description font Integer specifying the font to use for plotted text. 1 = plain, 2 = bold, 3 = italic, 4 = bold italic, and 5=symbol (in Adobe symbol encoding). font.axis Font for axis text. font.lab Font for axis labels. font.main Font for titles. font.sub Font for subtitles. ps Font point size (roughly 1/72 inch). The text size = ps*cex. family Font family for drawing text. Standard values are serif, sans, and mono. Whereas font size and style are easily set, font family is a bit more complicated. This is because the mappings of serif, sans, and mono are device dependent. For example, on Windows platforms, mono is mapped to TT Courier New, serif is mapped to TT Times New Roman, and sans is mapped to TT Arial (TT stands for TrueType). If you’re satisfied with this mapping, you can use parameters like family="serif" to get the results you want. If not, you need to create a new mapping. On Windows, you can create this mapping via the windowsFont() function. For example, after issuing this statement, you can use A, B, and C as family values: windowsFonts( A=windowsFont("Arial Black"), B=windowsFont("Bookman Old Style"), C=windowsFont("Comic Sans MS") ) In this case, par(family="A") specifies an Arial Black font. (Listing 3.2 in section 3.4.2 provides an example of modifying text parameters.) Note that the windowsFont() function only works for Windows. On a Mac, use quartzFonts() instead. If graphs will be output in PDF or PostScript format, changing the font family is relatively straightforward. For PDFs, use names(pdfFonts())to find out which fonts are available on your system and pdf(file="myplot.pdf", family="fontname") to generate the plots. For graphs that are output in PostScript format, use names(postscriptFonts()) and postscript(file="myplot.ps", family="fontname"). See the online help for more information. 3.3.4 Graph and margin dimensions Finally, you can control the plot dimensions and margin sizes using the parameters listed in table 3.6. www.it-ebooks.info Graphical parameters Table 3.6 55 Parameters for graph and margin dimensions Parameter Description pin Plot dimensions (width, height) in inches. mai Numerical vector indicating margin size, where c(bottom, left, top, right) is expressed in inches. mar Numerical vector indicating margin size, where c(bottom, left, top, right) is expressed in lines. The default is c(5, 4, 4, 2) + 0.1. The code par(pin=c(4,3), mai=c(1,.5, 1, .2)) produces graphs that are 4 inches wide by 3 inches tall, with a 1-inch margin on the bottom and top, a 0.5-inch margin on the left, and a 0.2-inch margin on the right. For more on margins, see Earl F. Glynn’s comprehensive online tutorial (http://mng.bz/ 6aMp). Let’s use the options we’ve covered so far to enhance the simple example. The code in the following listing produces the graphs in figure 3.7. Listing 3.1 Using graphical parameters to control graph appearance dose <- c(20, 30, 40, 45, 60) drugA <- c(16, 20, 27, 40, 60) drugB <- c(15, 18, 25, 31, 40) opar <- par(no.readonly=TRUE) par(pin=c(2, 3)) par(lwd=2, cex=1.5) par(cex.axis=.75, font.axis=3) plot(dose, drugA, type="b", pch=19, lty=2, col="red") plot(dose, drugB, type="b", pch=23, lty=6, col="blue", bg="green") par(opar) First you enter your data as vectors, and then you save the current graphical parameter settings (so that you can restore them later). You modify the default graphical parameters so that graphs will be 2 inches wide by 3 inches tall. Additionally, lines will be twice the default width and symbols will be 1.5 times the default size. Axis text will be set to italic and scaled to 75% of the default. The first plot is then created using filled red circles and dashed lines. The second plot is created using filled green diamonds and a blue border and blue dashed lines. Finally, you restore the original graphical parameter settings. Note that parameters set with the par() function apply to both graphs, whereas parameters specified in the plot() functions only apply to that specific graph. Looking at figure 3.7, you can see some limitations in the presentation. The graphs lack titles, and the vertical axes aren’t on the same scale, limiting your ability to compare the two drugs directly. The axis labels could also be more informative. www.it-ebooks.info 56 Getting started with graphs 30 25 drugB 40 15 20 20 30 drugA 50 35 40 60 CHAPTER 3 20 30 40 50 60 20 30 dose Figure 3.7 40 50 60 dose Line plot of dose vs. response for both drug A and drug B In the next section, we’ll turn to the customization of text annotations (such as titles and labels) and axes. For more information on the graphical parameters that are available, take a look at help(par). 3.4 Adding text, customized axes, and legends Many high-level plotting functions (for example, plot, hist, and boxplot) allow you to include axis and text options, as well as graphical parameters. For example, the following adds a title (main), a subtitle (sub), axis labels (xlab, ylab), and axis ranges (xlim, ylim). The results are presented in figure 3.8: plot(dose, drugA, type="b", col="red", lty=2, pch=2, lwd=2, main="Clinical Trials for Drug A", sub="This is hypothetical data", xlab="Dosage", ylab="Drug Response", xlim=c(0, 60), ylim=c(0, 70)) Again, not all functions allow you to add these options. See the help for the function of interest to see what options are accepted. For finer control and for modularization, you can use the functions described in the remainder of this section to control titles, axes, legends, and text annotations. NOTE Some high-level plotting functions include default titles and labels. You can remove them by adding ann=FALSE in the plot() statement or in a separate par() statement. 3.4.1 Titles Use the title() function to add a title and axis labels to a plot. The format is title(main="main title", sub="subtitle", xlab="x-axis label", ylab="y-axis label") www.it-ebooks.info Adding text, customized axes, and legends 57 40 30 10 20 Drug Response 50 60 70 Clinical Trials for Drug A 0 Figure 3.8 Line plot of dose vs. response for drug A with title, subtitle, and modified axes 0 10 20 30 40 50 60 Dosage This is hypothetical data Graphical parameters (such as text size, font, rotation, and color) can also be specified in title(). For example, the following code produces a red title and a blue subtitle, and creates green x and y labels that are 25% smaller than the default text size: title(main="My Title", col.main="red", sub="My Subtitle", col.sub="blue", xlab="My X label", ylab="My Y label", col.lab="green", cex.lab=0.75) The title() function is typically used to add information to a plot in which the default title and axis labels have been suppressed via the ann=FALSE option. 3.4.2 Axes Rather than use R’s default axes, you can create custom axes with the axis() function. The format is axis(side, at=, labels=, pos=, lty=, col=, las=, tck=, ...) where each parameter is described in table 3.7. Table 3.7 Axis options Option Description side Integer indicating the side of the graph on which to draw the axis (1 = bottom, 2 = left, 3 = top, and 4 = right). at Numeric vector indicating where tick marks should be drawn. www.it-ebooks.info 58 CHAPTER 3 Table 3.7 Getting started with graphs Axis options (continued) Option Description labels Character vector of labels to be placed at the tick marks (if NULL, the at values are used). pos Coordinate at which the axis line is to be drawn (that is, the value on the other axis where it crosses). lty Line type. col Line and tick mark color. las Specifies that labels are parallel (= 0) or perpendicular (= 2) to the axis. tck Length of each tick mark as a fraction of the plotting region (a negative number is outside the graph, a positive number is inside, 0 suppresses ticks, and 1 creates gridlines). The default is –0.01. (...) Other graphical parameters. When creating a custom axis, you should suppress the axis that’s automatically generated by the high-level plotting function. The option axes=FALSE suppresses all axes (including all axis frame lines, unless you add the option frame.plot=TRUE). The options xaxt="n" and yaxt="n" suppress the x-axis and y-axis, respectively (leaving the frame lines, without ticks). Listing 3.2 is a somewhat silly and overblown example that demonstrates each of the features we’ve discussed so far. The resulting graph is presented in figure 3.9. An Example of Creative Axes 10 10 9 8 Y=X 7 6 y=1/x 5 5 4 3.33 3 2.5 2 2 1 1.67 1.43 1.25 1.11 1 2 4 6 8 10 X values www.it-ebooks.info Figure 3.9 A demonstration of axis options 59 Adding text, customized axes, and legends Listing 3.2 An example of custom axes x <y <z <opar c(1:10) x 10/x <- par(no.readonly=TRUE) Specifies data Increases margins par(mar=c(5, 4, 4, 8) + 0.1) plot(x, y, type="b", Plots x vs. y, suppressing annotations pch=21, col="red", yaxt="n", lty=3, ann=FALSE) lines(x, z, type="b", pch=22, col="blue", lty=2) Adds an x versus 1/x line axis(2, at=x, labels=x, col.axis="red", las=2) axis(4, at=z, labels=round(z, digits=2), col.axis="blue", las=2, cex.axis=0.7, tck=-.01) Draws the axes mtext("y=1/x", side=4, line=3, cex.lab=1, las=2, col="blue") title("An Example of Creative Axes", xlab="X values", ylab="Y=X") Adds titles and text par(opar) At this point, we’ve covered everything in listing 3.2 except the line() and mtext() statements. A plot() statement starts a new graph. By using line() instead, you can add new graph elements to an existing graph. You’ll use it again when you plot the response of drug A and drug B on the same graph in section 3.4.4. The mtext() function is used to add text to the margins of the plot. mtext() is covered in section 3.4.5, and line() is covered more fully in chapter 11. Minor tick marks Notice that each of the graphs you’ve created so far has major tick marks but not minor tick marks. To create minor tick marks, you need the minor.tick() function in the Hmisc package. If you don’t already have Hmisc installed, be sure to install it first (see chapter 1, section 1.4.2). You can add minor tick marks with the code library(Hmisc) minor.tick(nx=n, ny=n, tick.ratio=n) where nx and ny specify the number of intervals into which to divide the area between major tick marks on the x-axis and y-axis, respectively. tick.ratio is the size of the minor tick mark relative to the major tick mark. The current length of the major tick mark can be retrieved using par("tck"). For example, the following statement adds one tick mark between each major tick mark on the x-axis and two tick marks between each major tick mark on the y-axis: minor.tick(nx=2, ny=3, tick.ratio=0.5) These tick marks will be 50% as long as the major tick marks. An example of minor tick marks is given in section 3.4.4 (listing 3.3 and figure 3.10). www.it-ebooks.info 60 3.4.3 CHAPTER 3 Getting started with graphs Reference lines The abline() function is used to add reference lines to a graph. The format is abline(h=yvalues, v=xvalues) Other graphical parameters (such as line type, color, and width) can also be specified in the abline() function. For example abline(h=c(1,5,7)) adds solid horizontal lines at y = 1, 5, and 7, whereas the code abline(v=seq(1, 10, 2), lty=2, col="blue") adds dashed blue vertical lines at x = 1, 3, 5, 7, and 9. Listing 3.3, in the next section, creates a reference line for the drug example at y = 30. The resulting graph is displayed in figure 3.10 (also in the next section). 3.4.4 Legend When more than one set of data or group is incorporated into a graph, a legend can help you to identify what’s being represented by each bar, pie slice, or line. A legend can be added (not surprisingly) with the legend() function. The format is legend(location, title, legend, ...) The common options are described in table 3.8. Table 3.8 Legend options Option Description location There are several ways to indicate the location of the legend. You can give an x,y coordinate for its upper-left corner. You can use locator(1), in which case you use the mouse to indicate the legend’s location. You can also use the keyword bottom, bottomleft, left, topleft, top, topright, right, bottomright, or center to place the legend in the graph. If you use one of these keywords, you can also use inset= to specify an amount to move the legend into the graph (as a fraction of the plot region). title Character string for the legend title (optional). legend Character vector with the labels. ... Other options. If the legend labels colored lines, specify col= and a vector of colors. If the legend labels point symbols, specify pch= and a vector of point symbols. If the legend labels line width or line style, use lwd= or lty= and a vector of widths or styles. To create colored boxes for the legend (common in bar, box, and pie charts), use fill= and a vector of colors. Other common legend options include bty for box type, bg for background color, cex for size, and text.col for text color. Specifying horiz=TRUE sets the legend horizontally rather than vertically. For more on legends, see help(legend). The examples in the help file are particularly informative. www.it-ebooks.info 61 Adding text, customized axes, and legends Let’s take a look at an example using the drug data (listing 3.3). Again, you’ll use a number of the features that we’ve covered up to this point. The resulting graph is presented in figure 3.10. Listing 3.3 Comparing drug A and drug B response by dose dose <- c(20, 30, 40, 45, 60) drugA <- c(16, 20, 27, 40, 60) drugB <- c(15, 18, 25, 31, 40) opar <- par(no.readonly=TRUE) Increases line, text, symbol, and label size par(lwd=2, cex=1.5, font.lab=2) plot(dose, drugA, type="b", pch=15, lty=1, col="red", ylim=c(0, 60), main="Drug A vs. Drug B", xlab="Drug Dosage", ylab="Drug Response") Generates the graph lines(dose, drugB, type="b", pch=17, lty=2, col="blue") abline(h=c(30), lwd=1.5, lty=2, col="gray") library(Hmisc) minor.tick(nx=3, ny=3, tick.ratio=0.5) Adds minor tick marks legend("topleft", inset=.05, title="Drug Type", c("A","B") lty=c(1, 2), pch=c(15, 17), col=c("red", "blue")) Adds a legend par(opar) Text can be added to graphs using the text() and mtext() functions. text() places text within the graph, whereas mtext() places text in one of the four margins. The formats are 60 50 20 30 40 A B 10 Text annotations Drug Type 0 3.4.5 Drug A vs. Drug B Drug Response Almost all aspects of the graph in figure 3.10 can be modified using the options discussed in this chapter. Additionally, there are many ways to specify the options desired. The final annotation to consider is the addition of text to the plot itself. This topic is covered in the next section. 20 30 40 50 Drug Dosage Figure 3.10 and drug B www.it-ebooks.info An annotated comparison of drug A 60 62 CHAPTER 3 Getting started with graphs text(location, "text to place", pos, ...) mtext("text to place", side, line=n, ...) and the common options are described in table 3.9. Other common options are cex, col, and font (for size, color, and font style, respectively). Table 3.9 Options for the text() and mtext() functions Option Description location Location can be an x,y coordinate. Alternatively, you can place the text interactively via mouse by specifying location as locator(1). pos Position relative to location. 1 = below, 2 = left, 3 = above, and 4 = right. If you specify pos, you can specify offset= as a percentage of character width. side Which margin to place text in, where 1 = bottom, 2 = left, 3 = top, and 4 = right. You can specify line= to indicate the line in the margin, starting with 0 (closest to the plot area) and moving out. You can also specify adj=0 for left/bottom alignment or adj=1 for top/right alignment. The text() function is typically used for labeling points as well as for adding other text annotations. Specify location as a set of x,y coordinates, and specify the text to place as a vector of labels. The x, y, and label vectors should all be the same length. An example is given next, and the resulting graph is shown in figure 3.11: Mileage vs. Car Weight Toyota Corolla 30 Fiat 128 Lotus Europa Honda Civic Fiat X1−9 25 Porsche 914−2 Mileage Merc 240D Merc 230 Datsun 710 20 Toyota Corona Volvo 142E Hornet 4 Drive Mazda RX4 Mazda RX4 Wag Ferrari Dino Merc 280 Pontiac Firebird Hornet Sportabout Valiant Merc 280C Merc 450SL 15 Merc 450SE Ford Pantera L Dodge Challenger Merc 450SLC AMCMaserati Javelin Bora Chrys Duster 360 Camaro Z28 10 Cadillac Lin 2 3 4 Weight www.it-ebooks.info 5 Figure 3.11 Example of a scatter plot (car weight vs. mileage) with labeled points (car make and model) Adding text, customized axes, and legends 63 attach(mtcars) plot(wt, mpg, main="Mileage vs. Car Weight", xlab="Weight", ylab="Mileage", pch=18, col="blue") text(wt, mpg, row.names(mtcars), cex=0.6, pos=4, col="red") detach(mtcars) This example plots car mileage versus car weight for the 32 automobile makes provided in the mtcars data frame. The text() function is used to add the car make to the right of each data point. The point labels are shrunk by 40% and presented in red. As a second example, the following code can be used to display font families: opar <- par(no.readonly=TRUE) par(cex=1.5) plot(1:7,1:7,type="n") text(3,3,"Example of default text") text(4,4,family="mono","Example of mono-spaced text") text(5,5,family="serif","Example of serif text") par(opar) The results, produced on a Windows platform, are shown in figure 3.12. Here the par() function was used to increase the font size to produce a better display. The resulting plot will differ from platform to platform, because plain, mono, and serif text are mapped to different font families on different systems. What does it look like on yours? Math annotations 4 5 6 7 Finally, you can add mathematical symbols and formulas to a graph using TeX-like rules. See help(plotmath) for details and examples. You can also try demo(plotmath) 2 3 Example of default text Figure 3.12 Examples of font families on a Windows platform 1 1:7 3.4.6 1 2 3 4 5 6 1:7 www.it-ebooks.info 7 64 CHAPTER 3 Getting started with graphs Figure 3.13 Partial results from demo(plotmath) to see this in action. A portion of the results is presented in figure 3.13. The plotmath() function can be used to add mathematical symbols to titles, axis labels, or text annotations in the body or margins of a graph. You can often gain greater insight into your data by comparing several graphs at one time. So, we’ll end this chapter by looking at ways to combine more than one graph into a single image. 3.5 Combining graphs R makes it easy to combine several graphs into one overall graph, using either the par() or layout() function. At this point, don’t worry about the specific types of graphs being combined; our focus here is on the general methods used to combine them. The creation and interpretation of each graph type are covered in later chapters. With the par() function, you can include the graphical parameter mfrow=c(nrows, ncols) to create a matrix of nrows × ncols plots that are filled in by row. Alternatively, you can use mfcol=c(nrows, ncols) to fill the matrix by columns. For example, the following code creates four plots and arranges them into two rows and two columns: www.it-ebooks.info 65 Combining graphs attach(mtcars) opar <- par(no.readonly=TRUE) par(mfrow=c(2,2)) plot(wt,mpg, main="Scatterplot of wt vs. mpg") plot(wt,disp, main="Scatterplot of wt vs. disp") hist(wt, main="Histogram of wt") boxplot(wt, main="Boxplot of wt") par(opar) detach(mtcars) The results are presented in figure 3.14. As a second example, let’s arrange three plots in three rows and one column. Here’s the code: attach(mtcars) opar <- par(no.readonly=TRUE) par(mfrow=c(3,1)) hist(wt) hist(mpg) hist(disp) par(opar) detach(mtcars) Scatterplot of wt vs. disp 200 300 disp 20 10 100 15 mpg 25 30 400 Scatterplot of wt vs. mpg 3 4 5 3 4 wt Histogram of wt Boxplot of wt 8 4 6 3 4 0 2 2 Frequency 2 wt 5 2 2 3 4 5 wt Figure 3.14 Graph combining four figures through par(mfrow=c(2,2)) www.it-ebooks.info 5 66 CHAPTER 3 Getting started with graphs The graph is displayed in figure 3.15. Note that the high-level function hist() includes a default title (use main="" to suppress it, or ann=FALSE to suppress all titles and labels). The layout() function has the form layout(mat), where mat is a matrix object specifying the location of the multiple plots to combine. In the following code, one figure is placed in row 1 and two figures are placed in row 2: attach(mtcars) layout(matrix(c(1,1,2,3), 2, 2, byrow = TRUE)) hist(wt) hist(mpg) hist(disp) detach(mtcars) The resulting graph is presented in figure 3.16. 6 4 2 0 Frequency 8 Histogram of wt 2 3 4 5 wt Frequency 0 2 4 6 8 12 Histogram of mpg 10 15 20 25 30 35 mpg 6 4 2 0 Frequency Histogram of disp 100 200 300 400 disp Figure 3.15 Graph combining three figures through par(mfrow=c(3,1)) www.it-ebooks.info 500 67 Combining graphs 6 4 0 2 Frequency 8 Histogram of wt 2 3 4 5 wt Histogram of disp 5 4 2 3 Frequency 8 6 4 0 0 1 2 Frequency 10 6 7 12 Histogram of mpg 10 15 20 25 30 35 100 mpg Figure 3.16 200 300 400 500 disp Graph combining three figures using the layout() function with default widths Optionally, you can include widths= and heights= options in the layout() function to control the size of each figure more precisely. These options have the following form: ■ ■ widths—A vector of values for the widths of columns heights—A vector of values for the heights of rows Relative widths are specified with numeric values. Absolute widths (in centimeters) are specified with the lcm() function. In the following code, one figure is again placed in row 1 and two figures are placed in row 2. But the figure in row 1 is one-third the height of the figures in row 2. Additionally, the figure in the bottom-right cell is one-fourth the width of the figure in the bottom-left cell: attach(mtcars) layout(matrix(c(1, 1, 2, 3), 2, 2, byrow = TRUE), widths=c(3, 1), heights=c(1, 2)) hist(wt) hist(mpg) hist(disp) detach(mtcars) www.it-ebooks.info 68 CHAPTER 3 Getting started with graphs 8 4 0 Frequency Histogram of wt 2 3 4 5 wt Histogram of disp 4 3 Frequency 6 0 0 1 2 2 4 Frequency 8 5 10 6 7 12 Histogram of mpg 10 15 20 25 30 mpg Figure 3.17 35 100 400 disp Graph combining three figures using the layout() function with specified widths The graph is presented in figure 3.17. As you can see, layout() gives you easy control over both the number and placement of graphs in a final image and the relative sizes of these graphs. See help(layout) for more details. 3.5.1 Creating a figure arrangement with fine control There are times when you want to arrange or superimpose several figures to create a single meaningful plot. Doing so requires fine control over the placement of the figures. You can accomplish this with the fig= graphical parameter. In the following listing, two box plots are added to a scatter plot to create a single enhanced graph. The resulting graph is shown in figure 3.18. Listing 3.4 Fine placement of figures in a graph opar <- par(no.readonly=TRUE) par(fig=c(0, 0.8, 0, 0.8)) plot(mtcars$wt, mtcars$mpg, xlab="Miles Per Gallon", ylab="Car Weight") www.it-ebooks.info Sets up the scatter plot 69 Combining graphs par(fig=c(0, 0.8, 0.55, 1), new=TRUE) boxplot(mtcars$wt, horizontal=TRUE, axes=FALSE) par(fig=c(0.65, 1, 0, 0.8), new=TRUE) boxplot(mtcars$mpg, axes=FALSE) Adds a box plot above Adds a box plot to the right mtext("Enhanced Scatterplot", side=3, outer=TRUE, line=-3) par(opar) 20 15 Car Weight 25 30 Enhanced Scatterplot 10 Figure 3.18 A scatter plot with two box plots added to the margins 2 3 4 5 Miles Per Gallon To understand how this graph is created, think of the full graph area as going from (0,0) in the lower-left corner to (1,1) in the upper-right corner. Figure 3.19 will help you visualize this. The format of the fig= parameter is a numerical vector of the form c(x1, x2, y1, y2). The first fig= sets up the scatter plot going from 0 to 0.8 on the x-axis and 0 to 0.8 on the y-axis. The top box (1,1) y2 y1 x1 (0,0) Figure 3.19 Specifying locations using the fig= graphical parameter www.it-ebooks.info x2 70 CHAPTER 3 Getting started with graphs plot goes from 0 to 0.8 on the x-axis and 0.55 to 1 on the y-axis. The box plot on the right goes from 0.65 to 1 on the x-axis and 0 to 0.8 on the y-axis. fig= starts a new plot, so when you add a figure to an existing graph, include the new=TRUE option. I chose 0.55 rather than 0.8 so that the top figure would be pulled closer to the scatter plot. Similarly, I chose 0.65 to pull the box plot on the right closer to the scatter plot. You have to experiment to get the placement correct. The amount of space needed for individual subplots can be device dependent. If you get “Error in plot.new(): figure margins too large,” try varying the area given for each portion of the overall graph. NOTE You can use the fig= graphical parameter to combine several plots into any arrangement within a single graph. With a little practice, this approach gives you a great deal of flexibility when creating complex visual presentations. 3.6 Summary In this chapter, we reviewed methods for creating graphs and saving them in a variety of formats. The majority of the chapter was concerned with modifying the default graphs produced by R, in order to arrive at more useful or attractive plots. You learned how to modify a graph’s axes, fonts, symbols, lines, and colors, as well as how to add titles, subtitles, labels, plotted text, legends, and reference lines. You saw how to specify the size of the graph and margins, and how to combine multiple graphs into a single useful image. Our focus in this chapter was on general techniques that you can apply to all graphs (with the exception of ggplot2 graphs, discussed in chapter 19). Later chapters look at specific types of graphs. For example, chapter 6 covers methods for graphing a single variable. Graphing relationships between variables will be described in chapter 11. In chapter 19, we discuss advanced graphic methods, including innovative methods for displaying multivariate data. In other chapters, we’ll discuss methods of visualizing data that are particularly useful for the statistical approaches under consideration. Graphs are a central part of modern data analysis, and I’ll endeavor to incorporate them into each of the statistical approaches we discuss. In the previous chapter, we discussed a range of methods for inputting or importing data into R. Unfortunately, in the real world, your data is rarely usable in the format in which you first get it. The next chapter looks at ways to transform and massage your data into a state that’s more useful and conducive to analysis. www.it-ebooks.info Basic data management This chapter covers ■ Manipulating dates and missing values ■ Understanding data type conversions ■ Creating and recoding variables ■ Sorting, merging, and subsetting datasets ■ Selecting and dropping variables In chapter 2, we covered a variety of methods for importing data into R. Unfortunately, getting your data in the rectangular arrangement of a matrix or data frame is only the first step in preparing it for analysis. To paraphrase Captain Kirk in the Star Trek episode “A Taste of Armageddon” (and proving my geekiness once and for all), “Data is a messy business—a very, very messy business.” In my own work, as much as 60% of the time I spend on data analysis is focused on preparing the data for analysis. I’ll go out a limb and say that the same is probably true in one form or another for most real-world data analysts. Let’s take a look at an example. 4.1 A working example One of the topics that I study in my current job is how men and women differ in the ways they lead their organizations. Typical questions might be 71 www.it-ebooks.info 72 CHAPTER 4 ■ ■ Basic data management Do men and women in management positions differ in the degree to which they defer to superiors? Does this vary from country to country, or are these gender differences universal? One way to address these questions is to have bosses in multiple countries rate their managers on deferential behavior, using questions like the following: This manager asks my opinion before making personnel decisions. 1 2 3 4 5 strongly disagree disagree neither agree nor disagree agree strongly agree The resulting data might resemble that in table 4.1. Each row represents the ratings given to a manager by his or her boss. Table 4.1 Manager Gender differences in leadership behavior Date Country Gender Age q1 q2 q3 q4 q5 1 10/24/14 US M 32 5 4 5 5 5 2 10/28/14 US F 45 3 5 2 5 5 3 10/01/14 UK F 25 3 5 5 5 2 4 10/12/14 UK M 39 3 3 4 5 05/01/14 UK F 99 2 2 1 2 1 Here, each manager is rated by their boss on five statements (q1 to q5) related to deference to authority. For example, manager 1 is a 32-year-old male working in the US and is rated deferential by his boss, whereas manager 5 is a female of unknown age (99 probably indicates that the information is missing) working in the UK and is rated low on deferential behavior. The Date column captures when the ratings were made. Although a dataset might have dozens of variables and thousands of observations, we’ve included only 10 columns and 5 rows to simplify the examples. Additionally, we’ve limited the number of items pertaining to the managers’ deferential behavior to five. In a real-world study, you’d probably use 10–20 such items to improve the reliability and validity of the results. You can create a data frame containing the data in table 4.1 using the following code. Listing 4.1 Creating the leadership data frame manager <- c(1, 2, 3, 4, 5) date <- c("10/24/08", "10/28/08", "10/1/08", "10/12/08", "5/1/09") country <- c("US", "US", "UK", "UK", "UK") gender <- c("M", "F", "F", "M", "F") www.it-ebooks.info Creating new variables 73 age <- c(32, 45, 25, 39, 99) q1 <- c(5, 3, 3, 3, 2) q2 <- c(4, 5, 5, 3, 2) q3 <- c(5, 2, 5, 4, 1) q4 <- c(5, 5, 5, NA, 2) q5 <- c(5, 5, 2, NA, 1) leadership <- data.frame(manager, date, country, gender, age, q1, q2, q3, q4, q5, stringsAsFactors=FALSE) In order to address the questions of interest, you must first deal with several datamanagement issues. Here’s a partial list: ■ ■ ■ ■ ■ The five ratings (q1 to q5) need to be combined, yielding a single mean deferential score from each manager. In surveys, respondents often skip questions. For example, the boss rating manager 4 skipped questions 4 and 5. You need a method of handling incomplete data. You also need to recode values like 99 for age to missing. There may be hundreds of variables in a dataset, but you may only be interested in a few. To simplify matters, you’ll want to create a new dataset with only the variables of interest. Past research suggests that leadership behavior may change as a function of the manager’s age. To examine this, you may want to recode the current values of age into a new categorical age grouping (for example, young, middle-aged, elder). Leadership behavior may change over time. You might want to focus on deferential behavior during the recent global financial crisis. To do so, you may want to limit the study to data gathered during a specific period of time (say, January 1, 2009 to December 31, 2009). We’ll work through each of these issues in this chapter, as well as other basic datamanagement tasks such as combining and sorting datasets. Then, in chapter 5, we’ll look at some advanced topics. 4.2 Creating new variables In a typical research project, you’ll need to create new variables and transform existing ones. This is accomplished with statements of the form variable <- expression A wide array of operators and functions can be included in the expression portion of the statement. Table 4.2 lists R’s arithmetic operators. Arithmetic operators are used when developing formulas. Table 4.2 Arithmetic operators Operator Description + Addition. - Subtraction. www.it-ebooks.info 74 CHAPTER 4 Table 4.2 Basic data management Arithmetic operators (continued) Operator Description * Multiplication. / Division. ^ or ** Exponentiation. x%%y Modulus (x mod y): for example, 5%%2 is 1. x%/%y Integer division: for example, 5%/%2 is 2. Let’s say you have a data frame named mydata, with variables x1 and x2, and you want to create a new variable sumx that adds these two variables and a new variable called meanx that averages the two variables. If you use the code sumx <- x1 + x2 meanx <- (x1 + x2)/2 you’ll get an error, because R doesn’t know that x1 and x2 are from the data frame mydata. If you use this code instead sumx <- mydata$x1 + mydata$x2 meanx <- (mydata$x1 + mydata$x2)/2 the statements will succeed but you’ll end up with a data frame (mydata) and two separate vectors (sumx and meanx). This probably isn’t the result you want. Ultimately, you want to incorporate new variables into the original data frame. The following listing provides three separate ways to accomplish this goal. The one you choose is up to you; the results will be the same. Listing 4.2 Creating new variables mydata<-data.frame(x1 = c(2, 2, 6, 4), x2 = c(3, 4, 2, 8)) mydata$sumx <- mydata$x1 + mydata$x2 mydata$meanx <- (mydata$x1 + mydata$x2)/2 attach(mydata) mydata$sumx <- x1 + x2 mydata$meanx <- (x1 + x2)/2 detach(mydata) mydata <- transform(mydata, sumx = x1 + x2, meanx = (x1 + x2)/2) Personally, I prefer the third method, exemplified by the use of the transform() function. It simplifies inclusion of as many new variables as desired and saves the results to the data frame. www.it-ebooks.info Recoding variables 4.3 75 Recoding variables Recoding involves creating new values of a variable conditional on the existing values of the same and/or other variables. For example, you may want to ■ ■ ■ Change a continuous variable into a set of categories Replace miscoded values with correct values Create a pass/fail variable based on a set of cutoff scores To recode data, you can use one or more of R’s logical operators (see table 4.3). Logical operators are expressions that return TRUE or FALSE. Table 4.3 Logical operators Operator Description < Less than <= Less than or equal to > Greater than >= Greater than or equal to == Exactly equal to != Not equal to !x Not x x | y x or y x & y x and y isTRUE(x) Tests whether x is TRUE Let’s say you want to recode the ages of the managers in the leadership dataset from the continuous variable age to the categorical variable agecat (Young, Middle Aged, Elder). First, you must recode the value 99 for age to indicate that the value is missing using code such as leadership$age[leadership$age == 99] <- NA The statement variable[condition] <- expression will only make the assignment when condition is TRUE. Once missing values for age have been specified, you can then use the following code to create the agecat variable: leadership$agecat[leadership$age > 75] leadership$agecat[leadership$age >= 55 & leadership$age <= 75] leadership$agecat[leadership$age < 55] <- "Elder" <- "Middle Aged" <- "Young" You include the data-frame names in leadership$agecat to ensure that the new variable is saved back to the data frame. (I defined middle aged as 55 to 75 so I won’t feel www.it-ebooks.info 76 CHAPTER 4 Basic data management so old.) Note that if you hadn’t recoded 99 as missing for age first, manager 5 would’ve erroneously been given the value “Elder” for agecat. This code can be written more compactly as follows: leadership <- within(leadership,{ agecat <- NA agecat[age > 75] <- "Elder" agecat[age >= 55 & age <= 75] <- "Middle Aged" agecat[age < 55] <- "Young" }) The within() function is similar to the with() function (section 2.2.4), but it allows you to modify the data frame. First the variable agecat is created and set to missing for each row of the data frame. Then the remaining statements within the braces are executed in order. Remember that agecat is a character variable; you’re likely to want to turn it into an ordered factor, as explained in section 2.2.5. Several packages offer useful recoding functions; in particular, the car package’s recode() function recodes numeric and character vectors and factors very simply. The package doBy offers recodeVar(), another popular function. Finally, R ships with cut(), which allows you to divide the range of a numeric variable into intervals, returning a factor. 4.4 Renaming variables If you’re not happy with your variable names, you can change them interactively or programmatically. Let’s say you want to change the variable manager to managerID and date to testDate. You can use the following statement to invoke an interactive editor: fix(leadership) Then you click the variable names and rename them in the dialogs that are presented (see figure 4.1). Figure 4.1 Renaming variables interactively using the fix() function www.it-ebooks.info 77 Missing values Programmatically, you can rename variables via the names() function. For example, this statement names(leadership)[2] <- "testDate" renames date to testDate as demonstrated in the following code: > names(leadership) [1] "manager" "date" "country" "gender" "age" [8] "q3" "q4" "q5" > names(leadership)[2] <- "testDate" > leadership manager testDate country gender age q1 q2 q3 q4 q5 1 1 10/24/08 US M 32 5 4 5 5 5 2 2 10/28/08 US F 45 3 5 2 5 5 3 3 10/1/08 UK F 25 3 5 5 5 2 4 4 10/12/08 UK M 39 3 3 4 NA NA 5 5 5/1/09 UK F 99 2 2 1 2 1 "q1" "q2" In a similar fashion, the statement names(leadership)[6:10] <- c("item1", "item2", "item3", "item4", "item5") renames q1 through q5 to item1 through item5. Finally, the plyr package has a rename() function that’s useful for altering the names of variables. The plyr package isn’t installed by default, so you’ll need to install it on first use using the install.packages("plyr") command. The format of the rename() function is rename(dataframe, c(oldname="newname", oldname="newname",...)) Here’s an example with the leadership data: library(plyr) leadership <- rename(leadership, c(manager="managerID", date="testDate")) The plyr package has a powerful set of functions for manipulating datasets. You can learn more about it at http://had.co.nz/plyr. 4.5 Missing values In a project of any size, data is likely to be incomplete because of missed questions, faulty equipment, or improperly coded data. In R, missing values are represented by the symbol NA (not available). Unlike programs such as SAS, R uses the same missingvalue symbol for character and numeric data. R provides a number of functions for identifying observations that contain missing values. The function is.na() allows you to test for the presence of missing values. Assume that you have this vector: y <- c(1, 2, 3, NA) Then the following function returns c(FALSE, FALSE, FALSE, TRUE): is.na(y) www.it-ebooks.info 78 CHAPTER 4 Basic data management Notice how the is.na() function works on an object. It returns an object of the same size, with the entries replaced by TRUE if the element is a missing value or FALSE if the element isn’t a missing value. The following listing applies this to the leadership example. Listing 4.3 Applying the is.na() function > is.na(leadership[,6:10]) q1 q2 q3 q4 [1,] FALSE FALSE FALSE FALSE [2,] FALSE FALSE FALSE FALSE [3,] FALSE FALSE FALSE FALSE [4,] FALSE FALSE FALSE TRUE [5,] FALSE FALSE FALSE FALSE q5 FALSE FALSE FALSE TRUE FALSE Here, leadership[,6:10] limits the data frame to columns 6 to 10, and is.na() identifies which values are missing. There are two important things to keep in mind when you’re working with missing values in R. First, missing values are considered noncomparable, even to themselves. This means you can’t use comparison operators to test for the presence of missing values. For example, the logical test myvar == NA is never TRUE. Instead, you have to use missing-value functions like is.na() to identify the missing values in R data objects. Second, R doesn’t represent infinite or impossible values as missing values. Again, this is different than the way other programs like SAS handle such data. Positive and negative infinity are represented by the symbols Inf and –Inf, respectively. Thus 5/0 returns Inf. Impossible values (for example, sin(Inf)) are represented by the symbol NaN (not a number). To identify these values, you need to use is.infinite() or is.nan(). 4.5.1 Recoding values to missing As demonstrated in section 4.3, you can use assignments to recode values to missing. In the leadership example, missing age values are coded as 99. Before analyzing this dataset, you must let R know that the value 99 means missing in this case (otherwise, the mean age for this sample of bosses will be way off!). You can accomplish this by recoding the variable: leadership$age[leadership$age == 99] <- NA Any value of age that’s equal to 99 is changed to NA. Be sure that any missing data is properly coded as missing before you analyze the data, or the results will be meaningless. 4.5.2 Excluding missing values from analyses Once you’ve identified missing values, you need to eliminate them in some way before analyzing your data further. The reason is that arithmetic expressions and functions that contain missing values yield missing values. For example, consider the following code: x <- c(1, 2, NA, 3) y <- x[1] + x[2] + x[3] + x[4] z <- sum(x) www.it-ebooks.info 79 Date values Both y and z will be NA (missing) because the third element of x is missing. Luckily, most numeric functions have an na.rm=TRUE option that removes missing values prior to calculations and applies the function to the remaining values: x <- c(1, 2, NA, 3) y <- sum(x, na.rm=TRUE) Here, y is equal to 6. When using a function with incomplete data, be sure to check how that function handles missing data by looking at its online help (for example, help(sum)). The sum() function is only one of many functions we’ll consider in chapter 5. Functions allow you to transform data with flexibility and ease. You can remove any observation with missing data by using the na.omit() function. na.omit() deletes any rows with missing data. Let’s apply this to the leadership dataset in the following listing. Listing 4.4 Using na.omit() to delete incomplete observations > leadership manager date country gender age q1 q2 q3 q4 q5 1 1 10/24/08 US M 32 5 4 5 5 5 2 2 10/28/08 US F 40 3 5 2 5 5 3 3 10/01/08 UK F 25 3 5 5 5 2 4 4 10/12/08 UK M 39 3 3 4 NA NA 5 5 05/01/09 UK F NA 2 2 1 2 1 > newdata > newdata manager 1 1 2 2 3 3 Data frame with missing data <- na.omit(leadership) date country gender age q1 q2 q3 q4 q5 10/24/08 US M 32 5 4 5 5 5 10/28/08 US F 40 3 5 2 5 5 10/01/08 UK F 25 3 5 5 5 2 Data frame with complete cases only Any rows containing missing data are deleted from leadership before the results are saved to newdata. Deleting all observations with missing data (called listwise deletion) is one of several methods of handling incomplete datasets. If there are only a few missing values or they’re concentrated in a small number of observations, listwise deletion can provide a good solution to the missing-values problem. But if missing values are spread throughout the data or there’s a great deal of missing data in a small number of variables, listwise deletion can exclude a substantial percentage of your data. We’ll explore several more sophisticated methods of dealing with missing values in chapter 18. Next, let’s look at dates. 4.6 Date values Dates are typically entered into R as character strings and then translated into date variables that are stored numerically. The function as.Date() is used to make this translation. The syntax is as.Date(x, "input_format"), where x is the character data and input_format gives the appropriate format for reading the date (see table 4.4). www.it-ebooks.info 80 CHAPTER 4 Table 4.4 Symbol Basic data management Date formats Meaning Example %d Day as a number (0–31) 01–31 %a %A Abbreviated weekday Unabbreviated weekday Mon Monday %m Month (00–12) 00–12 %b %B Abbreviated month Unabbreviated month Jan January %y %Y Two-digit year Four-digit year 07 2007 The default format for inputting dates is yyyy-mm-dd. The statement mydates <- as.Date(c("2007-06-22", "2004-02-13")) converts the character data to dates using this default format. In contrast, strDates <- c("01/05/1965", "08/16/1975") dates <- as.Date(strDates, "%m/%d/%Y") reads the data using a mm/dd/yyyy format. In the leadership dataset, date is coded as a character variable in mm/dd/yy format. Therefore: myformat <- "%m/%d/%y" leadership$date <- as.Date(leadership$date, myformat) uses the specified format to read the character variable and replace it in the data frame as a date variable. Once the variable is in date format, you can analyze and plot the dates using the wide range of analytic techniques covered in later chapters. Two functions are especially useful for time-stamping data. Sys.Date() returns today’s date, and date() returns the current date and time. As I write this, it’s November 27, 2014 at 1:21 pm. So executing those functions produces > Sys.Date() [1] "2014-11-27" > date() [1] "Fri Nov 27 13:21:54 2014" You can use the format(x, format="output_format") function to output dates in a specified format and to extract portions of dates: > today <- Sys.Date() > format(today, format="%B %d %Y") [1] "November 27 2014" > format(today, format="%A") [1] "Thursday" The format() function takes an argument (a date in this case) and applies an output format (in this case, assembled from the symbols in table 4.4). The important result here is that there are only two more days until the weekend! www.it-ebooks.info Type conversions 81 When R stores dates internally, they’re represented as the number of days since January 1, 1970, with negative values for earlier dates. That means you can perform arithmetic operations on them. For example, > startdate <- as.Date("2004-02-13") > enddate <- as.Date("2011-01-22") > days <- enddate - startdate > days Time difference of 2535 days displays the number of days between February 13, 2004 and January 22, 2011. Finally, you can also use the function difftime() to calculate a time interval and express it as seconds, minutes, hours, days, or weeks. Let’s assume that I was born on October 12, 1956. How old am I? > today <- Sys.Date() > dob <- as.Date("1956-10-12") > difftime(today, dob, units="weeks") Time difference of 3033 weeks Apparently I am 3,033 weeks old. Who knew? Final test: On which day of the week was I born? 4.6.1 Converting dates to character variables You can also convert date variables to character variables. Date values can be converted to character values using the as.character() function: strDates <- as.character(dates) The conversion allows you to apply a range of character functions to the data values (subsetting, replacement, concatenation, and so on). We’ll cover character functions in detail in chapter 5. 4.6.2 Going further To learn more about converting character data to dates, look at help(as.Date) and help(strftime). To learn more about formatting dates and times, see help(ISOdatetime). The lubridate package contains a number of functions that simplify working with dates, including functions to identify and parse date-time data, extract date-time components (for example, years, months, days, and so on), and perform arithmetic calculations on date-times. If you need to do complex calculations with dates, the timeDate package can also help. It provides a myriad of functions for dealing with dates, can handle multiple time zones at once, and provides sophisticated calendar manipulations that recognize business days, weekends, and holidays. 4.7 Type conversions In the previous section, we discussed how to convert character data to date values, and vice versa. R provides a set of functions to identify an object’s data type and convert it to a different data type. www.it-ebooks.info 82 CHAPTER 4 Basic data management Type conversions in R work in a similar fashion to those in other statistical programming languages. For example, adding a character string to a numeric vector converts all the elements in the vector to character values. You can use the functions listed in table 4.5 to test for a data type and to convert it to a given type. Table 4.5 Type-conversion functions Test Convert is.numeric() as.numeric() is.character() as.character() is.vector() as.vector() is.matrix() as.matrix() is.data.frame() as.data.frame() is.factor() as.factor() is.logical() as.logical() Functions of the form is.datatype() return TRUE or FALSE, whereas as.datatype() converts the argument to that type. The following listing provides an example. Listing 4.5 Converting from one data type to another > a <- c(1,2,3) > a [1] 1 2 3 > is.numeric(a) [1] TRUE > is.vector(a) [1] TRUE > a <- as.character(a) > a [1] "1" "2" "3" > is.numeric(a) [1] FALSE > is.vector(a) [1] TRUE > is.character(a) [1] TRUE When combined with the flow controls (such as if-then) that we’ll discuss in chapter 5, the is.datatype() function can be a powerful tool, allowing you to handle data in different ways depending on its type. Additionally, some R functions require data of a specific type (character or numeric, matrix or data frame), and as.datatype() lets you transform your data into the format required prior to analyses. 4.8 Sorting data Sometimes, viewing a dataset in a sorted order can tell you quite a bit about the data. For example, which managers are most deferential? To sort a data frame in R, you use www.it-ebooks.info Merging datasets 83 the order() function. By default, the sorting order is ascending. Prepend the sorting variable with a minus sign to indicate descending order. The following examples illustrate sorting with the leadership data frame. The statement newdata <- leadership[order(leadership$age),] creates a new dataset containing rows sorted from youngest manager to oldest manager. The statement attach(leadership) newdata <- leadership[order(gender, age),] detach(leadership) sorts the rows into female followed by male, and youngest to oldest within each gender. Finally, attach(leadership) newdata <-leadership[order(gender, -age),] detach(leadership) sorts the rows by gender, and then from oldest to youngest manager within each gender. 4.9 Merging datasets If your data exists in multiple locations, you’ll need to combine it before moving forward. This section shows you how to add columns (variables) and rows (observations) to a data frame. 4.9.1 Adding columns to a data frame To merge two data frames (datasets) horizontally, you use the merge() function. In most cases, two data frames are joined by one or more common key variables (that is, an inner join). For example, total <- merge(dataframeA, dataframeB, by="ID") merges dataframeA and dataframeB by ID. Similarly, total <- merge(dataframeA, dataframeB, by=c("ID","Country")) merges the two data frames by ID and Country. Horizontal joins like this are typically used to add variables to a data frame. Horizontal concatenation with cbind() If you’re joining two matrices or data frames horizontally and don’t need to specify a common key, you can use the cbind() function: total <- cbind(A, B) This function horizontally concatenates objects A and B. For the function to work properly, each object must have the same number of rows and be sorted in the same order. www.it-ebooks.info 84 4.9.2 CHAPTER 4 Basic data management Adding rows to a data frame To join two data frames (datasets) vertically, use the rbind() function: total <- rbind(dataframeA, dataframeB) The two data frames must have the same variables, but they don’t have to be in the same order. If dataframeA has variables that dataframeB doesn’t, then before joining them, do one of the following: ■ ■ Delete the extra variables in dataframeA. Create the additional variables in dataframeB, and set them to NA (missing). Vertical concatenation is typically used to add observations to a data frame. 4.10 Subsetting datasets R has powerful indexing features for accessing the elements of an object. These features can be used to select and exclude variables, observations, or both. The following sections demonstrate several methods for keeping or deleting variables and observations. 4.10.1 Selecting (keeping) variables It’s a common practice to create a new dataset from a limited number of variables chosen from a larger dataset. In chapter 2, you saw that the elements of a data frame are accessed using the notation dataframe[row indices, column indices]. You can use this to select variables. For example, newdata <- leadership[, c(6:10)] selects variables q1, q2, q3, q4, and q5 from the leadership data frame and saves them to the data frame newdata. Leaving the row indices blank (,) selects all the rows by default. The statements myvars <- c("q1", "q2", "q3", "q4", "q5") newdata <-leadership[myvars] accomplish the same variable selection. Here, variable names (in quotes) are entered as column indices, thereby selecting the same columns. Finally, you could use myvars <- paste("q", 1:5, sep="") newdata <- leadership[myvars] This example uses the paste() function to create the same character vector as in the previous example. paste() will be covered in chapter 5. 4.10.2 Excluding (dropping) variables There are many reasons to exclude variables. For example, if a variable has many missing values, you may want to drop it prior to further analyses. Let’s look at some methods of excluding variables. www.it-ebooks.info 85 Subsetting datasets You can exclude variables q3 and q4 with these statements: myvars <- names(leadership) %in% c("q3", "q4") newdata <- leadership[!myvars] 1 In order to understand why this works, you need to break it down: names(leadership) produces a character vector containing the variable names: c("managerID","testDate","country","gender","age","q1","q2","q3","q4","q5") 2 names(leadership) %in% c("q3", "q4") returns a logical vector with TRUE for each element in names(leadership)that matches q3 or q4 and FALSE otherwise: c(FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, FALSE) 3 The not (!) operator reverses the logical values: c(TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, FALSE, TRUE) 4 leadership[c(TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, FALSE, TRUE)] selects columns with TRUE logical values, so q3 and q4 are excluded. Knowing that q3 and q4 are the eighth and ninth variables, you can exclude them with the following statement: newdata <- leadership[c(-8,-9)] This works because prepending a column index with a minus sign (-) excludes that column. Finally, the same deletion can be accomplished via leadership$q3 <- leadership$q4 <- NULL Here you set columns q3 and q4 to undefined (NULL). Note that NULL isn’t the same as NA (missing). Dropping variables is the converse of keeping variables. The choice depends on which is easier to code. If there are many variables to drop, it may be easier to keep the ones that remain, or vice versa. 4.10.3 Selecting observations Selecting or excluding observations (rows) is typically a key aspect of successful data preparation and analysis. Several examples are given in the following listing. Listing 4.6 Selecting observations Asks for rows 1 through 3 (the first three observations) newdata <- leadership[1:3,] newdata <- leadership[leadership$gender=="M" & leadership$age > 30,] attach(leadership) newdata <- leadership[gender=='M' & age > 30,] detach(leadership) www.it-ebooks.info b Selects all men over 30 Uses attach() so you don’t have to prepend variable names with data-frame names 86 CHAPTER 4 Basic data management Each of these examples provides the row indices and leaves the column indices blank (therefore choosing all columns). Let’s break down the line of code at B in order to understand it: 1 2 3 4 The logical comparison leadership$gender=="M" produces the vector c(TRUE, FALSE, FALSE, TRUE, FALSE). The logical comparison leadership$age > 30 produces the vector c(TRUE, TRUE, FALSE, TRUE, TRUE). The logical comparison c(TRUE, FALSE, FALSE, TRUE, FALSE) & c(TRUE, TRUE, FALSE, TRUE, TRUE) produces the vector c(TRUE, FALSE, FALSE, TRUE, FALSE). leadership[c(TRUE, FALSE, FALSE, TRUE, FALSE),] selects the first and fourth observations from the data frame (when the row index is TRUE, the row is included; when it’s FALSE, the row is excluded). This meets the selection crite- ria (men over 30). At the beginning of this chapter, I suggested that you might want to limit your analyses to observations collected between January 1, 2009 and December 31, 2009. How can you do this? Here’s one solution: Converts the date values read in originally as character values to date values using the format mm/dd/yy leadership$date <- as.Date(leadership$date, "%m/%d/%y") Creates starting date startdate <- as.Date("2009-01-01") enddate <- as.Date("2009-10-31") Creates ending date newdata <- leadership[which(leadership$date >= startdate & leadership$date <= enddate),] Selects cases meeting your desired criteria, as in the previous example Note that the default for the as.Date() function is yyyy-mm-dd, so you don’t have to supply it here. 4.10.4 The subset() function The examples in the previous two sections are important because they help describe the ways in which logical vectors and comparison operators are interpreted in R. Understanding how these examples work will help you to interpret R code in general. Now that you’ve done things the hard way, let’s look at a shortcut. The subset() function is probably the easiest way to select variables and observations. Here are two examples: newdata <- subset(leadership, age >= 35 | age < 24, select=c(q1, q2, q3, q4)) Selects all rows that have a value of age greater than or equal to 35 or less than 24. Keeps variables q1 through q4. newdata <- subset(leadership, gender=="M" & age > 25, select=gender:q4) Selects all men over the age of 25, and keeps variables gender through q4 (gender, q4, and all columns between them) www.it-ebooks.info Using SQL statements to manipulate data frames 87 You saw the colon operator from:to in chapter 2. Here, it provides all variables in a data frame between the from variable and the to variable, inclusive. 4.10.5 Random samples Sampling from larger datasets is a common practice in data mining and machine learning. For example, you may want to select two random samples, creating a predictive model from one and validating its effectiveness on the other. The sample() function enables you to take a random sample (with or without replacement) of size n from a dataset. You could take a random sample of size 3 from the leadership dataset using the following statement: mysample <- leadership[sample(1:nrow(leadership), 3, replace=FALSE),] The first argument to sample() is a vector of elements to choose from. Here, the vector is 1 to the number of observations in the data frame. The second argument is the number of elements to be selected, and the third argument indicates sampling without replacement. sample() returns the randomly sampled elements, which are then used to select rows from the data frame. R has extensive facilities for sampling, including drawing and calibrating survey samples (see the sampling package) and analyzing complex survey data (see the survey package). Other methods that rely on sampling, including bootstrapping and resampling statistics, are described in chapter 12. 4.11 Using SQL statements to manipulate data frames Until now, you’ve been using R statements to manipulate data. But many data analysts come to R well versed in Structured Query Language (SQL). It would be a shame to lose all that accumulated knowledge. Therefore, before we end, let me briefly mention the existence of the sqldf package. (If you’re unfamiliar with SQL, please feel free to skip this section.) After downloading and installing the package (install.packages("sqldf")), you can use the sqldf() function to apply SQL SELECT statements to data frames. Two examples are given in the following listing. Listing 4.7 Using SQL statements to manipulate data frames Selects all variables (columns) from data frame mtcars, keeps only automobiles (rows) with one carburetor (carb), sorts in ascending order by mpg, and saves the results as the data frame newdf. The option row.names=TRUE carries the row names from the original data frame over to the new one. > library(sqldf) > newdf <- sqldf("select * from mtcars where carb=1 order by mpg", row.names=TRUE) > newdf mpg cyl disp hp drat wt qsec vs am gear carb Valiant 18.1 6 225.0 105 2.76 3.46 20.2 1 0 3 1 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.21 19.4 1 0 3 1 www.it-ebooks.info 88 CHAPTER 4 Toyota Corona Datsun 710 Fiat X1-9 Fiat 128 Toyota Corolla 21.5 22.8 27.3 32.4 33.9 4 120.1 4 108.0 4 79.0 4 78.7 4 71.1 Basic data management 97 93 66 66 65 3.70 3.85 4.08 4.08 4.22 2.46 2.32 1.94 2.20 1.83 20.0 18.6 18.9 19.5 19.9 1 1 1 1 1 0 1 1 1 1 3 4 4 4 4 1 1 1 1 1 > sqldf("select avg(mpg) as avg_mpg, avg(disp) as avg_disp, gear from mtcars where cyl in (4, 6) group by gear") avg_mpg avg_disp gear Prints the mean mpg and disp within 1 20.3 201 3 each level of gear for automobiles 2 24.5 123 4 with four or six cylinders (cyl) 3 25.4 120 5 Experienced SQL users will find the sqldf package a useful adjunct to data management in R. See the project home page (http://code.google.com/p/sqldf/) for more details. 4.12 Summary This chapter covered a lot of ground. First we examined how R stores missing and date values and explored various ways of handling them. Next, you learned how to determine the data type of an object and how to convert it to other types. Simple formulas were used to create new variables and recode existing variables. You learned how to sort data, rename variables, and merge data with other datasets both horizontally (adding variables) and vertically (adding observations). Finally, we discussed how to keep or drop variables and how to select observations based on a variety of criteria. In the next chapter, we’ll look at the myriad of arithmetic, character, and statistical functions that R makes available for creating and transforming variables. After exploring ways of controlling program flow, you’ll see how to write your own functions. We’ll also explore how you can use these functions to aggregate and summarize your data. By the end of chapter 5, you’ll have most of the tools necessary to manage complex datasets. (And you’ll be the envy of data analysts everywhere!) www.it-ebooks.info Advanced data management This chapter covers ■ Mathematical and statistical functions ■ Character functions ■ Looping and conditional execution ■ User-written functions ■ Ways to aggregate and reshape data In chapter 4, we reviewed the basic techniques used for managing datasets in R. In this chapter, we’ll focus on advanced topics. The chapter is divided into three basic parts. In the first part, we’ll take a whirlwind tour of R’s many functions for mathematical, statistical, and character manipulation. To give this section relevance, we begin with a data-management problem that can be solved using these functions. After covering the functions themselves, we’ll look at one possible solution to the data-management problem. Next, we cover how to write your own functions to accomplish data-management and -analysis tasks. First, we’ll explore ways of controlling program flow, including looping and conditional statement execution. Then we’ll investigate the structure of user-written functions and how to invoke them once created. 89 www.it-ebooks.info 90 CHAPTER 5 Advanced data management Then, we’ll look at ways of aggregating and summarizing data, along with methods of reshaping and restructuring datasets. When aggregating data, you can specify the use of any appropriate built-in or user-written function to accomplish the summarization, so the topics you learn in the first two parts of the chapter will provide a real benefit. 5.1 A data-management challenge To begin our discussion of numerical and character functions, let’s consider a datamanagement problem. A group of students have taken exams in math, science, and English. You want to combine these scores in order to determine a single performance indicator for each student. Additionally, you want to assign an A to the top 20% of students, a B to the next 20%, and so on. Finally, you want to sort the students alphabetically. The data are presented in table 5.1. Table 5.1 Student exam data Student Math Science English John Davis 502 95 25 Angela Williams 600 99 22 Bullwinkle Moose 412 80 18 David Jones 358 82 15 Janice Markhammer 495 75 20 Cheryl Cushing 512 85 28 Reuven Ytzrhak 410 80 15 Greg Knox 625 95 30 Joel England 573 89 27 Mary Rayburn 522 86 18 Looking at this dataset, several obstacles are immediately evident. First, scores on the three exams aren’t comparable. They have widely different means and standard deviations, so averaging them doesn’t make sense. You must transform the exam scores into comparable units before combining them. Second, you’ll need a method of determining a student’s percentile rank on this score in order to assign a grade. Third, there’s a single field for names, complicating the task of sorting students. You’ll need to split their names into first name and last name in order to sort them properly. Each of these tasks can be accomplished through the judicious use of R’s numerical and character functions. After working through the functions described in the next section, we’ll consider a possible solution to this data-management challenge. www.it-ebooks.info 91 Numerical and character functions 5.2 Numerical and character functions In this section, we’ll review functions in R that can be used as the basic building blocks for manipulating data. They can be divided into numerical (mathematical, statistical, probability) and character functions. After we review each type, I’ll show you how to apply functions to the columns (variables) and rows (observations) of matrices and data frames (see section 5.2.6). 5.2.1 Mathematical functions Table 5.2 lists common mathematical functions along with short examples. Table 5.2 Mathematical functions Function Description abs(x) Absolute value abs(-4) returns 4. sqrt(x) Square root sqrt(25) returns 5. This is the same as 25^(0.5). ceiling(x) Smallest integer not less than x ceiling(3.475) returns 4. floor(x) Largest integer not greater than x floor(3.475) returns 3. trunc(x) Integer formed by truncating values in x toward 0 trunc(5.99) returns 5. round(x, digits=n) Rounds x to the specified number of decimal places round(3.475, digits=2) returns 3.48. signif(x, digits=n) Rounds x to the specified number of significant digits signif(3.475, digits=2) returns 3.5. cos(x), sin(x), tan(x) Cosine, sine, and tangent cos(2) returns –0.416. acos(x), asin(x), atan(x) Arc-cosine, arc-sine, and arc-tangent acos(-0.416) returns 2. cosh(x), sinh(x), tanh(x) Hyperbolic cosine, sine, and tangent sinh(2) returns 3.627. acosh(x), asinh(x), atanh(x) Hyperbolic arc-cosine, arc-sine, and arc-tangent asinh(3.627) returns 2. log(x,base=n) log(x) log10(x) Logarithm of x to the base n For convenience: ■ log(x) is the natural logarithm. ■ log10(x) is the common logarithm. ■ log(10) returns 2.3026. ■ log10(10) returns 1. exp(x) Exponential function exp(2.3026) returns 10. www.it-ebooks.info 92 CHAPTER 5 Advanced data management Data transformation is one of the primary uses for these functions. For example, you often transform positively skewed variables such as income to a log scale before further analyses. Mathematical functions are also used as components in formulas, in plotting functions (for example, x versus sin(x)), and in formatting numerical values prior to printing. The examples in table 5.2 apply mathematical functions to scalars (individual numbers). When these functions are applied to numeric vectors, matrices, or data frames, they operate on each individual value. For example, sqrt(c(4, 16, 25)) returns c(2, 4, 5). 5.2.2 Statistical functions Common statistical functions are presented in table 5.3. Many of these functions have optional parameters that affect the outcome. For example, y <- mean(x) provides the arithmetic mean of the elements in object x, and z <- mean(x, trim = 0.05, na.rm=TRUE) provides the trimmed mean, dropping the highest and lowest 5% of scores and any missing values. Use the help() function to learn more about each function and its arguments. Table 5.3 Statistical functions Function Description mean(x) Mean mean(c(1,2,3,4)) returns 2.5. median(x) Median median(c(1,2,3,4)) returns 2.5. sd(x) Standard deviation sd(c(1,2,3,4)) returns 1.29. var(x) Variance var(c(1,2,3,4)) returns 1.67. mad(x) Median absolute deviation mad(c(1,2,3,4)) returns 1.48. quantile(x, probs) Quantiles where x is the numeric vector, where quantiles are desired and probs is a numeric vector with probabilities in [0,1] # 30th and 84th percentiles of x y <- quantile(x, c(.3,.84)) range(x) Range x <- c(1,2,3,4) range(x) returns c(1,4). diff(range(x)) returns 3. sum(x) Sum sum(c(1,2,3,4)) returns 10. www.it-ebooks.info Numerical and character functions Table 5.3 93 Statistical functions Function Description diff(x, lag=n) Lagged differences, with lag indicating which lag to use. The default lag is 1. x<- c(1, 5, 23, 29) diff(x) returns c(4, 18, 6). min(x) Minimum min(c(1,2,3,4)) returns 1. max(x) Maximum max(c(1,2,3,4)) returns 4. scale(x, center=TRUE, scale=TRUE) Column center (center=TRUE) or standardize (center=TRUE, scale=TRUE) data object x. An example is given in listing 5.6. To see these functions in action, look at the next listing. This example demonstrates two ways to calculate the mean and standard deviation of a vector of numbers. Listing 5.1 Calculating the mean and standard deviation > x <- c(1,2,3,4,5,6,7,8) > mean(x) [1] 4.5 > sd(x) [1] 2.449490 Short way > n <- length(x) > meanx <- sum(x)/n > css <- sum((x - meanx)^2) > sdx <- sqrt(css / (n-1)) > meanx [1] 4.5 > sdx [1] 2.449490 Long way It’s instructive to view how the corrected sum of squares (css) is calculated in the second approach: 1 2 x equals c(1, 2, 3, 4, 5, 6, 7, 8), and mean x equals 4.5 (length(x) returns the number of elements in x). (x – meanx) subtracts 4.5 from each element of x, resulting in c(-3.5, -2.5, -1.5, -0.5, 0.5, 1.5, 2.5, 3.5) 3 (x – meanx)^2 squares each element of (x - meanx), resulting in c(12.25, 6.25, 2.25, 0.25, 0.25, 2.25, 6.25, 12.25) 4 sum((x - meanx)^2) sums each of the elements of (x - meanx)^2), resulting in 42. Writing formulas in R has much in common with matrix-manipulation languages such as MATLAB (we’ll look more specifically at solving matrix algebra problems in appendix D). www.it-ebooks.info 94 CHAPTER 5 Advanced data management Standardizing data By default, the scale() function standardizes the specified columns of a matrix or data frame to a mean of 0 and a standard deviation of 1: newdata <- scale(mydata) To standardize each column to an arbitrary mean and standard deviation, you can use code similar to the following newdata <- scale(mydata)*SD + M where M is the desired mean and SD is the desired standard deviation. Using the scale() function on non-numeric columns produces an error. To standardize a specific column rather than an entire matrix or data frame, you can use code such as this: newdata <- transform(mydata, myvar = scale(myvar)*10+50) This code standardizes the variable myvar to a mean of 50 and standard deviation of 10. You’ll use the scale() function in the solution to the data-management challenge in section 5.3. 5.2.3 Probability functions You may wonder why probability functions aren’t listed with the statistical functions (it was really bothering you, wasn’t it?). Although probability functions are statistical by definition, they’re unique enough to deserve their own section. Probability functions are often used to generate simulated data with known characteristics and to calculate probability values within user-written statistical functions. In R, probability functions take the form [dpqr]distribution_abbreviation() where the first letter refers to the aspect of the distribution returned: d p q r = = = = density distribution function quantile function random generation (random deviates) The common probability functions are listed in table 5.4. Table 5.4 Probability distributions Distribution Abbreviation Distribution Abbreviation Beta beta Logistic logis Binomial binom Multinomial multinom Cauchy cauchy Negative binomial nbinom Chi-squared (noncentral) chisq Normal norm Exponential exp Poisson pois www.it-ebooks.info 95 Numerical and character functions Table 5.4 Probability distributions Distribution Abbreviation Distribution Abbreviation F f Wilcoxon signed rank signrank Gamma gamma T t Geometric geom Uniform unif Hypergeometric hyper Weibull weibull Lognormal lnorm Wilcoxon rank sum wilcox To see how these work, let’s look at functions related to the normal distribution. If you don’t specify a mean and a standard deviation, the standard normal distribution is assumed (mean=0, sd=1). Examples of the density (dnorm), distribution (pnorm), quantile (qnorm), and random deviate generation (rnorm) functions are given in table 5.5. Table 5.5 Normal distribution functions Problem Solution 0.2 x <- pretty(c(-3,3), 30) y <- dnorm(x) plot(x, y, type = "l", xlab = "Normal Deviate", ylab = "Density", yaxs = "i" ) 0.1 Density 0.3 Plot the standard normal curve on the interval [–3,3] (see figure). −3 −2 −1 0 2 1 3 Normal Deviate What is the area under the standard normal curve to the left of z=1.96? pnorm(1.96)equals 0.975. What is the value of the 90th percentile of a normal distribution with a mean of 500 and a standard deviation of 100? qnorm(.9, mean=500, sd=100) equals 628.16. Generate 50 random normal deviates with a mean of 50 and a standard deviation of 10. rnorm(50, mean=50, sd=10) Don’t worry if the plot() function options are unfamiliar. They’re covered in detail in chapter 11; pretty() is explained in table 5.7 later in this chapter. www.it-ebooks.info 96 CHAPTER 5 Advanced data management SETTING THE SEED FOR RANDOM NUMBER GENERATION Each time you generate pseudo-random deviates, a different seed, and therefore different results, are produced. To make your results reproducible, you can specify the seed explicitly, using the set.seed() function. An example is given in the next listing. Here, the runif() function is used to generate pseudo-random numbers from a uniform distribution on the interval 0 to 1. Listing 5.2 Generating pseudo-random numbers from a uniform distribution > runif(5) [1] 0.8725344 0.3962501 > runif(5) [1] 0.4273903 0.2641101 > set.seed(1234) > runif(5) [1] 0.1137034 0.6222994 > set.seed(1234) > runif(5) [1] 0.1137034 0.6222994 0.6826534 0.3667821 0.9255909 0.3550058 0.3233044 0.6584988 0.6092747 0.6233794 0.8609154 0.6092747 0.6233794 0.8609154 By setting the seed manually, you’re able to reproduce your results. This ability can be helpful in creating examples you can access in the future and share with others. GENERATING MULTIVARIATE NORMAL DATA In simulation research and Monte Carlo studies, you often want to draw data from a multivariate normal distribution with a given mean vector and covariance matrix. The mvrnorm() function in the MASS package makes this easy. The function call is mvrnorm(n, mean, sigma) where n is the desired sample size, mean is the vector of means, and sigma is the variance-covariance (or correlation) matrix. Listing 5.3 samples 500 observations from a three-variable multivariate normal distribution for which the following are true: Mean vector Covariance matrix Listing 5.3 230.7 146.7 3.6 15360.8 6721.2 -47.1 6721.2 4700.9 -16.5 -47.1 -16.5 0.3 Generating data from a multivariate normal distribution > library(MASS) > options(digits=3) > set.seed(1234) b Sets the random number seed > mean <- c(230.7, 146.7, 3.6) > sigma <- matrix(c(15360.8, 6721.2, -47.1, 6721.2, 4700.9, -16.5, -47.1, -16.5, 0.3), nrow=3, ncol=3) www.it-ebooks.info c Specifies the mean vector and covariance matrix Numerical and character functions > mydata <- mvrnorm(500, mean, sigma) > mydata <- as.data.frame(mydata) > names(mydata) <- c("y","x1","x2") > dim(mydata) [1] 500 3 > head(mydata, y x1 1 98.8 41.3 2 244.5 205.2 3 375.7 186.7 4 -59.2 11.2 5 313.0 111.0 6 288.8 185.1 7 134.8 165.0 8 171.7 97.4 9 167.3 101.0 10 121.1 94.5 d e 97 Generates data Views the results n=10) x2 4.35 3.57 3.69 4.23 2.91 4.18 3.68 3.81 4.01 3.76 In listing 5.3, you set a random number seed so that you can reproduce the results at a later time b. You specify the desired mean vector and variance-covariance matrix c and generate 500 pseudo-random observations d. For convenience, the results are converted from a matrix to a data frame, and the variables are given names. Finally, you confirm that you have 500 observations and 3 variables, and you print out the first 10 observations e. Note that because a correlation matrix is also a covariance matrix, you could have specified the correlation structure directly. The probability functions in R allow you to generate simulated data, sampled from distributions with known characteristics. Statistical methods that rely on simulated data have grown exponentially in recent years, and you’ll see several examples of these in later chapters. 5.2.4 Character functions Whereas mathematical and statistical functions operate on numerical data, character functions extract information from textual data or reformat textual data for printing and reporting. For example, you may want to concatenate a person’s first name and last name, ensuring that the first letter of each is capitalized. Or you may want to count the instances of obscenities in open-ended feedback. Some of the most useful character functions are listed in table 5.6. Table 5.6 Character functions Function Description nchar(x) Counts the number of characters of x. x <- c("ab", "cde", "fghij") length(x) returns 3 (see table 5.7). nchar(x[3]) returns 5. substr(x, start, stop) Extracts or replaces substrings in a character vector. x <- "abcdef" substr(x, 2, 4) returns bcd. substr(x, 2, 4) <- "22222" (x is now "a222ef"). www.it-ebooks.info 98 CHAPTER 5 Table 5.6 Advanced data management Character functions (continued) Function Description grep(pattern, x, ignore.case=FALSE, fixed=FALSE) Searches for pattern in x. If fixed=FALSE, then pattern is a regular expression. If fixed=TRUE, then pattern is a text string. Returns the matching indices. grep("A", c("b","A","c"), fixed=TRUE) returns 2. sub(pattern, replacement, x, ignore.case=FALSE, fixed=FALSE) Finds pattern in x and substitutes the replacement text. If fixed=FALSE, then pattern is a regular expression. If fixed=TRUE, then pattern is a text string. sub("\\s",".","Hello There") returns Hello.There. Note that "\s" is a regular expression for finding whitespace; use "\\s" instead, because "\" is R’s escape character (see section 1.3.3). strsplit(x, split, fixed=FALSE) Splits the elements of character vector x at split. If fixed=FALSE, then pattern is a regular expression. If fixed=TRUE, then pattern is a text string. y <- strsplit("abc", "") returns a one-component, three-element list containing "a" "b" "c" unlist(y)[2] and sapply(y, "[", 2) both return “b”. paste(..., sep="") Concatenates strings after using the sep string to separate them. paste("x", 1:3, sep="") returns c("x1", "x2", "x3"). paste("x",1:3,sep="M") returns c("xM1","xM2" "xM3"). paste("Today is", date()) returns Today is Mon Dec 28 14:17:32 2015 (I changed the date to appear more current.) toupper(x) Uppercase. toupper("abc") returns “ABC”. tolower(x) Lowercase. tolower("ABC") returns “abc”. Note that the functions grep(), sub(), and strsplit() can search for a text string (fixed=TRUE) or a regular expression (fixed=FALSE); FALSE is the default. Regular expressions provide a clear and concise syntax for matching a pattern of text. For example, the regular expression ^[hc]?at matches any string that starts with 0 or one occurrences of h or c, followed by at. The expression therefore matches hat, cat, and at, but not bat. To learn more, see the regular expression entry in Wikipedia. 5.2.5 Other useful functions The functions in table 5.7 are also quite useful for data-management and manipulation, but they don’t fit cleanly into the other categories. www.it-ebooks.info 99 Numerical and character functions Table 5.7 Other useful functions Function Description length(x) Returns the length of object x. x <- c(2, 5, 6, 9) length(x) returns 4. seq(from, to, by) Generates a sequence. indices <- seq(1,10,2) indices is c(1, 3, 5, 7, 9). rep(x, n) Repeats x n times. y <- rep(1:3, 2) y is c(1, 2, 3, 1, 2, 3). cut(x, n) Divides the continuous variable x into a factor with n levels. To create an ordered factor, include the option ordered_result = TRUE. pretty(x, n) Creates pretty breakpoints. Divides a continuous variable x into n intervals by selecting n + 1 equally spaced rounded values. Often used in plotting. cat(... , file = "myfile", append = FALSE) Concatenates the objects in … and outputs them to the screen or to a file (if one is declared). name <- c("Jane") cat("Hello" , name, "\n") The last example in the table demonstrates the use of escape characters in printing. Use \n for new lines, \t for tabs, \' for a single quote, \b for backspace, and so forth (type ?Quotes for more information). For example, the code name <- "Bob" cat( "Hello", name, "\b.\n", "Isn\'t R", "\t", "GREAT?\n") produces Hello Bob. Isn't R GREAT? Note that the second line is indented one space. When cat concatenates objects for output, it separates each by a space. That’s why you include the backspace (\b) escape character before the period. Otherwise it would produce “Hello Bob .” How you apply the functions covered so far to numbers, strings, and vectors is intuitive and straightforward, but how do you apply them to matrices and data frames? That’s the subject of the next section. 5.2.6 Applying functions to matrices and data frames One of the interesting features of R functions is that they can be applied to a variety of data objects (scalars, vectors, matrices, arrays, and data frames). The following listing provides an example. www.it-ebooks.info 100 CHAPTER 5 Advanced data management Listing 5.4 Applying functions to data objects > a <- 5 > sqrt(a) [1] 2.236068 > b <- c(1.243, 5.654, 2.99) > round(b) [1] 1 6 3 > c <- matrix(runif(12), nrow=3) > c [,1] [,2] [,3] [,4] [1,] 0.4205 0.355 0.699 0.323 [2,] 0.0270 0.601 0.181 0.926 [3,] 0.6682 0.319 0.599 0.215 > log(c) [,1] [,2] [,3] [,4] [1,] -0.866 -1.036 -0.358 -1.130 [2,] -3.614 -0.508 -1.711 -0.077 [3,] -0.403 -1.144 -0.513 -1.538 > mean(c) [1] 0.444 Notice that the mean of matrix c in listing 5.4 results in a scalar (0.444). The mean() function takes the average of all 12 elements in the matrix. But what if you want the three row means or the four column means? R provides a function, apply(), that allows you to apply an arbitrary function to any dimension of a matrix, array, or data frame. The format for the apply() function is apply(x, MARGIN, FUN, ...) where x is the data object, MARGIN is the dimension index, FUN is a function you specify, and ... are any parameters you want to pass to FUN. In a matrix or data frame, MARGIN=1 indicates rows and MARGIN=2 indicates columns. Look at the following examples. Listing 5.5 Calculates the trimmed column means e Applying a function to the rows (columns) of a matrix > mydata <- matrix(rnorm(30), nrow=6) > mydata [,1] [,2] [,3] [,4] [,5] [1,] 0.71298 1.368 -0.8320 -1.234 -0.790 [2,] -0.15096 -1.149 -1.0001 -0.725 0.506 [3,] -1.77770 0.519 -0.6675 0.721 -1.350 [4,] -0.00132 -0.308 0.9117 -1.391 1.558 [5,] -0.00543 0.378 -0.0906 -1.485 -0.350 [6,] -0.52178 -0.539 -1.7347 2.050 1.569 > apply(mydata, 1, mean) [1] -0.155 -0.504 -0.511 0.154 -0.310 0.165 > apply(mydata, 2, mean) [1] -0.2907 0.0449 -0.5688 -0.3442 0.1906 > apply(mydata, 2, mean, trim=0.2) [1] -0.1699 0.0127 -0.6475 -0.6575 0.2312 b c Generates data Calculates the row means d Calculates the column means You start by generating a 6 × 5 matrix containing random normal variates b. Then you calculate the six row means c and five column means d. Finally, you calculate www.it-ebooks.info 101 A solution for the data-management challenge the trimmed column means (in this case, means based on the middle 60% of the data, with the bottom 20% and top 20% of the values discarded) e. Because FUN can be any R function, including a function that you write yourself (see section 5.4), apply() is a powerful mechanism. Whereas apply() applies a function over the margins of an array, lapply() and sapply() apply a function over a list. You’ll see an example of sapply() (which is a user-friendly version of lapply()) in the next section. You now have all the tools you need to solve the data challenge presented in section 5.1, so let’s give it a try. 5.3 A solution for the data-management challenge Your challenge from section 5.1 is to combine subject test scores into a single performance indicator for each student, grade each student from A to F based on their relative standing (top 20%, next 20%, and so on), and sort the roster by last name followed by first name. A solution is given in the following listing. Listing 5.6 A solution to the learning example > options(digits=2) Step 1 b Step 2 c Step 3 d Step 4 e Step 5 f Step 6 g Step 7 h > Student <- c("John Davis", "Angela Williams", "Bullwinkle Moose", "David Jones", "Janice Markhammer", "Cheryl Cushing", "Reuven Ytzrhak", "Greg Knox", "Joel England", "Mary Rayburn") > Math <- c(502, 600, 412, 358, 495, 512, 410, 625, 573, 522) > Science <- c(95, 99, 80, 82, 75, 85, 80, 95, 89, 86) > English <- c(25, 22, 18, 15, 20, 28, 15, 30, 27, 18) > roster <- data.frame(Student, Math, Science, English, stringsAsFactors=FALSE) > z <- scale(roster[,2:4]) > score <- apply(z, 1, mean) > roster <- cbind(roster, score) Obtains the performance scores > > > > > > y <- quantile(score, c(.8,.6,.4,.2)) roster$grade[score >= y[1]] <- "A" roster$grade[score < y[1] & score >= y[2]] <- "B" roster$grade[score < y[2] & score >= y[3]] <- "C" roster$grade[score < y[3] & score >= y[4]] <- "D" roster$grade[score < y[4]] <- "F" > > > > name <- strsplit((roster$Student), " ") Lastname <- sapply(name, "[", 2) Firstname <- sapply(name, "[", 1) roster <- cbind(Firstname,Lastname, roster[,-1]) Grades the students Extracts the last and first names > roster <- roster[order(Lastname,Firstname),] Step 8 i Sorts by last and first names > roster www.it-ebooks.info 102 CHAPTER 5 6 1 9 4 8 5 3 10 2 7 Firstname Cheryl John Joel David Greg Janice Bullwinkle Mary Angela Reuven Advanced data management Lastname Cushing Davis England Jones Knox Markhammer Moose Rayburn Williams Ytzrhak Math 512 502 573 358 625 495 412 522 600 410 Science 85 95 89 82 95 75 80 86 99 80 English 28 25 27 15 30 20 18 18 22 15 score 0.35 0.56 0.70 -1.16 1.34 -0.63 -0.86 -0.18 0.92 -1.05 grade C B B F A D D C A F The code is dense, so let’s walk through the solution step by step. b The original student roster is given. options(digits=2) limits the number of digits printed after the decimal place and makes the printouts easier to read: > options(digits=2) > roster Student 1 John Davis 2 Angela Williams 3 Bullwinkle Moose 4 David Jones 5 Janice Markhammer 6 Cheryl Cushing 7 Reuven Ytzrhak 8 Greg Knox 9 Joel England 10 Mary Rayburn Math 502 600 412 358 495 512 410 625 573 522 Science 95 99 80 82 75 85 80 95 89 86 English 25 22 18 15 20 28 15 30 27 18 c Because the math, science, and English tests are reported on different scales (with widely differing means and standard deviations), you need to make them comparable before combining them. One way to do this is to standardize the variables so that each test is reported in standard-deviation units, rather than in their original scales. You can do this with the scale() function: > z <- scale(roster[,2:4]) > z Math Science English [1,] 0.013 1.078 0.587 [2,] 1.143 1.591 0.037 [3,] -1.026 -0.847 -0.697 [4,] -1.649 -0.590 -1.247 [5,] -0.068 -1.489 -0.330 [6,] 0.128 -0.205 1.137 [7,] -1.049 -0.847 -1.247 [8,] 1.432 1.078 1.504 [9,] 0.832 0.308 0.954 [10,] 0.243 -0.077 -0.697 www.it-ebooks.info 103 A solution for the data-management challenge d You can then get a performance score for each student by calculating the row means using the mean() function and adding them to the roster using the cbind() function: > score <- apply(z, 1, mean) > roster <- cbind(roster, score) > roster Student Math Science 1 John Davis 502 95 2 Angela Williams 600 99 3 Bullwinkle Moose 412 80 4 David Jones 358 82 5 Janice Markhammer 495 75 6 Cheryl Cushing 512 85 7 Reuven Ytzrhak 410 80 8 Greg Knox 625 95 9 Joel England 573 89 10 Mary Rayburn 522 86 English 25 22 18 15 20 28 15 30 27 18 score 0.559 0.924 -0.857 -1.162 -0.629 0.353 -1.048 1.338 0.698 -0.177 e The quantile() function gives you the percentile rank of each student’s performance score. You see that the cutoff for an A is 0.74, for a B is 0.44, and so on: > y <- quantile(roster$score, c(.8,.6,.4,.2)) > y 80% 60% 40% 20% 0.74 0.44 -0.36 -0.89 f Using logical operators, you can recode students’ percentile ranks into a new categorical grade variable. This code creates the variable grade in the roster data frame: > > > > > > roster$grade[score roster$grade[score roster$grade[score roster$grade[score roster$grade[score roster Student 1 John Davis 2 Angela Williams 3 Bullwinkle Moose 4 David Jones 5 Janice Markhammer 6 Cheryl Cushing 7 Reuven Ytzrhak 8 Greg Knox 9 Joel England 10 Mary Rayburn >= y[1]] <- "A" < y[1] & score >= y[2]] <- "B" < y[2] & score >= y[3]] <- "C" < y[3] & score >= y[4]] <- "D" < y[4]] <- "F" Math 502 600 412 358 495 512 410 625 573 522 Science 95 99 80 82 75 85 80 95 89 86 English 25 22 18 15 20 28 15 30 27 18 score 0.559 0.924 -0.857 -1.162 -0.629 0.353 -1.048 1.338 0.698 -0.177 grade B A D F D C F A B C g You use the strsplit() function to break the student names into first name and last name at the space character. Applying strsplit() to a vector of strings returns a list: > name <- strsplit((roster$Student), " ") > name www.it-ebooks.info 104 CHAPTER 5 [[1]] [1] "John" [[2]] [1] "Angela" Advanced data management "Davis" "Williams" [[3]] [1] "Bullwinkle" "Moose" [[4]] [1] "David" "Jones" [[5]] [1] "Janice" "Markhammer" [[6]] [1] "Cheryl" "Cushing" [[7]] [1] "Reuven" "Ytzrhak" [[8]] [1] "Greg" "Knox" [[9]] [1] "Joel" "England" [[10]] [1] "Mary" "Rayburn" h You use the sapply() function to take the first element of each component and put it in a Firstname vector, and the second element of each component and put it in a Lastname vector. "[" is a function that extracts part of an object—here the first or second component of the list name. You use cbind() to add these elements to the roster. Because you no longer need the student variable, you drop it (with the –1 in the roster index): > > > > Firstname <- sapply(name, "[", 1) Lastname <- sapply(name, "[", 2) roster <- cbind(Firstname, Lastname, roster[,-1]) roster Firstname Lastname Math Science English 1 John Davis 502 95 25 2 Angela Williams 600 99 22 3 Bullwinkle Moose 412 80 18 4 David Jones 358 82 15 5 Janice Markhammer 495 75 20 6 Cheryl Cushing 512 85 28 7 Reuven Ytzrhak 410 80 15 8 Greg Knox 625 95 30 9 Joel England 573 89 27 10 Mary Rayburn 522 86 18 www.it-ebooks.info score 0.559 0.924 -0.857 -1.162 -0.629 0.353 -1.048 1.338 0.698 -0.177 grade B A D F D C F A B C 105 Control flow i Finally, you sort the dataset by first and last name using the order() function: > roster[order(Lastname,Firstname),] Firstname Cheryl John Joel David Greg Janice Bullwinkle Mary Angela Reuven 6 1 9 4 8 5 3 10 2 7 Lastname Cushing Davis England Jones Knox Markhammer Moose Rayburn Williams Ytzrhak Math 512 502 573 358 625 495 412 522 600 410 Science 85 95 89 82 95 75 80 86 99 80 English 28 25 27 15 30 20 18 18 22 15 score 0.35 0.56 0.70 -1.16 1.34 -0.63 -0.86 -0.18 0.92 -1.05 grade C B B F A D D C A F Voilà! Piece of cake! There are many other ways to accomplish these tasks, but this code helps capture the flavor of these functions. Now it’s time to look at control structures and userwritten functions. 5.4 Control flow In the normal course of events, the statements in an R program are executed sequentially from the top of the program to the bottom. But there are times that you’ll want to execute some statements repetitively while executing other statements only if certain conditions are met. This is where control-flow constructs come in. R has the standard control structures you’d expect to see in a modern programming language. First we’ll go through the constructs used for conditional execution, followed by the constructs used for looping. For the syntax examples throughout this section, keep the following in mind: ■ ■ ■ ■ statement is a single R statement or a compound statement (a group of R statements enclosed in curly braces {} and separated by semicolons). cond is an expression that resolves to TRUE or FALSE. expr is a statement that evaluates to a number or character string. seq is a sequence of numbers or character strings. After we discuss control-flow constructs, you’ll learn how to write your own functions. 5.4.1 Repetition and looping Looping constructs repetitively execute a statement or series of statements until a condition isn’t true. These include the for and while structures. FOR The for loop executes a statement repetitively until a variable’s value is no longer contained in the sequence seq. The syntax is for (var in seq) statement www.it-ebooks.info 106 CHAPTER 5 Advanced data management In this example for (i in 1:10) print("Hello") the word Hello is printed 10 times. WHILE A while loop executes a statement repetitively until the condition is no longer true. The syntax is while (cond) statement In a second example, the code i <- 10 while (i > 0) {print("Hello"); i <- i - 1} once again prints the word Hello 10 times. Make sure the statements inside the brackets modify the while condition so that sooner or later it’s no longer true—otherwise the loop will never end! In the previous example, the statement i <- i – 1 subtracts 1 from object i on each loop, so that after the tenth loop it’s no longer larger than 0. If you instead added 1 on each loop, R would never stop saying hello. This is why while loops can be more dangerous than other looping constructs. Looping in R can be inefficient and time consuming when you’re processing the rows or columns of large datasets. Whenever possible, it’s better to use R’s built-in numerical and character functions in conjunction with the apply family of functions. 5.4.2 Conditional execution In conditional execution, a statement or statements are executed only if a specified condition is met. These constructs include if-else, ifelse, and switch. IF-ELSE The if-else control structure executes a statement if a given condition is true. Optionally, a different statement is executed if the condition is false. The syntax is if (cond) statement if (cond) statement1 else statement2 Here are some examples: if (is.character(grade)) grade <- as.factor(grade) if (!is.factor(grade)) grade <- as.factor(grade) else print("Grade already is a factor") In the first instance, if grade is a character vector, it’s converted into a factor. In the second instance, one of two statements is executed. If grade isn’t a factor (note the ! symbol), it’s turned into one. If it’s a factor, then the message is printed. www.it-ebooks.info User-written functions 107 IFELSE The ifelse construct is a compact and vectorized version of the if-else construct. The syntax is ifelse(cond, statement1, statement2) The first statement is executed if cond is TRUE. If cond is FALSE, the second statement is executed. Here are some examples: ifelse(score > 0.5, print("Passed"), print("Failed")) outcome <- ifelse (score > 0.5, "Passed", "Failed") Use ifelse when you want to take a binary action or when you want to input and output vectors from the construct. SWITCH switch chooses statements based on the value of an expression. The syntax is switch(expr, ...) where ... represents statements tied to the possible outcome values of expr. It’s easiest to understand how switch works by looking at the example in the following listing. Listing 5.7 A switch example > feelings <- c("sad", "afraid") > for (i in feelings) print( switch(i, happy = "I am glad you are happy", afraid = "There is nothing to fear", sad = "Cheer up", angry = "Calm down now" ) ) [1] "Cheer up" [1] "There is nothing to fear" This is a silly example, but it shows the main features. You’ll learn how to use switch in user-written functions in the next section. 5.5 User-written functions One of R’s greatest strengths is the user’s ability to add functions. In fact, many of the functions in R are functions of existing functions. The structure of a function looks like this: myfunction <- function(arg1, arg2, ... ){ statements return(object) } Objects in the function are local to the function. The object returned can be any data type, from scalar to list. Let’s look at an example. www.it-ebooks.info 108 CHAPTER 5 Advanced data management Say you’d like to have a function that calculates the central tendency and spread of data objects. The function should give you a choice between parametric (mean and standard deviation) and nonparametric (median and median absolute deviation) statistics. The results should be returned as a named list. Additionally, the user should have the choice of automatically printing the results or not. Unless otherwise specified, the function’s default behavior should be to calculate parametric statistics and not print the results. One solution is given in the following listing. Listing 5.8 mystats(): a user-written function for summary statistics mystats <- function(x, parametric=TRUE, print=FALSE) { if (parametric) { center <- mean(x); spread <- sd(x) } else { center <- median(x); spread <- mad(x) } if (print & parametric) { cat("Mean=", center, "\n", "SD=", spread, "\n") } else if (print & !parametric) { cat("Median=", center, "\n", "MAD=", spread, "\n") } result <- list(center=center, spread=spread) return(result) } To see this function in action, first generate some data (a random sample of size 500 from a normal distribution): set.seed(1234) x <- rnorm(500) After executing the statement y <- mystats(x) y$center contains the mean (0.00184) and y$spread contains the standard deviation (1.03). No output is produced. If you execute the statement y <- mystats(x, parametric=FALSE, print=TRUE) y$center contains the median (–0.0207) and y$spread contains the median absolute deviation (1.001). In addition, the following output is produced: Median= -0.0207 MAD= 1 Next, let’s look at a user-written function that uses the switch construct. This function gives the user a choice regarding the format of today’s date. Values that are assigned to parameters in the function declaration are taken as defaults. In the mydate() function, long is the default format for dates if type isn’t specified: mydate <- function(type="long") { switch(type, long = format(Sys.time(), "%A %B %d %Y"), www.it-ebooks.info Aggregation and reshaping 109 short = format(Sys.time(), "%m-%d-%y"), cat(type, "is not a recognized type\n") ) } Here’s the function in action: > mydate("long") [1] "Monday July 14 2014" > mydate("short") [1] "07-14-14" > mydate() [1] "Monday July 14 2014" > mydate("medium") medium is not a recognized type Note that the cat() function is executed only if the entered type doesn’t match "long" or "short". It’s usually a good idea to have an expression that catches usersupplied arguments that have been entered incorrectly. Several functions are available that can help add error trapping and correction to your functions. You can use the function warning() to generate a warning message, message() to generate a diagnostic message, and stop() to stop execution of the current expression and carry out an error action. Error trapping and debugging are discussed more fully in section 20.5. After creating your own functions, you may want to make them available in every session. Appendix B describes how to customize the R environment so that userwritten functions are loaded automatically at startup. We’ll look at additional examples of user-written functions in chapters 6 and 8. You can accomplish a great deal using the basic techniques provided in this section. Control flow and other programming topics are covered in greater detail in chapter 20. Creating a package is covered in chapter 21. If you’d like to explore the subtleties of function writing, or you want to write professional-level code that you can distribute to others, I recommend reading these two chapters and then reviewing two excellent books that you’ll find in the References section at the end of this book: Venables & Ripley (2000) and Chambers (2008). Together, they provide a significant level of detail and breadth of examples. Now that we’ve covered user-written functions, we’ll end this chapter with a discussion of data aggregation and reshaping. 5.6 Aggregation and reshaping R provides a number of powerful methods for aggregating and reshaping data. When you aggregate data, you replace groups of observations with summary statistics based on those observations. When you reshape data, you alter the structure (rows and columns) determining how the data is organized. This section describes a variety of methods for accomplishing these tasks. In the next two subsections, we’ll use the mtcars data frame that’s included with the base installation of R. This dataset, extracted from Motor Trend magazine (1974), www.it-ebooks.info 110 CHAPTER 5 Advanced data management describes the design and performance characteristics (number of cylinders, displacement, horsepower, mpg, and so on) for 34 automobiles. To learn more about the dataset, see help(mtcars). 5.6.1 Transpose Transposing (reversing rows and columns) is perhaps the simplest method of reshaping a dataset. Use the t() function to transpose a matrix or a data frame. In the latter case, row names become variable (column) names. An example is presented in the next listing. Listing 5.9 Transposing a dataset > cars <- mtcars[1:5,1:4] > cars mpg cyl disp hp Mazda RX4 21.0 6 160 110 Mazda RX4 Wag 21.0 6 160 110 Datsun 710 22.8 4 108 93 Hornet 4 Drive 21.4 6 258 110 Hornet Sportabout 18.7 8 360 175 > t(cars) Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive Hornet Sportabout mpg 21 21 22.8 21.4 18.7 cyl 6 6 4.0 6.0 8.0 disp 160 160 108.0 258.0 360.0 hp 110 110 93.0 110.0 175.0 Listing 5.9 uses a subset of the mtcars dataset in order to conserve space on the page. You’ll see a more flexible way of transposing data when we look at the reshape2 package later in this section. 5.6.2 Aggregating data It’s relatively easy to collapse data in R using one or more by variables and a defined function. The format is aggregate(x, by, FUN) where x is the data object to be collapsed, by is a list of variables that will be crossed to form the new observations, and FUN is the scalar function used to calculate summary statistics that will make up the new observation values. As an example, let’s aggregate the mtcars data by number of cylinders and gears, returning means for each of the numeric variables. Listing 5.10 Aggregating data > > > > options(digits=3) attach(mtcars) aggdata <-aggregate(mtcars, by=list(cyl,gear), FUN=mean, na.rm=TRUE) aggdata www.it-ebooks.info Aggregation and reshaping 1 2 3 4 5 6 7 8 111 Group.1 Group.2 mpg cyl disp hp drat wt qsec vs am gear carb 4 3 21.5 4 120 97 3.70 2.46 20.0 1.0 0.00 3 1.00 6 3 19.8 6 242 108 2.92 3.34 19.8 1.0 0.00 3 1.00 8 3 15.1 8 358 194 3.12 4.10 17.1 0.0 0.00 3 3.08 4 4 26.9 4 103 76 4.11 2.38 19.6 1.0 0.75 4 1.50 6 4 19.8 6 164 116 3.91 3.09 17.7 0.5 0.50 4 4.00 4 5 28.2 4 108 102 4.10 1.83 16.8 0.5 1.00 5 2.00 6 5 19.7 6 145 175 3.62 2.77 15.5 0.0 1.00 5 6.00 8 5 15.4 8 326 300 3.88 3.37 14.6 0.0 1.00 5 6.00 In these results, Group.1 represents the number of cylinders (4, 6, or 8), and Group.2 represents the number of gears (3, 4, or 5). For example, cars with 4 cylinders and 3 gears have a mean of 21.5 miles per gallon (mpg). When you’re using the aggregate() function, the by variables must be in a list (even if there’s only one). You can declare a custom name for the groups from within the list, for instance, using by=list(Group.cyl=cyl, Group.gears=gear). The function specified can be any built-in or user-provided function. This gives the aggregate command a great deal of power. But when it comes to power, nothing beats the reshape2 package. 5.6.3 The reshape2 package The reshape2 package is a tremendously versatile approach to both restructuring and aggregating datasets. Because of this versatility, it can be a bit challenging to learn. We’ll go through the process slowly and use a small dataset so it’s clear what’s happening. Because reshape2 isn’t included in the standard installation of R, you’ll need to install it one time, using install.packages("reshape2"). Basically, you melt data so that each row is a unique ID-variable combination. Then you cast the melted data into any shape you desire. During the cast, you can aggregate the data with any function you wish. The dataset you’ll be working with is shown in table 5.8. Table 5.8 The original dataset (mydata) ID Time X1 X2 1 1 5 6 1 2 3 5 2 1 6 1 2 2 2 4 In this dataset, the measurements are the values in the last two columns (5, 6, 3, 5, 6, 1, 2, and 4). Each measurement is uniquely identified by a combination of ID variables (in this case ID, Time, and whether the measurement is on X1 or X2). For example, the measured value 5 in the first row is uniquely identified by knowing that it’s from observation (ID) 1, at Time 1, and on variable X1. www.it-ebooks.info 112 Advanced data management CHAPTER 5 MELTING When you melt a dataset, you restructure it into a format in which each measured variable is in its own row along with the ID variables needed to uniquely identify it. If you melt the data from table 5.8 using the following code, you end up with the structure shown in table 5.9. library(reshape2) md <- melt(mydata, id=c("ID", "Time")) Table 5.9 The melted dataset ID Time variable value 1 1 X1 5 1 2 X1 3 2 1 X1 6 2 2 X1 2 1 1 X2 6 1 2 X2 5 2 1 X2 1 2 2 X2 4 Note that you must specify the variables needed to uniquely identify each measurement (ID and Time) and that the variable indicating the measurement variable names (X1 or X2) is created for you automatically. Now that you have your data in a melted form, you can recast it into any shape, using the dcast() function. CASTING The dcast() function starts with a melted data frame and reshapes it into a new data frame using a formula that you provide and an (optional) function used to aggregate the data. The format is newdata <- dcast(md, formula, fun.aggregate) where md is the melted data, formula describes the desired end result, and fun.aggregate is the (optional) aggregating function. The formula takes the form rowvar1 + rowvar2 + ... ~ colvar1 + colvar2 + ... In this formula, rowvar1 + rowvar2 + ... defines the set of crossed variables that defines the rows, and colvar1 + colvar2 + ... defines the set of crossed variables that defines the columns. See the examples in figure 5.1. Because the formulas on the right side (d, e, and f) don’t include a function, the data is reshaped. In contrast, the examples on the left side (a, b, and c) specify the www.it-ebooks.info 113 Summary Reshaping a dataset With aggregation dcast(md, ID~variable, mean) ID X1 X2 1 4 5.5 2 4 2.5 Without aggregation mydata ID Time X1 X2 1 1 5 6 1 2 3 5 2 1 6 1 2 2 2 4 dcast(md, ID+Time~variable) ID Time X1 X2 1 1 5 6 1 2 3 5 2 1 6 1 2 2 2 4 (d) (a) dcast(md, ID+variable~Time) md <- melt(mydata, ID=c("ID", "Time")) dcast(md, Time~variable, mean) ID Time variable value ID variable Time1 Time 2 Time X1 X2 1 1 X1 5 1 X1 5 3 1 5.5 3.5 1 2 X1 3 1 X2 6 5 2 2.5 4.5 2 1 X1 6 2 X1 6 2 2 2 X1 2 2 X2 1 4 1 1 X2 6 1 2 X2 5 2 1 X2 1 2 2 X2 4 (b) dcast(md, ID~Time, mean) ID Time1 Time2 1 5.5 4 2 3.5 3 (e) dcast(md, ID~variable+Time) (c) ID X1 Time1 X1 Time2 X2 Time1 X2 Time2 1 5 3 6 5 2 6 2 1 4 (f) Figure 5.1 Reshaping data with the melt() and dcast() functions mean as an aggregating function. Thus the data is not only reshaped but aggregated as well. For example, example a gives the means on X1 and X2 averaged over time for each observation. Example b gives the mean scores of X1 and X2 at Time 1 and Time 2, averaged over observations. In example c, you have the mean score for each observation at Time 1 and Time 2, averaged over X1 and X2. As you can see, the flexibility provided by the melt() and dcast() functions is amazing. There are many times when you’ll have to reshape or aggregate data prior to analysis. For example, you’ll typically need to place your data in what’s called long format, resembling table 5.9, when analyzing repeated-measures data (data where multiple measures are recorded for each observation). See section 9.6 for an example. 5.7 Summary This chapter reviewed dozens of mathematical, statistical, and probability functions that are useful for manipulating data. You saw how to apply these functions to a wide range of data objects including vectors, matrices, and data frames. You learned to use control-flow constructs for looping and branching to execute some statements repetitively and execute other statements only when certain conditions are met. You then www.it-ebooks.info 114 CHAPTER 5 Advanced data management had a chance to write your own functions and apply them to data. Finally, we explored ways of collapsing, aggregating, and restructuring data. Now that you’ve gathered the tools you need to get your data into shape (no pun intended), you’re ready to bid part 1 goodbye and enter the exciting world of data analysis! In upcoming chapters, we’ll begin to explore the many statistical and graphical methods available for turning data into information. www.it-ebooks.info Part 2 Basic methods I n part 1, we explored the R environment and discussed how to input data from a wide variety of sources, combine and transform it, and prepare it for further analyses. Once your data has been input and cleaned up, the next step is typically to explore the variables one at a time. This provides you with information about the distribution of each variable, which is useful in understanding the characteristics of the sample, identifying unexpected or problematic values, and selecting appropriate statistical methods. Next, variables are typically studied two at a time. This can help you to uncover basic relationships among variables and is a useful first step in developing more complex models. Part 2 focuses on graphical and statistical techniques for obtaining basic information about data. Chapter 6 describes methods for visualizing the distribution of individual variables. For categorical variables, this includes bar plots, pie charts, and the newer fan plot. For numeric variables, this includes histograms, density plots, box plots, dot plots, and the less well-known violin plot. Each type of graph is useful for understanding the distribution of a single variable. Chapter 7 describes statistical methods for summarizing individual variables and bivariate relationships. The chapter starts with coverage of descriptive statistics for numerical data based on the dataset as a whole and on subgroups of interest. Next, the use of frequency tables and cross-tabulations for summarizing categorical data is described. The chapter ends by discussing basic inferential methods for understanding relationships between two variables at a time, including bivariate correlations, chi-square tests, t-tests, and nonparametric methods. When you have finished this part of the book, you’ll be able to use basic graphical and statistical methods available in R to describe your data, explore group differences, and identify significant relationships among variables. www.it-ebooks.info 116 CHAPTER www.it-ebooks.info Basic graphs This chapter covers ■ Bar, box, and dot plots ■ Pie and fan charts ■ Histograms and kernel density plots Whenever we analyze data, the first thing we should do is look at it. For each variable, what are the most common values? How much variability is present? Are there any unusual observations? R provides a wealth of functions for visualizing data. In this chapter, we’ll look at graphs that help you understand a single categorical or continuous variable. This topic includes ■ ■ Visualizing the distribution of a variable Comparing groups on an outcome variable In both cases, the variable can be continuous (for example, car mileage as miles per gallon) or categorical (for example, treatment outcome as none, some, or marked). In later chapters, we’ll explore graphs that display bivariate and multivariate relationships among variables. The following sections explore the use of bar plots, pie charts, fan charts, histograms, kernel density plots, box plots, violin plots, and dot plots. Some of these may 117 www.it-ebooks.info 118 CHAPTER 6 Basic graphs be familiar to you, whereas others (such as fan plots or violin plots) may be new to you. The goal, as always, is to understand your data better and to communicate this understanding to others. Let’s start with bar plots. 6.1 Bar plots A bar plot displays the distribution (frequency) of a categorical variable through vertical or horizontal bars. In its simplest form, the format of the barplot() function is barplot(height) where height is a vector or matrix. In the following examples, you’ll plot the outcome of a study investigating a new treatment for rheumatoid arthritis. The data are contained in the Arthritis data frame distributed with the vcd package. This package isn’t included in the default R installation, so install it before first use (install.packages("vcd")). Note that the vcd package isn’t needed to create bar plots. You’re loading it in order to gain access to the Arthritis dataset. But you’ll need the vcd package when creating spinograms, which are described in section 6.1.5. 6.1.1 Simple bar plots If height is a vector, the values determine the heights of the bars in the plot, and a vertical bar plot is produced. Including the option horiz=TRUE produces a horizontal bar chart instead. You can also add annotating options. The main option adds a plot title, whereas the xlab and ylab options add x-axis and y-axis labels, respectively. In the Arthritis study, the variable Improved records the patient outcomes for individuals receiving a placebo or drug: > library(vcd) > counts <- table(Arthritis$Improved) > counts None Some Marked 42 14 28 Here, you see that 28 patients showed marked improvement, 14 showed some improvement, and 42 showed no improvement. We’ll discuss the use of the table() function to obtain cell counts more fully in chapter 7. You can graph the variable counts using a vertical or horizontal bar plot. The code is provided in the following listing, and the resulting graphs are displayed in figure 6.1. Listing 6.1 Simple bar plots barplot(counts, main="Simple Bar Plot", xlab="Improvement", ylab="Frequency") barplot(counts, main="Horizontal Bar Plot", xlab="Frequency", ylab="Improvement", horiz=TRUE) www.it-ebooks.info Simple bar plot Horizontal bar plot 119 Bar plots Horizontal Bar Plot None Some Improvement 20 10 Frequency 30 Marked 40 Simple Bar Plot 0 Figure 6.1 Simple vertical and horizontal bar charts None Some Marked 0 10 Improvement 20 30 40 Frequency Creating bar plots with factor variables If the categorical variable to be plotted is a factor or ordered factor, you can create a vertical bar plot quickly with the plot() function. Because Arthritis$Improved is a factor, the code plot(Arthritis$Improved, main="Simple Bar Plot", xlab="Improved", ylab="Frequency") plot(Arthritis$Improved, horiz=TRUE, main="Horizontal Bar Plot", xlab="Frequency", ylab="Improved") will generate the same bar plots as those in listing 6.1, but without the need to tabulate values with the table() function. What happens if you have long labels? In section 6.1.4, you’ll see how to tweak labels so that they don’t overlap. 6.1.2 Stacked and grouped bar plots If height is a matrix rather than a vector, the resulting graph will be a stacked or grouped bar plot. If beside=FALSE (the default), then each column of the matrix produces a bar in the plot, with the values in the column giving the heights of stacked “sub-bars.” If beside=TRUE, each column of the matrix represents a group, and the values in each column are juxtaposed rather than stacked. Consider the cross-tabulation of treatment type and improvement status: > library(vcd) > counts <- table(Arthritis$Improved, Arthritis$Treatment) > counts Treatment Improved Placebo Treated None 29 13 Some 7 7 Marked 7 21 www.it-ebooks.info 120 CHAPTER 6 Basic graphs Grouped Bar Plot 15 20 25 None Some Marked 10 20 Frequency 30 Marked Some None 10 Frequency 40 Stacked Bar Plot 0 0 5 Figure 6.2 Stacked and grouped bar plots Placebo Treated Placebo Treatment Treated Treatment You can graph the results as either a stacked or a grouped bar plot (see the next listing). The resulting graphs are displayed in figure 6.2. Listing 6.2 Stacked and grouped bar plots barplot(counts, main="Stacked Bar Plot", xlab="Treatment", ylab="Frequency", col=c("red", "yellow","green"), legend=rownames(counts)) barplot(counts, main="Grouped Bar Plot", xlab="Treatment", ylab="Frequency", col=c("red", "yellow", "green"), legend=rownames(counts), beside=TRUE) Stacked bar plot Grouped bar plot The first barplot() function produces a stacked bar plot, whereas the second produces a grouped bar plot. We’ve also added the col option to add color to the bars plotted. The legend.text parameter provides bar labels for the legend (which are only useful when height is a matrix). In chapter 3, we covered ways to format and place the legend to maximum benefit. See if you can rearrange the legend to avoid overlap with the bars. 6.1.3 Mean bar plots Bar plots needn’t be based on counts or frequencies. You can create bar plots that represent means, medians, standard deviations, and so forth by using the aggregate function and passing the results to the barplot() function. The following listing shows an example, which is displayed in figure 6.3. Listing 6.3 Bar plot for sorted mean values > states <- data.frame(state.region, state.x77) > means <- aggregate(states$Illiteracy, by=list(state.region), FUN=mean) > means www.it-ebooks.info 121 Bar plots Group.1 x 1 Northeast 1.00 2 South 1.74 3 North Central 0.70 4 West 1.02 > means <- means[order(means$x),] > means Group.1 x 3 North Central 0.70 1 Northeast 1.00 4 West 1.02 2 South 1.74 > barplot(means$x, names.arg=means$Group.1) > title("Mean Illiteracy Rate") b 6.1.4 c Adds title 0.5 1.0 1.5 Mean Illiteracy Rate 0.0 Listing 6.3 sorts the means from smallest to largest b. Also note that using the title() function c is equivalent to adding the main option in the plot call. means$x is the vector containing the heights of the bars, and the option names.arg=means$Group.1 is added to provide labels. You can take this example further. The bars can be connected with straight-line segments using the lines() function. You can also create mean bar plots with superimposed confidence intervals using the barplot2() function in the gplots package. See help(barplot2) for examples. Sorts means, smallest to largest North Central Northeast West South Figure 6.3 Bar plot of mean illiteracy rates for US regions sorted by rate Tweaking bar plots There are several ways to tweak the appearance of a bar plot. For example, with many bars, bar labels may start to overlap. You can decrease the font size using the cex.names option. Specifying values smaller than 1 will shrink the size of the labels. Optionally, the names.arg argument allows you to specify a character vector of names used to label the bars. You can also use graphical parameters to help text spacing. An example is given in the following listing, with the output displayed in figure 6.4. Listing 6.4 Fitting labels in a bar plot par(mar=c(5,8,4,2)) Increases the size of the y margin par(las=2) Rotates the counts <- table(Arthritis$Improved) FL bar labels barplot(counts, main="Treatment Outcome", Decreases the font size in order horiz=TRUE, to fit the labels comfortably cex.names=0.8, names.arg=c("No Improvement", "Some Improvement", Changes the label text "Marked Improvement")) www.it-ebooks.info 122 CHAPTER 6 Basic graphs Treatment Outcome Marked Improvement Some Improvement Figure 6.4 Horizontal bar plot with tweaked labels 40 30 20 10 0 No Improvement The par() function allows you to make extensive modifications to the graphs that R produces by default. See chapter 3 for more details. Spinograms Before finishing our discussion of bar plots, let’s take a look at a specialized version called a spinogram. In a spinogram, a stacked bar plot is rescaled so that the height of each bar is 1 and the segment heights represent proportions. Spinograms are created through the spine() function of the vcd package. The following code produces a simple spinogram: library(vcd) attach(Arthritis) counts <- table(Treatment, Improved) spine(counts, main="Spinogram Example") detach(Arthritis) Spinogram Example Marked 1 Some 0.8 0.6 Improved The output is provided in figure 6.5. The larger percentage of patients with marked improvement in the Treated condition is quite evident when compared with the Placebo condition. In addition to bar plots, pie charts are a popular vehicle for displaying the distribution of a categorical variable. We’ll consider them next. 0.4 None 6.1.5 0.2 0 Figure 6.5 Spinogram of arthritis treatment outcome Placebo Treated Treatment www.it-ebooks.info 123 Pie charts Simple Pie Chart Pie Chart with Percentages UK UK 24% US US 20% Australia 8% Australia France 16% France Germany Germany 32% Pie Chart from a Table (with sample sizes) 3D Pie Chart South 16 UK Northeast 9 US Australia France North Central 12 West 13 Germany Figure 6.6 6.2 Pie chart examples Pie charts Whereas pie charts are ubiquitous in the business world, they’re denigrated by most statisticians, including the authors of the R documentation. They recommend bar or dot plots over pie charts because people are able to judge length more accurately than volume. Perhaps for this reason, the pie chart options in R are limited when compared with other statistical software. Pie charts are created with the function pie(x, labels) where x is a non-negative numeric vector indicating the area of each slice and labels provides a character vector of slice labels. Four examples are given in the next listing; the resulting plots are provided in figure 6.6. Listing 6.5 Pie charts par(mfrow=c(2, 2)) slices <- c(10, 12,4, 16, 8) lbls <- c("US", "UK", "Australia", "Germany", "France") pie(slices, labels = lbls, main="Simple Pie Chart") pct <- round(slices/sum(slices)*100) lbls2 <- paste(lbls, " ", pct, "%", sep="") pie(slices, labels=lbls2, col=rainbow(length(lbls2)), main="Pie Chart with Percentages") www.it-ebooks.info b Combines four graphs into one c Adds percentages to the pie chart 124 CHAPTER 6 Basic graphs library(plotrix) pie3D(slices, labels=lbls,explode=0.1, Creates a chart main="3D Pie Chart ") from the table mytable <- table(state.region) lbls3 <- paste(names(mytable), "\n", mytable, sep="") pie(mytable, labels = lbls3, main="Pie Chart from a Table\n (with sample sizes)") d First you set up the plot so that four graphs are combined into one b. (Combining multiple graphs is covered in chapter 3.) Then you input the data that will be used for the first three graphs. For the second pie chart c, you convert the sample sizes to percentages and add the information to the slice labels. The second pie chart also defines the colors of the slices using the rainbow() function, described in chapter 3. Here rainbow(length(lbls2)) resolves to rainbow(5), providing five colors for the graph. The third pie chart is a 3D chart created using the pie3D() function from the plotrix package. Be sure to download and install this package before using it for the first time. If statisticians dislike pie charts, they positively despise 3D pie charts (although they may secretly find them pretty). This is because the 3D effect adds no additional insight into the data and is considered distracting eye candy. The fourth pie chart demonstrates how to create a chart from a table d. In this case, you count the number of states by US region and append the information to the labels before producing the plot. Pie charts make it difficult to compare the values of the slices (unless the values are appended to the labels). For example, looking at the simple pie chart, can you tell how the US compares to Germany? (If you can, you’re more perceptive than I am.) In an attempt to improve on this situation, a variation of the pie chart, called a fan plot, has been developed. The fan plot (Lemon & Tyagi, 2009) provides you with a way to display both relative quantities and differences. In R, it’s implemented through the fan.plot() function in the plotrix package. Consider the following code and the resulting graph (figure 6.7): library(plotrix) slices <- c(10, 12,4, 16, 8) lbls <- c("US", "UK", "Australia", "Germany", "France") fan.plot(slices, labels = lbls, main="Fan Plot") In a fan plot, the slices are rearranged to overlap each other, and the radii are modified so that each slice is visible. Here you can see that Germany is the largest slice and that the US slice is roughly 60% as large. France appears to be half as large as Germany and twice as large as Australia. Remember that the width of the slice and not the radius is what’s important here. Fan Plot France Australia Figure 6.7 www.it-ebooks.info US UK Germany A fan plot of the country data 125 Histograms As you can see, it’s much easier to determine the relative sizes of the slice in a fan plot than in a pie chart. Fan plots haven’t caught on yet, but they’re new. Now that we’ve covered pie and fan charts, let’s move on to histograms. Unlike bar plots and pie charts, histograms describe the distribution of a continuous variable. 6.3 Histograms Histograms display the distribution of a continuous variable by dividing the range of scores into a specified number of bins on the x-axis and displaying the frequency of scores in each bin on the y-axis. You can create histograms with the function hist(x) where x is a numeric vector of values. The option freq=FALSE creates a plot based on probability densities rather than frequencies. The breaks option controls the number of bins. The default produces equally spaced breaks when defining the cells of the histogram. The following listing provides the code for four variations of a histogram; the results are plotted in figure 6.8. Listing 6.6 Histograms par(mfrow=c(2,2)) b Simple histogram hist(mtcars$mpg) hist(mtcars$mpg, breaks=12, col="red", xlab="Miles Per Gallon", main="Colored histogram with 12 bins") c With specified bins and color hist(mtcars$mpg, freq=FALSE, breaks=12, col="red", xlab="Miles Per Gallon", main="Histogram, rug plot, density curve") rug(jitter(mtcars$mpg)) lines(density(mtcars$mpg), col="blue", lwd=2) d With a rug plot x <- mtcars$mpg h<-hist(x, breaks=12, col="red", xlab="Miles Per Gallon", main="Histogram with normal curve and box") xfit<-seq(min(x), max(x), length=40) yfit<-dnorm(xfit, mean=mean(x), sd=sd(x)) yfit <- yfit*diff(h$mids[1:2])*length(x) lines(xfit, yfit, col="blue", lwd=2) box() www.it-ebooks.info e With a normal curve and frame 126 CHAPTER 6 Basic graphs The first histogram b demonstrates the default plot when no options are specified. In this case, five bins are created, and the default axis labels and titles are printed. For the second histogram c, you specified 12 bins, a red fill for the bars, and more attractive and informative labels and title. The third histogram d maintains the same colors, bins, labels, and titles as the previous plot but adds a density curve and rug-plot overlay. The density curve is a kernel density estimate and is described in the next section. It provides a smoother description of the distribution of scores. You use the lines() function to overlay this curve in a blue color and a width that’s twice the default thickness for lines. Finally, a rug plot is a one-dimensional representation of the actual data values. If there are many tied values, you can jitter the data on the rug plot using code like the following: rug(jitter(mtcars$mpag, amount=0.01)) This adds a small random value to each data point (a uniform random variate between ±amount), in order to avoid overlapping points. The fourth histogram e is similar to the second but has a superimposed normal curve and a box around the figure. The code for superimposing the normal curve Colored histogram with 12 bins 4 0 0 1 2 2 3 Frequency 8 6 4 Frequency 5 10 6 7 12 Histogram of mtcars$mpg 10 15 20 25 30 35 10 15 20 25 30 Miles Per Gallon Histogram, rug plot, density curve Histogram with normal curve and box 5 4 0 0.00 1 2 3 Frequency 0.04 Density 0.08 6 7 mtcars$mpg 10 15 20 25 30 10 Miles Per Gallon Figure 6.8 15 20 25 Miles Per Gallon Histogram examples www.it-ebooks.info 30 127 Kernel density plots comes from a suggestion posted to the R-help mailing list by Peter Dalgaard. The surrounding box is produced by the box() function. 6.4 Kernel density plots In the previous section, you saw a kernel density plot superimposed on a histogram. Technically, kernel density estimation is a nonparametric method for estimating the probability density function of a random variable. Although the mathematics are beyond the scope of this text, in general, kernel density plots can be an effective way to view the distribution of a continuous variable. The format for a density plot (that’s not being superimposed on another graph) is plot(density(x)) where x is a numeric vector. Because the plot() function begins a new graph, use the lines() function (listing 6.6) when superimposing a density curve on an existing graph. Two kernel density examples are given in the next listing, and the results are plotted in figure 6.9. Listing 6.7 Kernel density plots par(mfrow=c(2,1)) d <- density(mtcars$mpg) Creates the minimal graph with all the defaults in place plot(d) area under the curve with solid red Kernel density plots 0.06 0.03 0.00 10 20 30 40 N = 32 Bandwidth = 2.477 0.03 0.06 Kernel Density of Miles Per Gallon 0.00 Figure 6.9 density.default(x = mtcars$mpg) Density The polygon() function draws a polygon whose vertices are given by x and y. These values are provided by the density() function in this case. Kernel density plots can be used to compare groups. This is a highly underutilized approach, probably due to a general lack of easily accessible software. Fortunately, the sm package fills this gap nicely. Density Adds a brown rug Adds a title d <- density(mtcars$mpg) plot(d, main="Kernel Density of Miles Per Gallon") polygon(d, col="red", border="blue") Colors the curve blue and fills the rug(mtcars$mpg, col="brown") 10 20 30 N = 32 Bandwidth = 2.477 www.it-ebooks.info 40 128 CHAPTER 6 Basic graphs The sm.density.compare() function in the sm package allows you to superimpose the kernel density plots of two or more groups. The format is sm.density.compare(x, factor) where x is a numeric vector and factor is a grouping variable. Be sure to install the sm package before first use. An example comparing the mpg of cars with four, six, and eight cylinders is provided in the following listing. Listing 6.8 Comparative kernel density plots library(sm) attach(mtcars) cyl.f <- factor(cyl, levels= c(4,6,8), labels = c("4 cylinder", "6 cylinder", "8 cylinder")) b sm.density.compare(mpg, cyl, xlab="Miles Per Gallon") title(main="MPG Distribution by Car Cylinders") colfill<-c(2:(1+length(levels(cyl.f)))) legend(locator(1), levels(cyl.f), fill=colfill) c d Creates a grouping factor Plots the densities Adds a legend via mouse click detach(mtcars) 0.10 0.00 0.05 Density 0.15 0.20 First, the sm package is loaded and the mtcars data frame is attached. In the mtcars data frame b, the variable cyl is a numeric variable coded 4, 6, or 8. cyl is transformed into a factor named cyl.f, in order to provide value labels for the MPG Distribution by Car Cylinders plot. The sm.density.compare() function creates the plot c, and 4 cylinder a title() statement adds a main 6 cylinder 8 cylinder title. Finally, you add a legend to improve interpretability d. (Legends are covered in chapter 3.) A vector of colors is created; here, colfill is c(2,3,4). Then the legend is added to the plot via the legend() function. The locator(1) option indicates that you’ll place the legend interactively by clicking in the graph where you want the leg5 10 15 20 25 30 35 40 Miles Per Gallon end to appear. The second option provides a character vec- Figure 6.10 Kernel density plots of mpg by number of cylinders www.it-ebooks.info 129 Box plots tor of the labels. The third option assigns a color from the vector colfill to each level of cyl.f. The results are displayed in figure 6.10. Overlapping kernel density plots can be a powerful way to compare groups of observations on an outcome variable. Here you can see both the shapes of the distribution of scores for each group and the amount of overlap between groups. (The moral of the story is that my next car will have four cylinders—or a battery.) Box plots are also a wonderful (and more commonly used) graphical approach to visualizing distributions and differences among groups. We’ll discuss them next. Box plots A box-and-whiskers plot describes the distribution of a continuous variable by plotting its five-number summary: the minimum, lower quartile (25th percentile), median (50th percentile), upper quartile (75th percentile), and maximum. It can also display observations that may be outliers (values outside the range of ± 1.5*IQR, where IQR is the interquartile range defined as the upper quartile minus the lower quartile). For example, this statement produces the plot shown in figure 6.11: Box plot Miles Per Gallon 6.5 Figure 6.11 Box plot with annotations added by hand boxplot(mtcars$mpg, main="Box plot", ylab="Miles per Gallon") I added annotations by hand to illustrate the components. By default, each whisker extends to the most extreme data point, which is no more than 1.5 times the interquartile range for the box. Values outside this range are depicted as dots (not shown here). For example, in the sample of cars, the median mpg is 19.2, 50% of the scores fall between 15.3 and 22.8, the smallest value is 10.4, and the largest value is 33.9. How did I read this so precisely from the graph? Issuing boxplot.stats(mtcars$mpg) prints the statistics used to build the graph (in other words, I cheated). There don’t appear to be any outliers, and there is a mild positive skew (the upper whisker is longer than the lower whisker). 6.5.1 Using parallel box plots to compare groups Box plots can be created for individual variables or for variables by group. The format is boxplot(formula, data=dataframe) www.it-ebooks.info CHAPTER 6 Basic graphs where formula is a formula and dataframe denotes the data frame (or list) providing the data. An example of a formula is y ~ A, where a separate box plot for numeric variable y is generated for each value of categorical variable A. The formula y ~ A*B would produce a box plot of numeric variable y, for each combination of levels in categorical variables A and B. Adding the option varwidth=TRUE makes the box-plot widths proportional to the square root of their sample sizes. Add horizontal=TRUE to reverse the axis orientation. The following code revisits the impact of four, six, and eight cylinders on auto mpg with parallel box plots. The plot is provided in figure 6.12: boxplot(mpg ~ cyl, data=mtcars, main="Car Mileage Data", xlab="Number of Cylinders", ylab="Miles Per Gallon") 25 20 15 Miles Per Gallon 30 You can see in figure 6.12 that Car Mileage Data there’s a good separation of groups based on gas mileage. You can also see that the distribution of mpg for six-cylinder cars is more symmetrical than for the other two car types. Cars with four cylinders show the greatest spread (and positive skew) of mpg scores, when compared with six- and eightcylinder cars. There’s also an outlier in the eight-cylinder group. Box plots are very versatile. 4 6 8 By adding notch=TRUE, you get Number of Cylinders notched box plots. If two boxes’ notches don’t overlap, there’s Figure 6.12 Box plots of car mileage vs. number of cylinders strong evidence that their medians differ (Chambers et al., 1983, p. 62). The following code creates notched box plots for the mpg example: 10 130 boxplot(mpg ~ cyl, data=mtcars, notch=TRUE, varwidth=TRUE, col="red", main="Car Mileage Data", xlab="Number of Cylinders", ylab="Miles Per Gallon") The col option fills the box plots with a red color, and varwidth=TRUE produces box plots with widths that are proportional to their sample sizes. www.it-ebooks.info 131 Box plots 25 20 15 Miles Per Gallon 30 Car Mileage Data 10 Figure 6.13 Notched box plots for car mileage vs. number of cylinders 4 6 8 Number of Cylinders You can see in figure 6.13 that the median car mileage for four-, six-, and eight-cylinder cars differs. Mileage clearly decreases with number of cylinders. Finally, you can produce box plots for more than one grouping factor. Listing 6.9 provides box plots for mpg versus the number of cylinders and transmission type in an automobile (see figure 6.14). Again, you use the col option to fill the box plots with color. Note that colors recycle; in this case, there are six box plots and only two specified colors, so the colors repeat three times. Listing 6.9 Box plots for two crossed factors mtcars$cyl.f <- factor(mtcars$cyl, levels=c(4,6,8), labels=c("4","6","8")) mtcars$am.f <- factor(mtcars$am, levels=c(0,1), labels=c("auto", "standard")) boxplot(mpg ~ am.f *cyl.f, data=mtcars, varwidth=TRUE, col=c("gold","darkgreen"), main="MPG Distribution by Auto Type", xlab="Auto Type", ylab="Miles Per Gallon") Creates a factor for the number of cylinders Creates a factor for transmission type Generates the box plot From figure 6.14, it’s again clear that median mileage decreases with number of cylinders. For four- and six-cylinder cars, mileage is higher for standard transmissions. But www.it-ebooks.info 132 CHAPTER 6 Basic graphs 25 20 15 Miles Per Gallon 30 MPG Distribution by Auto Type 10 Figure 6.14 Box plots for car mileage vs. transmission type and number of cylinders auto.4 standard.4 auto.6 standard.6 auto.8 standard.8 Auto Type for eight-cylinder cars, there doesn’t appear to be a difference. You can also see from the widths of the box plots that standard four-cylinder and automatic eight-cylinder cars are the most common in this dataset. 6.5.2 Violin plots Before we end our discussion of box plots, it’s worth examining a variation called a violin plot. A violin plot is a combination of a box plot and a kernel density plot. You can create one using the vioplot() function from the vioplot package. Be sure to install the vioplot package before first use. The format for the vioplot() function is vioplot(x1, x2, ... , names=, col=) where x1, x2, ... represent one or more numeric vectors to be plotted (one violin plot is produced for each vector). The names parameter provides a character vector of labels for the violin plots, and col is a vector specifying the colors for each violin plot. An example is given in the following listing. Listing 6.10 Violin plots library(vioplot) x1 <- mtcars$mpg[mtcars$cyl==4] x2 <- mtcars$mpg[mtcars$cyl==6] x3 <- mtcars$mpg[mtcars$cyl==8] vioplot(x1, x2, x3, names=c("4 cyl", "6 cyl", "8 cyl"), col="gold") www.it-ebooks.info 133 Dot plots 25 20 15 Miles Per Gallon 30 Violin Plots of Miles Per Gallon 10 Figure 6.15 Violin plots of mpg vs. number of cylinders 4 cyl 6 cyl 8 cyl Number of Cylinders title("Violin Plots of Miles Per Gallon", ylab="Miles Per Gallon", xlab="Number of Cylinders") Note that the vioplot() function requires you to separate the groups to be plotted into separate variables. The results are displayed in figure 6.15. Violin plots are basically kernel density plots superimposed in a mirror-image fashion over box plots. Here, the white dot is the median, the black boxes range from the lower to the upper quartile, and the thin black lines represent the whiskers. The outer shape provides the kernel density plot. Violin plots haven’t really caught on yet. Again, this may be due to a lack of easily accessible software; time will tell. We’ll end this chapter with a look at dot plots. Unlike the graphs you’ve seen previously, dot plots plot every value for a variable. 6.6 Dot plots Dot plots provide a method of plotting a large number of labeled values on a simple horizontal scale. You create them with the dotchart() function, using the format dotchart(x, labels=) where x is a numeric vector and labels specifies a vector that labels each point. You can add a groups option to designate a factor specifying how the elements of x are grouped. If so, the option gcolor controls the color of the groups label, and cex controls the size of the labels. Here’s an example with the mtcars dataset: dotchart(mtcars$mpg, labels=row.names(mtcars), cex=.7, main="Gas Mileage for Car Models", xlab="Miles Per Gallon") www.it-ebooks.info 134 CHAPTER 6 Basic graphs Gas Mileage for Car Models Volvo 142E Maserati Bora Ferrari Dino Ford Pantera L Lotus Europa Porsche 914−2 Fiat X1−9 Pontiac Firebird Camaro Z28 AMC Javelin Dodge Challenger Toyota Corona Toyota Corolla Honda Civic Fiat 128 Chrysler Imperial Lincoln Continental Cadillac Fleetwood Merc 450SLC Merc 450SL Merc 450SE Merc 280C Merc 280 Merc 230 Merc 240D Duster 360 Valiant Hornet Sportabout Hornet 4 Drive Datsun 710 Mazda RX4 Wag Mazda RX4 10 15 20 25 30 Miles Per Gallon Figure 6.16 Dot plot of mpg for each car model The resulting plot is given in figure 6.16. This graph allows you to see the mpg for each make of car on the same horizontal axis. Dot plots typically become most interesting when they’re sorted and grouping factors are distinguished by symbol and color. An example is given in the following listing and shown in figure 6.17. Listing 6.11 Dot plot grouped, sorted, and colored Transforms the numeric vector cyl into a factor x <- mtcars[order(mtcars$mpg),] x$cyl <- factor(x$cyl) x$color[x$cyl==4] <- "red" x$color[x$cyl==6] <- "blue" x$color[x$cyl==8] <- "darkgreen" Sorts the data frame mtcars by mpg (lowest to highest) and saves it as data frame x Adds a character vector (color) to data frame x containing the value "red", "blue", or "darkgreen" depending on the value of cyl www.it-ebooks.info 135 Dot plots dotchart(x$mpg, labels = row.names(x), The labels for the data Prints the numbers cex=.7, points are taken from 4, 6, and 8 in black the row names of the groups = x$cyl, Groups data points by data frame (car makes). gcolor = "black", number of cylinders The colors of the color = x$color, points and labels pch=19, are derived from main = "Gas Mileage for Car Models\ngrouped by cylinder", the color vector. xlab = "Miles Per Gallon") In figure 6.17, a number of features become evident for the first time. Again, you see an increase in gas mileage as the number of cylinders decreases. But you also see exceptions. For example, the Pontiac Firebird, with eight cylinders, gets higher gas mileage than the Mercury 280C and the Valiant, each with six cylinders. The Hornet 4 Drive, with six cylinders, gets the same miles per gallon as the Volvo 142E, which has Gas Mileage for Car Models grouped by cylinder 4 Toyota Corolla Fiat 128 Lotus Europa Honda Civic Fiat X1−9 Porsche 914−2 Merc 240D Merc 230 Datsun 710 Toyota Corona Volvo 142E 6 Hornet 4 Drive Mazda RX4 Wag Mazda RX4 Ferrari Dino Merc 280 Valiant Merc 280C 8 Pontiac Firebird Hornet Sportabout Merc 450SL Merc 450SE Ford Pantera L Dodge Challenger AMC Javelin Merc 450SLC Maserati Bora Chrysler Imperial Duster 360 Camaro Z28 Lincoln Continental Cadillac Fleetwood 10 15 20 25 30 Miles Per Gallon Figure 6.17 Dot plot of mpg for car models grouped by number of cylinders www.it-ebooks.info 136 CHAPTER 6 Basic graphs four cylinders. It’s also clear that the Toyota Corolla gets the best gas mileage by far, whereas the Lincoln Continental and Cadillac Fleetwood are outliers on the low end. You can gain significant insight from a dot plot in this example because each point is labeled, the value of each point is inherently meaningful, and the points are arranged in a manner that promotes comparisons. But as the number of data points increases, the utility of the dot plot decreases. NOTE There are many variations of the dot plot. Jacoby (2006) provides a very informative discussion of the dot plot and includes R code for innovative applications. Additionally, the Hmisc package offers a dot-plot function (aptly named dotchart2()) with a number of additional features. 6.7 Summary In this chapter, you learned how to describe continuous and categorical variables. You saw how bar plots and (to a lesser extent) pie charts can be used to gain insight into the distribution of a categorical variable, and how stacked and grouped bar charts can help you understand how groups differ on a categorical outcome. We also explored how histograms, plots, box plots, rug plots, and dot plots can help you visualize the distribution of continuous variables. Finally, we explored how overlapping kernel density plots, parallel box plots, and grouped dot plots can help you visualize group differences on a continuous outcome variable. In later chapters, we’ll extend this univariate focus to include bivariate and multivariate graphical methods. You’ll see how to visually depict relationships among many variables at once using such methods as scatter plots, multigroup line plots, mosaic plots, correlograms, lattice graphs, and more. In the next chapter, we’ll look at basic statistical methods for describing distributions and bivariate relationships numerically, as well as inferential methods for evaluating whether relationships among variables exist or are due to sampling error. www.it-ebooks.info Basic statistics This chapter covers ■ Descriptive statistics ■ Frequency and contingency tables ■ Correlations and covariances ■ t-tests ■ Nonparametric statistics In previous chapters, you learned how to import data into R and use a variety of functions to organize and transform the data into a useful format. We then reviewed basic methods for visualizing data. Once your data is properly organized and you’ve begun to explore the data visually, the next step is typically to describe the distribution of each variable numerically, followed by an exploration of the relationships among selected variables two at a time. The goal is to answer questions like these: ■ What kind of mileage are cars getting these days? Specifically, what’s the distribution of miles per gallon (mean, standard deviation, median, range, and so on) in a survey of automobile makes and models? 137 www.it-ebooks.info 138 CHAPTER 7 ■ ■ ■ Basic statistics After a new drug trial, what’s the outcome (no improvement, some improvement, marked improvement) for drug versus placebo groups? Does the gender of the participants have an impact on the outcome? What’s the correlation between income and life expectancy? Is it significantly different from zero? Are you more likely to receive imprisonment for a crime in different regions of the United States? Are the differences between regions statistically significant? In this chapter, we’ll review R functions for generating basic descriptive and inferential statistics. First we’ll look at measures of location and scale for quantitative variables. Then you’ll learn how to generate frequency and contingency tables (and associated chi-square tests) for categorical variables. Next, we’ll examine the various forms of correlation coefficients available for continuous and ordinal variables. Finally, we’ll turn to the study of group differences through parametric (t-tests) and nonparametric (Mann–Whitney U test, Kruskal–Wallis test) methods. Although our focus is on numerical results, we’ll refer to graphical methods for visualizing these results throughout. The statistical methods covered in this chapter are typically taught in a first-year undergraduate statistics course. If these methodologies are unfamiliar to you, two excellent references are McCall (2000) and Kirk (2007). Alternatively, many informative online resources are available (such as Wikipedia) for each of the topics covered. 7.1 Descriptive statistics In this section, we’ll look at measures of central tendency, variability, and distribution shape for continuous variables. For illustrative purposes, we’ll use several of the variables from the Motor Trend Car Road Tests (mtcars) dataset you first saw in chapter 1. Our focus will be on miles per gallon (mpg), horsepower (hp), and weight (wt): > myvars <- c("mpg", "hp", "wt") > head(mtcars[myvars]) mpg hp wt Mazda RX4 21.0 110 2.62 Mazda RX4 Wag 21.0 110 2.88 Datsun 710 22.8 93 2.32 Hornet 4 Drive 21.4 110 3.21 Hornet Sportabout 18.7 175 3.44 Valiant 18.1 105 3.46 First we’ll look at descriptive statistics for all 32 cars. Then we’ll examine descriptive statistics by transmission type (am) and number of cylinders (cyl). Transmission type is a dichotomous variable coded 0=automatic, 1=manual, and the number of cylinders can be 4, 5, or 6. 7.1.1 A menagerie of methods When it comes to calculating descriptive statistics, R has an embarrassment of riches. Let’s start with functions that are included in the base installation. Then we’ll look at extensions that are available through the use of user-contributed packages. www.it-ebooks.info Descriptive statistics 139 In the base installation, you can use the summary() function to obtain descriptive statistics. An example is presented in the following listing. Listing 7.1 Descriptive statistics via summary() > myvars <- c("mpg", "hp", "wt") > summary(mtcars[myvars]) mpg hp wt Min. :10.4 Min. : 52.0 Min. :1.51 1st Qu.:15.4 1st Qu.: 96.5 1st Qu.:2.58 Median :19.2 Median :123.0 Median :3.33 Mean :20.1 Mean :146.7 Mean :3.22 3rd Qu.:22.8 3rd Qu.:180.0 3rd Qu.:3.61 Max. :33.9 Max. :335.0 Max. :5.42 The summary() function provides the minimum, maximum, quartiles, and mean for numerical variables and frequencies for factors and logical vectors. You can use the apply() or sapply() function from chapter 5 to provide any descriptive statistics you choose. For the sapply() function, the format is sapply(x, FUN, options) where x is the data frame (or matrix) and FUN is an arbitrary function. If options are present, they’re passed to FUN. Typical functions that you can plug in here are mean(), sd(), var(), min(), max(), median(), length(), range(), and quantile(). The function fivenum() returns Tukey’s five-number summary (minimum, lower-hinge, median, upper-hinge, and maximum). Surprisingly, the base installation doesn’t provide functions for skew and kurtosis, but you can add your own. The example in the next listing provides several descriptive statistics, including skew and kurtosis. Listing 7.2 Descriptive statistics via sapply() > mystats <- function(x, na.omit=FALSE){ if (na.omit) x <- x[!is.na(x)] m <- mean(x) n <- length(x) s <- sd(x) skew <- sum((x-m)^3/s^3)/n kurt <- sum((x-m)^4/s^4)/n - 3 return(c(n=n, mean=m, stdev=s, skew=skew, kurtosis=kurt)) } > myvars <- c("mpg", "hp", "wt") > sapply(mtcars[myvars], mystats) mpg hp wt n 32.000 32.000 32.0000 mean 20.091 146.688 3.2172 stdev 6.027 68.563 0.9785 skew 0.611 0.726 0.4231 kurtosis -0.373 -0.136 -0.0227 www.it-ebooks.info 140 CHAPTER 7 Basic statistics For cars in this sample, the mean mpg is 20.1, with a standard deviation of 6.0. The distribution is skewed to the right (+0.61) and is somewhat flatter than a normal distribution (–0.37). This is most evident if you graph the data. Note that if you wanted to omit missing values, you could use sapply(mtcars[myvars], mystats, na.omit=TRUE). 7.1.2 Even more methods Several user-contributed packages offer functions for descriptive statistics, including Hmisc, pastecs, and psych. Because these packages aren’t included in the base distribution, you’ll need to install them on first use (see section 1.4). The describe() function in the Hmisc package returns the number of variables and observations, the number of missing and unique values, the mean, quantiles, and the five highest and lowest values. An example is provided in the following listing. Listing 7.3 Descriptive statistics via describe() in the Hmisc package > library(Hmisc) > myvars <- c("mpg", "hp", "wt") > describe(mtcars[myvars]) 3 Variables 32 Observations --------------------------------------------------------------------------mpg n missing unique Mean .05 .10 .25 .50 .75 .90 .95 32 0 25 20.09 12.00 14.34 15.43 19.20 22.80 30.09 31.30 lowest : 10.4 13.3 14.3 14.7 15.0, highest: 26.0 27.3 30.4 32.4 33.9 --------------------------------------------------------------------------hp n missing unique Mean .05 .10 .2 .50 .75 .90 .95 32 0 22 146.7 63.65 66.00 96.50 123.00 180.00 243.50 253.55 lowest : 52 62 65 66 91, highest: 215 230 245 264 335 --------------------------------------------------------------------------wt n missing unique Mean .05 .10 .25 .50 .75 .90 .95 32 0 29 3.217 1.736 1.956 2.581 3.325 3.610 4.048 5.293 lowest : 1.513 1.615 1.835 1.935 2.140, highest: 3.845 4.070 5.250 5.345 5.424 --------------------------------------------------------------------------- The pastecs package includes a function named stat.desc() that provides a wide range of descriptive statistics. The format is stat.desc(x, basic=TRUE, desc=TRUE, norm=FALSE, p=0.95) where x is a data frame or time series. If basic=TRUE (the default), the number of values, null values, missing values, minimum, maximum, range, and sum are provided. If desc=TRUE (also the default), the median, mean, standard error of the mean, 95% confidence interval for the mean, variance, standard deviation, and coefficient of variation are also provided. Finally, if norm=TRUE (not the default), normal distribution statistics are returned, including skewness and kurtosis (and their statistical significance) and www.it-ebooks.info Descriptive statistics 141 the Shapiro–Wilk test of normality. A p-value option is used to calculate the confidence interval for the mean (.95 by default). The next listing gives an example. Listing 7.4 Descriptive statistics via stat.desc() in the pastecs package > library(pastecs) > myvars <- c("mpg", "hp", "wt") > stat.desc(mtcars[myvars]) mpg hp wt nbr.val 32.00 32.000 32.000 nbr.null 0.00 0.000 0.000 nbr.na 0.00 0.000 0.000 min 10.40 52.000 1.513 max 33.90 335.000 5.424 range 23.50 283.000 3.911 sum 642.90 4694.000 102.952 median 19.20 123.000 3.325 mean 20.09 146.688 3.217 SE.mean 1.07 12.120 0.173 CI.mean.0.95 2.17 24.720 0.353 var 36.32 4700.867 0.957 std.dev 6.03 68.563 0.978 coef.var 0.30 0.467 0.304 As if this isn’t enough, the psych package also has a function called describe() that provides the number of nonmissing observations, mean, standard deviation, median, trimmed mean, median absolute deviation, minimum, maximum, range, skew, kurtosis, and standard error of the mean. You can see an example in the following listing. Listing 7.5 Descriptive statistics via describe() in the psych package > library(psych) Attaching package: 'psych' The following object(s) are masked from package:Hmisc : describe > myvars <- c("mpg", "hp", "wt") > describe(mtcars[myvars]) var n mean sd median trimmed mad min max mpg 1 32 20.09 6.03 19.20 19.70 5.41 10.40 33.90 hp 2 32 146.69 68.56 123.00 141.19 77.10 52.00 335.00 wt 3 32 3.22 0.98 3.33 3.15 0.77 1.51 5.42 range skew kurtosis se mpg 23.50 0.61 -0.37 1.07 hp 283.00 0.73 -0.14 12.12 wt 3.91 0.42 -0.02 0.17 I told you that it was an embarrassment of riches! In the previous examples, the packages psych and Hmisc both provide a function named describe(). How does R know which one to use? Simply put, the package last loaded takes precedence, as shown in listing 7.5. Here, psych is loaded after Hmisc, and a message is printed indicating that the describe() function in Hmisc is masked by the function in psych. When you NOTE www.it-ebooks.info 142 CHAPTER 7 Basic statistics type in the describe() function and R searches for it, R comes to the psych package first and executes it. If you want the Hmisc version instead, you can type Hmisc::describe(mt). The function is still there. You have to give R more information to find it. Now that you know how to generate descriptive statistics for the data as a whole, let’s review how to obtain statistics for subgroups of the data. 7.1.3 Descriptive statistics by group When comparing groups of individuals or observations, the focus is usually on the descriptive statistics of each group, rather than the total sample. Again, there are several ways to accomplish this in R. We’ll start by getting descriptive statistics for each level of transmission type. In chapter 5, we discussed methods of aggregating data. You can use the aggregate() function (section 5.6.2) to obtain descriptive statistics by group, as shown in the following listing. Listing 7.6 Descriptive statistics by group using aggregate() > myvars <- c("mpg", "hp", "wt") > aggregate(mtcars[myvars], by=list(am=mtcars$am), mean) am mpg hp wt 1 0 17.1 160 3.77 2 1 24.4 127 2.41 > aggregate(mtcars[myvars], by=list(am=mtcars$am), sd) am mpg hp wt 1 0 3.83 53.9 0.777 2 1 6.17 84.1 0.617 Note the use of list(am=mtcars$am). If you used list(mtcars$am), the am column would be labeled Group.1 rather than am. You use the assignment to provide a more useful column label. If you have more than one grouping variable, you can use code like by=list(name1=groupvar1, name2=groupvar2, ... , nameN=groupvarN). Unfortunately, aggregate() only allows you to use single-value functions such as mean, standard deviation, and the like in each call. It won’t return several statistics at once. For that task, you can use the by() function. The format is by(data, INDICES, FUN) where data is a data frame or matrix, INDICES is a factor or list of factors that defines the groups, and FUN is an arbitrary function that operates on all the columns of a data frame. The next listing provides an example. Listing 7.7 Descriptive statistics by group using by() > dstats <- function(x)sapply(x, mystats) > myvars <- c("mpg", "hp", "wt") > by(mtcars[myvars], mtcars$am, dstats) www.it-ebooks.info Descriptive statistics 143 mtcars$am: 0 mpg hp wt n 19.000 19.0000 19.000 mean 17.147 160.2632 3.769 stdev 3.834 53.9082 0.777 skew 0.014 -0.0142 0.976 kurtosis -0.803 -1.2097 0.142 ---------------------------------------mtcars$am: 1 mpg hp wt n 13.0000 13.000 13.000 mean 24.3923 126.846 2.411 stdev 6.1665 84.062 0.617 skew 0.0526 1.360 0.210 kurtosis -1.4554 0.563 -1.174 In this case, dstats() applies the mystats() function from listing 7.2 to each column of the data frame. Placing it in the by() function gives you summary statistics for each level of am. 7.1.4 Additional methods by group The doBy package and the psych package also provide functions for descriptive statistics by group. Again, they aren’t distributed in the base installation and must be installed before first use. The summaryBy() function in the doBy package has the format summaryBy(formula, data=dataframe, FUN=function) where the formula takes the form var1 + var2 + var3 + ... + varN ~ groupvar1 + groupvar2 + ... + groupvarN Variables on the left of the ~ are the numeric variables to be analyzed, and variables on the right are categorical grouping variables. The function can be any built-in or user-created R function. An example using the mystats() function created in section 7.2.1 is shown in the following listing. Listing 7.8 Summary statistics by group using summaryBy() in the doBy package > library(doBy) > summaryBy(mpg+hp+wt~am, data=mtcars, FUN=mystats) am mpg.n mpg.mean mpg.stdev mpg.skew mpg.kurtosis hp.n hp.mean hp.stdev 1 0 19 17.1 3.83 0.0140 -0.803 19 160 53.9 2 1 13 24.4 6.17 0.0526 -1.455 13 127 84.1 hp.skew hp.kurtosis wt.n wt.mean wt.stdev wt.skew wt.kurtosis 1 -0.0142 -1.210 19 3.77 0.777 0.976 0.142 2 1.3599 0.563 13 2.41 0.617 0.210 -1.174 The describeBy() function contained in the psych package provides the same descriptive statistics as describe(), stratified by one or more grouping variables, as you can see in the following listing. www.it-ebooks.info 144 CHAPTER 7 Basic statistics Listing 7.9 Summary statistics by group using describe.by() in the psych package > library(psych) > myvars <- c("mpg", "hp", "wt") > describeBy(mtcars[myvars], list(am=mtcars$am)) am: 0 var n mean sd median trimmed mad min max 1 19 17.15 3.83 17.30 17.12 3.11 10.40 24.40 2 19 160.26 53.91 175.00 161.06 77.10 62.00 245.00 3 19 3.77 0.78 3.52 3.75 0.45 2.46 5.42 range skew kurtosis se mpg 14.00 0.01 -0.80 0.88 hp 183.00 -0.01 -1.21 12.37 wt 2.96 0.98 0.14 0.18 ---------------------------------------------------------------------------am: 1 var n mean sd median trimmed mad min max mpg 1 13 24.39 6.17 22.80 24.38 6.67 15.00 33.90 hp 2 13 126.85 84.06 109.00 114.73 63.75 52.00 335.00 wt 3 13 2.41 0.62 2.32 2.39 0.68 1.51 3.57 range skew kurtosis se mpg 18.90 0.05 -1.46 1.71 hp 283.00 1.36 0.56 23.31 wt 2.06 0.21 -1.17 0.17 mpg hp wt Unlike the previous example, the describeBy() function doesn’t allow you to specify an arbitrary function, so it’s less generally applicable. If there’s more than one grouping variable, you can write them as list(name1=groupvar1, name2=groupvar2, ... , nameN=groupvarN). But this will work only if there are no empty cells when the grouping variables are crossed. Data analysts have their own preferences for which descriptive statistics to display and how they like to see them formatted. This is probably why there are many variations available. Choose the one that works best for you, or create your own! 7.1.5 Visualizing results Numerical summaries of a distribution’s characteristics are important, but they’re no substitute for a visual representation. For quantitative variables, you have histograms (section 6.3), density plots (section 6.4), box plots (section 6.5), and dot plots (section 6.6). They can provide insights that are easily missed by reliance on a small set of descriptive statistics. The functions considered so far provide summaries of quantitative variables. The functions in the next section allow you to examine the distributions of categorical variables. 7.2 Frequency and contingency tables In this section, we’ll look at frequency and contingency tables from categorical variables, along with tests of independence, measures of association, and methods for www.it-ebooks.info 145 Frequency and contingency tables graphically displaying results. We’ll be using functions in the basic installation, along with functions from the vcd and gmodels packages. In the following examples, assume that A, B, and C represent categorical variables. The data for this section come from the Arthritis dataset included with the vcd package. The data are from Kock & Edward (1988) and represent a double-blind clinical trial of new treatments for rheumatoid arthritis. Here are the first few observations: > library(vcd) > head(Arthritis) ID Treatment 1 57 Treated 2 46 Treated 3 77 Treated 4 17 Treated 5 36 Treated 6 23 Treated Sex Male Male Male Male Male Male Age 27 29 30 32 46 58 Improved Some None None Marked Marked Marked Treatment (Placebo, Treated), Sex (Male, Female), and Improved (None, Some, Marked) are all categorical factors. In the next section, you’ll create frequency and contingency tables (cross-classifications) from the data. 7.2.1 Generating frequency tables R provides several methods for creating frequency and contingency tables. The most important functions are listed in table 7.1. Table 7.1 Functions for creating and manipulating contingency tables Function Description table(var1, var2, ..., varN) Creates an N-way contingency table from N categorical variables (factors) xtabs(formula, data) Creates an N-way contingency table based on a formula and a matrix or data frame prop.table(table, margins) Expresses table entries as fractions of the marginal table defined by the margins margin.table(table, margins) Computes the sum of table entries for a marginal table defined by the margins addmargins(table, margins) Puts summary margins (sums by default) on a table ftable(table) Creates a compact, “flat” contingency table In the following sections, we’ll use each of these functions to explore categorical variables. We’ll begin with simple frequencies, followed by two-way contingency tables, and end with multiway contingency tables. The first step is to create a table using either the table() or xtabs() function and then manipulate it using the other functions. www.it-ebooks.info 146 CHAPTER 7 Basic statistics ONE-WAY TABLES You can generate simple frequency counts using the table() function. Here’s an example: > mytable <- with(Arthritis, table(Improved)) > mytable Improved None Some Marked 42 14 28 You can turn these frequencies into proportions with prop.table() > prop.table(mytable) Improved None Some Marked 0.500 0.167 0.333 or into percentages using prop.table()*100: > prop.table(mytable)*100 Improved None Some Marked 50.0 16.7 33.3 Here you can see that 50% of study participants had some or marked improvement (16.7 + 33.3). TWO-WAY TABLES For two-way tables, the format for the table() function is mytable <- table(A, B) where A is the row variable and B is the column variable. Alternatively, the xtabs() function allows you to create a contingency table using formula-style input. The format is mytable <- xtabs(~ A + B, data=mydata) where mydata is a matrix or data frame. In general, the variables to be cross-classified appear on the right of the formula (that is, to the right of the ~) separated by + signs. If a variable is included on the left side of the formula, it’s assumed to be a vector of frequencies (useful if the data have already been tabulated). For the Arthritis data, you have > mytable <- xtabs(~ Treatment+Improved, data=Arthritis) > mytable Improved Treatment None Some Marked Placebo 29 7 7 Treated 13 7 21 You can generate marginal frequencies and proportions using the margin.table() and prop.table() functions, respectively. For row sums and row proportions, you have > margin.table(mytable, 1) Treatment Placebo Treated 3 41 www.it-ebooks.info Frequency and contingency tables 147 > prop.table(mytable, 1) Improved Treatment None Some Marked Placebo 0.674 0.163 0.163 Treated 0.317 0.171 0.512 The index (1) refers to the first variable in the table() statement. Looking at the table, you can see that 51% of treated individuals had marked improvement, compared to 16% of those receiving a placebo. For column sums and column proportions, you have > margin.table(mytable, 2) Improved None Some Marked 42 14 28 > prop.table(mytable, 2) Improved Treatment None Some Marked Placebo 0.690 0.500 0.250 Treated 0.310 0.500 0.750 Here, the index (2) refers to the second variable in the table() statement. Cell proportions are obtained with this statement: > prop.table(mytable) Improved Treatment None Some Placebo 0.3452 0.0833 Treated 0.1548 0.0833 Marked 0.0833 0.2500 You can use the addmargins() function to add marginal sums to these tables. For example, the following code adds a Sum row and column: > addmargins(mytable) Improved Treatment None Some Marked Placebo 29 7 7 Treated 13 7 21 Sum 42 14 28 > addmargins(prop.table(mytable)) Improved Treatment None Some Marked Placebo 0.3452 0.0833 0.0833 Treated 0.1548 0.0833 0.2500 Sum 0.5000 0.1667 0.3333 Sum 43 41 84 Sum 0.5119 0.4881 1.0000 When using addmargins(), the default is to create sum margins for all variables in a table. In contrast, the following code adds a Sum column alone: > addmargins(prop.table(mytable, 1), 2) Improved Treatment None Some Marked Sum Placebo 0.674 0.163 0.163 1.000 Treated 0.317 0.171 0.512 1.000 www.it-ebooks.info 148 CHAPTER 7 Basic statistics Similarly, this code adds a Sum row: > addmargins(prop.table(mytable, 2), 1) Improved Treatment None Some Marked Placebo 0.690 0.500 0.250 Treated 0.310 0.500 0.750 Sum 1.000 1.000 1.000 In the table, you see that 25% of those patients with marked improvement received a placebo. The table() function ignores missing values (NAs) by default. To include NA as a valid category in the frequency counts, include the table option useNA="ifany". NOTE A third method for creating two-way tables is the CrossTable() function in the gmodels package. The CrossTable() function produces two-way tables modeled after PROC FREQ in SAS or CROSSTABS in SPSS. The following listing shows an example. Listing 7.10 Two-way table using CrossTable > library(gmodels) > CrossTable(Arthritis$Treatment, Arthritis$Improved) Cell Contents |-------------------------| | N | | Chi-square contribution | | N / Row Total | | N / Col Total | | N / Table Total | |-------------------------| Total Observations in Table: 84 | Arthritis$Improved Arthritis$Treatment | None | Some | Marked | Row Total | --------------------|-----------|-----------|-----------|-----------| Placebo | 29 | 7 | 7 | 43 | | 2.616 | 0.004 | 3.752 | | | 0.674 | 0.163 | 0.163 | 0.512 | | 0.690 | 0.500 | 0.250 | | | 0.345 | 0.083 | 0.083 | | --------------------|-----------|-----------|-----------|-----------| Treated | 13 | 7 | 21 | 41 | | 2.744 | 0.004 | 3.935 | | | 0.317 | 0.171 | 0.512 | 0.488 | | 0.310 | 0.500 | 0.750 | | | 0.155 | 0.083 | 0.250 | | --------------------|-----------|-----------|-----------|-----------| Column Total | 42 | 14 | 28 | 84 | | 0.500 | 0.167 | 0.333 | | --------------------|-----------|-----------|-----------|-----------| www.it-ebooks.info 149 Frequency and contingency tables The CrossTable() function has options to report percentages (row, column, and cell); specify decimal places; produce chi-square, Fisher, and McNemar tests of independence; report expected and residual values (Pearson, standardized, and adjusted standardized); include missing values as valid; annotate with row and column titles; and format as SAS or SPSS style output. See help(CrossTable) for details. If you have more than two categorical variables, you’re dealing with multidimensional tables. We’ll consider these next. MULTIDIMENSIONAL TABLES Both table() and xtabs() can be used to generate multidimensional tables based on three or more categorical variables. The margin.table(), prop.table(), and addmargins() functions extend naturally to more than two dimensions. Additionally, the ftable() function can be used to print multidimensional tables in a compact and attractive manner. An example is given in the next listing. Listing 7.11 Three-way contingency table > mytable <- xtabs(~ Treatment+Sex+Improved, data=Arthritis) > mytable , , Improved = None Treatment Placebo Treated Sex Female 19 6 Male 10 7 , , Improved = Some Treatment Placebo Treated Sex Female 7 5 Male 0 2 , , Improved = Marked Treatment Placebo Treated Sex Female 6 16 Male 1 5 > ftable(mytable) Sex Female Male Treatment Improved Placebo None Some Marked Treated None Some Marked 19 7 6 6 5 16 10 0 1 7 2 5 c > margin.table(mytable, 1) www.it-ebooks.info Marginal frequencies b Cell frequencies 150 CHAPTER 7 Basic statistics Treatment Placebo Treated 43 41 > margin.table(mytable, 2) Sex Female Male 59 25 > margin.table(mytable, 3) Improved None Some Marked Treatment × Improved 42 14 28 marginal frequencies > margin.table(mytable, c(1, 3)) Improved Treatment None Some Marked Placebo 29 7 7 Improved proportions Treated 13 7 21 for Treatment × Sex > ftable(prop.table(mytable, c(1, 2))) Improved None Some Marked Treatment Sex Placebo Female 0.594 0.219 0.188 Male 0.909 0.000 0.091 Treated Female 0.222 0.185 0.593 Male 0.500 0.143 0.357 d e > ftable(addmargins(prop.table(mytable, c(1, Improved None Some Marked Treatment Sex Placebo Female 0.594 0.219 0.188 Male 0.909 0.000 0.091 Treated Female 0.222 0.185 0.593 Male 0.500 0.143 0.357 2)), 3)) Sum 1.000 1.000 1.000 1.000 The code at b produces cell frequencies for the three-way classification. The code also demonstrates how the ftable() function can be used to print a more compact and attractive version of the table. The code at c produces the marginal frequencies for Treatment, Sex, and Improved. Because you created the table with the formula ~Treatment+Sex+ Improved, Treatment is referred to by index 1, Sex is referred to by index 2, and Improved is referred to by index 3. The code at d produces the marginal frequencies for the Treatment x Improved classification, summed over Sex. The proportion of patients with None, Some, and Marked improvement for each Treatment × Sex combination is provided in e. Here you see that 36% of treated males had marked improvement, compared to 59% of treated females. In general, the proportions will add to 1 over the indices not included in the prop.table() call (the third index, or Improved in this case). You can see this in the last example, where you add a sum margin over the third index. If you want percentages instead of proportions, you can multiply the resulting table by 100. For example, this statement ftable(addmargins(prop.table(mytable, c(1, 2)), 3)) * 100 www.it-ebooks.info 151 Frequency and contingency tables produces this table: Sex Female Male Sum 65.5 100.0 85.7 46.2 71.4 76.2 34.5 0.0 14.3 53.8 28.6 23.8 100.0 100.0 100.0 100.0 100.0 100.0 Treatment Improved Placebo None Some Marked Treated None Some Marked Contingency tables tell you the frequency or proportions of cases for each combination of the variables that make up the table, but you’re probably also interested in whether the variables in the table are related or independent. Tests of independence are covered in the next section. 7.2.2 Tests of independence R provides several methods of testing the independence of categorical variables. The three tests described in this section are the chi-square test of independence, the Fisher exact test, and the Cochran-Mantel–Haenszel test. CHI-SQUARE TEST OF INDEPENDENCE You can apply the function chisq.test() to a two-way table in order to produce a chisquare test of independence of the row and column variables. See the next listing for an example. Listing 7.12 Chi-square test of independence > library(vcd) > mytable <- xtabs(~Treatment+Improved, data=Arthritis) > chisq.test(mytable) Pearson’s Chi-squared test Treatment and Improved data: mytable aren’t independent. X-squared = 13.1, df = 2, p-value = 0.001463 b > mytable <- xtabs(~Improved+Sex, data=Arthritis) > chisq.test(mytable) Pearson's Chi-squared test data: mytable X-squared = 4.84, df = 2, p-value = 0.0889 c Gender and Improved are independent. Warning message: In chisq.test(mytable) : Chi-squared approximation may be incorrect From the results b, there appears to be a relationship between treatment received and level of improvement (p < .01). But there doesn’t appear to be a relationship c between patient sex and improvement (p > .05). The p-values are the probability of obtaining the sampled results, assuming independence of the row and column variables in the population. Because the probability is small for b, you reject the hypothesis that treatment type and outcome are independent. Because the probability for c isn’t small, it’s not unreasonable to assume that outcome and gender are indepen- www.it-ebooks.info 152 CHAPTER 7 Basic statistics dent. The warning message in listing 7.13 is produced because one of the six cells in the table (male-some improvement) has an expected value less than five, which may invalidate the chi-square approximation. FISHER’S EXACT TEST You can produce a Fisher’s exact test via the fisher.test() function. Fisher’s exact test evaluates the null hypothesis of independence of rows and columns in a contingency table with fixed marginals. The format is fisher.test(mytable), where mytable is a two-way table. Here’s an example: > mytable <- xtabs(~Treatment+Improved, data=Arthritis) > fisher.test(mytable) Fisher's Exact Test for Count Data data: mytable p-value = 0.001393 alternative hypothesis: two.sided In contrast to many statistical packages, the fisher.test() function can be applied to any two-way table with two or more rows and columns, not a 2 × 2 table. COCHRAN–MANTEL–HAENSZEL TEST The mantelhaen.test() function provides a Cochran–Mantel–Haenszel chi-square test of the null hypothesis that two nominal variables are conditionally independent in each stratum of a third variable. The following code tests the hypothesis that the Treatment and Improved variables are independent within each level for Sex. The test assumes that there’s no three-way (Treatment × Improved × Sex) interaction: > mytable <- xtabs(~Treatment+Improved+Sex, data=Arthritis) > mantelhaen.test(mytable) Cochran-Mantel-Haenszel test data: mytable Cochran-Mantel-Haenszel M^2 = 14.6, df = 2, p-value = 0.0006647 The results suggest that the treatment received and the improvement reported aren’t independent within each level of Sex (that is, treated individuals improved more than those receiving placebos when controlling for sex). 7.2.3 Measures of association The significance tests in the previous section evaluate whether sufficient evidence exists to reject a null hypothesis of independence between variables. If you can reject the null hypothesis, your interest turns naturally to measures of association in order to gauge the strength of the relationships present. The assocstats() function in the vcd package can be used to calculate the phi coefficient, contingency coefficient, and Cramer’s V for a two-way table. An example is given in the following listing. Listing 7.13 Measures of association for a two-way table > library(vcd) > mytable <- xtabs(~Treatment+Improved, data=Arthritis) > assocstats(mytable) www.it-ebooks.info Correlations 153 X^2 df P(> X^2) Likelihood Ratio 13.530 2 0.0011536 Pearson 13.055 2 0.0014626 Phi-Coefficient : 0.394 Contingency Coeff.: 0.367 Cramer's V : 0.394 In general, larger magnitudes indicate stronger associations. The vcd package also provides a kappa() function that can calculate Cohen’s kappa and weighted kappa for a confusion matrix (for example, the degree of agreement between two judges classifying a set of objects into categories). 7.2.4 Visualizing results R has mechanisms for visually exploring the relationships among categorical variables that go well beyond those found in most other statistical platforms. You typically use bar charts to visualize frequencies in one dimension (see section 6.1). The vcd package has excellent functions for visualizing relationships among categorical variables in multidimensional datasets using mosaic and association plots (see section 11.4). Finally, correspondence-analysis functions in the ca package allow you to visually explore relationships between rows and columns in contingency tables using various geometric representations (Nenadic and Greenacre, 2007). This ends the discussion of contingency tables, until we take up more advanced topics in chapters 11 and 15. Next, let’s look at various types of correlation coefficients. 7.3 Correlations Correlation coefficients are used to describe relationships among quantitative variables. The sign ± indicates the direction of the relationship (positive or inverse), and the magnitude indicates the strength of the relationship (ranging from 0 for no relationship to 1 for a perfectly predictable relationship). In this section, we’ll look at a variety of correlation coefficients, as well as tests of significance. We’ll use the state.x77 dataset available in the base R installation. It provides data on the population, income, illiteracy rate, life expectancy, murder rate, and high school graduation rate for the 50 US states in 1977. There are also temperature and land-area measures, but we’ll drop them to save space. Use help(state.x77) to learn more about the file. In addition to the base installation, we’ll be using the psych and ggm packages. 7.3.1 Types of correlations R can produce a variety of correlation coefficients, including Pearson, Spearman, Kendall, partial, polychoric, and polyserial. Let’s look at each in turn. PEARSON, SPEARMAN, AND KENDALL CORRELATIONS The Pearson product-moment correlation assesses the degree of linear relationship between two quantitative variables. Spearman’s rank-order correlation coefficient www.it-ebooks.info 154 CHAPTER 7 Basic statistics assesses the degree of relationship between two rank-ordered variables. Kendall’s tau is also a nonparametric measure of rank correlation. The cor() function produces all three correlation coefficients, whereas the cov() function provides covariances. There are many options, but a simplified format for producing correlations is cor(x, use= , method= ) The options are described in table 7.2. Table 7.2 cor/cov options Option Description x Matrix or data frame. use Specifies the handling of missing data. The options are all.obs (assumes no missing data—missing data will produce an error), everything (any correlation involving a case with missing values will be set to missing), complete.obs (listwise deletion), and pairwise.complete.obs (pairwise deletion). method Specifies the type of correlation. The options are pearson, spearman, and kendall. The default options are use="everything" and method="pearson". You can see an example in the following listing. Listing 7.14 Covariances and correlations > states<- state.x77[,1:6] > cov(states) Population Income Illiteracy Life Exp Murder HS Grad Population 19931684 571230 292.868 -407.842 5663.52 -3551.51 Income 571230 377573 -163.702 280.663 -521.89 3076.77 Illiteracy 293 -164 0.372 -0.482 1.58 -3.24 Life Exp -408 281 -0.482 1.802 -3.87 6.31 Murder 5664 -522 1.582 -3.869 13.63 -14.55 HS Grad -3552 3077 -3.235 6.313 -14.55 65.24 > cor(states) Population Income Illiteracy Life Exp Murder Population 1.0000 0.208 0.108 -0.068 0.344 Income 0.2082 1.000 -0.437 0.340 -0.230 Illiteracy 0.1076 -0.437 1.000 -0.588 0.703 Life Exp -0.0681 0.340 -0.588 1.000 -0.781 Murder 0.3436 -0.230 0.703 -0.781 1.000 HS Grad -0.0985 0.620 -0.657 0.582 -0.488 > cor(states, method="spearman") Population Income Illiteracy Life Exp Murder Population 1.000 0.125 0.313 -0.104 0.346 Income 0.125 1.000 -0.315 0.324 -0.217 Illiteracy 0.313 -0.315 1.000 -0.555 0.672 Life Exp -0.104 0.324 -0.555 1.000 -0.780 Murder 0.346 -0.217 0.672 -0.780 1.000 HS Grad -0.383 0.510 -0.655 0.524 -0.437 www.it-ebooks.info HS Grad -0.0985 0.6199 -0.6572 0.5822 -0.4880 1.0000 HS Grad -0.383 0.510 -0.655 0.524 -0.437 1.000 Correlations 155 The first call produces the variances and covariances. The second provides Pearson product-moment correlation coefficients, and the third produces Spearman rankorder correlation coefficients. You can see, for example, that a strong positive correlation exists between income and high school graduation rate and that a strong negative correlation exists between illiteracy rates and life expectancy. Notice that you get square matrices by default (all variables crossed with all other variables). You can also produce nonsquare matrices, as shown in the following example: > x <- states[,c("Population", "Income", "Illiteracy", "HS Grad")] > y <- states[,c("Life Exp", "Murder")] > cor(x,y) Life Exp Murder Population -0.068 0.344 Income 0.340 -0.230 Illiteracy -0.588 0.703 HS Grad 0.582 -0.488 This version of the function is particularly useful when you’re interested in the relationships between one set of variables and another. Notice that the results don’t tell you if the correlations differ significantly from 0 (that is, whether there’s sufficient evidence based on the sample data to conclude that the population correlations differ from 0). For that, you need tests of significance (described in section 7.3.2). PARTIAL CORRELATIONS A partial correlation is a correlation between two quantitative variables, controlling for one or more other quantitative variables. You can use the pcor() function in the ggm package to provide partial correlation coefficients. The ggm package isn’t installed by default, so be sure to install it on first use. The format is pcor(u, S) where u is a vector of numbers, with the first two numbers being the indices of the variables to be correlated, and the remaining numbers being the indices of the conditioning variables (that is, the variables being partialed out). S is the covariance matrix among the variables. An example will help clarify this: > library(ggm) > colnames(states) [1] "Population" "Income" "Illiteracy" "Life Exp" "Murder" "HS Grad" > pcor(c(1,5,2,3,6), cov(states)) [1] 0.346 In this case, 0.346 is the correlation between population (variable 1) and murder rate (variable 5), controlling for the influence of income, illiteracy rate, and high school graduation rate (variables 2, 3, and 6 respectively). The use of partial correlations is common in the social sciences. OTHER TYPES OF CORRELATIONS The hetcor() function in the polycor package can compute a heterogeneous correlation matrix containing Pearson product-moment correlations between numeric variables, polyserial correlations between numeric and ordinal variables, polychoric www.it-ebooks.info 156 CHAPTER 7 Basic statistics correlations between ordinal variables, and tetrachoric correlations between two dichotomous variables. Polyserial, polychoric, and tetrachoric correlations assume that the ordinal or dichotomous variables are derived from underlying normal distributions. See the documentation that accompanies this package for more information. 7.3.2 Testing correlations for significance Once you’ve generated correlation coefficients, how do you test them for statistical significance? The typical null hypothesis is no relationship (that is, the correlation in the population is 0). You can use the cor.test() function to test an individual Pearson, Spearman, and Kendall correlation coefficient. A simplified format is cor.test(x, y, alternative = , method = ) where x and y are the variables to be correlated, alternative specifies a two-tailed or one-tailed test ("two.side", "less", or "greater"), and method specifies the type of correlation ("pearson", "kendall", or "spearman") to compute. Use alternative ="less" when the research hypothesis is that the population correlation is less than 0. Use alternative="greater" when the research hypothesis is that the population correlation is greater than 0. By default, alternative="two.side" (population correlation isn’t equal to 0) is assumed. See the following listing for an example. Listing 7.15 Testing a correlation coefficient for significance > cor.test(states[,3], states[,5]) Pearson's product-moment correlation data: states[, 3] and states[, 5] t = 6.85, df = 48, p-value = 1.258e-08 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.528 0.821 sample estimates: cor 0.703 This code tests the null hypothesis that the Pearson correlation between life expectancy and murder rate is 0. Assuming that the population correlation is 0, you’d expect to see a sample correlation as large as 0.703 less than 1 time out of 10 million (that is, p = 1.258e-08). Given how unlikely this is, you reject the null hypothesis in favor of the research hypothesis, that the population correlation between life expectancy and murder rate is not 0. Unfortunately, you can test only one correlation at a time using cor.test(). Luckily, the corr.test() function provided in the psych package allows you to go further. The corr.test() function produces correlations and significance levels for matrices of Pearson, Spearman, and Kendall correlations. An example is given in the following listing. www.it-ebooks.info Correlations 157 Listing 7.16 Correlation matrix and tests of significance via corr.test() > library(psych) > corr.test(states, use="complete") Call:corr.test(x = states, use = "complete") Correlation matrix Population Income Illiteracy Life Exp Murder HS Grad Population 1.00 0.21 0.11 -0.07 0.34 -0.10 Income 0.21 1.00 -0.44 0.34 -0.23 0.62 Illiteracy 0.11 -0.44 1.00 -0.59 0.70 -0.66 Life Exp -0.07 0.34 -0.59 1.00 -0.78 0.58 Murder 0.34 -0.23 0.70 -0.78 1.00 -0.49 HS Grad -0.10 0.62 -0.66 0.58 -0.49 1.00 Sample Size [1] 50 Probability value Population Income Illiteracy Life Exp Murder HS Grad Population 0.00 0.15 0.46 0.64 0.01 0.5 Income 0.15 0.00 0.00 0.02 0.11 0.0 Illiteracy 0.46 0.00 0.00 0.00 0.00 0.0 Life Exp 0.64 0.02 0.00 0.00 0.00 0.0 Murder 0.01 0.11 0.00 0.00 0.00 0.0 HS Grad 0.50 0.00 0.00 0.00 0.00 0.0 The use= options can be "pairwise" or "complete" (for pairwise or listwise deletion of missing values, respectively). The method= option is "pearson" (the default), "spearman", or "kendall". Here you see that the correlation between population size and high school graduation rate (–0.10) is not significantly different from 0 (p = 0.5). OTHER TESTS OF SIGNIFICANCE In section 7.4.1, we looked at partial correlations. The pcor.test() function in the psych package can be used to test the conditional independence of two variables controlling for one or more additional variables, assuming multivariate normality. The format is pcor.test(r, q, n) where r is the partial correlation produced by the pcor() function, q is the number of variables being controlled, and n is the sample size. Before leaving this topic, it should be mentioned that the r.test() function in the psych package also provides a number of useful significance tests. The function can be used to test the following: ■ ■ ■ ■ The significance of a correlation coefficient The difference between two independent correlations The difference between two dependent correlations sharing a single variable The difference between two dependent correlations based on completely different variables See help(r.test) for details. www.it-ebooks.info 158 7.3.3 CHAPTER 7 Basic statistics Visualizing correlations The bivariate relationships underlying correlations can be visualized through scatter plots and scatter plot matrices, whereas correlograms provide a unique and powerful method for comparing a large number of correlation coefficients in a meaningful way. Each is covered in chapter 11. 7.4 T-tests The most common activity in research is the comparison of two groups. Do patients receiving a new drug show greater improvement than patients using an existing medication? Does one manufacturing process produce fewer defects than another? Which of two teaching methods is most cost-effective? If your outcome variable is categorical, you can use the methods described in section 7.3. Here, we’ll focus on group comparisons, where the outcome variable is continuous and assumed to be distributed normally. For this illustration, we’ll use the UScrime dataset distributed with the MASS package. It contains information about the effect of punishment regimes on crime rates in 47 US states in 1960. The outcome variables of interest will be Prob (the probability of imprisonment), U1 (the unemployment rate for urban males ages 14–24), and U2 (the unemployment rate for urban males ages 35–39). The categorical variable So (an indicator variable for Southern states) will serve as the grouping variable. The data have been rescaled by the original authors. (Note: I considered naming this section “Crime and Punishment in the Old South,” but cooler heads prevailed.) 7.4.1 Independent t-test Are you more likely to be imprisoned if you commit a crime in the South? The comparison of interest is Southern versus non-Southern states, and the dependent variable is the probability of incarceration. A two-group independent t-test can be used to test the hypothesis that the two population means are equal. Here, you assume that the two groups are independent and that the data is sampled from normal populations. The format is either t.test(y ~ x, data) where y is numeric and x is a dichotomous variable, or t.test(y1, y2) where y1 and y2 are numeric vectors (the outcome variable for each group). The optional data argument refers to a matrix or data frame containing the variables. In contrast to most statistical packages, the default test assumes unequal variance and applies the Welsh degrees-of-freedom modification. You can add a var.equal=TRUE option to specify equal variances and a pooled variance estimate. By default, a twotailed alternative is assumed (that is, the means differ but the direction isn’t specified). You can add the option alternative="less" or alternative="greater" to specify a directional test. www.it-ebooks.info T-tests 159 The following code compares Southern (group 1) and non-Southern (group 0) states on the probability of imprisonment using a two-tailed test without the assumption of equal variances: > library(MASS) > t.test(Prob ~ So, data=UScrime) Welch Two Sample t-test data: Prob by So t = -3.8954, df = 24.925, p-value = 0.0006506 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -0.03852569 -0.01187439 sample estimates: mean in group 0 mean in group 1 0.03851265 0.06371269 You can reject the hypothesis that Southern states and non-Southern states have equal probabilities of imprisonment (p < .001). Because the outcome variable is a proportion, you might try to transform it to normality before carrying out the t-test. In the current case, all reasonable transformations of the outcome variable (Y/1-Y, log(Y/1-Y), arcsin(Y), and arcsin(sqrt(Y)) would lead to the same conclusions. Transformations are covered in detail in chapter 8. NOTE 7.4.2 Dependent t-test As a second example, you might ask if the unemployment rate for younger males (14– 24) is greater than for older males (35–39). In this case, the two groups aren’t independent. You wouldn’t expect the unemployment rate for younger and older males in Alabama to be unrelated. When observations in the two groups are related, you have a dependent-groups design. Pre-post or repeated-measures designs also produce dependent groups. A dependent t-test assumes that the difference between groups is normally distributed. In this case, the format is t.test(y1, y2, paired=TRUE) where y1 and y2 are the numeric vectors for the two dependent groups. The results are as follows: > library(MASS) > sapply(UScrime[c("U1","U2")], function(x)(c(mean=mean(x),sd=sd(x)))) U1 U2 mean 95.5 33.98 sd 18.0 8.45 > with(UScrime, t.test(U1, U2, paired=TRUE)) Paired t-test data: U1 and U2 www.it-ebooks.info 160 CHAPTER 7 Basic statistics t = 32.4066, df = 46, p-value < 2.2e-16 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 57.67003 65.30870 sample estimates: mean of the differences 61.48936 The mean difference (61.5) is large enough to warrant rejection of the hypothesis that the mean unemployment rate for older and younger males is the same. Younger males have a higher rate. In fact, the probability of obtaining a sample difference this large if the population means are equal is less than 0.00000000000000022 (that is, 2.2e–16). 7.4.3 7.5 When there are more than two groups What do you do if you want to compare more than two groups? If you can assume that the data are independently sampled from normal populations, you can use analysis of variance (ANOVA). ANOVA is a comprehensive methodology that covers many experimental and quasi-experimental designs. As such, it has earned its own chapter. Feel free to abandon this section and jump to chapter 9 at any time. Nonparametric tests of group differences If you’re unable to meet the parametric assumptions of a t-test or ANOVA, you can turn to nonparametric approaches. For example, if the outcome variables are severely skewed or ordinal in nature, you may wish to use the techniques in this section. 7.5.1 Comparing two groups If the two groups are independent, you can use the Wilcoxon rank sum test (more popularly known as the Mann–Whitney U test) to assess whether the observations are sampled from the same probability distribution (that is, whether the probability of obtaining higher scores is greater in one population than the other). The format is either wilcox.test(y ~ x, data) where y is numeric and x is a dichotomous variable, or wilcox.test(y1, y2) where y1 and y2 are the outcome variables for each group. The optional data argument refers to a matrix or data frame containing the variables. The default is a two-tailed test. You can add the option exact to produce an exact test, and alternative="less" or alternative="greater" to specify a directional test. If you apply the Mann–Whitney U test to the question of incarceration rates from the previous section, you’ll get these results: > with(UScrime, by(Prob, So, median)) So: 0 [1] 0.0382 www.it-ebooks.info Nonparametric tests of group differences 161 -------------------So: 1 [1] 0.0556 > wilcox.test(Prob ~ So, data=UScrime) Wilcoxon rank sum test data: Prob by So W = 81, p-value = 8.488e-05 alternative hypothesis: true location shift is not equal to 0 Again, you can reject the hypothesis that incarceration rates are the same in Southern and non-Southern states (p < .001). The Wilcoxon signed rank test provides a nonparametric alternative to the dependent sample t-test. It’s appropriate in situations where the groups are paired and the assumption of normality is unwarranted. The format is identical to the Mann–Whitney U test, but you add the paired=TRUE option. Let’s apply it to the unemployment question from the previous section: > sapply(UScrime[c("U1","U2")], median) U1 U2 92 34 > with(UScrime, wilcox.test(U1, U2, paired=TRUE)) Wilcoxon signed rank test with continuity correction data: U1 and U2 V = 1128, p-value = 2.464e-09 alternative hypothesis: true location shift is not equal to 0 Again, you reach the same conclusion reached with the paired t-test. In this case, the parametric t-tests and their nonparametric equivalents reach the same conclusions. When the assumptions for the t-tests are reasonable, the parametric tests are more powerful (more likely to find a difference if it exists). The nonparametric tests are more appropriate when the assumptions are grossly unreasonable (for example, rank-ordered data). 7.5.2 Comparing more than two groups When there are more than two groups to be compared, you must turn to other methods. Consider the state.x77 dataset from section 7.4. It contains population, income, illiteracy rate, life expectancy, murder rate, and high school graduation rate data for US states. What if you want to compare the illiteracy rates in four regions of the country (Northeast, South, North Central, and West)? This is called a one-way design, and there are both parametric and nonparametric approaches available to address the question. If you can’t meet the assumptions of ANOVA designs, you can use nonparametric methods to evaluate group differences. If the groups are independent, a Kruskal–Wallis test provides a useful approach. If the groups are dependent (for example, repeated measures or randomized block design), the Friedman test is more appropriate. www.it-ebooks.info 162 CHAPTER 7 Basic statistics The format for the Kruskal–Wallis test is kruskal.test(y ~ A, data) where y is a numeric outcome variable and A is a grouping variable with two or more levels (if there are two levels, it’s equivalent to the Mann–Whitney U test). For the Friedman test, the format is friedman.test(y ~ A | B, data) where y is the numeric outcome variable, A is a grouping variable, and B is a blocking variable that identifies matched observations. In both cases, data is an option argument specifying a matrix or data frame containing the variables. Let’s apply the Kruskal–Wallis test to the illiteracy question. First, you’ll have to add the region designations to the dataset. These are contained in the dataset state.region distributed with the base installation of R: states <- data.frame(state.region, state.x77) Now you can apply the test: > kruskal.test(Illiteracy ~ state.region, data=states) Kruskal-Wallis rank sum test data: states$Illiteracy by states$state.region Kruskal-Wallis chi-squared = 22.7, df = 3, p-value = 4.726e-05 The significance test suggests that the illiteracy rate isn’t the same in each of the four regions of the country (p <.001). Although you can reject the null hypothesis of no difference, the test doesn’t tell you which regions differ significantly from each other. To answer this question, you could compare groups two at a time using the Wilcoxon test. A more elegant approach is to apply a multiple-comparisons procedure that computes all pairwise comparisons, while controlling the type I error rate (the probability of finding a difference that isn’t there). I have created a function called wmc() that can be used for this purpose. It compares groups two at a time using the Wilcoxon test and adjusts the probability values using the p.adj() function. To be honest, I’m stretching the definition of basic in the chapter title quite a bit, but because the function fits well here, I hope you’ll bear with me. You can download a text file containing wmc() from www.statmethods.net/RiA/wmc.txt. The following listing uses this function to compare the illiteracy rates in the four US regions. Listing 7.17 Nonparametric multiple comparisons > source("http://www.statmethods.net/RiA/wmc.txt") > states <- data.frame(state.region, state.x77) > wmc(Illiteracy ~ state.region, data=states, method="holm") Descriptive Statistics n West North Central Northeast South 13.00 12.00 9.0 16.00 www.it-ebooks.info c Basic statistics b Accesses the function 163 Visualizing group differences median mad 0.60 0.15 0.70 0.15 1.1 0.3 1.75 0.59 Multiple Comparisons (Wilcoxon Rank Sum Tests) Probability Adjustment = holm Group.1 Group.2 1 West North Central 2 West Northeast 3 West South 4 North Central Northeast 5 North Central South 6 Northeast South --Signif. codes: 0 '***' 0.001 W 88 46 39 20 2 18 d Pairwise comparisons p 8.7e-01 8.7e-01 1.8e-02 * 5.4e-02 . 8.1e-05 *** 1.2e-02 * '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 The source() function downloads and executes the R script defining the wmc() function b. The function’s format is wmc(y ~ A, data, method), where y is a numeric outcome variable, A is a grouping variable, data is the data frame containing these variables, and method is the approach used to limit Type I errors. Listing 7.17 uses an adjustment method developed by Holm (1979). It provides strong control of the family-wise error rate (the probability of making one or more Type I errors in a set of comparisons). See help(p.adjust) for a description of the other methods available. The wmc() function first provides the sample sizes, medians, and median absolute deviations for each group c. The West has the lowest illiteracy rate, and the South has the highest. The function then generates six statistical comparisons (West versus North Central, West versus Northeast, West versus South, North Central versus Northeast, North Central versus South, and Northeast versus South) d. You can see from the two-sided p-values (p) that the South differs significantly from the other three regions and that the other three regions don’t differ from each other at a p < .05 level. Nonparametric multiple comparisons are a useful set of techniques that aren’t easily accessible in R. In chapter 21, you’ll have an opportunity to expand the wmc() function into a fully developed package that includes error checking and informative graphics. 7.6 Visualizing group differences In sections 7.4 and 7.5, we looked at statistical methods for comparing groups. Examining group differences visually is also a crucial part of a comprehensive data-analysis strategy. It allows you to assess the magnitude of the differences, identify any distributional characteristics that influence the results (such as skew, bimodality, or outliers), and evaluate the appropriateness of the test assumptions. R provides a wide range of graphical methods for comparing groups, including box plots (simple, notched, and violin), covered in section 6.5; overlapping kernel density plots, covered in section 6.4.1; and graphical methods for visualizing outcomes in an ANOVA framework, discussed in chapter 9. Advanced methods for visualizing group differences, including grouping and faceting, are discussed in chapter 19. www.it-ebooks.info 164 7.7 CHAPTER 7 Basic statistics Summary In this chapter, we reviewed the functions in R that provide basic statistical summaries and tests. We looked at sample statistics and frequency tables, tests of independence and measures of association for categorical variables, correlations between quantitative variables (and their associated significance tests), and comparisons of two or more groups on a quantitative outcome variable. In the next chapter, we’ll explore simple and multiple regression, where the focus is on understanding relationships between one (simple) or more than one (multiple) predictor variables and a predicted or criterion variable. Graphical methods will help you diagnose potential problems, evaluate and improve the fit of your models, and uncover unexpected gems of information in your data. www.it-ebooks.info Part 3 Intermediate methods W hereas part 2 of this book covered basic graphical and statistical methods, part 3 discusses intermediate methods. We move from describing the relationship between two variables to, in chapter 8, using regression models to model the relationship between a numerical outcome variable and a set of numeric and/or categorical predictor variables. Modeling data is typically a complex, multistep, interactive process. Chapter 8 provides step-by-step coverage of the methods available for fitting linear models, evaluating their appropriateness, and interpreting their meaning. Chapter 9 considers the analysis of basic experimental and quasi-experimental designs through the analysis of variance and its variants. Here we’re interested in how treatment combinations or conditions affect a numerical outcome variable. The chapter introduces the functions in R that are used to perform an analysis of variance, analysis of covariance, repeated measures analysis of variance, multifactor analysis of variance, and multivariate analysis of variance. Methods for assessing the appropriateness of these analyses and visualizing the results are also discussed. In designing experimental and quasi-experimental studies, it’s important to determine whether the sample size is adequate for detecting the effects of interest (power analysis). Otherwise, why conduct the study? A detailed treatment of power analysis is provided in chapter 10. Starting with a discussion of hypothesis testing, the presentation focuses on how to use R functions to determine the sample size necessary to detect a treatment effect of a given size with a given degree of confidence. This can help you to plan studies that are likely to yield useful results. www.it-ebooks.info 166 Intermediate CHAPTERmethods Chapter 11 expands on the material in chapter 5 by covering the creation of graphs that help you to visualize relationships among two or more variables. This includes the various types of 2D and 3D scatter plots, scatter-plot matrices, line plots, and bubble plots. It also introduces the very useful, but less well-known, corrgrams, and mosaic plots. The linear models described in chapters 8 and 9 assume that the outcome or response variable is not only numeric, but also randomly sampled from a normal distribution. There are situations where this distributional assumption is untenable. Chapter 12 presents analytic methods that work well in cases where data is sampled from unknown or mixed distributions, where sample sizes are small, where outliers are a problem, or where devising an appropriate test based on a theoretical distribution is mathematically intractable. They include both resampling and bootstrapping approaches—computer-intensive methods that are powerfully implemented in R. The methods described in this chapter will allow you to devise hypothesis tests for data that don’t fit traditional parametric assumptions. After completing part 3, you’ll have the tools to analyze most common dataanalytic problems encountered in practice. And you’ll be able to create some gorgeous graphs! www.it-ebooks.info Regression This chapter covers ■ Fitting and interpreting linear models ■ Evaluating model assumptions ■ Selecting among competing models In many ways, regression analysis lives at the heart of statistics. It’s a broad term for a set of methodologies used to predict a response variable (also called a dependent, criterion, or outcome variable) from one or more predictor variables (also called independent or explanatory variables). In general, regression analysis can be used to identify the explanatory variables that are related to a response variable, to describe the form of the relationships involved, and to provide an equation for predicting the response variable from the explanatory variables. For example, an exercise physiologist might use regression analysis to develop an equation for predicting the expected number of calories a person will burn while exercising on a treadmill. The response variable is the number of calories burned (calculated from the amount of oxygen consumed), and the predictor variables might include duration of exercise (minutes), percentage of time spent at their target heart rate, average speed (mph), age (years), gender, and body mass index (BMI). 167 www.it-ebooks.info 168 CHAPTER 8 Regression From a theoretical point of view, the analysis will help answer such questions as these: ■ ■ ■ What’s the relationship between exercise duration and calories burned? Is it linear or curvilinear? For example, does exercise have less impact on the number of calories burned after a certain point? How does effort (the percentage of time at the target heart rate, the average walking speed) factor in? Are these relationships the same for young and old, male and female, heavy and slim? From a practical point of view, the analysis will help answer such questions as the following: ■ ■ ■ How many calories can a 30-year-old man with a BMI of 28.7 expect to burn if he walks for 45 minutes at an average speed of 4 miles per hour and stays within his target heart rate 80% of the time? What’s the minimum number of variables you need to collect in order to accurately predict the number of calories a person will burn when walking? How accurate will your prediction tend to be? Because regression analysis plays such a central role in modern statistics, we’ll cover it in some depth in this chapter. First, we’ll look at how to fit and interpret regression models. Next, we’ll review a set of techniques for identifying potential problems with these models and how to deal with them. Third, we’ll explore the issue of variable selection. Of all the potential predictor variables available, how do you decide which ones to include in your final model? Fourth, we’ll address the question of generalizability. How well will your model work when you apply it in the real world? Finally, we’ll consider relative importance. Of all the predictors in your model, which one is the most important, the second most important, and the least important? As you can see, we’re covering a lot of ground. Effective regression analysis is an interactive, holistic process with many steps, and it involves more than a little skill. Rather than break it up into multiple chapters, I’ve opted to present this topic in a single chapter in order to capture this flavor. As a result, this will be the longest and most involved chapter in the book. Stick with it to the end, and you’ll have all the tools you need to tackle a wide variety of research questions. Promise! 8.1 The many faces of regression The term regression can be confusing because there are so many specialized varieties (see table 8.1). In addition, R has powerful and comprehensive features for fitting regression models, and the abundance of options can be confusing as well. For example, in 2005, Vito Ricci created a list of more than 205 functions in R that are used to generate regression analyses (http://mng.bz/NJhu). www.it-ebooks.info The many faces of regression Table 8.1 169 Varieties of regression analysis Type of regression Typical use Simple linear Predicting a quantitative response variable from a quantitative explanatory variable. Polynomial Predicting a quantitative response variable from a quantitative explanatory variable, where the relationship is modeled as an nth order polynomial. Multiple linear Predicting a quantitative response variable from two or more explanatory variables. Multilevel Predicting a response variable from data that have a hierarchical structure (for example, students within classrooms within schools). Also called hierarchical, nested, or mixed models. Multivariate Predicting more than one response variable from one or more explanatory variables. Logistic Predicting a categorical response variable from one or more explanatory variables. Poisson Predicting a response variable representing counts from one or more explanatory variables. Cox proportional hazards Predicting time to an event (death, failure, relapse) from one or more explanatory variables. Time-series Modeling time-series data with correlated errors. Nonlinear Predicting a quantitative response variable from one or more explanatory variables, where the form of the model is nonlinear. Nonparametric Predicting a quantitative response variable from one or more explanatory variables, where the form of the model is derived from the data and not specified a priori. Robust Predicting a quantitative response variable from one or more explanatory variables using an approach that’s resistant to the effect of influential observations. In this chapter, we’ll focus on regression methods that fall under the rubric of ordinary least squares (OLS) regression, including simple linear regression, polynomial regression, and multiple linear regression. OLS regression is the most common variety of statistical analysis today. Other types of regression models (including logistic regression and Poisson regression) will be covered in chapter 13. 8.1.1 Scenarios for using OLS regression In OLS regression, a quantitative dependent variable is predicted from a weighted sum of predictor variables, where the weights are parameters estimated from the data. Let’s take a look at a concrete example (no pun intended), loosely adapted from Fwa (2006). www.it-ebooks.info 170 CHAPTER 8 Regression An engineer wants to identify the most important factors related to bridge deterioration (such as age, traffic volume, bridge design, construction materials and methods, construction quality, and weather conditions) and determine the mathematical form of these relationships. She collects data on each of these variables from a representative sample of bridges and models the data using OLS regression. The approach is highly interactive. She fits a series of models, checks their compliance with underlying statistical assumptions, explores any unexpected or aberrant findings, and finally chooses the “best” model from among many possible models. If successful, the results will help her to ■ ■ ■ Focus on important variables, by determining which of the many collected variables are useful in predicting bridge deterioration, along with their relative importance. Look for bridges that are likely to be in trouble, by providing an equation that can be used to predict bridge deterioration for new cases (where the values of the predictor variables are known, but the degree of bridge deterioration isn’t). Take advantage of serendipity, by identifying unusual bridges. If she finds that some bridges deteriorate much faster or slower than predicted by the model, a study of these outliers may yield important findings that could help her to understand the mechanisms involved in bridge deterioration. Bridges may hold no interest for you. I’m a clinical psychologist and statistician, and I know next to nothing about civil engineering. But the general principles apply to an amazingly wide selection of problems in the physical, biological, and social sciences. Each of the following questions could also be addressed using an OLS approach: ■ ■ ■ ■ ■ ■ What’s the relationship between surface stream salinity and paved road surface area (Montgomery, 2007)? What aspects of a user’s experience contribute to the overuse of massively multiplayer online role playing games (MMORPGs) (Hsu, Wen, & Wu, 2009)? Which qualities of an educational environment are most strongly related to higher student achievement scores? What’s the form of the relationship between blood pressure, salt intake, and age? Is it the same for men and women? What’s the impact of stadiums and professional sports on metropolitan area development (Baade & Dye, 1990)? What factors account for interstate differences in the price of beer (Culbertson & Bradford, 1991)? (That one got your attention!) Our primary limitation is our ability to formulate an interesting question, devise a useful response variable to measure, and gather appropriate data. 8.1.2 What you need to know For the remainder of this chapter, I’ll describe how to use R functions to fit OLS regression models, evaluate the fit, test assumptions, and select among competing www.it-ebooks.info 171 OLS regression models. I assume you’ve had exposure to least squares regression as typically taught in a second-semester undergraduate statistics course. But I’ve made efforts to keep the mathematical notation to a minimum and focus on practical rather than theoretical issues. A number of excellent texts are available that cover the statistical material outlined in this chapter. My favorites are John Fox’s Applied Regression Analysis and Generalized Linear Models (for theory) and An R and S-Plus Companion to Applied Regression (for application). They both served as major sources for this chapter. A good nontechnical overview is provided by Licht (1995). 8.2 OLS regression For most of this chapter, we’ll be predicting the response variable from a set of predictor variables (also called regressing the response variable on the predictor variables— hence the name) using OLS. OLS regression fits models of the form Yi = β0 + β1 X1i + ... + βk Xki i = 1 ... n where n is the number of observations and k is the number of predictor variables. (Although I’ve tried to keep equations out of these discussions, this is one of the few places where it simplifies things.) In this equation: Yi is the predicted value of the dependent variable for observation i (specifically, it’s the estimated mean of the Y distribution, conditional on the set of predictor values). Xji is the jth predictor value for the ith observation. β0 is the intercept (the predicted value of Y when all the predictor variables equal zero). βj is the regression coefficient for the jth predictor (slope representing the change in Y for a unit change in Xj). Our goal is to select model parameters (intercept and slopes) that minimize the difference between actual response values and those predicted by the model. Specifically, model parameters are selected to minimize the sum of squared residuals: n ∑ ( Yi − Yi)2 = i =1 n ∑ ( Yi − β0 + β1X1i + ... + βkXki)2 = i =1 n ∑ i =1 ε i2 To properly interpret the coefficients of the OLS model, you must satisfy a number of statistical assumptions: ■ ■ ■ Normality—For fixed values of the independent variables, the dependent variable is normally distributed. Independence—The Yi values are independent of each other. Linearity—The dependent variable is linearly related to the independent variables. www.it-ebooks.info 172 CHAPTER 8 ■ Regression Homoscedasticity—The variance of the dependent variable doesn’t vary with the levels of the independent variables. (I could call this constant variance, but saying homoscedasticity makes me feel smarter.) If you violate these assumptions, your statistical significance tests and confidence intervals may not be accurate. Note that OLS regression also assumes that the independent variables are fixed and measured without error, but this assumption is typically relaxed in practice. 8.2.1 Fitting regression models with lm() In R, the basic function for fitting a linear model is lm(). The format is myfit <- lm(formula, data) where formula describes the model to be fit and data is the data frame containing the data to be used in fitting the model. The resulting object (myfit, in this case) is a list that contains extensive information about the fitted model. The formula is typically written as Y ~ X1 + X2 + ... + Xk where the ~ separates the response variable on the left from the predictor variables on the right, and the predictor variables are separated by + signs. Other symbols can be used to modify the formula in various ways (see table 8.2). Table 8.2 Symbols commonly used in R formulas Symbol Usage ~ Separates response variables on the left from the explanatory variables on the right. For example, a prediction of y from x, z, and w would be coded y ~ x + z + w. + Separates predictor variables. : Denotes an interaction between predictor variables. A prediction of y from x, z, and the interaction between x and z would be coded y ~ x + z + x:z. * A shortcut for denoting all possible interactions. The code y ~ x * z * w expands to y ~ x + z + w + x:z + x:w + z:w + x:z:w. ^ Denotes interactions up to a specified degree. The code y ~ (x + z + w)^2 expands to y ~ x + z + w + x:z + x:w + z:w. . A placeholder for all other variables in the data frame except the dependent variable. For example, if a data frame contained the variables x, y, z, and w, then the code y ~ . would expand to y ~ x + z + w. - A minus sign removes a variable from the equation. For example, y ~ (x + z + w)^2 – x:w expands to y ~ x + z + w + x:z + z:w. -1 Suppresses the intercept. For example, the formula y ~ x -1 fits a regression of y on x, and forces the line through the origin at x=0. www.it-ebooks.info 173 OLS regression Table 8.2 Symbols commonly used in R formulas Symbol Usage I() Elements within the parentheses are interpreted arithmetically. For example, y ~ x + (z + w)^2 would expand to y ~ x + z + w + z:w. In contrast, the code y ~ x + I((z + w)^2) would expand to y ~ x + h, where h is a new variable created by squaring the sum of z and w. function Mathematical functions can be used in formulas. For example, log(y) ~ x + z + w would predict log(y) from x, z, and w. In addition to lm(), table 8.3 lists several functions that are useful when generating a simple or multiple regression analysis. Each of these functions is applied to the object returned by lm() in order to generate additional information based on that fitted model. Table 8.3 Other functions that are useful when fitting linear models Function Action summary() Displays detailed results for the fitted model coefficients() Lists the model parameters (intercept and slopes) for the fitted model confint() Provides confidence intervals for the model parameters (95% by default) fitted() Lists the predicted values in a fitted model residuals() Lists the residual values in a fitted model anova() Generates an ANOVA table for a fitted model, or an ANOVA table comparing two or more fitted models vcov() Lists the covariance matrix for model parameters AIC() Prints Akaike’s Information Criterion plot() Generates diagnostic plots for evaluating the fit of a model predict() Uses a fitted model to predict response values for a new dataset When the regression model contains one dependent variable and one independent variable, the approach is called simple linear regression. When there’s one predictor variable but powers of the variable are included (for example, X, X2, X3), it’s called polynomial regression. When there’s more than one predictor variable, it’s called multiple linear regression. We’ll start with an example of simple linear regression, then progress to examples of polynomial and multiple linear regression, and end with an example of multiple regression that includes an interaction among the predictors. 8.2.2 Simple linear regression Let’s look at the functions in table 8.3 through a simple regression example. The dataset women in the base installation provides the height and weight for a set of 15 women www.it-ebooks.info Figure 8.1 Scatter plot with regression line for weight predicted from height 160 150 140 Weight (in pounds) ages 30 to 39. Suppose you want to predict weight from height. Having an equation for predicting weight from height can help you to identify overweight or underweight individuals. The analysis is provided in the following listing, and the resulting graph is shown in figure 8.1. Regression 130 CHAPTER 8 120 174 58 60 62 64 66 Height (in inches) Listing 8.1 Simple linear regression > fit <- lm(weight ~ height, data=women) > summary(fit) Call: lm(formula=weight ~ height, data=women) Residuals: Min 1Q Median -1.733 -1.133 -0.383 3Q 0.742 Max 3.117 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -87.5167 5.9369 -14.7 1.7e-09 *** height 3.4500 0.0911 37.9 1.1e-14 *** --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 '' 1 Residual standard error: 1.53 on 13 degrees of freedom Multiple R-squared: 0.991, Adjusted R-squared: 0.99 F-statistic: 1.43e+03 on 1 and 13 DF, p-value: 1.09e-14 > women$weight [1] 115 117 120 123 126 129 132 135 139 142 146 150 154 159 164 > fitted(fit) 1 2 3 4 5 6 7 8 9 112.58 116.03 119.48 122.93 126.38 129.83 133.28 136.73 140.18 10 11 12 13 14 15 143.63 147.08 150.53 153.98 157.43 160.88 > residuals(fit) www.it-ebooks.info 68 70 72 OLS regression 1 2.42 12 -0.53 2 0.97 13 0.02 3 0.52 14 1.57 175 4 5 6 7 8 9 10 11 0.07 -0.38 -0.83 -1.28 -1.73 -1.18 -1.63 -1.08 15 3.12 > plot(women$height,women$weight, xlab="Height (in inches)", ylab="Weight (in pounds)") > abline(fit) From the output, you see that the prediction equation is ^ Weight = − 87.52 + 3.45 × Height Because a height of 0 is impossible, you wouldn’t try to give a physical interpretation to the intercept. It merely becomes an adjustment constant. From the Pr(>|t|) column, you see that the regression coefficient (3.45) is significantly different from zero (p < 0.001) and indicates that there’s an expected increase of 3.45 pounds of weight for every 1 inch increase in height. The multiple R-squared (0.991) indicates that the model accounts for 99.1% of the variance in weights. The multiple R-squared is also the squared correlation between the actual and predicted value (that is, R2 = ryy). The residual standard error (1.53 pounds) can be thought of as the average error in predicting weight from height using this model. The F statistic tests whether the predictor variables, taken together, predict the response variable above chance levels. Because there’s only one predictor variable in simple regression, in this example the F test is equivalent to the t-test for the regression coefficient for height. For demonstration purposes, we’ve printed out the actual, predicted, and residual values. Evidently, the largest residuals occur for low and high heights, which can also be seen in the plot (figure 8.1). The plot suggests that you might be able to improve on the prediction by using a line with one bend. For example, a model of the form Yi = β0 + β1X + β2X2 may provide a better fit to the data. Polynomial regression allows you to predict a response variable from an explanatory variable, where the form of the relationship is an nthdegree polynomial. 8.2.3 Polynomial regression The plot in figure 8.1 suggests that you might be able to improve your prediction using a regression with a quadratic term (that is, X 2). You can fit a quadratic equation using the statement fit2 <- lm(weight ~ height + I(height^2), data=women) The new term I(height^2) requires explanation. height^2 adds a height-squared term to the prediction equation. The I function treats the contents within the parentheses as an R regular expression. You need this because the ^ operator has a special meaning in formulas that you don’t want to invoke here (see table 8.2). www.it-ebooks.info 176 Regression CHAPTER 8 The following listing shows the results of fitting the quadratic equation. Listing 8.2 Polynomial regression > fit2 <- lm(weight ~ height + I(height^2), data=women) > summary(fit2) Call: lm(formula=weight ~ height + I(height^2), data=women) Residuals: Min 1Q Median -0.5094 -0.2961 -0.0094 3Q 0.2862 Max 0.5971 Coefficients: Estimate Std. Error t (Intercept) 261.87818 25.19677 height -7.34832 0.77769 I(height^2) 0.08306 0.00598 --Signif. codes: 0 '***' 0.001 '**' value Pr(>|t|) 10.39 2.4e-07 *** -9.45 6.6e-07 *** 13.89 9.3e-09 *** 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.384 on 12 degrees of freedom Multiple R-squared: 0.999, Adjusted R-squared: 0.999 F-statistic: 1.14e+04 on 2 and 12 DF, p-value: <2e-16 > plot(women$height,women$weight, xlab="Height (in inches)", ylab="Weight (in lbs)") > lines(women$height,fitted(fit2)) From this new analysis, the prediction equation is ^ 150 140 130 120 Weight (in lbs) and both regression coefficients are significant at the p < 0.0001 level. The amount of variance accounted for has increased to 99.9%. The significance of the squared term (t = 13.89, p < .001) suggests that inclusion of the quadratic term improves the model fit. If you look at the plot of fit2 (figure 8.2) you can see that the curve does indeed provide a better fit. 160 Weight = 261.88 − 7 . 3 5 × Height + 0.083 × Height2 Figure 8.2 Quadratic regression for weight predicted by height 58 60 62 64 66 Height (in inches) www.it-ebooks.info 68 70 72 177 OLS regression Linear vs. nonlinear models Note that this polynomial equation still fits under the rubric of linear regression. It’s linear because the equation involves a weighted sum of predictor variables (height and height-squared in this case). Even a model such as Yi = β0 × logX1 + β2 × sinX2 would be considered a linear model (linear in terms of the parameters) and fit with the formula Y ~ log(X1) + sin(X2) In contrast, here’s an example of a truly nonlinear model: Yi = β0 + β1 ex/β2 Nonlinear models of this form can be fit with the nls() function. In general, an nth-degree polynomial produces a curve with n-1 bends. To fit a cubic polynomial, you’d use fit3 <- lm(weight ~ height + I(height^2) +I(height^3), data=women) Although higher polynomials are possible, I’ve rarely found that terms higher than cubic are necessary. Before we move on, I should mention that the scatterplot() function in the car package provides a simple and convenient method of plotting a bivariate relationship. The code library(car) scatterplot(weight ~ height, data=women, spread=FALSE, smoother.args=list(lty=2), pch=19, main="Women Age 30-39", xlab="Height (inches)", ylab="Weight (lbs.)") Figure 8.3 Scatter plot of height by weight, with linear and smoothed fits, and marginal box plots 140 130 120 Weight (lbs.) 150 160 produces the graph in figure 8.3. Women Age 30−39 58 60 62 64 66 Height (inches) www.it-ebooks.info 68 70 72 178 CHAPTER 8 Regression This enhanced plot provides the scatter plot of weight with height, box plots for each variable in their respective margins, the linear line of best fit, and a smoothed (loess) fit line. The spread=FALSE options suppress spread and asymmetry information. The smoother.args=list(lty=2)option specifies that the loess fit be rendered as a dashed line. The pch=19 options display points as filled circles (the default is open circles). You can tell at a glance that the two variables are roughly symmetrical and that a curved line will fit the data points better than a straight line. 8.2.4 Multiple linear regression When there’s more than one predictor variable, simple linear regression becomes multiple linear regression, and the analysis grows more involved. Technically, polynomial regression is a special case of multiple regression. Quadratic regression has two predictors (X and X 2), and cubic regression has three predictors (X, X 2, and X 3). Let’s look at a more general example. We’ll use the state.x77 dataset in the base package for this example. Suppose you want to explore the relationship between a state’s murder rate and other characteristics of the state, including population, illiteracy rate, average income, and frost levels (mean number of days below freezing). Because the lm() function requires a data frame (and the state.x77 dataset is contained in a matrix), you can simplify your life with the following code: states <- as.data.frame(state.x77[,c("Murder", "Population", "Illiteracy", "Income", "Frost")]) This code creates a data frame called states, containing the variables you’re interested in. You’ll use this new data frame for the remainder of the chapter. A good first step in multiple regression is to examine the relationships among the variables two at a time. The bivariate correlations are provided by the cor() function, and scatter plots are generated from the scatterplotMatrix() function in the car package (see the following listing and figure 8.4). Listing 8.3 Examining bivariate relationships > states <- as.data.frame(state.x77[,c("Murder", "Population", "Illiteracy", "Income", "Frost")]) > cor(states) Murder Population Illiteracy Income Frost Murder 1.00 0.34 0.70 -0.23 -0.54 Population 0.34 1.00 0.11 0.21 -0.33 Illiteracy 0.70 0.11 1.00 -0.44 -0.67 Income -0.23 0.21 -0.44 1.00 0.23 Frost -0.54 -0.33 -0.67 0.23 1.00 > library(car) > scatterplotMatrix(states, spread=FALSE, smoother.args=list(lty=2), main="Scatter Plot Matrix") www.it-ebooks.info 179 OLS regression Scatterplot matrix 10000 20000 3000 4500 6000 14 0 20000 2 6 10 Murder 0 10000 Population 6000 0.5 1.5 2.5 Illiteracy 3000 4500 Income 0 50 100 Frost 2 6 10 14 0.5 1.5 2.5 0 50 100 Figure 8.4 Scatter plot matrix of dependent and independent variables for the states data, including linear and smoothed fits, and marginal distributions (kernel-density plots and rug plots) By default, the scatterplotMatrix() function provides scatter plots of the variables with each other in the off-diagonals and superimposes smoothed (loess) and linear fit lines on these plots. The principal diagonal contains density and rug plots for each variable. You can see that murder rate may be bimodal and that each of the predictor variables is skewed to some extent. Murder rates rise with population and illiteracy, and they fall with higher income levels and frost. At the same time, colder states have lower illiteracy rates and population and higher incomes. Now let’s fit the multiple regression model with the lm() function (see the following listing). Listing 8.4 Multiple linear regression > states <- as.data.frame(state.x77[,c("Murder", "Population", "Illiteracy", "Income", "Frost")]) > fit <- lm(Murder ~ Population + Illiteracy + Income + Frost, data=states) > summary(fit) www.it-ebooks.info 180 CHAPTER 8 Regression Call: lm(formula=Murder ~ Population + Illiteracy + Income + Frost, data=states) Residuals: Min 1Q Median -4.7960 -1.6495 -0.0811 3Q 1.4815 Max 7.6210 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.23e+00 3.87e+00 0.32 0.751 Population 2.24e-04 9.05e-05 2.47 0.017 * Illiteracy 4.14e+00 8.74e-01 4.74 2.2e-05 *** Income 6.44e-05 6.84e-04 0.09 0.925 Frost 5.81e-04 1.01e-02 0.06 0.954 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.v 0.1 'v' 1 Residual standard error: 2.5 on 45 degrees of freedom Multiple R-squared: 0.567, Adjusted R-squared: 0.528 F-statistic: 14.7 on 4 and 45 DF, p-value: 9.13e-08 When there’s more than one predictor variable, the regression coefficients indicate the increase in the dependent variable for a unit change in a predictor variable, holding all other predictor variables constant. For example, the regression coefficient for Illiteracy is 4.14, suggesting that an increase of 1% in illiteracy is associated with a 4.14% increase in the murder rate, controlling for population, income, and temperature. The coefficient is significantly different from zero at the p < .0001 level. On the other hand, the coefficient for Frost isn’t significantly different from zero (p = 0.954) suggesting that Frost and Murder aren’t linearly related when controlling for the other predictor variables. Taken together, the predictor variables account for 57% of the variance in murder rates across states. Up to this point, we’ve assumed that the predictor variables don’t interact. In the next section, we’ll consider a case in which they do. 8.2.5 Multiple linear regression with interactions Some of the most interesting research findings are those involving interactions among predictor variables. Consider the automobile data in the mtcars data frame. Let’s say that you’re interested in the impact of automobile weight and horsepower on mileage. You could fit a regression model that includes both predictors, along with their interaction, as shown in the next listing. Listing 8.5 Multiple linear regression with a significant interaction term > fit <- lm(mpg ~ hp + wt + hp:wt, data=mtcars) > summary(fit) www.it-ebooks.info OLS regression 181 Call: lm(formula=mpg ~ hp + wt + hp:wt, data=mtcars) Residuals: Min 1Q Median -3.063 -1.649 -0.736 3Q 1.421 Max 4.551 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 49.80842 3.60516 13.82 5.0e-14 *** hp -0.12010 0.02470 -4.86 4.0e-05 *** wt -8.21662 1.26971 -6.47 5.2e-07 *** hp:wt 0.02785 0.00742 3.75 0.00081 *** --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 2.1 on 28 degrees of freedom Multiple R-squared: 0.885, Adjusted R-squared: 0.872 F-statistic: 71.7 on 3 and 28 DF, p-value: 2.98e-13 You can see from the Pr(>|t|) column that the interaction between horsepower and car weight is significant. What does this mean? A significant interaction between two predictor variables tells you that the relationship between one predictor and the response variable depends on the level of the other predictor. Here it means the relationship between miles per gallon and horsepower varies by car weight. The model for predicting mpg is mpg = 49.81 – 0.12 × hp – 8.22 × wt + 0.03 × hp × wt. To interpret the interaction, you can plug in various values of wt and simplify the equation. For example, you can try the mean of wt (3.2) and one standard deviation below and above the mean (2.2 and 4.2, respectively). For wt=2.2, the equation simplifies to mpg = 49.81 – 0.12 × hp – 8.22 × (2.2) + 0.03 × hp × (2.2) = 31.41 – 0.06 × hp. For wt=3.2, this becomes mpg = 23.37 – 0.03 × hp. Finally, for wt=4.2 the equation becomes mpg = 15.33 – 0.003 × hp. You see that as weight increases (2.2, 3.2, 4.2), the expected change in mpg from a unit increase in hp decreases (0.06, 0.03, 0.003). You can visualize interactions using the effect() function in the effects package. The format is plot(effect(term, mod,, xlevels), multiline=TRUE) where term is the quoted model term to plot, mod is the fitted model returned by lm(), and xlevels is a list specifying the variables to be set to constant values and the values to employ. The multiline=TRUE option superimposes the lines being plotted. For the previous model, this becomes library(effects) plot(effect("hp:wt", fit,, list(wt=c(2.2,3.2,4.2))), multiline=TRUE) The resulting graph is displayed in figure 8.5. www.it-ebooks.info 182 hp*wt effect plot wt 2.2 3.2 4.2 25 mpg You can see from this graph that as the weight of the car increases, the relationship between horsepower and miles per gallon weakens. For wt=4.2, the line is almost horizontal, indicating that as hp increases, mpg doesn’t change. Unfortunately, fitting the model is only the first step in the analysis. Once you fit a regression model, you need to evaluate whether you’ve met the statistical assumptions underlying your approach before you can have confidence in the inferences you draw. This is the topic of the next section. 8.3 Regression CHAPTER 8 20 15 50 100 150 200 250 300 Figure 8.5 Interaction plot for hp*wt. This plot displays the relationship between mpg and hp at three values of wt. Regression diagnostics In the previous section, you used the lm() function to fit an OLS regression model and the summary() function to obtain the model parameters and summary statistics. Unfortunately, nothing in this printout tells you whether the model you’ve fit is appropriate. Your confidence in inferences about regression parameters depends on the degree to which you’ve met the statistical assumptions of the OLS model. Although the summary() function in listing 8.4 describes the model, it provides no information concerning the degree to which you’ve satisfied the statistical assumptions underlying the model. Why is this important? Irregularities in the data or misspecifications of the relationships between the predictors and the response variable can lead you to settle on a model that’s wildly inaccurate. On the one hand, you may conclude that a predictor and a response variable are unrelated when, in fact, they are. On the other hand, you may conclude that a predictor and a response variable are related when, in fact, they aren’t! You may also end up with a model that makes poor predictions when applied in real-world settings, with significant and unnecessary error. Let’s look at the output from the confint() function applied to the states multiple regression problem in section 8.2.4: > states <- as.data.frame(state.x77[,c("Murder", "Population", "Illiteracy", "Income", "Frost")]) > fit <- lm(Murder ~ Population + Illiteracy + Income + Frost, data=states) > confint(fit) 2.5 % 97.5 % (Intercept) -6.55e+00 9.021318 www.it-ebooks.info Regression diagnostics Population Illiteracy Income Frost 4.14e-05 2.38e+00 -1.31e-03 -1.97e-02 183 0.000406 5.903874 0.001441 0.020830 The results suggest that you can be 95% confident that the interval [2.38, 5.90] contains the true change in murder rate for a 1% change in illiteracy rate. Additionally, because the confidence interval for Frost contains 0, you can conclude that a change in temperature is unrelated to murder rate, holding the other variables constant. But your faith in these results is only as strong as the evidence you have that your data satisfies the statistical assumptions underlying the model. A set of techniques called regression diagnostics provides the necessary tools for evaluating the appropriateness of the regression model and can help you to uncover and correct problems. We’ll start with a standard approach that uses functions that come with R’s base installation. Then we’ll look at newer, improved methods available through the car package. 8.3.1 A typical approach R’s base installation provides numerous methods for evaluating the statistical assumptions in a regression analysis. The most common approach is to apply the plot() function to the object returned by the lm(). Doing so produces four graphs that are useful for evaluating the model fit. Applying this approach to the simple linear regression example fit <- lm(weight ~ height, data=women) par(mfrow=c(2,2)) plot(fit) produces the graphs shown in figure 8.6. The par(mfrow=c(2,2)) statement is used to combine the four plots produced by the plot() function into one large 2 × 2 graph. The par() function is described in chapter 3. To understand these graphs, consider the assumptions of OLS regression: ■ ■ Normality—If the dependent variable is normally distributed for a fixed set of predictor values, then the residual values should be normally distributed with a mean of 0. The Normal Q-Q plot (upper right) is a probability plot of the standardized residuals against the values that would be expected under normality. If you’ve met the normality assumption, the points on this graph should fall on the straight 45-degree line. Because they don’t, you’ve clearly violated the normality assumption. Independence—You can’t tell if the dependent variable values are independent from these plots. You have to use your understanding of how the data was collected. There’s no a priori reason to believe that one woman’s weight influences another woman’s weight. If you found out that the data were sampled from families, you might have to adjust your assumption of independence. www.it-ebooks.info 184 CHAPTER 8 Regression Residuals vs Fitted Normal Q−Q 120 130 140 2 1 −1 −2 8 150 1 0 2 1 0 −1 Residuals 15 Standardized residuals 3 15 1 8 160 −1 Fitted values 1.5 1 1 120 130 140 150 160 1 0 Cook’s distance 0.00 0.05 Fitted values Figure 8.6 0.5 14 −1 Standardized residuals 0.5 1.0 8 2 15 0.0 Standardized residuals Residuals vs Leverage 15 1 ■ 1 Theoretical Quantiles Scale−Location ■ 0 0.10 0.15 0.20 0.25 Leverage Diagnostic plots for the regression of weight on height Linearity—If the dependent variable is linearly related to the independent variables, there should be no systematic relationship between the residuals and the predicted (that is, fitted) values. In other words, the model should capture all the systematic variance present in the data, leaving nothing but random noise. In the Residuals vs. Fitted graph (upper left), you see clear evidence of a curved relationship, which suggests that you may want to add a quadratic term to the regression. Homoscedasticity—If you’ve met the constant variance assumption, the points in the Scale-Location graph (bottom left) should be a random band around a horizontal line. You seem to meet this assumption. Finally, the Residuals vs. Leverage graph (bottom right) provides information about individual observations that you may wish to attend to. The graph identifies outliers, high-leverage points, and influential observations. Specifically: ■ An outlier is an observation that isn’t predicted well by the fitted regression model (that is, has a large positive or negative residual). www.it-ebooks.info 185 Regression diagnostics ■ ■ An observation with a high leverage value has an unusual combination of predictor values. That is, it’s an outlier in the predictor space. The dependent variable value isn’t used to calculate an observation’s leverage. An influential observation is an observation that has a disproportionate impact on the determination of the model parameters. Influential observations are identified using a statistic called Cook’s distance, or Cook’s D. To be honest, I find the Residuals vs. Leverage plot difficult to read and not useful. You’ll see better representations of this information in later sections. To complete this section, let’s look at the diagnostic plots for the quadratic fit. The necessary code is fit2 <- lm(weight ~ height + I(height^2), data=women) par(mfrow=c(2,2)) plot(fit2) and the resulting graph is provided in figure 8.7. This second set of plots suggests that the polynomial regression provides a better fit with regard to the linearity assumption, normality of residuals (except for observation Normal Q−Q 2 120 13 130 140 150 0 1 2 15 −1 0.2 Standardized residuals 15 −0.2 −0.6 Residuals 0.6 Residuals vs Fitted 13 160 −1 Fitted values 1.5 1 Residuals vs Leverage 2 1 0.5 1 1.0 0.5 15 0 13 −1 Standardized residuals 15 0.5 Cook’s 13 distance2 0.0 Standardized residuals 0 Theoretical Quantiles Scale−Location 2 120 130 140 150 160 0.0 Fitted values Figure 8.7 2 0.1 0.2 0.3 1 0.4 Leverage Diagnostic plots for the regression of weight on height and height-squared www.it-ebooks.info 186 CHAPTER 8 Regression 13), and homoscedasticity (constant residual variance). Observation 15 appears to be influential (based on a large Cook’s D value), and deleting it has an impact on the parameter estimates. In fact, dropping both observations 13 and 15 produces a better model fit. To see this, try newfit <- lm(weight~ height + I(height^2), data=women[-c(13,15),]) for yourself. But you need to be careful when deleting data. Your models should fit your data, not the other way around! Finally, let’s apply the basic approach to the states multiple regression problem: states <- as.data.frame(state.x77[,c("Murder", "Population", "Illiteracy", "Income", "Frost")]) fit <- lm(Murder ~ Population + Illiteracy + Income + Frost, data=states) par(mfrow=c(2,2)) plot(fit) The results are displayed in figure 8.8. As you can see from the graph, the model assumptions appear to be well satisfied, with the exception that Nevada is an outlier. Although these standard diagnostic plots are helpful, better tools are now available in R and I recommend their use over the plot(fit) approach. Residuals vs Fitted Normal Q−Q 4 6 3 2 0 1 Alaska −2 8 10 12 Nevada −1 0 Massachusetts Rhode Island −5 Residuals 5 Standardized residuals Nevada 14 Rhode Island −2 −1 0 1 2 Fitted values Theoretical Quantiles Scale−Location Residuals vs Leverage 3 Nevada 2 1 0.5 0 1 Alaska −1 Standardized residuals 1.5 Hawaii −2 0.5 1.0 Rhode Island Alaska 0.0 Standardized residuals Nevada 0.5 Cook’s distance 1 4 6 8 10 12 14 0.0 Fitted values Figure 8.8 0.1 0.2 0.3 0.4 Leverage Diagnostic plots for the regression of murder rate on state characteristics www.it-ebooks.info Regression diagnostics 8.3.2 187 An enhanced approach The car package provides a number of functions that significantly enhance your ability to fit and evaluate regression models (see table 8.4). Table 8.4 Useful functions for regression diagnostics (car package) Function Purpose qqPlot() Quantile comparisons plot durbinWatsonTest() Durbin–Watson test for autocorrelated errors crPlots() Component plus residual plots ncvTest() Score test for nonconstant error variance spreadLevelPlot() Spread-level plots outlierTest() Bonferroni outlier test avPlots() Added variable plots influencePlot() Regression influence plots scatterplot() Enhanced scatter plots scatterplotMatrix() Enhanced scatter plot matrixes vif() Variance inflation factors It’s important to note that there are many changes between version 1.x and version 2.x of the car package, including changes in function names and behavior. This chapter is based on version 2. In addition, the gvlma package provides a global test for linear model assumptions. Let’s look at each in turn, by applying them to our multiple regression example. NORMALITY The qqPlot() function provides a more accurate method of assessing the normality assumption than that provided by the plot() function in the base package. It plots the studentized residuals (also called studentized deleted residuals or jackknifed residuals) against a t distribution with n – p – 1 degrees of freedom, where n is the sample size and p is the number of regression parameters (including the intercept). The code follows: library(car) states <- as.data.frame(state.x77[,c("Murder", "Population", "Illiteracy", "Income", "Frost")]) fit <- lm(Murder ~ Population + Illiteracy + Income + Frost, data=states) qqPlot(fit, labels=row.names(states), id.method="identify", simulate=TRUE, main="Q-Q Plot") www.it-ebooks.info 188 CHAPTER 8 Regression 2 1 0 Studentized Residuals(fit) 3 The qqPlot() function generates Q-Q Plot the probability plot displayed in Nevada figure 8.9. The option id.method ="identify" makes the plot interactive—after the graph is drawn, mouse clicks on points in the graph will label them with values specified in the labels option of the function. Pressing the Esc key, selecting Stop from the graph’s drop-down menu, or right-clicking the graph turns off this interactive mode. Here, I identified Nevada. When simulate=TRUE, a 95% con0 1 2 fidence envelope is produced t Quantiles using a parametric bootstrap. Figure 8.9 Q-Q plot for studentized residuals (Bootstrap methods are considered in chapter 12.) With the exception of Nevada, all the points fall close to the line and are within the confidence envelope, suggesting that you’ve met the normality assumption fairly well. But you should definitely look at Nevada. It has a large positive residual (actual – predicted), indicating that the model underestimates the murder rate in this state. Specifically: > states["Nevada",] Nevada Murder Population Illiteracy Income Frost 11.5 590 0.5 5149 188 > fitted(fit)["Nevada"] Nevada 3.878958 > residuals(fit)["Nevada"] Nevada 7.621042 > rstudent(fit)["Nevada"] Nevada 3.542929 Here you see that the murder rate is 11.5%, but the model predicts a 3.9% murder rate. The question that you need to ask is, “Why does Nevada have a higher murder rate than predicted from population, income, illiteracy, and temperature?” Anyone (who hasn’t see Goodfellas) want to guess? www.it-ebooks.info 189 Regression diagnostics For completeness, here’s another way of visualizing errors. Take a look at the code in the next listing. The residplot() function generates a histogram of the studentized residuals and superimposes a normal curve, kernel-density curve, and rug plot. It doesn’t require the car package. Listing 8.6 Function for plotting studentized residuals residplot <- function(fit, nbreaks=10) { z <- rstudent(fit) hist(z, breaks=nbreaks, freq=FALSE, xlab="Studentized Residual", main="Distribution of Errors") rug(jitter(z), col="brown") curve(dnorm(x, mean=mean(z), sd=sd(z)), add=TRUE, col="blue", lwd=2) lines(density(z)$x, density(z)$y, col="red", lwd=2, lty=2) legend("topright", legend = c( "Normal Curve", "Kernel Density Curve"), lty=1:2, col=c("blue","red"), cex=.7) } residplot(fit) The results are displayed in figure 8.10. As you can see, the errors follow a normal distribution quite well, with the exception of a large outlier. Although the Q-Q plot is probably more informative, I’ve always 0.4 Distribution of Errors 0.2 0.1 Figure 8.10 Distribution of studentized residuals using the residplot() function 0.0 Density 0.3 Normal Curve Kernel Density Cur ve −2 −1 0 1 2 Studentized Residual www.it-ebooks.info 3 4 190 CHAPTER 8 Regression found it easier to gauge the skew of a distribution from a histogram or density plot than from a probability plot. Why not use both? INDEPENDENCE OF ERRORS As indicated earlier, the best way to assess whether the dependent variable values (and thus the residuals) are independent is from your knowledge of how the data were collected. For example, time series data often display autocorrelation—observations collected closer in time are more correlated with each other than with observations distant in time. The car package provides a function for the Durbin–Watson test to detect such serially correlated errors. You can apply the Durbin–Watson test to the multiple-regression problem with the following code: > durbinWatsonTest(fit) lag Autocorrelation D-W Statistic p-value 1 -0.201 2.32 0.282 Alternative hypothesis: rho != 0 The nonsignificant p-value (p=0.282) suggests a lack of autocorrelation and, conversely, an independence of errors. The lag value (1 in this case) indicates that each observation is being compared with the one next to it in the dataset. Although appropriate for time-dependent data, the test is less applicable for data that isn’t clustered in this fashion. Note that the durbinWatsonTest() function uses bootstrapping (see chapter 12) to derive p-values. Unless you add the option simulate=FALSE, you’ll get a slightly different value each time you run the test. LINEARITY You can look for evidence of nonlinearity in the relationship between the dependent variable and the independent variables by using component plus residual plots (also known as partial residual plots). The plot is produced by the crPlots() function in the car package. You’re looking for any systematic departure from the linear model that you’ve specified. To create a component plus residual plot for variable X, you plot the points εi + (β0 + β1 × X1i + ... + βk × Xki) vs. X i where the residuals are based on the full model, and i =1 … n. The straight line in each graph is given by (β0 + β1 × X1i + ... + βk × Xki) vs. Xi . Loess fit lines are described in chapter 11. The code to produce these plots is as follows: > library(car) > crPlots(fit) The resulting plots are provided in figure 8.11. Nonlinearity in any of these plots suggests that you may not have adequately modeled the functional form of that predictor in the regression. If so, you may need to add curvilinear components such as polynomial terms, transform one or more variables (for example, use log(X) instead of X), or abandon linear regression in favor of some other regression variant. Transformations are discussed later in this chapter. www.it-ebooks.info 191 Regression diagnostics 6 4 2 0 −2 −4 Component+Residual(Murder) 6 4 2 0 −2 −4 −6 Component+Residual(Murder) 8 Component + Residual Plots 0 5000 10000 15000 20000 0.5 1.0 2.0 2.5 4000 5000 −2 0 2 4 6 8 3000 −4 Component+Residual(Murder) 6 4 2 0 −2 −4 Component+Residual(Murder) 1.5 Illiteracy 8 Population 6000 Income 0 50 100 150 Frost Figure 8.11 Component plus residual plots for the regression of murder rate on state characteristics The component plus residual plots confirm that you’ve met the linearity assumption. The form of the linear model seems to be appropriate for this dataset. HOMOSCEDASTICITY The car package also provides two useful functions for identifying non-constant error variance. The ncvTest() function produces a score test of the hypothesis of constant error variance against the alternative that the error variance changes with the level of the fitted values. A significant result suggests heteroscedasticity (nonconstant error variance). The spreadLevelPlot() function creates a scatter plot of the absolute standardized residuals versus the fitted values and superimposes a line of best fit. Both functions are demonstrated in the next listing. www.it-ebooks.info 192 CHAPTER 8 Regression Listing 8.7 Assessing homoscedasticity > library(car) > ncvTest(fit) Non-constant Variance Score Test Variance formula: ~ fitted.values Chisquare=1.7 Df=1 p=0.19 > spreadLevelPlot(fit) Suggested power transformation: 1.2 The score test is nonsignificant (p = 0.19), suggesting that you’ve met the constant variance assumption. You can also see this in the spread-level plot (figure 8.12). The points form a random horizontal band around a horizontal line of best fit. If you’d violated the assumption, you’d expect to see a nonhorizontal line. The suggested power transformation in listing 8.7 is the suggested power p (Y p) that would stabilize the nonconstant error variance. For example, if the plot showed a nonhorizontal trend and the suggested power transformation was 0.5, then using Y rather than Y in the regression equation might lead to a model that satisfied homoscedasticity. If the suggested power was 0, you’d use a log transformation. In the current example, there’s no evidence of heteroscedasticity, and the suggested power is close to 1 (no transformation required). 1.00 0.50 0.20 0.10 0.05 Absolute Studentized Residuals 2.00 Spread−Level Plot for fit 4 6 8 10 12 14 Fitted Values Figure 8.12 Spread-level plot for assessing constant error variance www.it-ebooks.info Regression diagnostics 8.3.3 193 Global validation of linear model assumption Finally, let’s examine the gvlma() function in the gvlma package. Written by Pena and Slate (2006), the gvlma() function performs a global validation of linear model assumptions as well as separate evaluations of skewness, kurtosis, and heteroscedasticity. In other words, it provides a single omnibus (go/no go) test of model assumptions. The following listing applies the test to the states data. Listing 8.8 Global test of linear model assumptions > library(gvlma) > gvmodel <- gvlma(fit) > summary(gvmodel) ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM: Level of Significance= 0.05 Call: gvlma(x=fit) Global Stat Skewness Kurtosis Link Function Heteroscedasticity Value p-value Decision 2.773 0.597 Assumptions acceptable. 1.537 0.215 Assumptions acceptable. 0.638 0.425 Assumptions acceptable. 0.115 0.734 Assumptions acceptable. 0.482 0.487 Assumptions acceptable. You can see from the printout (the Global Stat line) that the data meet all the statistical assumptions that go with the OLS regression model (p = 0.597). If the decision line indicated that the assumptions were violated (say, p < 0.05), you’d have to explore the data using the previous methods discussed in this section to determine which assumptions were the culprit. 8.3.4 Multicollinearity Before leaving this section on regression diagnostics, let’s focus on a problem that’s not directly related to statistical assumptions but is important in allowing you to interpret multiple regression results. Imagine you’re conducting a study of grip strength. Your independent variables include date of birth (DOB) and age. You regress grip strength on DOB and age and find a significant overall F test at p < .001. But when you look at the individual regression coefficients for DOB and age, you find that they’re both nonsignificant (that is, there’s no evidence that either is related to grip strength). What happened? The problem is that DOB and age are perfectly correlated within rounding error. A regression coefficient measures the impact of one predictor variable on the response variable, holding all other predictor variables constant. This amounts to looking at the relationship of grip strength and age, holding age constant. The problem is called multicollinearity. It leads to large confidence intervals for model parameters and makes the interpretation of individual coefficients difficult. www.it-ebooks.info 194 CHAPTER 8 Regression Multicollinearity can be detected using a statistic called the variance inflation factor (VIF). For any predictor variable, the square root of the VIF indicates the degree to which the confidence interval for that variable’s regression parameter is expanded relative to a model with uncorrelated predictors (hence the name). VIF values are provided by the vif() function in the car package. As a general rule, vif > 2 indicates a multicollinearity problem. The code is provided in the following listing. The results indicate that multicollinearity isn’t a problem with these predictor variables. Listing 8.9 Evaluating multicollinearity > library(car) > vif(fit) Population Illiteracy 1.2 2.2 Income 1.3 Frost 2.1 > sqrt(vif(fit)) > 2 # problem? Population Illiteracy FALSE FALSE 8.4 Income FALSE Frost FALSE Unusual observations A comprehensive regression analysis will also include a screening for unusual observations—namely outliers, high-leverage observations, and influential observations. These are data points that warrant further investigation, either because they’re different than other observations in some way, or because they exert a disproportionate amount of influence on the results. Let’s look at each in turn. 8.4.1 Outliers Outliers are observations that aren’t predicted well by the model. They have unusually large positive or negative residuals (Yi _ Yi). Positive residuals indicate that the model is underestimating the response value, whereas negative residuals indicate an overestimation. You’ve already seen one way to identify outliers. Points in the Q-Q plot of figure 8.9 that lie outside the confidence band are considered outliers. A rough rule of thumb is that standardized residuals that are larger than 2 or less than –2 are worth attention. The car package also provides a statistical test for outliers. The outlierTest() function reports the Bonferroni adjusted p-value for the largest absolute studentized residual: > library(car) > outlierTest(fit) Nevada rstudent unadjusted p-value Bonferonni p 3.5 0.00095 0.048 Here, you see that Nevada is identified as an outlier (p = 0.048). Note that this function tests the single largest (positive or negative) residual for significance as an outlier. www.it-ebooks.info 195 Unusual observations If it isn’t significant, there are no outliers in the dataset. If it’s significant, you must delete it and rerun the test to see if others are present. High-leverage points Observations that have high leverage are outliers with regard to the other predictors. In other words, they have an unusual combination of predictor values. The response value isn’t involved in determining leverage. Observations with high leverage are identified through the hat statistic. For a given dataset, the average hat value is p/n, where p is the number of parameters estimated in the model (including the intercept) and n is the sample size. Roughly speaking, an observation with a hat value greater than 2 or 3 times the average hat value should be examined. The code that follows plots the hat values: hat.plot <- function(fit) { p <- length(coefficients(fit)) n <- length(fitted(fit)) plot(hatvalues(fit), main="Index Plot of Hat Values") abline(h=c(2,3)*p/n, col="red", lty=2) identify(1:n, hatvalues(fit), names(hatvalues(fit))) } hat.plot(fit) The resulting graph is shown in figure 8.13. Horizontal lines are drawn at 2 and 3 times the average hat value. The locator function places the graph in interactive mode. Clicking points of interest labels them Index Plot of Hat Values 0.4 Alaska 0.3 California New York Washington 0.2 Hawaii 0.1 hatvalues(fit) 8.4.2 Figure 8.13 Index plot of hat values for assessing observations with high leverage 0 10 20 30 40 Index www.it-ebooks.info 50 196 CHAPTER 8 Regression until the user presses Esc, selects Stop from the graph drop-down menu, or rightclicks the graph. Here you see that Alaska and California are particularly unusual when it comes to their predictor values. Alaska has a much higher income than other states, while having a lower population and temperature. California has a much higher population than other states, while having a higher income and higher temperature. These states are atypical compared with the other 48 observations. High-leverage observations may or may not be influential observations. That will depend on whether they’re also outliers. Influential observations Influential observations have a disproportionate impact on the values of the model parameters. Imagine finding that your model changes dramatically with the removal of a single observation. It’s this concern that leads you to examine your data for influential points. There are two methods for identifying influential observations: Cook’s distance (or D statistic) and added variable plots. Roughly speaking, Cook’s D values greater than 4/(n – k – 1), where n is the sample size and k is the number of predictor variables, indicate influential observations. You can create a Cook’s D plot (figure 8.14) with the following code: cutoff <- 4/(nrow(states)-length(fit$coefficients)-2) plot(fit, which=4, cook.levels=cutoff) abline(h=cutoff, lty=2, col="red") The graph identifies Alaska, Hawaii, and Nevada as influential observations. Deleting these states will have a notable impact on the values of the intercept and slopes in the 0.5 Cook’s distance 0.3 0.2 Nevada Hawaii 0.1 Cook’s distance 0.4 Alaska Figure 8.14 Cook’s D plot for identifying influential observations 0.0 8.4.3 0 10 20 30 40 Obs. number lm(Murder ~ Population + Illiteracy + Income + Frost) www.it-ebooks.info 50 197 Unusual observations regression model. Note that although it’s useful to cast a wide net when searching for influential observations, I tend to find a cutoff of 1 more generally useful than 4/(n – k – 1). Given a criterion of D=1, none of the observations in the dataset would appear to be influential. Cook’s D plots can help identify influential observations, but they don’t provide information about how these observations affect the model. Added-variable plots can help in this regard. For one response variable and k predictor variables, you’d create k added-variable plots as follows. For each predictor Xk, plot the residuals from regressing the response variable on the other k – 1 predictors versus the residuals from regressing Xk on the other k – 1 predictors. Added-variable plots can be created using the avPlots() function in the car package: library(car) avPlots(fit, ask=FALSE, id.method="identify") The resulting graphs are provided in figure 8.15. The graphs are produced one at a time, and users can click points to identify them. Press Esc, choose Stop from the graph’s menu, or right-click to move to the next plot. Here, I’ve identified Alaska in the bottom-left plot. 4 2 −2 0 Murder | others 2 0 −2 −4 −4 Murder | others 4 6 6 8 Added−Variable Plots −5000 0 5000 10000 −1.0 −0.5 0.5 1.0 2 0 −2 Murder | others 4 6 8 6 4 0 2 Alaska −4 −4 −2 Murder | others 0.0 Illiteracy | others 8 Population | others −500 0 500 1500 −100 Income | others Figure 8.15 −50 0 50 Frost | others Added-variable plots for assessing the impact of influential observations www.it-ebooks.info 198 CHAPTER 8 Regression Influence Plot 2 Alaska Figure 8.16 Influence plot. States above +2 or below –2 on the vertical axis are considered outliers. States above 0.2 or 0.3 on the horizontal axis have high leverage (unusual combinations of predictor values). Circle size is proportional to influence. Observations depicted by large circles may have disproportionate influence on the parameter estimates of the model. 1 0 Studentized Residuals 3 Nevada Washington California −1 New York −2 Hawaii Rhode Island 0.1 0.2 0.3 0.4 Hat−Values Circle size is proportial to Cook’s Distance The straight line in each plot is the actual regression coefficient for that predictor variable. You can see the impact of influential observations by imagining how the line would change if the point representing that observation was deleted. For example, look at the graph of Murder | Others versus Income | Others in the lower-left corner. You can see that eliminating the point labeled Alaska would move the line in a negative direction. In fact, deleting Alaska changes the regression coefficient for Income from positive (.00006) to negative (–.00085). You can combine the information from outlier, leverage, and influence plots into one highly informative plot using the influencePlot() function from the car package: library(car) influencePlot(fit, id.method="identify", main="Influence Plot", sub="Circle size is proportional to Cook's distance") The resulting plot (figure 8.16) shows that Nevada and Rhode Island are outliers; New York, California, Hawaii, and Washington have high leverage; and Nevada, Alaska, and Hawaii are influential observations. 8.5 Corrective measures Having spent the last 20 pages learning about regression diagnostics, you may ask, “What do you do if you identify problems?” There are four approaches to dealing with violations of regression assumptions: ■ ■ Deleting observations Transforming variables www.it-ebooks.info 199 Corrective measures ■ ■ Adding or deleting variables Using another regression approach Let’s look at each in turn. 8.5.1 Deleting observations Deleting outliers can often improve a dataset’s fit to the normality assumption. Influential observations are often deleted as well, because they have an inordinate impact on the results. The largest outlier or influential observation is deleted, and the model is refit. If there are still outliers or influential observations, the process is repeated until an acceptable fit is obtained. Again, I urge caution when considering the deletion of observations. Sometimes you can determine that the observation is an outlier because of data errors in recording, or because a protocol wasn’t followed, or because a test subject misunderstood instructions. In these cases, deleting the offending observation seems perfectly reasonable. In other cases, the unusual observation may be the most interesting thing about the data you’ve collected. Uncovering why an observation differs from the rest can contribute great insight to the topic at hand and to other topics you might not have thought of. Some of our greatest advances have come from the serendipity of noticing that something doesn’t fit our preconceptions (pardon the hyperbole). 8.5.2 Transforming variables When models don’t meet the normality, linearity, or homoscedasticity assumptions, transforming one or more variables can often improve or correct the situation. Transformations typically involve replacing a variable Y with Y λ. Common values of λ and their interpretations are given in table 8.5. If Y is a proportion, a logit transformation [ln (Y/1-Y)] is often used. Table 8.5 Common transformations λ Transformation -2 -1 1/Y2 1/Y -0.5 1/ Y 0 log(Y) 0.5 Y 1 None 2 Y2 When the model violates the normality assumption, you typically attempt a transformation of the response variable. You can use the powerTransform() function in the car package to generate a maximum-likelihood estimation of the power λ most likely to normalize the variable X λ. In the next listing, this is applied to the states data. Listing 8.10 Box–Cox transformation to normality > library(car) > summary(powerTransform(states$Murder)) bcPower Transformation to Normality www.it-ebooks.info 200 CHAPTER 8 states$Murder Regression Est.Power Std.Err. Wald Lower Bound Wald Upper Bound 0.6 0.26 0.088 1.1 Likelihood ratio tests about transformation parameters LRT df pval LR test, lambda=(0) 5.7 1 0.017 LR test, lambda=(1) 2.1 1 0.145 The results suggest that you can normalize the variable Murder by replacing it with Murder0.6. Because 0.6 is close to 0.5, you could try a square-root transformation to improve the model’s fit to normality. But in this case, the hypothesis that λ=1 can’t be rejected (p = 0.145), so there’s no strong evidence that a transformation is needed in this case. This is consistent with the results of the Q-Q plot in figure 8.9. When the assumption of linearity is violated, a transformation of the predictor variables can often help. The boxTidwell() function in the car package can be used to generate maximum-likelihood estimates of predictor powers that can improve linearity. An example of applying the Box–Tidwell transformations to a model that predicts state murder rates from their population and illiteracy rates follows: > library(car) > boxTidwell(Murder~Population+Illiteracy,data=states) Population Illiteracy Score Statistic p-value MLE of lambda -0.32 0.75 0.87 0.62 0.54 1.36 The results suggest trying the transformations Population.87 and Population1.36 to achieve greater linearity. But the score tests for Population (p = .75) and Illiteracy (p = .54) suggest that neither variable needs to be transformed. Again, these results are consistent with the component plus residual plots in figure 8.11. Finally, transformations of the response variable can help in situations of heteroscedasticity (nonconstant error variance). You saw in listing 8.7 that the spreadLevelPlot() function in the car package offers a power transformation for improving homoscedasticity. Again, in the case of the states example, the constant error-variance assumption is met, and no transformation is necessary. A caution concerning transformations There’s an old joke in statistics: if you can’t prove A, prove B and pretend it was A. (For statisticians, that’s pretty funny.) The relevance here is that if you transform your variables, your interpretations must be based on the transformed variables, not the original variables. If the transformation makes sense, such as the log of income or the inverse of distance, the interpretation is easier. But how do you interpret the relationship between the frequency of suicidal ideation and the cube root of depression? If a transformation doesn’t make sense, you should avoid it. www.it-ebooks.info Selecting the “best” regression model 8.5.3 201 Adding or deleting variables Changing the variables in a model will impact the fit of the model. Sometimes, adding an important variable will correct many of the problems that we’ve discussed. Deleting a troublesome variable can do the same thing. Deleting variables is a particularly important approach for dealing with multicollinearity. If your only goal is to make predictions, then multicollinearity isn’t a problem. But if you want to make interpretations about individual predictor variables, then you must deal with it. The most common approach is to delete one of the variables involved in the multicollinearity (that is, one of the variables with a vif > 2). An alternative is to use ridge regression, a variant of multiple regression designed to deal with multicollinearity situations. 8.5.4 Trying a different approach As you’ve just seen, one approach to dealing with multicollinearity is to fit a different type of model (ridge regression in this case). If there are outliers and/or influential observations, you can fit a robust regression model rather than an OLS regression. If you’ve violated the normality assumption, you can fit a nonparametric regression model. If there’s significant nonlinearity, you can try a nonlinear regression model. If you’ve violated the assumptions of independence of errors, you can fit a model that specifically takes the error structure into account, such as time-series models or multilevel regression models. Finally, you can turn to generalized linear models to fit a wide range of models in situations where the assumptions of OLS regression don’t hold. We’ll discuss some of these alternative approaches in chapter 13. The decision regarding when to try to improve the fit of an OLS regression model and when to try a different approach is a complex one. It’s typically based on knowledge of the subject matter and an assessment of which approach will provide the best result. Speaking of best results, let’s turn now to the problem of deciding which predictor variables to include in a regression model. 8.6 Selecting the “best” regression model When developing a regression equation, you’re implicitly faced with a selection of many possible models. Should you include all the variables under study, or drop ones that don’t make a significant contribution to prediction? Should you add polynomial and/or interaction terms to improve the fit? The selection of a final regression model always involves a compromise between predictive accuracy (a model that fits the data as well as possible) and parsimony (a simple and replicable model). All things being equal, if you have two models with approximately equal predictive accuracy, you favor the simpler one. This section describes methods for choosing among competing models. The word “best” is in quotation marks because there’s no single criterion you can use to make the decision. The final decision requires judgment on the part of the investigator. (Think of it as job security.) www.it-ebooks.info 202 8.6.1 CHAPTER 8 Regression Comparing models You can compare the fit of two nested models using the anova() function in the base installation. A nested model is one whose terms are completely included in the other model. In the states multiple-regression model, you found that the regression coefficients for Income and Frost were nonsignificant. You can test whether a model without these two variables predicts as well as one that includes them (see the following listing). Listing 8.11 Comparing nested models using the anova() function > states <- as.data.frame(state.x77[,c("Murder", "Population", "Illiteracy", "Income", "Frost")]) > fit1 <- lm(Murder ~ Population + Illiteracy + Income + Frost, data=states) > fit2 <- lm(Murder ~ Population + Illiteracy, data=states) > anova(fit2, fit1) Analysis of Variance Table Model 1: Model 2: Res.Df 1 47 2 45 Murder ~ Murder ~ RSS 289.246 289.167 Population + Illiteracy Population + Illiteracy + Income + Frost Df Sum of Sq F Pr(>F) 2 0.079 0.0061 0.994 Here, model 1 is nested within model 2. The anova() function provides a simultaneous test that Income and Frost add to linear prediction above and beyond Population and Illiteracy. Because the test is nonsignificant (p = .994), you conclude that they don’t add to the linear prediction and you’re justified in dropping them from your model. The Akaike Information Criterion (AIC) provides another method for comparing models. The index takes into account a model’s statistical fit and the number of parameters needed to achieve this fit. Models with smaller AIC values—indicating adequate fit with fewer parameters—are preferred. The criterion is provided by the AIC() function (see the following listing). Listing 8.12 Comparing models with the AIC > fit1 <- lm(Murder ~ Population + Illiteracy + Income + Frost, data=states) > fit2 <- lm(Murder ~ Population + Illiteracy, data=states) > AIC(fit1,fit2) fit1 fit2 df AIC 6 241.6429 4 237.6565 The AIC values suggest that the model without Income and Frost is the better model. Note that although the ANOVA approach requires nested models, the AIC approach doesn’t. Comparing two models is relatively straightforward, but what do you do when there are 4, or 10, or 100 possible models to consider? That’s the topic of the next section. www.it-ebooks.info Selecting the “best” regression model 8.6.2 203 Variable selection Two popular approaches to selecting a final set of predictor variables from a larger pool of candidate variables are stepwise methods and all-subsets regression. STEPWISE REGRESSION In stepwise selection, variables are added to or deleted from a model one at a time, until some stopping criterion is reached. For example, in forward stepwise regression, you add predictor variables to the model one at a time, stopping when the addition of variables would no longer improve the model. In backward stepwise regression, you start with a model that includes all predictor variables, and then you delete them one at a time until removing variables would degrade the quality of the model. In stepwise stepwise regression (usually called stepwise to avoid sounding silly), you combine the forward and backward stepwise approaches. Variables are entered one at a time, but at each step, the variables in the model are reevaluated, and those that don’t contribute to the model are deleted. A predictor variable may be added to, and deleted from, a model several times before a final solution is reached. The implementation of stepwise regression methods varies by the criteria used to enter or remove variables. The stepAIC() function in the MASS package performs stepwise model selection (forward, backward, or stepwise) using an exact AIC criterion. The next listing applies backward stepwise regression to the multiple regression problem. Listing 8.13 Backward stepwise selection > library(MASS) > states <- as.data.frame(state.x77[,c("Murder", "Population", "Illiteracy", "Income", "Frost")]) > fit <- lm(Murder ~ Population + Illiteracy + Income + Frost, data=states) > stepAIC(fit, direction="backward") Start: AIC=97.75 Murder ~ Population + Illiteracy + Income + Frost - Frost - Income <none> - Population - Illiteracy Df Sum of Sq RSS AIC 1 0.02 289.19 95.75 1 0.06 289.22 95.76 289.17 97.75 1 39.24 328.41 102.11 1 144.26 433.43 115.99 Step: AIC=95.75 Murder ~ Population + Illiteracy + Income - Income <none> - Population - Illiteracy Df Sum of Sq RSS AIC 1 0.06 289.25 93.76 289.19 95.75 1 43.66 332.85 100.78 1 236.20 525.38 123.61 www.it-ebooks.info 204 CHAPTER 8 Regression Step: AIC=93.76 Murder ~ Population + Illiteracy Df Sum of Sq <none> - Population - Illiteracy 1 1 RSS AIC 289.25 93.76 48.52 337.76 99.52 299.65 588.89 127.31 Call: lm(formula=Murder ~ Population + Illiteracy, data=states) Coefficients: (Intercept) Population 1.6515497 0.0002242 Illiteracy 4.0807366 You start with all four predictors in the model. For each step, the AIC column provides the model AIC resulting from the deletion of the variable listed in that row. The AIC value for <none> is the model AIC if no variables are removed. In the first step, Frost is removed, decreasing the AIC from 97.75 to 95.75. In the second step, Income is removed, decreasing the AIC to 93.76. Deleting any more variables would increase the AIC, so the process stops. Stepwise regression is controversial. Although it may find a good model, there’s no guarantee that it will find the “best” model. This is because not every possible model is evaluated. An approach that attempts to overcome this limitation is all subsets regression. ALL SUBSETS REGRESSION In all subsets regression, every possible model is inspected. The analyst can choose to have all possible results displayed or ask for the nbest models of each subset size (one predictor, two predictors, and so on). For example, if nbest=2, the two best onepredictor models are displayed, followed by the two best two-predictor models, followed by the two best three-predictor models, up to a model with all predictors. All subsets regression is performed using the regsubsets() function from the leaps package. You can choose the R-squared, Adjusted R-squared, or Mallows Cp statistic as your criterion for reporting “best” models. As you’ve seen, R-squared is the amount of variance accounted for in the response variable by the predictors variables. Adjusted R-squared is similar but takes into account the number of parameters in the model. R-squared always increases with the addition of predictors. When the number of predictors is large compared to the sample size, this can lead to significant overfitting. The Adjusted R-squared is an attempt to provide a more honest estimate of the population R-squared—one that’s less likely to take advantage of chance variation in the data. The Mallows Cp statistic is also used as a stopping rule in stepwise regression. It has been widely suggested that a good model is one in which the Cp statistic is close to the number of model parameters (including the intercept). In listing 8.14, we’ll apply all subsets regression to the states data. The results can be plotted with either the plot() function in the leaps package or the subsets() function in the car package. An example of the former is provided in figure 8.17, and an example of the latter is given in figure 8.18. www.it-ebooks.info 205 Selecting the “best” regression model 0.55 0.54 0.54 0.53 0.48 adjr2 0.48 0.48 0.48 0.31 0.29 0.28 0.1 Figure 8.17 Frost Income Illiteracy Population (Intercept) 0.033 Best four models for each subset size based on Adjusted R-square Listing 8.14 All subsets regression library(leaps) states <- as.data.frame(state.x77[,c("Murder", "Population", "Illiteracy", "Income", "Frost")]) leaps <-regsubsets(Murder ~ Population + Illiteracy + Income + Frost, data=states, nbest=4) plot(leaps, scale="adjr2") library(car) subsets(leaps, statistic="cp", main="Cp Plot for All Subsets Regression") abline(1,1,lty=2,col="red") Figure 8.17 can be confusing to read. Looking at the first row (starting at the bottom), you can see that a model with the intercept and Income has an adjusted R-square of 0.33. A model with the intercept and Population has an adjusted R-square of 0.1. Jumping to the 12th row, a model with the intercept, Population, Illiteracy, and Income has an adjusted R-square of 0.54, whereas one with the intercept, Population, and Illiteracy alone has an adjusted R-square of 0.55. Here you see that a model with fewer predictors has a larger adjusted R-square (something that can’t happen with an unadjusted R-square). The graph suggests that the two-predictor model (Population and Illiteracy) is the best. www.it-ebooks.info 206 CHAPTER 8 Regression Cp Plot for All Subsets Regression 50 In P: Population Il: Illiteracy In: Income F: Frost 30 F P−F P−In−F 10 20 Statistic: cp 40 P Il−In−F Il−In Il−F Il 0 1.0 1.5 2.0 P−Il−In−F P−Il−F P−Il−In P−Il 2.5 3.0 3.5 Figure 8.18 Best four models for each subset size based on the Mallows Cp statistic 4.0 Subset Size Figure 8.18 shows the best four models for each subset size based on the Mallows Cp statistic. Better models will fall close to a line with intercept 1 and slope 1. The plot suggests that you consider a two-predictor model with Population and Illiteracy; a three-predictor model with Population, Illiteracy, and Frost, or Population, Illiteracy, and Income (they overlap on the graph and are hard to read); or a four-predictor model with Population, Illiteracy, Income, and Frost. You can reject the other possible models. In most instances, all subsets regression is preferable to stepwise regression, because more models are considered. But when the number of predictors is large, the procedure can require significant computing time. In general, automated variableselection methods should be seen as an aid rather than a directing force in model selection. A well-fitting model that doesn’t make sense doesn’t help you. Ultimately, it’s your knowledge of the subject matter that should guide you. 8.7 Taking the analysis further We’ll end our discussion of regression by considering methods for assessing model generalizability and predictor relative importance. 8.7.1 Cross-validation In the previous section, we examined methods for selecting the variables to include in a regression equation. When description is your primary goal, the selection and interpretation of a regression model signals the end of your labor. But when your goal is prediction, you can justifiably ask, “How well will this equation perform in the real world?” www.it-ebooks.info Taking the analysis further 207 By definition, regression techniques obtain model parameters that are optimal for a given set of data. In OLS regression, the model parameters are selected to minimize the sum of squared errors of prediction (residuals) and, conversely, maximize the amount of variance accounted for in the response variable (R-squared). Because the equation has been optimized for the given set of data, it won’t perform as well with a new set of data. We began this chapter with an example involving a research physiologist who wanted to predict the number of calories an individual will burn from the duration and intensity of their exercise, age, gender, and BMI. If you fit an OLS regression equation to this data, you’ll obtain model parameters that uniquely maximize the Rsquared for this particular set of observations. But our researcher wants to use this equation to predict the calories burned by individuals in general, not only those in the original study. You know that the equation won’t perform as well with a new sample of observations, but how much will you lose? Cross-validation is a useful method for evaluating the generalizability of a regression equation. In cross-validation, a portion of the data is selected as the training sample, and a portion is selected as the hold-out sample. A regression equation is developed on the training sample and then applied to the hold-out sample. Because the hold-out sample wasn’t involved in the selection of the model parameters, the performance on this sample is a more accurate estimate of the operating characteristics of the model with new data. In k-fold cross-validation, the sample is divided into k subsamples. Each of the k subsamples serves as a hold-out group, and the combined observations from the remaining k – 1 subsamples serve as the training group. The performance for the k prediction equations applied to the k hold-out samples is recorded and then averaged. (When k equals n, the total number of observations, this approach is called jackknifing.) You can perform k-fold cross-validation using the crossval() function in the bootstrap package. The following listing provides a function (called shrinkage()) for cross-validating a model’s R-square statistic using k-fold cross-validation. Listing 8.15 Function for k-fold cross-validated R-square shrinkage <- function(fit, k=10){ require(bootstrap) theta.fit <- function(x,y){lsfit(x,y)} theta.predict <- function(fit,x){cbind(1,x)%*%fit$coef} x <- fit$model[,2:ncol(fit$model)] y <- fit$model[,1] results <- crossval(x, y, theta.fit, theta.predict, ngroup=k) r2 <- cor(y, fit$fitted.values)^2 r2cv <- cor(y, results$cv.fit)^2 cat("Original R-square =", r2, "\n") cat(k, "Fold Cross-Validated R-square =", r2cv, "\n") cat("Change =", r2-r2cv, "\n") } www.it-ebooks.info 208 CHAPTER 8 Regression Using this listing, you define your functions, create a matrix of predictor and predicted values, get the raw R-squared, and get the cross-validated R-squared. (Chapter 12 covers bootstrapping in detail.) The shrinkage() function is then used to perform a 10-fold cross-validation with the states data, using a model with all four predictor variables: > states <- as.data.frame(state.x77[,c("Murder", "Population", "Illiteracy", "Income", "Frost")]) > fit <- lm(Murder ~ Population + Income + Illiteracy + Frost, data=states) > shrinkage(fit) Original R-square=0.567 10 Fold Cross-Validated R-square=0.4481 Change=0.1188 You can see that the R-square based on the sample (0.567) is overly optimistic. A better estimate of the amount of variance in murder rates that this model will account for with new data is the cross-validated R-square (0.448). (Note that observations are assigned to the k groups randomly, so you’ll get a slightly different result each time you execute the shrinkage() function.) You could use cross-validation in variable selection by choosing a model that demonstrates better generalizability. For example, a model with two predictors (Population and Illiteracy) shows less R-square shrinkage (.03 versus .12) than the full model: > fit2 <- lm(Murder ~ Population + Illiteracy,data=states) > shrinkage(fit2) Original R-square=0.5668327 10 Fold Cross-Validated R-square=0.5346871 Change=0.03214554 This may make the two-predictor model a more attractive alternative. All other things being equal, a regression equation that’s based on a larger training sample and one that’s more representative of the population of interest will cross-validate better. You’ll get less R-squared shrinkage and make more accurate predictions. 8.7.2 Relative importance Up to this point in the chapter, we’ve been asking, “Which variables are useful for predicting the outcome?” But often your real interest is in the question, “Which variables are most important in predicting the outcome?” You implicitly want to rank-order the predictors in terms of relative importance. There may be practical grounds for asking the second question. For example, if you could rank-order leadership practices by their relative importance for organizational success, you could help managers focus on the behaviors they most need to develop. If predictor variables were uncorrelated, this would be a simple task. You would rank-order the predictor variables by their correlation with the response variable. In most cases, though, the predictors are correlated with each other, and this complicates the task significantly. www.it-ebooks.info 209 Taking the analysis further There have been many attempts to develop a means for assessing the relative importance of predictors. The simplest has been to compare standardized regression coefficients. Standardized regression coefficients describe the expected change in the response variable (expressed in standard deviation units) for a standard deviation change in a predictor variable, holding the other predictor variables constant. You can obtain the standardized regression coefficients in R by standardizing each of the variables in your dataset to a mean of 0 and standard deviation of 1 using the scale() function, before submitting the dataset to a regression analysis. (Note that because the scale() function returns a matrix and the lm() function requires a data frame, you convert between the two in an intermediate step.) The code and results for the multiple regression problem are shown here: > states <- as.data.frame(state.x77[,c("Murder", "Population", "Illiteracy", "Income", "Frost")]) > zstates <- as.data.frame(scale(states)) > zfit <- lm(Murder~Population + Income + Illiteracy + Frost, data=zstates) > coef(zfit) (Intercept) -9.406e-17 Population 2.705e-01 Income 1.072e-02 Illiteracy 6.840e-01 Frost 8.185e-03 Here you see that a one-standard-deviation increase in illiteracy rate yields a 0.68 standard deviation increase in murder rate, when controlling for population, income, and temperature. Using standardized regression coefficients as your guide, Illiteracy is the most important predictor and Frost is the least. There have been many other attempts at quantifying relative importance. Relative importance can be thought of as the contribution each predictor makes to R-square, both alone and in combination with other predictors. Several possible approaches to relative importance are captured in the relaimpo package written by Ulrike Grömping (http://mng.bz/KDYF). A new method called relative weights shows significant promise. The method closely approximates the average increase in R-square obtained by adding a predictor variable across all possible submodels (Johnson, 2004; Johnson and Lebreton, 2004; LeBreton and Tonidandel, 2008). A function for generating relative weights is provided in the next listing. Listing 8.16 relweights() for calculating relative importance of predictors relweights <- function(fit,...){ R <- cor(fit$model) nvar <- ncol(R) rxx <- R[2:nvar, 2:nvar] rxy <- R[2:nvar, 1] svd <- eigen(rxx) evec <- svd$vectors ev <- svd$values delta <- diag(sqrt(ev)) lambda <- evec %*% delta %*% t(evec) www.it-ebooks.info 210 CHAPTER 8 Regression lambdasq <- lambda ^ 2 beta <- solve(lambda) %*% rxy rsquare <- colSums(beta ^ 2) rawwgt <- lambdasq %*% beta ^ 2 import <- (rawwgt / rsquare) * 100 import <- as.data.frame(import) row.names(import) <- names(fit$model[2:nvar]) names(import) <- "Weights" import <- import[order(import),1, drop=FALSE] dotchart(import$Weights, labels=row.names(import), xlab="% of R-Square", pch=19, main="Relative Importance of Predictor Variables", sub=paste("Total R-Square=", round(rsquare, digits=3)), ...) return(import) } NOTE The code in listing 8.16 is adapted from an SPSS program generously provided by Dr. Johnson. See Johnson (2000, Multivariate Behavioral Research, 35, 1–19) for an explanation of how the relative weights are derived. In listing 8.17, the relweights() function is applied to the states data with murder rate predicted by the population, illiteracy, income, and temperature. You can see from figure 8.19 that the total amount of variance accounted for by the model (R-square=0.567) has been divided among the predictor variables. Illiteracy accounts for 59% of the R-square, Frost accounts for 20.79%, and so forth. Based on Relative Importance of Predictor Variables Illiteracy Frost Figure 8.19 Dot chart of relative weights for the states multiple regression problem. Larger weights indicate relatively more important predictors. For example, Illiteracy accounts for 59% of the total explained variance (0.567), whereas Income only accounts for 5.49%. Thus Illiteracy has greater relative importance than Income in this model. Population Income 10 20 30 40 50 % of R−Square Total R−Square= 0.567 www.it-ebooks.info 60 Summary 211 the method of relative weights, Illiteracy has the greatest relative importance, followed by Frost, Population, and Income, in that order. Listing 8.17 Applying the relweights() function > states <- as.data.frame(state.x77[,c("Murder", "Population", "Illiteracy", "Income", "Frost")]) > fit <- lm(Murder ~ Population + Illiteracy + Income + Frost, data=states) > relweights(fit, col="blue") Weights Income 5.49 Population 14.72 Frost 20.79 Illiteracy 59.00 Relative-importance measures (and, in particular, the method of relative weights) have wide applicability. They come much closer to our intuitive conception of relative importance than standardized regression coefficients do, and I expect to see their use increase dramatically in coming years. 8.8 Summary Regression analysis is a term that covers a broad range of methodologies in statistics. You’ve seen that it’s a highly interactive approach that involves fitting models, assessing their fit to statistical assumptions, modifying both the data and the models, and refitting to arrive at a final result. In many ways, this final result is based on art and skill as much as science. This has been a long chapter, because regression analysis is a process with many parts. We’ve discussed fitting OLS regression models, using regression diagnostics to assess the data’s fit to statistical assumptions, and methods for modifying the data to meet these assumptions more closely. We looked at ways of selecting a final regression model from many possible models, and you learned how to evaluate its likely performance on new samples of data. Finally, we tackled the thorny problem of variable importance: identifying which variables are the most important for predicting an outcome. In each of the examples in this chapter, the predictor variables have been quantitative. However, there are no restrictions against using categorical variables as predictors as well. Using a categorical predictor such as gender, treatment type, or manufacturing process allows you to examine group differences on a response or outcome variable. This is the focus of our next chapter. www.it-ebooks.info Analysis of variance This chapter covers ■ Using R to model basic experimental designs ■ Fitting and interpreting ANOVA type models ■ Evaluating model assumptions In chapter 7, we looked at regression models for predicting a quantitative response variable from quantitative predictor variables. But there’s no reason that we couldn’t have included nominal or ordinal factors as predictors as well. When factors are included as explanatory variables, our focus usually shifts from prediction to understanding group differences, and the methodology is referred to as analysis of variance (ANOVA). ANOVA methodology is used to analyze a wide variety of experimental and quasi-experimental designs. This chapter provides an overview of R functions for analyzing common research designs. First we’ll look at design terminology, followed by a general discussion of R’s approach to fitting ANOVA models. Then we’ll explore several examples that illustrate the analysis of common designs. Along the way, you’ll treat anxiety disorders, lower blood cholesterol levels, help pregnant mice have fat babies, assure that pigs grow long in the tooth, facilitate breathing in plants, and learn which grocery shelves to avoid. 212 www.it-ebooks.info A crash course on terminology 213 In addition to the base installation, you’ll be using the car, gplots, HH, rrcov, multicomp, effects, MASS, and mvoutlier packages in the examples. Be sure to install them before trying out the sample code. 9.1 A crash course on terminology Experimental design in general, and analysis of variance in particular, has its own language. Before discussing the analysis of these designs, we’ll quickly review some important terms. We’ll use a series of increasingly complex study designs to introduce the most significant concepts. Say you’re interested in studying the treat- Table 9.1 One-way between-groups ANOVA ment of anxiety. Two popular therapies for Treatment anxiety are cognitive behavior therapy (CBT) and eye movement desensitization and reproCBT EMDR cessing (EMDR). You recruit 10 anxious indis1 s6 viduals and randomly assign half of them to s2 s7 receive five weeks of CBT and half to receive s3 s8 five weeks of EMDR. At the conclusion of therapy, each patient is asked to complete the s4 s9 State-Trait Anxiety Inventory (STAI), a selfs5 s10 report measure of anxiety. The design is outlined in table 9.1. In this design, Treatment is a between-groups factor with two levels (CBT, EMDR). It’s called a between-groups factor because patients are assigned to one and only one group. No patient receives both CBT and EMDR. The s characters represent the subjects (patients). STAI is the dependent variable, and Treatment is the independent variable. Because there is an equal number of observations in each treatment condition, you have a balanced design. When the sample sizes Table 9.2 One-way within-groups ANOVA are unequal across the cells of a design, you have an unbalanced design. Time Patient The statistical design in table 9.1 is called a 5 weeks 6 months one-way ANOVA because there’s a single classifis1 cation variable. Specifically, it’s a one-way between-groups ANOVA. Effects in ANOVA s2 designs are primarily evaluated through F s3 tests. If the F test for Treatment is significant, s4 you can conclude that the mean STAI scores s5 for two therapies differed after five weeks of s6 treatment. If you were interested in the effect of CBT on s7 anxiety over time, you could place all 10 s8 patients in the CBT group and assess them at s9 the conclusion of therapy and again six months s10 later. This design is displayed in table 9.2. www.it-ebooks.info 214 CHAPTER 9 Analysis of variance Time is a within-groups factor with two levels (five weeks, six months). It’s called a within-groups factor because each patient is measured under both levels. The statistical design is a one-way within-groups ANOVA. Because each subject is measured more than once, the design is also called a repeated measures ANOVA. If the F test for Time is significant, you can conclude that patients’ mean STAI scores changed between five weeks and six months. If you were interested in both treatment differences and change over time, you could combine the first two study designs and randomly assign five patients to CBT and five patients to EMDR, and assess their STAI results at the end of therapy (five weeks) and at six months (see table 9.3). By including both Therapy and Time as factors, you’re able to examine the impact of Therapy (averaged across time), Time (averaged across therapy type), and the interaction of Therapy and Time. The first two are called the main effects, whereas the interaction is (not surprisingly) called an interaction effect. When you cross two or more factors, as is done here, you have a factorial ANOVA design. Crossing two factors produces a two-way ANOVA, crossing three factors produces a three-way ANOVA, and so forth. When a factorial design includes both between-groups and within-groups factors, it’s also called a mixed-model ANOVA. The current design is a two-way mixed-model factorial ANOVA (phew!). In this case, you’ll have three F tests: one for Therapy, one for Time, and one for the Therapy × Time interaction. A significant result for Therapy indicates that CBT and EMDR differ in their impact on anxiety. A significant result for Time indicates that Table 9.3 Two-way factorial ANOVA with one between-groups and one within-groups factor Time Patient 5 weeks s1 s2 CBT s3 s4 s5 Therapy s6 s7 EMDR s8 s9 s10 www.it-ebooks.info 6 months Fitting ANOVA models 215 anxiety changed from week five to the six-month follow-up. A significant Therapy × Time interaction indicates that the two treatments for anxiety had a differential impact over time (that is, the change in anxiety from five weeks to six months was different for the two treatments). Now let’s extend the design a bit. It’s known that depression can have an impact on therapy, and that depression and anxiety often co-occur. Even though subjects were randomly assigned to treatment conditions, it’s possible that the two therapy groups differed in patient depression levels at the initiation of the study. Any posttherapy differences might then be due to the preexisting depression differences and not to your experimental manipulation. Because depression could also explain the group differences on the dependent variable, it’s a confounding factor. And because you’re not interested in depression, it’s called a nuisance variable. If you recorded depression levels using a self-report depression measure such as the Beck Depression Inventory (BDI) when patients were recruited, you could statistically adjust for any treatment group differences in depression before assessing the impact of therapy type. In this case, BDI would be called a covariate, and the design would be called an analysis of covariance (ANCOVA). Finally, you’ve recorded a single dependent variable in this study (the STAI). You could increase the validity of this study by including additional measures of anxiety (such as family ratings, therapist ratings, and a measure assessing the impact of anxiety on their daily functioning). When there’s more than one dependent variable, the design is called a multivariate analysis of variance (MANOVA). If there are covariates present, it’s called a multivariate analysis of covariance (MANCOVA). Now that you have the basic terminology under your belt, you’re ready to amaze your friends, dazzle new acquaintances, and learn how to fit ANOVA/ANCOVA/ MANOVA models with R. 9.2 Fitting ANOVA models Although ANOVA and regression methodologies developed separately, functionally they’re both special cases of the general linear model. You could analyze ANOVA models using the same lm() function used for regression in chapter 7. But you’ll primarily use the aov() function in this chapter. The results of lm() and aov() are equivalent, but the aov() function presents these results in a format that’s more familiar to ANOVA methodologists. For completeness, I’ll provide an example using lm() at the end of this chapter. 9.2.1 The aov() function The syntax of the aov() function is aov(formula, data=dataframe). Table 9.4 describes special symbols that can be used in the formulas. In this table, y is the dependent variable and the letters A, B, and C represent factors. www.it-ebooks.info 216 CHAPTER 9 Table 9.4 Analysis of variance Special symbols used in R formulas Symbol ~ Usage Separates response variables on the left from the explanatory variables on the right. For example, a prediction of y from A, B, and C would be coded y ~ A + B + C : Denotes an interaction between variables. A prediction of y from A, B, and the interaction between A and B would be coded y ~ A + B + A:B * Denotes the complete crossing variables. The code y ~ A*B*C expands to y ~ A + B + C + A:B + A:C + B:C + A:B:C ^ Denotes crossing to a specified degree. The code y ~ (A+B+C)^2 expands to y ~ A + B + C + A:B + A:C + A:B . Denotes all remaining variables. The code y ~ . expands to y ~ A + B + C Table 9.5 provides formulas for several common research designs. In this table, lowercase letters are quantitative variables, uppercase letters are grouping factors, and Subject is a unique identifier variable for subjects. Table 9.5 Formulas for common research designs Design Formula One-way ANOVA y ~ A One-way ANCOVA with 1 covariate y ~ x + A Two-way factorial ANOVA y ~ A * B Two-way factorial ANCOVA with 2 covariates y ~ x1 + x2 + A * B Randomized block y ~ B + A (where B is a blocking factor) One-way within-groups ANOVA y ~ A + Error(Subject/A) Repeated measures ANOVA with 1 within-groups factor (W) and 1 between-groups factor (B) y ~ B * W + Error(Subject/W) We’ll explore in-depth examples of several of these designs later in this chapter. 9.2.2 The order of formula terms The order in which the effects appear in a formula matters when (a) there’s more than one factor and the design is unbalanced, or (b) covariates are present. When either of these two conditions is present, the variables on the right side of the equation will be correlated with each other. In this case, there’s no unambiguous way to divide up their impact on the dependent variable. For example, in a two-way ANOVA www.it-ebooks.info Fitting ANOVA models 217 with unequal numbers of observations in the treatment combinations, the model y ~ A*B will not produce the same results as the model y ~ B*A. By default, R employs the Type I (sequential) approach to calculating ANOVA effects (see the sidebar “Order counts!”). The first model can be written as y ~ A + B + A:B. The resulting R ANOVA table will assess ■ ■ ■ The impact of A on y The impact of B on y, controlling for A The interaction of A and B, controlling for the A and B main effects Order counts! When independent variables are correlated with each other or with covariates, there’s no unambiguous method for assessing the independent contributions of these variables to the dependent variable. Consider an unbalanced two-way factorial design with factors A and B and dependent variable y. There are three effects in this design: the A and B main effects and the A × B interaction. Assuming that you’re modeling the data using the formula Y ~ A + B + A:B there are three typical approaches for partitioning the variance in y among the effects on the right side of this equation. TYPE I (SEQUENTIAL) Effects are adjusted for those that appear earlier in the formula. A is unadjusted. B is adjusted for the A. The A:B interaction is adjusted for A and B. TYPE II (HIERARCHICAL) Effects are adjusted for other effects at the same or lower level. A is adjusted for B. B is adjusted for A. The A:B interaction is adjusted for both A and B. TYPE III (MARGINAL) Each effect is adjusted for every other effect in the model. A is adjusted for B and A:B. B is adjusted for A and A:B. The A:B interaction is adjusted for A and B. R employs the Type I approach by default. Other programs such as SAS and SPSS employ the Type III approach by default. The greater the imbalance in sample sizes, the greater the impact that the order of the terms will have on the results. In general, more fundamental effects should be listed earlier in the formula. In particular, covariates should be listed first, followed by main effects, followed by two-way interactions, followed by three-way interactions, and so on. For main effects, more fundamental variables should be listed first. Thus gender would be listed before treatment. Here’s the bottom line: when the research design isn’t orthogonal (that is, when the factors and/or covariates are correlated), be careful when specifying the order of effects. www.it-ebooks.info 218 CHAPTER 9 Analysis of variance Before moving on to specific examples, note that the Anova() function in the car package (not to be confused with the standard anova() function) provides the option of using the Type II or Type III approach, rather than the Type I approach used by the aov() function. You may want to use the Anova() function if you’re concerned about matching your results to those provided by other packages such as SAS and SPSS. See help(Anova, package="car") for details. 9.3 One-way ANOVA In a one-way ANOVA, you’re interested in comparing the dependent variable means of two or more groups defined by a categorical grouping factor. This example comes from the cholesterol dataset in the multcomp package, taken from Westfall, Tobia, Rom, & Hochberg (1999). Fifty patients received one of five cholesterol-reducing drug regimens (trt). Three of the treatment conditions involved the same drug administered as 20 mg once per day (1time), 10mg twice per day (2times), or 5 mg four times per day (4times). The two remaining conditions (drugD and drugE) represented competing drugs. Which drug regimen produced the greatest cholesterol reduction (response)? The analysis is provided in the following listing. Listing 9.1 One-way ANOVA > library(multcomp) > attach(cholesterol) > table(trt) trt 1time 2times 4times drugD 10 10 10 10 b Group sample sizes drugE 10 c > aggregate(response, by=list(trt), FUN=mean) Group.1 x 1 1time 5.78 2 2times 9.22 3 4times 12.37 4 drugD 15.36 5 drugE 20.95 d Group means Group standard deviations > aggregate(response, by=list(trt), FUN=sd) Group.1 x 1 1time 2.88 2 2times 3.48 3 4times 2.92 4 drugD 3.45 5 drugE 3.35 > fit <- aov(response ~ trt) > summary(fit) Df Sum Sq Mean Sq trt 4 1351 338 Residuals 45 469 10 --- e F value 32.4 www.it-ebooks.info Pr(>F) 9.8e-13 *** Tests for group differences (ANOVA) 219 One-way ANOVA Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 > library(gplots) > plotmeans(response ~ trt, xlab="Treatment", ylab="Response", main="Mean Plot\nwith 95% CI") > detach(cholesterol) f Plots group means and confidence intervals 15 5 10 Response 20 Looking at the output, you can Mean Plot with 95% CI see that 10 patients received each of the drug regimens b. From the means, it appears that drugE produced the greatest cholesterol reduction, whereas 1time produced the least c. Standard deviations were relatively constant across the five groups, ranging from 2.88 to 3.48 d. The ANOVA F test for treatment (trt) is significant (p < .0001), providing evidence that the five treatments aren’t all equally effective e. n=10 n=10 n=10 n=10 n=10 The plotmeans() function in 1time 2times 4times drugD drugE the gplots package can be used Treatment to produce a graph of group Figure 9.1 Treatment group means with 95% confidence intervals for five cholesterol-reducing drug regimens means and their confidence intervals f. A plot of the treatment means, with 95% confidence limits, is provided in figure 9.1 and allows you to clearly see these treatment differences. 9.3.1 Multiple comparisons The ANOVA F test for treatment tells you that the five drug regimens aren’t equally effective, but it doesn’t tell you which treatments differ from one another. You can use a multiple comparison procedure to answer this question. For example, the TukeyHSD() function provides a test of all pairwise differences between group means, as shown next. Listing 9.2 Tukey HSD pairwise group comparisons > TukeyHSD(fit) Tukey multiple comparisons of means 95% family-wise confidence level Fit: aov(formula = response ~ trt) $trt www.it-ebooks.info 220 CHAPTER 9 Analysis of variance diff lwr upr p adj 2times-1time 3.44 -0.658 7.54 0.138 4times-1time 6.59 2.492 10.69 0.000 drugD-1time 9.58 5.478 13.68 0.000 drugE-1time 15.17 11.064 19.27 0.000 4times-2times 3.15 -0.951 7.25 0.205 drugD-2times 6.14 2.035 10.24 0.001 drugE-2times 11.72 7.621 15.82 0.000 drugD-4times 2.99 -1.115 7.09 0.251 drugE-4times 8.57 4.471 12.67 0.000 drugE-drugD 5.59 1.485 9.69 0.003 > par(las=2) > par(mar=c(5,8,4,2)) > plot(TukeyHSD(fit)) For example, the mean cholesterol reductions for 1time and 2times aren’t significantly different from each other (p = 0.138), whereas the difference between 1time and 4times is significantly different (p < .001). The pairwise comparisons are plotted in figure 9.2. The first par statement rotates the axis labels, and the second one increases the left margin area so that the labels fit (par options are covered in chapter 3). In this graph, confidence intervals that include 0 indicate treatments that aren’t significantly different (p > 0.5). 95% family−wise confidence level 2times−1tim e 4times−1tim e drugD−1tim e drugE−1tim e 4times−2times drugD−2times drugE−2times drugD−4times drugE−4times Differences in mean levels of trt Figure 9.2 Plot of Tukey HSD pairwise mean comparisons www.it-ebooks.info 20 15 10 5 0 drugE−dr ugD 221 One-way ANOVA a b b c c d 15 5 10 response 20 25 a Figure 9.3 Tukey HSD tests provided by the multcomp package 1time 2times 4times drugD drugE trt The glht() function in the multcomp package provides a much more comprehensive set of methods for multiple mean comparisons that you can use for both linear models (such as those described in this chapter) and generalized linear models (covered in chapter 13). The following code reproduces the Tukey HSD test, along with a different graphical representation of the results (figure 9.3): > > > > library(multcomp) par(mar=c(5,4,6,2)) tuk <- glht(fit, linfct=mcp(trt="Tukey")) plot(cld(tuk, level=.05),col="lightgrey") In this code, the par statement increases the top margin to fit the letter array. The level option in the cld() function provides the significance level to use (0.05, or 95% confidence in this case). Groups (represented by box plots) that have the same letter don’t have significantly different means. You can see that 1time and 2times aren’t significantly different (they both have the letter a) and that 2times and 4times aren’t significantly different (they both have the letter b); but that 1time and 4times are different (they don’t share a letter). Personally, I find figure 9.3 easier to read than figure 9.2. It also has the advantage of providing information on the distribution of scores within each group. From these results, you can see that taking the cholesterol-lowering drug in 5 mg doses four times a day was better than taking a 20 mg dose once per day. The competitor drugD wasn’t superior to this four-times-per-day regimen. But competitor drugE was superior to both drugD and all three dosage strategies for the focus drug. www.it-ebooks.info 222 Analysis of variance CHAPTER 9 Multiple comparisons methodology is a complex and rapidly changing area of study. To learn more, see Bretz, Hothorn, and Westfall (2010). Assessing test assumptions As you saw in the previous chapter, confidence in results depends on the degree to which your data satisfies the assumptions underlying the statistical tests. In a one-way ANOVA, the dependent variable is assumed to be normally distributed and have equal variance in each group. You can use a Q-Q plot to assess the normality assumption: > library(car) > qqPlot(lm(response ~ trt, data=cholesterol), simulate=TRUE, main="Q-Q Plot", labels=FALSE) Note the qqPlot() requires an lm() fit. The graph is provided in figure 9.4. The data falls within the 95% confidence envelope, suggesting that the normality assumption has been met fairly well. R provides several tests for the equality (homogeneity) of variances. For example, you can perform Bartlett’s test with this code: > bartlett.test(response ~ trt, data=cholesterol) Bartlett test of homogeneity of variances data: response by trt Bartlett's K-squared = 0.5797, df = 4, p-value = 0.9653 −2 −1 0 1 2 Q−Q Plot Studentized Residuals(lm(response ~ trt, data = cholesterol)) 9.3.2 Figure 9.4 Test of normality −2 −1 0 1 t Quantiles www.it-ebooks.info 2 One-way ANCOVA 223 Bartlett’s test indicates that the variances in the five groups don’t differ significantly (p = 0.97). Other possible tests include the Fligner–Killeen test (provided by the fligner.test() function) and the Brown–Forsythe test (provided by the hov() function in the HH package). Although not shown, the other two tests reach the same conclusion. Finally, analysis of variance methodologies can be sensitive to the presence of outliers. You can test for outliers using the outlierTest() function in the car package: > library(car) > outlierTest(fit) No Studentized residuals with Bonferonni p < 0.05 Largest |rstudent|: rstudent unadjusted p-value Bonferonni p 19 2.251149 0.029422 NA From the output, you can see that there’s no indication of outliers in the cholesterol data (NA occurs when p > 1). Taking the Q-Q plot, Bartlett’s test, and outlier test together, the data appear to fit the ANOVA model quite well. This, in turn, adds to your confidence in the results. 9.4 One-way ANCOVA A one-way analysis of covariance (ANCOVA) extends the one-way ANOVA to include one or more quantitative covariates. This example comes from the litter dataset in the multcomp package (see Westfall et al., 1999). Pregnant mice were divided into four treatment groups; each group received a different dose of a drug (0, 5, 50, or 500). The mean post-birth weight for each litter was the dependent variable, and gestation time was included as a covariate. The analysis is given in the following listing. Listing 9.3 One-way ANCOVA > data(litter, package="multcomp") > attach(litter) > table(dose) dose 0 5 50 500 20 19 18 17 > aggregate(weight, by=list(dose), FUN=mean) Group.1 x 1 0 32.3 2 5 29.3 3 50 29.9 4 500 29.6 > fit <- aov(weight ~ gesttime + dose) > summary(fit) Df Sum Sq Mean Sq F value Pr(>F) gesttime 1 134.30 134.30 8.0493 0.005971 ** dose 3 137.12 45.71 2.7394 0.049883 * Residuals 69 1151.27 16.69 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 www.it-ebooks.info 224 CHAPTER 9 Analysis of variance From the table() function, you can see that there is an unequal number of litters at each dosage level, with 20 litters at zero dosage (no drug) and 17 litters at dosage 500. Based on the group means provided by the aggregate() function, the no-drug group had the highest mean litter weight (32.3). The ANCOVA F tests indicate that (a) gestation time was related to birth weight, and (b) drug dosage was related to birth weight after controlling for gestation time. The mean birth weight isn’t the same for each of the drug dosages, after controlling for gestation time. Because you’re using a covariate, you may want to obtain adjusted group means— that is, the group means obtained after partialing out the effects of the covariate. You can use the effect() function in the effects library to calculate adjusted means: > library(effects) > effect("dose", fit) dose effect dose 0 5 50 500 32.4 28.9 30.6 29.3 In this case, the adjusted means are similar to the unadjusted means produced by the aggregate() function, but this won’t always be the case. The effects package provides a powerful method of obtaining adjusted means for complex research designs and presenting them visually. See the package documentation on CRAN for more details. As with the one-way ANOVA example in the last section, the F test for dose indicates that the treatments don’t have the same mean birth weight, but it doesn’t tell you which means differ from one another. Again you can use the multiple comparison procedures provided by the multcomp package to compute all pairwise mean comparisons. Additionally, the multcomp package can be used to test specific user-defined hypotheses about the means. Suppose you’re interested in whether the no-drug condition differs from the threedrug condition. The code in the following listing can be used to test this hypothesis. Listing 9.4 Multiple comparisons employing user-supplied contrasts > library(multcomp) > contrast <- rbind("no drug vs. drug" = c(3, -1, -1, -1)) > summary(glht(fit, linfct=mcp(dose=contrast))) Multiple Comparisons of Means: User-defined Contrasts Fit: aov(formula = weight ~ gesttime + dose) Linear Hypotheses: Estimate Std. Error t value Pr(>|t|) no drug vs. drug == 0 8.284 3.209 2.581 0.0120 * --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 www.it-ebooks.info One-way ANCOVA 225 The contrast c(3, -1, -1, -1) specifies a comparison of the first group with the average of the other three. The hypothesis is tested with a t statistic (2.581 in this case), which is significant at the p < .05 level. Therefore, you can conclude that the no-drug group has a higher birth weight than drug conditions. Other contrasts can be added to the rbind() function (see help(glht) for details). 9.4.1 Assessing test assumptions ANCOVA designs make the same normality and homogeneity of variance assumptions described for ANOVA designs, and you can test these assumptions using the same procedures described in section 9.3.2. In addition, standard ANCOVA designs assume homogeneity of regression slopes. In this case, it’s assumed that the regression slope for predicting birth weight from gestation time is the same in each of the four treatment groups. A test for the homogeneity of regression slopes can be obtained by including a gestation × dose interaction term in your ANCOVA model. A significant interaction would imply that the relationship between gestation and birth weight depends on the level of the dose variable. The code and results are provided in the following listing. Listing 9.5 Testing for homogeneity of regression slopes > library(multcomp) > fit2 <- aov(weight ~ gesttime*dose, data=litter) > summary(fit2) Df Sum Sq Mean Sq F value Pr(>F) gesttime 1 134 134 8.29 0.0054 ** dose 3 137 46 2.82 0.0456 * gesttime:dose 3 82 27 1.68 0.1789 Residuals 66 1069 16 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 The interaction is nonsignificant, supporting the assumption of equality of slopes. If the assumption is untenable, you could try transforming the covariate or dependent variable, using a model that accounts for separate slopes, or employing a nonparametric ANCOVA method that doesn’t require homogeneity of regression slopes. See the sm.ancova() function in the sm package for an example of the latter. 9.4.2 Visualizing the results The ancova() function in the HH package provides a plot of the relationship between the dependent variable, the covariate, and the factor. For example, > library(HH) > ancova(weight ~ gesttime + dose, data=litter) produces the plot shown in figure 9.5. (The figure has been modified to display better in black and white and will look slightly different when you run the code yourself.) www.it-ebooks.info 226 CHAPTER 9 Analysis of variance weight ~ gesttime + dose 21.5 0 22.5 21.5 5 50 22.5 500 superpose weight 35 30 0 5 50 500 dose 25 Figure 9.5 Plot of the relationship between gestation time and birth weight for each of four drug treatment groups 20 21.5 22.5 21.5 22.5 21.5 22.5 gesttime Here you can see that the regression lines for predicting birth weight from gestation time are parallel in each group but have different intercepts. As gestation time increases, birth weight increases. Additionally, you can see that the zero-dose group has the largest intercept and the five-dose group has the lowest intercept. The lines are parallel because they’ve been specified to be. If you used the statement ancova(weight ~ gesttime*dose) instead, you’d generate a plot that allows both the slopes and intercepts to vary by group. This approach is useful for visualizing the case where the homogeneity of regression slopes doesn’t hold. 9.5 Two-way factorial ANOVA In a two-way factorial ANOVA, subjects are assigned to groups that are formed from the cross-classification of two factors. This example uses the ToothGrowth dataset in the base installation to demonstrate a two-way between-groups ANOVA. Sixty guinea pigs are randomly assigned to receive one of three levels of ascorbic acid (0.5, 1, or 2 mg) and one of two delivery methods (orange juice or Vitamin C), under the restriction that each treatment combination has 10 guinea pigs. The dependent variable is tooth length. The following listing shows the code for the analysis. www.it-ebooks.info Two-way factorial ANOVA 227 Listing 9.6 Two-way ANOVA > attach(ToothGrowth) > table(supp, dose) dose supp 0.5 1 2 OJ 10 10 10 VC 10 10 10 > aggregate(len, by=list(supp, dose), FUN=mean) Group.1 Group.2 x 1 OJ 0.5 13.23 2 VC 0.5 7.98 3 OJ 1.0 22.70 4 VC 1.0 16.77 5 OJ 2.0 26.06 6 VC 2.0 26.14 > aggregate(len, by=list(supp, dose), FUN=sd) Group.1 Group.2 x 1 OJ 0.5 4.46 2 VC 0.5 2.75 3 OJ 1.0 3.91 4 VC 1.0 2.52 5 OJ 2.0 2.66 6 VC 2.0 4.80 > dose <- factor(dose) > fit <- aov(len ~ supp*dose) > summary(fit) Df Sum Sq Mean Sq F value Pr(>F) supp 1 205 205 15.57 0.00023 *** dose 2 2426 1213 92.00 < 2e-16 *** supp:dose 2 108 54 4.11 0.02186 * Residuals 54 712 13 --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 > detach(ToothGrowth) The table statement indicates that you have a balanced design (equal sample sizes in each cell of the design), and the aggregate statements provide the cell means and standard deviations. The dose variable is converted to a factor so that the aov() function will treat it as a grouping variable, rather than a numeric covariate. The ANOVA table provided by the summary() function indicates that both main effects (supp and dose) and the interaction between these factors are significant. You can visualize the results in several ways. You can use the interaction.plot() function to display the interaction in a two-way ANOVA. The code is interaction.plot(dose, supp, len, type="b", col=c("red","blue"), pch=c(16, 18), main = "Interaction between Dose and Supplement Type") www.it-ebooks.info CHAPTER 9 Analysis of variance and the resulting plot is presented in figure 9.6. The plot provides the mean tooth length for each supplement at each dosage. Interaction between Dose and Supplement Type 25 supp mean of len 10 Figure 9.6 Interaction between dose and delivery mechanism on tooth growth. The plot of means was created using the interaction.plot() function. 15 20 VC OJ 0.5 2 1 dose With a little finesse, you can get an interaction plot out of the plotmeans() function in the gplots package. The following code produces the graph in figure 9.7: library(gplots) plotmeans(len ~ interaction(supp, dose, sep=" "), connect=list(c(1,3,5),c(2,4,6)), col=c("red", "darkgreen"), main = "Interaction Plot with 95% CIs", xlab="Treatment and Dose Combination") The graph includes the means, as well as error bars (95% confidence intervals) and sample sizes. 15 len 20 25 30 Interaction Plot with 95% CIs Figure 9.7 Interaction between dose and delivery mechanism on tooth growth. The mean plot with 95% confidence intervals was created by the plotmeans() function. 10 228 n=10 n=10 n=10 n=10 n=10 n=10 OJ 0.5 VC 0.5 OJ 1 VC 1 OJ 2 VC 2 Treatment and Dose Combination www.it-ebooks.info 229 Repeated measures ANOVA len: main effects and 2−way interactions len ~ supp | dose len ~ dose | dose 30 25 dose 0.5 1 2 20 len 15 response.var 10 5 len ~ supp | supp len ~ dose | supp 30 25 supp 20 OJ VC 15 10 5 OJ VC 0.5 supp 1 len Figure 9.8 Main effects and two-way interaction for the ToothGrowth dataset. This plot was created by the interaction2way() function. 2 dose x.values Finally, you can use the interaction2wt() function in the HH package to produce a plot of both main effects and two-way interactions for any factorial design of any order (figure 9.8): library(HH) interaction2wt(len~supp*dose) Again, this figure has been modified to display more clearly in black and white and will look slightly different when you run the code yourself. All three graphs indicate that tooth growth increases with the dose of ascorbic acid for both orange juice and Vitamin C. For the 0.5 and 1 mg doses, orange juice produced more tooth growth than Vitamin C. For 2 mg of ascorbic acid, both delivery methods produced identical growth. Of the three plotting methods provided, I prefer the interaction2wt() function in the HH package. It displays both the main effects (the box plots) and the two-way interactions for designs of any complexity (two-way ANOVA, three-way ANOVA, and so on). Although I don’t cover the tests of model assumptions and mean comparison procedures, they’re a natural extension of the methods you’ve seen so far. Additionally, the design is balanced, so you don’t have to worry about the order of effects. 9.6 Repeated measures ANOVA In repeated measures ANOVA, subjects are measured more than once. This section focuses on a repeated measures ANOVA with one within-groups and one www.it-ebooks.info 230 CHAPTER 9 Analysis of variance between-groups factor (a common design). We’ll take our example from the field of physiological ecology. Physiological ecologists study how the physiological and biochemical processes of living systems respond to variations in environmental factors (a crucial area of study given the realities of global warming). The CO2 dataset included in the base installation contains the results of a study of cold tolerance in Northern and Southern plants of the grass species Echinochloa crus-galli (Potvin, Lechowicz, & Tardif, 1990). The photosynthetic rates of chilled plants were compared with the photosynthetic rates of nonchilled plants at several ambient CO 2 concentrations. Half the plants were from Quebec, and half were from Mississippi. In this example, we’ll focus on chilled plants. The dependent variable is carbon dioxide uptake (uptake) in ml/L, and the independent variables are Type (Quebec versus Mississippi) and ambient CO 2 concentration (conc) with seven levels (ranging from 95 to 1000 umol/m^2 sec). Type is a between-groups factor, and conc is a withingroups factor. Type is already stored as a factor, but you’ll need to convert conc to a factor before continuing. The analysis is presented in the next listing. Listing 9.7 > > > > Repeated measures ANOVA with one between- and within-groups factor CO2$conc <- factor(CO2$conc) w1b1 <- subset(CO2, Treatment=='chilled') fit <- aov(uptake ~ conc*Type + Error(Plant/(conc)), w1b1) summary(fit) Error: Plant Df Sum Sq Mean Sq F value Pr(>F) Type 1 2667 2667 60.4 0.0015 ** Residuals 4 177 44 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Error: Plant:conc Df Sum Sq Mean Sq F value Pr(>F) conc 6 1472 245.4 52.5 1.3e-12 *** conc:Type 6 429 71.5 15.3 3.7e-07 *** Residuals 24 112 4.7 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 > par(las=2) > par(mar=c(10,4,4,2)) > with(w1b1, interaction.plot(conc,Type,uptake, type="b", col=c("red","blue"), pch=c(16,18), main="Interaction Plot for Plant Type and Concentration")) > boxplot(uptake ~ Type*conc, data=w1b1, col=(c("gold", "green")), main="Chilled Quebec and Mississippi Plants", ylab="Carbon dioxide uptake rate (umol/m^2 sec)") The ANOVA table indicates that the Type and concentration main effects and the Type × concentration interaction are all significant at the 0.01 level. The interaction is plotted via the interaction.plot() function in figure 9.9. www.it-ebooks.info 231 Repeated measures ANOVA Interaction Plot for Plant Type and Concentration 40 Type Quebec Mississippi mean of uptake 35 30 25 20 Figure 9.9 Interaction of ambient CO2 concentration and plant type on CO2 uptake. Graph produced by the interaction.plot() function. 15 1000 675 500 350 250 175 95 10 conc In order to demonstrate a different presentation of the interaction, the boxplot() function is used to plot the same data. The results are provided in figure 9.10. Carbon dioxide uptake rate (umol/m^2 sec) Chilled Quebec and Mississippi Plants 40 35 30 25 20 15 Mississippi.1000 Quebec.1000 Mississippi.675 Quebec.675 Mississippi.500 Quebec.500 Mississippi.350 Quebec.350 Mississippi.250 Quebec.250 Quebec.175 Mississippi.175 Mississippi.95 Quebec.95 10 Figure 9.10 Interaction of ambient CO2 concentration and plant type on CO2 uptake. Graph produced by the boxplot() function. www.it-ebooks.info 232 CHAPTER 9 Analysis of variance From either graph, you can see that there’s a greater carbon dioxide uptake in plants from Quebec compared to Mississippi. The difference is more pronounced at higher ambient CO2 concentrations. NOTE Datasets are typically in wide format, where columns are variables and rows are observations, and there’s a single row for each subject. The litter data frame from section 9.4 is a good example. When dealing with repeated measures designs, you typically need the data in long format before fitting models. In long format, each measurement of the dependent variable is placed in its own row. The CO2 dataset follows this form. Luckily, the reshape package described in chapter 5 (section 5.6.3) can easily reorganize your data into the required format. The many approaches to mixed-model designs The CO2 example in this section was analyzed using a traditional repeated measures ANOVA. The approach assumes that the covariance matrix for any within-groups factor follows a specified form known as sphericity. Specifically, it assumes that the variances of the differences between any two levels of the within-groups factor are equal. In real-world data, it’s unlikely that this assumption will be met. This has led to a number of alternative approaches, including the following: ■ ■ ■ ■ Using the lmer() function in the lme4 package to fit linear mixed models (Bates, 2005) Using the Anova() function in the car package to adjust traditional test statistics to account for lack of sphericity (for example, the Geisser–Greenhouse correction) Using the gls() function in the nlme package to fit generalized least squares models with specified variance-covariance structures (UCLA, 2009) Using multivariate analysis of variance to model repeated measured data (Hand, 1987) Coverage of these approaches is beyond the scope of this text. If you’re interested in learning more, check out Pinheiro and Bates (2000) and Zuur et al. (2009). Up to this point, all the methods in this chapter have assumed that there’s a single dependent variable. In the next section, we’ll briefly consider designs that include more than one outcome variable. 9.7 Multivariate analysis of variance (MANOVA) If there’s more than one dependent (outcome) variable, you can test them simultaneously using a multivariate analysis of variance (MANOVA). The following example is based on the UScereal dataset in the MASS package. The dataset comes from Venables & Ripley (1999). In this example, you’re interested in whether the calories, fat, and sugar content of US cereals vary by store shelf, where 1 is the bottom shelf, 2 is the middle shelf, and 3 is the top shelf. Calories, fat, and sugars are the dependent www.it-ebooks.info 233 Multivariate analysis of variance (MANOVA) variables, and shelf is the independent variable, with three levels (1, 2, and 3). The analysis is presented in the following listing. Listing 9.8 One-way MANOVA > > > > > library(MASS) attach(UScereal) shelf <- factor(shelf) y <- cbind(calories, fat, sugars) aggregate(y, by=list(shelf), FUN=mean) 1 2 3 Group.1 calories fat sugars 1 119 0.662 6.3 2 130 1.341 12.5 3 180 1.945 10.9 > cov(y) calories fat sugars calories fat sugars 3895.2 60.67 180.38 60.7 2.71 4.00 180.4 4.00 34.05 > fit <- manova(y ~ shelf) > summary(fit) Df Pillai approx F num Df den Df Pr(>F) shelf 2 0.402 5.12 6 122 1e-04 *** Residuals 62 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 > summary.aov(fit) b Prints univariate results Response calories : Df Sum Sq Mean Sq F value Pr(>F) shelf 2 50435 25218 7.86 0.00091 *** Residuals 62 198860 3207 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Response fat : Df Sum Sq Mean Sq F value Pr(>F) shelf 2 18.4 9.22 3.68 0.031 * Residuals 62 155.2 2.50 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Response sugars : Df Sum Sq Mean Sq F value Pr(>F) shelf 2 381 191 6.58 0.0026 ** Residuals 62 1798 29 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 www.it-ebooks.info 234 CHAPTER 9 Analysis of variance First, the shelf variable is converted to a factor so that it can represent a grouping variable in the analyses. Next, the cbind() function is used to form a matrix of the three dependent variables (calories, fat, and sugars). The aggregate() function provides the shelf means, and the cov() function provides the variance and the covariances across cereals. The manova() function provides the multivariate test of group differences. The significant F value indicates that the three groups differ on the set of nutritional measures. Note that the shelf variable was converted to a factor so that it can represent a grouping variable. Because the multivariate test is significant, you can use the summary.aov() function to obtain the univariate one-way ANOVAs b. Here, you see that the three groups differ on each nutritional measure considered separately. Finally, you can use a mean comparison procedure (such as TukeyHSD) to determine which shelves differ from each other for each of the three dependent variables (omitted here to save space). 9.7.1 Assessing test assumptions The two assumptions underlying a one-way MANOVA are multivariate normality and homogeneity of variance-covariance matrices. The first assumption states that the vector of dependent variables jointly follows a multivariate normal distribution. You can use a Q-Q plot to assess this assumption (see the sidebar “A theory interlude” for a statistical explanation of how this works). A theory interlude If you have p × 1 multivariate normal random vector x with mean µ and covariance matrix Σ, then the squared Mahalanobis distance between x and µ is chi-square distributed with p degrees of freedom. The Q-Q plot graphs the quantiles of the chi-square distribution for the sample against the Mahalanobis D-squared values. To the degree that the points fall along a line with slope 1 and intercept 0, there’s evidence that the data is multivariate normal. The code is provided in the following listing, and the resulting graph is displayed in figure 9.11. Listing 9.9 Assessing multivariate normality > > > > > > center <- colMeans(y) n <- nrow(y) p <- ncol(y) cov <- cov(y) d <- mahalanobis(y,center,cov) coord <- qqplot(qchisq(ppoints(n),df=p), d, main="Q-Q Plot Assessing Multivariate Normality", ylab="Mahalanobis D2") > abline(a=0,b=1) > identify(coord$x, coord$y, labels=row.names(UScereal)) www.it-ebooks.info Multivariate analysis of variance (MANOVA) 235 Q−Q Plot Assessing Multivariate Normality 20 Wheaties 10 Mahalanobis D2 30 40 Wheaties Honey Gold 0 Figure 9.11 A Q-Q plot for assessing multivariate normality 0 4 2 6 8 10 12 qchisq(ppoints(n), df = p) If the data follow a multivariate normal distribution, then points will fall on the line. The identify() function allows you to interactively identify points in the graph. (The identify() function is covered in section 16.4.) Here, the dataset appears to violate multivariate normality, primarily due to the observations for Wheaties Honey Gold and Wheaties. You may want to delete these two cases and rerun the analyses. The homogeneity of variance-covariance matrices assumption requires that the covariance matrix for each group is equal. The assumption is usually evaluated with a Box’s M test. R doesn’t include a function for Box’s M, but an internet search will provide the appropriate code. Unfortunately, the test is sensitive to violations of normality, leading to rejection in most typical cases. This means that we don’t yet have a good working method for evaluating this important assumption (but see Anderson [2006] and Silva et al. [2008] for interesting alternative approaches not yet available in R). Finally, you can test for multivariate outliers using the aq.plot() function in the mvoutlier package. The code in this case looks like this: library(mvoutlier) outliers <- aq.plot(y) outliers Try it, and see what you get! 9.7.2 Robust MANOVA If the assumptions of multivariate normality or homogeneity of variance-covariance matrices are untenable, or if you’re concerned about multivariate outliers, you may want to consider using a robust or nonparametric version of the MANOVA test instead. A robust version of the one-way MANOVA is provided by the Wilks.test() function in www.it-ebooks.info 236 CHAPTER 9 Analysis of variance the rrcov package. The adonis() function in the vegan package can provide the equivalent of a nonparametric MANOVA. The following listing applies Wilks.test() to the example. Listing 9.10 Robust one-way MANOVA library(rrcov) > Wilks.test(y,shelf,method="mcd") Robust One-way MANOVA (Bartlett Chi2) data: x Wilks' Lambda = 0.511, Chi2-Value = 23.96, DF = 4.98, p-value = 0.0002167 sample estimates: calories fat sugars 1 120 0.701 5.66 2 128 1.185 12.54 3 161 1.652 10.35 From the results, you can see that using a robust test that’s insensitive to both outliers and violations of MANOVA assumptions still indicates that the cereals on the top, middle, and bottom store shelves differ in their nutritional profiles. 9.8 ANOVA as regression In section 9.2, we noted that ANOVA and regression are both special cases of the same general linear model. As such, the designs in this chapter could have been analyzed using the lm() function. But in order to understand the output, you need to understand how R deals with categorical variables when fitting models. Consider the one-way ANOVA problem in section 9.3, which compares the impact of five cholesterol-reducing drug regimens (trt): > library(multcomp) > levels(cholesterol$trt) [1] "1time" "2times" "4times" "drugD" "drugE" First, let’s fit the model using the aov() function: > fit.aov <- aov(response ~ trt, data=cholesterol) > summary(fit.aov) trt Residuals Df 4 45 Sum Sq 1351.37 468.75 Mean Sq 337.84 10.42 F value 32.433 Pr(>F) 9.819e-13 *** Now, let’s fit the same model using lm(). In this case, you get the results shown in the next listing. Listing 9.11 A regression approach to the ANOVA problem in section 9.3 > fit.lm <- lm(response ~ trt, data=cholesterol) > summary(fit.lm) www.it-ebooks.info 237 ANOVA as regression Coefficients: Estimate Std. Error t value (Intercept) 5.782 1.021 5.665 trt2times 3.443 1.443 2.385 trt4times 6.593 1.443 4.568 trtdrugD 9.579 1.443 6.637 trtdrugE 15.166 1.443 10.507 Pr(>|t|) 9.78e-07 0.0213 3.82e-05 3.53e-08 1.08e-13 *** * *** *** *** Residual standard error: 3.227 on 45 degrees of freedom Multiple R-squared: 0.7425, Adjusted R-squared: 0.7196 F-statistic: 32.43 on 4 and 45 DF, p-value: 9.819e-13 What are you looking at? Because linear models require numeric predictors, when the lm() function encounters a factor, it replaces that factor with a set of numeric variables representing contrasts among the levels. If the factor has k levels, k – 1 contrast variables are created. R provides five built-in methods for creating these contrast variables (see table 9.6). You can also create your own (we won’t cover that here). By default, treatment contrasts are used for unordered factors, and orthogonal polynomials are used for ordered factors. Table 9.6 Built-in contrasts Contrast Description contr.helmert Contrasts the second level with the first, the third level with the average of the first two, the fourth level with the average of the first three, and so on. contr.poly Contrasts are used for trend analysis (linear, quadratic, cubic, and so on) based on orthogonal polynomials. Use for ordered factors with equally spaced levels. contr.sum Contrasts are constrained to sum to zero. Also called deviation contrasts, they compare the mean of each level to the overall mean across levels. contr.treatment Contrasts each level with the baseline level (first level by default). Also called dummy coding. contr.SAS Similar to contr.treatment, but the baseline level is the last level. This produces coefficients similar to contrasts used in most SAS procedures. With treatment contrasts, the first level of the factor becomes the reference group, and each subsequent level is compared with it. You can see the coding scheme via the contrasts() function: > contrasts(cholesterol$trt) 2times 4times drugD 1time 0 0 0 2times 1 0 0 4times 0 1 0 drugD 0 0 1 drugE 0 0 0 drugE 0 0 0 0 1 If a patient is in the drugD condition, then the variable drugD equals 1, and the variables 2times, 4times, and drugE each equal zero. You don’t need a variable for the first www.it-ebooks.info 238 CHAPTER 9 Analysis of variance group, because a zero on each of the four indicator variables uniquely determines that the patient is in the 1times condition. In listing 9.11, the variable trt2times represents a contrast between the levels 1time and 2time. Similarly, trt4times is a contrast between 1time and 4times, and so on. You can see from the probability values in the output that each drug condition is significantly different from the first (1time). You can change the default contrasts used in lm() by specifying a contrasts option. For example, you can specify Helmert contrasts by using fit.lm <- lm(response ~ trt, data=cholesterol, contrasts="contr.helmert") You can change the default contrasts used during an R session via the options() function. For example, options(contrasts = c("contr.SAS", "contr.helmert")) would set the default contrast for unordered factors to contr.SAS and for ordered factors to contr.helmert. Although we’ve limited our discussion to the use of contrasts in linear models, note that they’re applicable to other modeling functions in R. This includes the generalized linear models covered in chapter 13. 9.9 Summary In this chapter, we reviewed the analysis of basic experimental and quasi-experimental designs using ANOVA/ANCOVA/MANOVA methodology. We reviewed the basic terminology used and looked at examples of between- and within-groups designs, including the one-way ANOVA, one-way ANCOVA, two-way factorial ANOVA, repeated measures ANOVA, and one-way MANOVA. In addition to the basic analyses, we reviewed methods of assessing model assumptions and applying multiple comparison procedures following significant omnibus tests. Finally, we explored a wide variety of methods for displaying the results visually. If you’re interested in learning more about the design of experiments (DOE) using R, be sure to see the CRAN View provided by Groemping (2009). Chapters 8 and 9 have covered the statistical methods most often used by researchers in a wide variety of fields. In the next chapter, we’ll address issues of power analysis. Power analysis helps us to determine the sample sizes needed to detect an effect of a given size with a given degree of confidence and is a crucial component of research design. www.it-ebooks.info Power analysis This chapter covers ■ Determining sample size requirements ■ Calculating effect sizes ■ Assessing statistical power As a statistical consultant, I’m often asked, “How many subjects do I need for my study?” Sometimes the question is phrased this way: “I have x number of people available for this study. Is the study worth doing?” Questions like these can be answered through power analysis, an important set of techniques in experimental design. Power analysis allows you to determine the sample size required to detect an effect of a given size with a given degree of confidence. Conversely, it allows you to determine the probability of detecting an effect of a given size with a given level of confidence, under sample size constraints. If the probability is unacceptably low, you’d be wise to alter or abandon the experiment. In this chapter, you’ll learn how to conduct power analyses for a variety of statistical tests, including tests of proportions, t-tests, chi-square tests, balanced one-way ANOVA, tests of correlations, and linear models. Because power analysis applies to hypothesis testing situations, we’ll start with a brief review of null hypothesis significance testing (NHST). Then we’ll review conducting power analyses within R, focus239 www.it-ebooks.info 240 CHAPTER 10 Power analysis ing primarily on the pwr package. Finally, we’ll consider other approaches to power analysis available with R. 10.1 A quick review of hypothesis testing To help you understand the steps in a power analysis, we’ll briefly review statistical hypothesis testing in general. If you have a statistical background, feel free to skip to section 10.2. In statistical hypothesis testing, you specify a hypothesis about a population parameter (your null hypothesis, or H0). You then draw a sample from this population and calculate a statistic that’s used to make inferences about the population parameter. Assuming that the null hypothesis is true, you calculate the probability of obtaining the observed sample statistic or one more extreme. If the probability is sufficiently small, you reject the null hypothesis in favor of its opposite (referred to as the alternative or research hypothesis, H1). An example will clarify the process. Say you’re interested in evaluating the impact of cell phone use on driver reaction time. Your null hypothesis is Ho: µ 1 – µ 2 = 0, where µ 1 is the mean response time for drivers using a cell phone and µ 2 is the mean response time for drivers that are cell phone free (here, µ1 – µ2 is the population parameter of interest). If you reject this null hypothesis, you’re left with the alternate or research hypothesis, namely H1: µ 1 – µ 2 ≠ 0. This is equivalent to µ1 ≠ µ2, that the mean reaction times for the two conditions are not equal. A sample of individuals is selected and randomly assigned to one of two conditions. In the first condition, participants react to a series of driving challenges in a simulator while talking on a cell phone. In the second condition, participants complete the same series of challenges but without a cell phone. Overall reaction time is assessed for each individual. _ _ _ Based_ on the sample data, you can calculate the statistic (X1 − X2)/(s/ n) , where X 1 and X 2 are the sample reaction time means in the two conditions, s is the pooled sample standard deviation, and n is the number of participants in each condition. If the null hypothesis is true and you can assume that reaction times are normally distributed, this sample statistic will follow a t distribution with 2n – 2 degrees of freedom. Using this fact, you can calculate the probability of obtaining a sample statistic this large or larger. If the probability (p) is smaller than some predetermined cutoff (say p < .05), you reject the null hypothesis in favor of the alternate hypothesis. This predetermined cutoff (0.05) is called the significance level of the test. Note that you use sample data to make an inference about the population it’s drawn from. Your null hypothesis is that the mean reaction time of all drivers talking on cell phones isn’t different from the mean reaction time of all drivers who aren’t talking on cell phones, not just those drivers in your sample. The four possible outcomes from your decision are as follows: ■ If the null hypothesis is false and the statistical test leads you to reject it, you’ve made a correct decision. You’ve correctly determined that reaction time is affected by cell phone use. www.it-ebooks.info 241 A quick review of hypothesis testing ■ ■ ■ If the null hypothesis is true and you don’t reject it, again you’ve made a correct decision. Reaction time isn’t affected by cell phone use. If the null hypothesis is true but you reject it, you’ve committed a Type I error. You’ve concluded that cell phone use affects reaction time when it doesn’t. If the null hypothesis is false and you fail to reject it, you’ve committed a Type II error. Cell phone use affects reaction time, but you’ve failed to discern this. Each of these outcomes is illustrated in the following table: Decision Reject H0 Actual Fail to Reject H0 H0 true Type I error correct H0 false correct Type II error Controversy surrounding null hypothesis significance testing Null hypothesis significance testing isn’t without controversy; detractors have raised numerous concerns about the approach, particularly as practiced in the field of psychology. They point to a widespread misunderstanding of p values, reliance on statistical significance over practical significance, the fact that the null hypothesis is never exactly true and will always be rejected for sufficient sample sizes, and a number of logical inconsistencies in NHST practices. An in-depth discussion of this topic is beyond the scope of this book. Interested readers are referred to Harlow, Mulaik, and Steiger (1997). In planning research, the researcher typically pays special attention to four quantities (see figure 10.1): ■ ■ ■ Sample size refers to the number of observations in each condition/group of the experimental design. The significance level (also referred to as alpha) is defined as the probability of making a Type I error. The significance level can also be thought of as the probability of finding an effect that is not there. Power is defined as one minus the probability of making a Type II error. Power can be thought of as the probability of finding an effect that is there. www.it-ebooks.info Power 1-P(Type II Error) Effect Size ES Sample Size n Significance Level P(Type I Error) Figure 10.1 Four primary quantities considered in a study design power analysis. Given any three, you can calculate the fourth. 242 CHAPTER 10 ■ Power analysis Effect size is the magnitude of the effect under the alternate or research hypothesis. The formula for effect size depends on the statistical methodology employed in the hypothesis testing. Although the sample size and significance level are under the direct control of the researcher, power and effect size are affected more indirectly. For example, as you relax the significance level (in other words, make it easier to reject the null hypothesis), power increases. Similarly, increasing the sample size increases power. Your research goal is typically to maximize the power of your statistical tests while maintaining an acceptable significance level and employing as small a sample size as possible. That is, you want to maximize the chances of finding a real effect and minimize the chances of finding an effect that isn’t really there, while keeping study costs within reason. The four quantities (sample size, significance level, power, and effect size) have an intimate relationship. Given any three, you can determine the fourth. You’ll use this fact to carry out various power analyses throughout the remainder of the chapter. In the next section, we’ll look at ways of implementing power analyses using the R package pwr. Later, we’ll briefly look at some highly specialized power functions that are used in biology and genetics. 10.2 Implementing power analysis with the pwr package The pwr package, developed by Stéphane Champely, implements power analysis as outlined by Cohen (1988). Some of the more important functions are listed in table 10.1. For each function, you can specify three of the four quantities (sample size, significance level, power, effect size), and the fourth will be calculated. Table 10.1 pwr package functions Function Power calculations for … pwr.2p.test Two proportions (equal n) pwr.2p2n.test Two proportions (unequal n) pwr.anova.test Balanced one-way ANOVA pwr.chisq.test Chi-square test pwr.f2.test General linear model pwr.p.test Proportion (one sample) pwr.r.test Correlation pwr.t.test t-tests (one sample, two samples, paired) pwr.t2n.test t-test (two samples with unequal n) www.it-ebooks.info Implementing power analysis with the pwr package 243 Of the four quantities, effect size is often the most difficult to specify. Calculating effect size typically requires some experience with the measures involved and knowledge of past research. But what can you do if you have no clue what effect size to expect in a given study? We’ll look at this difficult question in section 10.2.7. In the remainder of this section, we’ll look at the application of pwr functions to common statistical tests. Before invoking these functions, be sure to install and load the pwr package. 10.2.1 t-tests When the statistical test to be used is a t-test, the pwr.t.test() function provides a number of useful power analysis options. The format is pwr.t.test(n=, d=, sig.level=, power=, type=, alternative=) where ■ ■ n is the sample size. d is the effect size defined as the standardized mean difference. d = ■ ■ ■ ■ μ1 − μ2 σ where μ1 = mean of group 1 μ2 = mean of group 2 σ2 = common error variance sig.level is the significance level (0.05 is the default). power is the power level. type is a two-sample t-test ("two.sample"), a one-sample t-test ("one.sample"), or a dependent sample t-test ( "paired"). A two-sample test is the default. alternative indicates whether the statistical test is two-sided ("two.sided") or one-sided ("less" or "greater"). A two-sided test is the default. Let’s work through an example. Continuing the experiment from section 10.1 involving cell phone use and driving reaction time, assume that you’ll be using a two-tailed independent sample t-test to compare the mean reaction time for participants in the cell phone condition with the mean reaction time for participants driving unencumbered. Let’s assume that you know from past experience that reaction time has a standard deviation of 1.25 seconds. Also suppose that a 1-second difference in reaction time is considered an important difference. You’d therefore like to conduct a study in which you’re able to detect an effect size of d = 1/1.25 = 0.8 or larger. Additionally, you want to be 90% sure to detect such a difference if it exists, and 95% sure that you won’t declare a difference to be significant when it’s actually due to random variability. How many participants will you need in your study? Entering this information in the pwr.t.test() function, you have the following: > library(pwr) > pwr.t.test(d=.8, sig.level=.05, power=.9, type="two.sample", alternative="two.sided") www.it-ebooks.info 244 CHAPTER 10 Power analysis Two-sample t test power calculation n d sig.level power alternative = = = = = 34 0.8 0.05 0.9 two.sided NOTE: n is number in *each* group The results suggest that you need 34 participants in each group (for a total of 68 participants) in order to detect an effect size of 0.8 with 90% certainty and no more than a 5% chance of erroneously concluding that a difference exists when, in fact, it doesn’t. Let’s alter the question. Assume that in comparing the two conditions you want to be able to detect a 0.5 standard deviation difference in population means. You want to limit the chances of falsely declaring the population means to be different to 1 out of 100. Additionally, you can only afford to include 40 participants in the study. What’s the probability that you’ll be able to detect a difference between the population means that’s this large, given the constraints outlined? Assuming that an equal number of participants will be placed in each condition, you have > pwr.t.test(n=20, d=.5, sig.level=.01, type="two.sample", alternative="two.sided") Two-sample t test power calculation n d sig.level power alternative = = = = = 20 0.5 0.01 0.14 two.sided NOTE: n is number in *each* group With 20 participants in each group, an a priori significance level of 0.01, and a dependent variable standard deviation of 1.25 seconds, you have less than a 14% chance of declaring a difference of 0.625 seconds or less significant (d = 0.5 = 0.625/1.25). Conversely, there’s an 86% chance that you’ll miss the effect that you’re looking for. You may want to seriously rethink putting the time and effort into the study as it stands. The previous examples assumed that there are equal sample sizes in the two groups. If the sample sizes for the two groups are unequal, the function pwr.t2n.test(n1=, n2=, d=, sig.level=, power=, alternative=) can be used. Here, n1 and n2 are the sample sizes, and the other parameters are the same as for pwer.t.test. Try varying the values input to the pwr.t2n.test function and see the effect on the output. www.it-ebooks.info Implementing power analysis with the pwr package 245 10.2.2 ANOVA The pwr.anova.test() function provides power analysis options for a balanced oneway analysis of variance. The format is pwr.anova.test(k=, n=, f=, sig.level=, power=) where k is the number of groups and n is the common sample size in each group. For a one-way ANOVA, effect size is measured by f, where pi = ni/N ni = number of observations in group i N = total number of observations μi = mean of group i μ = grand mean σ2 = error variance within groups Let’s try an example. For a one-way ANOVA comparing five groups, calculate the sample size needed in each group to obtain a power of 0.80, when the effect size is 0.25 and a significance level of 0.05 is employed. The code looks like this: > pwr.anova.test(k=5, f=.25, sig.level=.05, power=.8) Balanced one-way analysis of variance power calculation k n f sig.level power = = = = = 5 39 0.25 0.05 0.8 NOTE: n is number in each group The total sample size is therefore 5 × 39, or 195. Note that this example requires you to estimate what the means of the five groups will be, along with the common variance. When you have no idea what to expect, the approaches described in section 10.2.7 may help. 10.2.3 Correlations The pwr.r.test() function provides a power analysis for tests of correlation coefficients. The format is as follows pwr.r.test(n=, r=, sig.level=, power=, alternative=) where n is the number of observations, r is the effect size (as measured by a linear correlation coefficient), sig.level is the significance level, power is the power level, and alternative specifies a two-sided ("two.sided") or a one-sided ("less" or "greater") significance test. For example, let’s assume that you’re studying the relationship between depression and loneliness. Your null and research hypotheses are H0: ρ ≤ 0.25 versus H1: ρ > 0.25 www.it-ebooks.info 246 CHAPTER 10 Power analysis where ρ is the population correlation between these two psychological variables. You’ve set your significance level to 0.05, and you want to be 90% confident that you’ll reject H0 if it’s false. How many observations will you need? This code provides the answer: > pwr.r.test(r=.25, sig.level=.05, power=.90, alternative="greater") approximate correlation power calculation (arctangh transformation) n r sig.level power alternative = = = = = 134 0.25 0.05 0.9 greater Thus, you need to assess depression and loneliness in 134 participants in order to be 90% confident that you’ll reject the null hypothesis if it’s false. 10.2.4 Linear models For linear models (such as multiple regression), the pwr.f2.test() function can be used to carry out a power analysis. The format is pwr.f2.test(u=, v=, f2=, sig.level=, power=) where u and v are the numerator and denominator degrees of freedom and f2 is the effect size. R2 f = 1− R 2 2 f2= R −R 2 1 − RAB 2 AB 2 A where R2 = population squared multiple correlation where R2A = variance accounted for in the population by variable set A R2AB = variance accounted for in the population by variable set A and B together The first formula for f2 is appropriate when you’re evaluating the impact of a set of predictors on an outcome. The second formula is appropriate when you’re evaluating the impact of one set of predictors above and beyond a second set of predictors (or covariates). Let’s say you’re interested in whether a boss’s leadership style impacts workers’ satisfaction above and beyond the salary and perks associated with the job. Leadership style is assessed by four variables, and salary and perks are associated with three variables. Past experience suggests that salary and perks account for roughly 30% of the variance in worker satisfaction. From a practical standpoint, it would be interesting if leadership style accounted for at least 5% above this figure. Assuming a significance level of 0.05, how many subjects would be needed to identify such a contribution with 90% confidence? Here, sig.level=0.05, power=0.90, u=3 (total number of predictors minus the number of predictors in set B), and the effect size is f2 = (.35 – .30)/(1 – .35) = 0.0769. Entering this into the function yields the following: www.it-ebooks.info Implementing power analysis with the pwr package 247 > pwr.f2.test(u=3, f2=0.0769, sig.level=0.05, power=0.90) Multiple regression power calculation u v f2 sig.level power = = = = = 3 184.2426 0.0769 0.05 0.9 In multiple regression, the denominator degrees of freedom equals N – k – 1, where N is the number of observations and k is the number of predictors. In this case, N – 7 – 1 = 185, which means the required sample size is N = 185 + 7 + 1 = 193. 10.2.5 Tests of proportions The pwr.2p.test() function can be used to perform a power analysis when comparing two proportions. The format is pwr.2p.test(h=, n=, sig.level=, power=) where h is the effect size and n is the common sample size in each group. The effect size h is defined as h = 2 arcsin p1 − 2 arcsin p2 and can be calculated with the function ES.h(p1, p2). For unequal ns, the desired function is pwr.2p2n.test(h=, n1=, n2=, sig.level=, power=) The alternative= option can be used to specify a two-tailed ("two.sided") or onetailed ("less" or "greater") test. A two-tailed test is the default. Let’s say that you suspect that a popular medication relieves symptoms in 60% of users. A new (and more expensive) medication will be marketed if it improves symptoms in 65% of users. How many participants will you need to include in a study comparing these two medications if you want to detect a difference this large? Assume that you want to be 90% confident in a conclusion that the new drug is better and 95% confident that you won’t reach this conclusion erroneously. You’ll use a one-tailed test because you’re only interested in assessing whether the new drug is better than the standard. The code looks like this: > pwr.2p.test(h=ES.h(.65, .6), sig.level=.05, power=.9, alternative="greater") Difference of proportion power calculation for binomial distribution (arcsine transformation) h = 0.1033347 n = 1604.007 sig.level = 0.05 power = 0.9 alternative = greater NOTE: same sample sizes www.it-ebooks.info 248 CHAPTER 10 Power analysis Based on these results, you’ll need to conduct a study with 1,605 individuals receiving the new drug and 1,605 receiving the existing drug in order to meet the criteria. 10.2.6 Chi-square tests Chi-square tests are often used to assess the relationship between two categorical variables. The null hypothesis is typically that the variables are independent versus a research hypothesis that they aren’t. The pwr.chisq.test() function can be used to evaluate the power, effect size, or requisite sample size when employing a chi-square test. The format is pwr.chisq.test(w=, N=, df=, sig.level=, power=) where w is the effect size, N is the total sample size, and df is the degrees of freedom. Here, effect size w is defined as where p0i = cell probability in ith cell under H0 p1i = cell probability in ith cell under H1 The summation goes from 1 to m, where m is the number of cells in the contingency table. The function ES.w2(P) can be used to calculate the effect size corresponding to the alternative hypothesis in a two-way contingency table. Here, P is a hypothesized two-way probability table. As a simple example, let’s assume that you’re looking at the relationship between ethnicity and promotion. You anticipate that 70% of your sample will be Caucasian, 10% will be African-American, and 20% will be Hispanic. Further, you believe that 60% of Caucasians tend to be promoted, compared with 30% for African-Americans and 50% for Hispanics. Your research hypothesis is that the probability of promotion follows the values in table 10.2. Table 10.2 Proportion of individuals expected to be promoted based on the research hypothesis Ethnicity Promoted Not promoted Caucasian 0.42 0.28 African-American 0.03 0.07 Hispanic 0.10 0.10 For example, you expect that 42% of the population will be promoted Caucasians (.42 = .70 × .60) and 7% of the population will be nonpromoted African-Americans (.07 = .10 × .70). Let’s assume a significance level of 0.05 and that the desired power level is 0.90. The degrees of freedom in a two-way contingency table are (r– 1)× (c – 1), where r is the number of rows and c is the number of columns. You can calculate the hypothesized effect size with the following code: www.it-ebooks.info 249 Implementing power analysis with the pwr package > prob <- matrix(c(.42, .28, .03, .07, .10, .10), byrow=TRUE, nrow=3) > ES.w2(prob) [1] 0.1853198 Using this information, you can calculate the necessary sample size like this: > pwr.chisq.test(w=.1853, df=2, sig.level=.05, power=.9) Chi squared power calculation w N df sig.level power = = = = = 0.1853 368.5317 2 0.05 0.9 NOTE: N is the number of observations The results suggest that a study with 369 participants will be adequate to detect a relationship between ethnicity and promotion given the effect size, power, and significance level specified. 10.2.7 Choosing an appropriate effect size in novel situations In power analysis, the expected effect size is the most difficult parameter to determine. It typically requires that you have experience with the subject matter and the measures employed. For example, the data from past studies can be used to calculate effect sizes, which can then be used to plan future studies. But what can you do when the research situation is completely novel and you have no past experience to call upon? In the area of behavioral sciences, Cohen (1988) attempted to provide benchmarks for “small,” “medium,” and “large” effect sizes for various statistical tests. These guidelines are provided in table 10.3. Table 10.3 Cohen’s effect size benchmarks Statistical method Effect size measures Suggested guidelines for effect size Small Medium Large t-test d 0.20 0.50 0.80 ANOVA f 0.10 0.25 0.40 f2 0.02 0.15 0.35 Test of proportions h 0.20 0.50 0.80 Chi-square w 0.10 0.30 0.50 Linear models When you have no idea what effect size may be present, this table may provide some guidance. For example, what’s the probability of rejecting a false null hypothesis (that www.it-ebooks.info 250 CHAPTER 10 Power analysis is, finding a real effect) if you’re using a one-way ANOVA with 5 groups, 25 subjects per group, and a significance level of 0.05? Using the pwr.anova.test() function and the suggestions in the f row of table 10.3, the power would be 0.118 for detecting a small effect, 0.574 for detecting a moderate effect, and 0.957 for detecting a large effect. Given the sample-size limitations, you’re only likely to find an effect if it’s large. It’s important to keep in mind that Cohen’s benchmarks are just general suggestions derived from a range of social research studies and may not apply to your particular field of research. An alternative is to vary the study parameters and note the impact on such things as sample size and power. For example, again assume that you want to compare five groups using a one-way ANOVA and a 0.05 significance level. The following listing computes the sample sizes needed to detect a range of effect sizes and plots the results in figure 10.2. Listing 10.1 Sample sizes for detecting significant effects in a one-way ANOVA library(pwr) es <- seq(.1, .5, .01) nes <- length(es) samsize <- NULL for (i in 1:nes){ result <- pwr.anova.test(k=5, f=es[i], sig.level=.05, power=.9) samsize[i] <- ceiling(result$n) } plot(samsize,es, type="l", lwd=2, col="red", ylab="Effect Size", xlab="Sample Size (per cell)", main="One Way ANOVA with Power=.90 and Alpha=.05") 0.3 0.2 Figure 10.2 Sample size needed to detect various effect sizes in a one-way ANOVA with five groups (assuming a power of 0.90 and significance level of 0.05) 0.1 Effect Size 0.4 0.5 One Way ANOVA with Power=.90 and Alpha=.05 50 100 150 200 250 300 Sample Size (per cell) www.it-ebooks.info 251 Creating power analysis plots Graphs such as these can help you estimate the impact of various conditions on your experimental design. For example, there appears to be little bang for the buck in increasing the sample size above 200 observations per group. We’ll look at another plotting example in the next section. 10.3 Creating power analysis plots Before leaving the pwr package, let’s look at a more involved graphing example. Suppose you’d like to see the sample size necessary to declare a correlation coefficient statistically significant for a range of effect sizes and power levels. You can use the pwr.r.test() function and for loops to accomplish this task, as shown in the following listing. Listing 10.2 Sample-size curves for detecting correlations of various sizes library(pwr) r <- seq(.1,.5,.01) nr <- length(r) b p <- seq(.4,.9,.1) np <- length(p) Sets the range of correlations and power values samsize <- array(numeric(nr*np), dim=c(nr,np)) for (i in 1:np){ for (j in 1:nr){ result <- pwr.r.test(n = NULL, r = r[j], sig.level = .05, power = p[i], alternative = "two.sided") samsize[j,i] <- ceiling(result$n) } } c Obtains sample size xrange <- range(r) yrange <- round(range(samsize)) Sets up the graph colors <- rainbow(length(p)) plot(xrange, yrange, type="n", xlab="Correlation Coefficient (r)", ylab="Sample Size (n)" ) d for (i in 1:np){ lines(r, samsize[,i], type="l", lwd=2, col=colors[i]) } e Adds power curves abline(v=0, h=seq(0,yrange[2],50), lty=2, col="grey89") abline(h=0, v=seq(xrange[1],xrange[2],.02), lty=2, col="gray89") title("Sample Size Estimation for Correlation Studies\n Sig=0.05 (Two-tailed)") legend("topright", title="Power", as.character(p), fill=colors) g f Adds grid lines Adds annotations Listing 10.2 uses the seq function to generate a range of effect sizes r (correlation coefficients under H1) and power levels p b. It then uses two for loops to cycle www.it-ebooks.info 252 CHAPTER 10 Power analysis Sample Size Estimation for Correlation Studies Sig=0.05 (Two−tailed) 1000 Power 600 400 0 200 Sample Siz e (n) 800 0.4 0.5 0.6 0.7 0.8 0.9 0.1 0.2 0.3 0.4 0.5 Correlation Coefficient (r) Figure 10.3 Sample size curves for detecting a significant correlation at various power levels through these effect sizes and power levels, calculating the corresponding sample sizes required and saving them in the array samsize c. The graph is set up with the appropriate horizontal and vertical axes and labels d. Power curves are added using lines rather than points e. Finally, a grid f and legend g are added to aid in reading the graph. The resulting graph is displayed in figure 10.3. As you can see from the graph, you’d need a sample size of approximately 75 to detect a correlation of 0.20 with 40% confidence. You’d need approximately 185 additional observations (n = 260) to detect the same correlation with 90% confidence. With simple modifications, the same approach can be used to create sample size and power curve graphs for a wide range of statistical tests. We’ll close this chapter by briefly looking at other R functions that are useful for power analysis. 10.4 Other packages There are several other packages in R that can be useful in the planning stages of studies (see table 10.4). Some contain general tools, whereas some are highly specialized. The last five in the table are particularly focused on power analysis in genetic studies. Genome-wide association studies (GWAS) are studies used to identify genetic www.it-ebooks.info 253 Summary associations with observable traits. For example, these studies would focus on why some people get a specific type of heart disease. Table 10.4 Specialized power-analysis packages Package Purpose asypow Power calculations via asymptotic likelihood ratio methods longpower Sample-size calculations for longitudinal data PwrGSD Power analysis for group sequential designs pamm Power analysis for random effects in mixed models powerSurvEpi Power and sample-size calculations for survival analysis in epidemiological studies powerMediation Power and sample-size calculations for mediation effects in linear, logistic, Poisson, and cox regression powerpkg Power analyses for the affected sib pair and the TDT (transmission disequilibrium test) design powerGWASinteraction Power calculations for interactions for GWAS pedantics Functions to facilitate power analyses for genetic studies of natural populations gap Functions for power and sample-size calculations in case-cohort designs ssize.fdr Sample-size calculations for microarray experiments Finally, the MBESS package contains a wide range of functions that can be used for various forms of power analysis and sample size determination. The functions are particularly relevant for researchers in the behavioral, educational, and social sciences. 10.5 Summary In chapters 7, 8, and 9, we explored a wide range of R functions for statistical hypothesis testing. In this chapter, we focused on the planning stages of such research. Power analysis helps you to determine the sample sizes needed to discern an effect of a given size with a given degree of confidence. It can also tell you the probability of detecting such an effect for a given sample size. You can directly see the tradeoff between limiting the likelihood of wrongly declaring an effect significant (a Type I error) with the likelihood of rightly identifying a real effect (power). The bulk of this chapter has focused on the use of functions provided by the pwr package. These functions can be used to carry out power and sample-size determinations for common statistical methods (including t-tests, chi-square tests, and tests of proportions, ANOVA, and regression). Pointers to more specialized methods were provided in the final section. www.it-ebooks.info 254 CHAPTER 10 Power analysis Power analysis is typically an interactive process. The investigator varies the parameters of sample size, effect size, desired significance level, and desired power to observe their impact on each other. The results are used to plan studies that are more likely to yield meaningful results. Information from past research (particularly regarding effect sizes) can be used to design more effective and efficient future research. An important side benefit of power analysis is the shift that it encourages, away from a singular focus on binary hypothesis testing (that is, does an effect exist or not), toward an appreciation of the size of the effect under consideration. Journal editors are increasingly requiring authors to include effect sizes as well as p values when reporting research results. This helps you to determine both the practical implications of the research and provides you with information that can be used to plan future studies. In the next chapter, we’ll look at additional and novel ways to visualize multivariate relationships. These graphic methods can complement and enhance the analytic methods that we’ve discussed so far and prepare you for the advanced methods covered in part 3. www.it-ebooks.info Intermediate graphs This chapter covers ■ Visualizing bivariate and multivariate relationships ■ Working with scatter and line plots ■ Understanding corrgrams ■ Using mosaic and association plots In chapter 6 (basic graphs), we considered a wide range of graph types for displaying the distribution of single categorical or continuous variables. Chapter 8 (regression) reviewed graphical methods that are useful when predicting a continuous outcome variable from a set of predictor variables. In chapter 9 (analysis of variance), we considered techniques that are particularly useful for visualizing how groups differ on a continuous outcome variable. In many ways, the current chapter is a continuation and extension of the topics covered so far. In this chapter, we’ll focus on graphical methods for displaying relationships between two variables (bivariate relationships) and between many variables (multivariate relationships). For example: ■ What’s the relationship between automobile mileage and car weight? Does it vary by the number of cylinders the car has? 255 www.it-ebooks.info 256 CHAPTER 11 ■ ■ ■ ■ ■ ■ Intermediate graphs How can you picture the relationships among an automobile’s mileage, weight, displacement, and rear axle ratio in a single graph? When plotting the relationship between two variables drawn from a large dataset (say, 10,000 observations), how can you deal with the massive overlap of data points you’re likely to see? In other words, what do you do when your graph is one big smudge? How can you visualize the multivariate relationships among three variables at once (given a 2D computer screen or sheet of paper, and a budget slightly less than that for Avatar)? How can you display the growth of several trees over time? How can you visualize the correlations among a dozen variables in a single graph? How does it help you to understand the structure of your data? How can you visualize the relationship of class, gender, and age with passenger survival on the Titanic? What can you learn from such a graph? These are the types of questions that can be answered with the methods described in this chapter. The datasets that we’ll use are examples of what’s possible. It’s the general techniques that are most important. If the topic of automobile characteristics or tree growth isn’t interesting to you, plug in your own data! We’ll start with scatter plots and scatter-plot matrices. Then, we’ll explore line charts of various types. These approaches are well known and widely used in research. Next, we’ll review the use of corrgrams for visualizing correlations and mosaic plots for visualizing multivariate relationships among categorical variables. These approaches are also useful but much less well known among researchers and data analysts. You’ll see examples of how you can use each of these approaches to gain a better understanding of your data and communicate these findings to others. 11.1 Scatter plots As you’ve seen in previous chapters, scatter plots describe the relationship between two continuous variables. In this section, we’ll start with a depiction of a single bivariate relationship (x versus y). We’ll then explore ways to enhance this plot by superimposing additional information. Next, you’ll learn how to combine several scatter plots into a scatter-plot matrix so that you can view many bivariate relationships at once. We’ll also review the special case where many data points overlap, limiting your ability to picture the data, and we’ll discuss a number of ways around this difficulty. Finally, we’ll extend the two-dimensional graph to three dimensions, with the addition of a third continuous variable. This will include 3D scatter plots and bubble plots. Each can help you understand the multivariate relationship among three variables at once. The basic function for creating a scatter plot in R is plot(x, y), where x and y are numeric vectors denoting the (x, y) points to plot. The following listing presents an example. www.it-ebooks.info 257 Scatter plots Listing 11.1 A scatter plot with best-fit lines attach(mtcars) plot(wt, mpg, main="Basic Scatter plot of MPG vs. Weight", xlab="Car Weight (lbs/1000)", ylab="Miles Per Gallon ", pch=19) abline(lm(mpg~wt), col="red", lwd=2, lty=1) lines(lowess(wt,mpg), col="blue", lwd=2, lty=2) The resulting graph is provided in figure 11.1. The code in listing 11.1 attaches the mtcars data frame and creates a basic scatter plot using filled circles for the plotting symbol. As expected, as car weight increases, miles per gallon decreases, although the relationship isn’t perfectly linear. The abline() function is used to add a linear line of best fit, and the lowess() function is used to add a smoothed line. This smoothed line is a nonparametric fit line based on locally weighted polynomial regression. See Cleveland (1981) for details on the algorithm. NOTE R has two functions for producing lowess fits: lowess() and loess(). The loess() function is a newer, formula-based version of lowess() and is more powerful. The two functions have different defaults, so be careful not to confuse them. The scatterplot() function in the car package offers many enhanced features and convenience functions for producing scatter plots, including fit lines, marginal box plots, confidence ellipses, plotting by subgroups, and interactive point identification. 25 20 15 Figure 11.1 Scatter plot of car mileage vs. weight, with superimposed linear and lowess fit lines 10 Miles Per Gallon 30 Basic Scatter plot of MPG vs. Weight 2 3 4 Car Weight (lbs/1000) www.it-ebooks.info 5 258 CHAPTER 11 Intermediate graphs For example, a more complex version of the previous plot is produced by the following code: library(car) scatterplot(mpg ~ wt | cyl, data=mtcars, lwd=2, span=0.75, main="Scatter Plot of MPG vs. Weight by # Cylinders", xlab="Weight of Car (lbs/1000)", ylab="Miles Per Gallon", legend.plot=TRUE, id.method="identify", labels=row.names(mtcars), boxplots="xy" ) Here, the scatterplot() function is used to plot miles per gallon versus weight for automobiles that have four, six, or eight cylinders. The formula mpg ~ wt | cyl indicates conditioning (that is, separate plots between mpg and wt for each level of cyl). The graph is provided in figure 11.2. By default, subgroups are differentiated by color and plotting symbol, and separate linear and loess lines are fit. The span parameter controls the amount of smoothing in the loess line. Larger values lead to smoother fits. The id.method option indicates that points will be identified interactively by mouse clicks, until you select Stop (via the Graphics or context-sensitive menu) or press the Esc key. The labels option indicates that points will be identified with their row names. Here you see that the Toyota Corolla and Fiat 128 have unusually good gas mileage, given their weights. The cyl Scatter Plot of MPG vs. Weight by # Cylinders 25 20 15 10 Miles Per Gallon 30 4 6 8 2 3 4 Weight of Car (lbs/1000) www.it-ebooks.info 5 Figure 11.2 Scatter plot with subgroups and separately estimated fit lines 259 Scatter plots legend.plot option adds a legend to the upper-left margin, and marginal box plots for mpg and weight are requested with the boxplots option. The scatterplot() function has many features worth investigating, including robust options and data concentration ellipses not covered here. See help(scatterplot) for more details. Scatter plots help you visualize relationships between quantitative variables, two at a time. But what if you wanted to look at the bivariate relationships between automobile mileage, weight, displacement (cubic inch), and rear axle ratio? One way is to arrange these six scatter plots in a matrix. When there are several quantitative variables, you can represent their relationships in a scatter-plot matrix, which is covered next. 11.1.1 Scatter-plot matrices There are many useful functions for creating scatter-plot matrices in R. A basic scatterplot matrix can be created with the pairs() function. The following code produces a scatter-plot matrix for the variables mpg, disp, drat, and wt: pairs(~mpg+disp+drat+wt, data=mtcars, main="Basic Scatter Plot Matrix") All the variables on the right of the ~ are included in the plot. The graph is provided in figure 11.3. Here you can see the bivariate relationship among all the variables specified. For example, the scatter plot between mpg and disp is found at the row and column Basic Scatter Plot Matrix 200 300 400 2 3 4 5 25 30 100 300 400 10 15 20 mpg 4.0 4.5 5.0 100 200 disp 4 5 3.0 3.5 drat 2 3 wt 10 15 20 25 30 3.0 3.5 4.0 4.5 5.0 www.it-ebooks.info Figure 11.3 Scatter-plot matrix created by the pairs() function 260 CHAPTER 11 Intermediate graphs intersection of those two variables. Note that the six scatter plots below the principal diagonal are the same as those above the diagonal. This arrangement is a matter of convenience. By adjusting the options, you could display just the lower or upper triangle. For example, the option upper.panel=NULL would produce a graph with just the lower triangle of plots. The scatterplotMatrix() function in the car package can also produce scatterplot matrices and can optionally do the following: Condition the scatter plot matrix on a factor. Include linear and loess fit lines. Place box plots, densities, or histograms in the principal diagonal. Add rug plots in the margins of the cells. ■ ■ ■ ■ Here’s an example: library(car) scatterplotMatrix(~ mpg + disp + drat + wt, data=mtcars, spread=FALSE, smoother.args=list(lty=2), main="Scatter Plot Matrix via car Package") The graph is provided in figure 11.4. Here you can see that linear and smoothed (loess) fit lines are added by default and that kernel density and rug plots are added to the principal diagonal. The spread=FALSE option suppresses lines showing spread Scatter Plot Matrix via car Package 100 200 300 400 2 3 4 10 15 20 25 30 mpg 5 5.0 100 200 300 400 disp 3.0 3.5 4.0 4.5 drat 2 3 4 5 wt 10 15 20 25 30 3.0 3.5 4.0 4.5 5.0 www.it-ebooks.info Figure 11.4 Scatter-plot matrix created with the scatterplotMatrix() function. The graph includes kernel density and rug plots in the principal diagonal and linear and loess fit lines. 261 Scatter plots and asymmetry, and the smoother.args=list(lty=2) option displays the loess fit lines using dashed rather than solid lines. R provides many other ways to create scatter-plot matrices. You may want to explore the cpars() function in the glus package, the pairs2() function in the TeachingDemos package, the xysplom() function in the HH package, the kepairs() function in the ResourceSelection package, and pairs.mod() in the SMPracticals package. Each adds its own unique twist. Analysts must love scatter-plot matrices! 11.1.2 High-density scatter plots When there’s a significant overlap among data points, scatter plots become less useful for observing relationships. Consider the following contrived example with 10,000 observations falling into two overlapping clusters of data: set.seed(1234) n <- 10000 c1 <- matrix(rnorm(n, mean=0, sd=.5), ncol=2) c2 <- matrix(rnorm(n, mean=3, sd=2), ncol=2) mydata <- rbind(c1, c2) mydata <- as.data.frame(mydata) names(mydata) <- c("x", "y") If you generate a standard scatter plot between these variables using the following code with(mydata, plot(x, y, pch=19, main="Scatter Plot with 10,000 Observations")) you’ll obtain a graph like the one in figure 11.5. 0 y 5 10 Scatter Plot with 10,000 Observations −5 Figure 11.5 Scatter plot with 10,000 observations and significant overlap of data points. Note that the overlap of data points makes it difficult to discern where the concentration of data is greatest. −5 0 5 x www.it-ebooks.info 10 262 Intermediate graphs CHAPTER 11 The overlap of data points in figure 11.5 makes it difficult to discern the relationship between x and y. R provides several graphical approaches that can be used when this occurs. They include the use of binning, color, and transparency to indicate the number of overprinted data points at any point on the graph. The smoothScatter() function uses a kernel-density estimate to produce smoothed color density representations of the scatter plot. The following code with(mydata, smoothScatter(x, y, main="Scatter Plot Colored by Smoothed Densities")) produces the graph in figure 11.6. Using a different approach, the hexbin() function in the hexbin package provides bivariate binning into hexagonal cells (it looks better than it sounds). Applying this function to the dataset library(hexbin) with(mydata, { bin <- hexbin(x, y, xbins=50) plot(bin, main="Hexagonal Binning with 10,000 Observations") }) you get the scatter plot in figure 11.7. −5 0 y 5 10 Scatter Plot Colored by Smoothed Densities −5 0 5 10 x Figure 11.6 Scatter plot using smoothScatter() to plot smoothed density estimates. Densities are easy to read from the graph. www.it-ebooks.info 263 Scatter plots Hexagonal Binning with 10,000 Observations 10 Counts 262 246 229 213 197 180 164 148 132 115 99 83 66 50 34 17 1 y 5 0 −5 −5 0 5 10 x Figure 11.7 Scatter plot using hexagonal binning to display the number of observations at each point. Data concentrations are easy to see, and counts can be read from the legend. It’s useful to note that the smoothScatter() function in the base package, along with the ipairs() function in the IDPmisc package, can be used to create readable scatter plot matrices for large datasets as well. See ?smoothScatter and ?ipairs for examples. 11.1.3 3D scatter plots Scatter plots and scatter-plot matrices display bivariate relationships. What if you want to visualize the interaction of three quantitative variables at once? In this case, you can use a 3D scatter plot. For example, say that you’re interested in the relationship between automobile mileage, weight, and displacement. You can use the scatterplot3d() function in the scatterplot3d package to picture their relationship. The format is scatterplot3d(x, y, z) where x is plotted on the horizontal axis, y is plotted on the vertical axis, and z is plotted in perspective. Continuing the example, library(scatterplot3d) attach(mtcars) scatterplot3d(wt, disp, mpg, main="Basic 3D Scatter Plot") www.it-ebooks.info 264 Intermediate graphs CHAPTER 11 disp 500 20 mpg 25 30 35 Basic 3D Scatter Plot 400 15 300 200 10 100 0 1 2 3 4 5 Figure 11.8 3D scatter plot of miles per gallon, auto weight, and displacement 6 wt produces the 3D scatter plot in figure 11.8. The scatterplot3d() function offers many options, including the ability to specify symbols, axes, colors, lines, grids, highlighting, and angles. For example, the code library(scatterplot3d) attach(mtcars) scatterplot3d(wt, disp, mpg, pch=16, highlight.3d=TRUE, type="h", main="3D Scatter Plot with Vertical Lines") produces a 3D scatter plot with highlighting to enhance the impression of depth, and vertical lines connecting points to the horizontal plane (see figure 11.9). 400 15 300 200 Figure 11.9 3D scatter plot with vertical lines and shading 10 100 0 1 2 3 4 wt www.it-ebooks.info 5 6 disp 500 20 mpg 25 30 35 3D Scatter Plot with Vertical Lines 265 Scatter plots As a final example, let’s take the previous graph and add a regression plane. The necessary code is library(scatterplot3d) attach(mtcars) s3d <-scatterplot3d(wt, disp, mpg, pch=16, highlight.3d=TRUE, type="h", main="3D Scatter Plot with Vertical Lines and Regression Plane") fit <- lm(mpg ~ wt+disp) s3d$plane3d(fit) 20 500 disp mpg 25 30 35 3D Scatter Plot with Vertical Lines and Regression Plane 400 15 300 200 100 10 The resulting graph is provided in figure 11.10. The graph allows you to visualize the prediction of miles per gallon from automobile weight and displacement using a multipleregression equation. The plane represents the predicted values, and the points are the actual values. The vertical distances from the plane to the points are the residuals. Points that lie above the plane are under-predicted, whereas points that lie below the line are over-predicted. Multiple regression is covered in chapter 8. 0 1 2 3 4 5 6 wt Figure 11.10 3D scatter plot with vertical lines, shading, and overlaid regression plane 11.1.4 Spinning 3D scatter plots Three-dimensional scatter plots are much easier to interpret if you can interact with them. R provides several mechanisms for rotating graphs so that you can see the plotted points from more than one angle. For example, you can create an interactive 3D scatter plot using the plot3d() function in the rgl package. It creates a spinning 3D scatter plot that can be rotated with the mouse. The format is plot3d(x, y, z) where x, y, and z are numeric vectors representing points. You can also add options like col and size to control the color and size of the points, respectively. Continuing the example, try this code: library(rgl) attach(mtcars) plot3d(wt, disp, mpg, col="red", size=5) www.it-ebooks.info 266 CHAPTER 11 Intermediate graphs You should get a graph like the one depicted in figure 11.11. Use the mouse to rotate the axes. I think you’ll find that being able to rotate the scatter plot in three dimensions makes the graph much easier to understand. You can perform a similar function with scatter3d() in the car package: library(car) with(mtcars, scatter3d(wt, disp, mpg)) The results are displayed in figure 11.12. The scatter3d() function can include a variety of regression surfaces, such as linear, quadratic, smooth, and additive. The linear surface depicted is the default. Additionally, there are options for interactively identifying points. See help(scatter3d) for more details. Figure 11.11 Rotating 3D scatter plot produced by the plot3d() function in the rgl package 11.1.5 Bubble plots In the previous section, you displayed the relationship between three quantitative variables using a 3D scatter plot. Another approach is to create a 2D scatter plot and use the size of the plotted point to represent the value of the third variable. This approach is referred to as a bubble plot. You can create a bubble plot Figure 11.12 Spinning 3D scatter plot produced by using the symbols() function. This the scatter3d() function in the car package function can be used to draw circles, squares, stars, thermometers, and box plots at a specified set of (x, y) coordinates. For plotting circles, the format is symbols(x, y, circle=radius) where x , y, and radius are vectors specifying the x and y coordinates and circle radii, respectively. www.it-ebooks.info 267 Scatter plots You want the areas, rather than the radii, of the circles to be proportional to the A , the values of a third variable. Given the formula for the radius of a circle r = − π proper call is symbols(x, y, circle=sqrt(z/pi)) where z is the third variable to be plotted. Let’s apply this to the mtcars data, plotting car weight on the x-axis, miles per gallon on the y-axis, and engine displacement as the bubble size. The following code attach(mtcars) r <- sqrt(disp/pi) symbols(wt, mpg, circle=r, inches=0.30, fg="white", bg="lightblue", main="Bubble Plot with point size proportional to displacement", ylab="Miles Per Gallon", xlab="Weight of Car (lbs/1000)") text(wt, mpg, rownames(mtcars), cex=0.6) detach(mtcars) produces the graph in figure 11.13. The option inches is a scaling factor that can be used to control the size of the circles (the default is to make the largest circle 1 inch). The text() function is optional. Here it is used to add the names of the cars to the plot. From the figure, you can see that increased gas mileage is associated with both decreased car weight and engine displacement. In general, statisticians involved in the R project tend to avoid bubble plots for the same reason they avoid pie charts. Humans typically have a harder time making 35 Bubble Plot with point size proportional to displacement Toyota Corolla 30 Fiat 128 Lotus Europa Honda Civic 25 Porsche 914−2 Merc 240D Datsun 710 Merc 230 15 20 Toyota Corona Volvo 142E Hornet 4 Drive Mazda Mazda RX4RX4 Wag Ferrari Dino Merc 280 Pontiac Firebird Hornet Spor tabout Valiant Merc 280C Merc 450SL Merc 450SE Ford Pantera DodgeLChallenger AMC JavMerc elinBor 450SLC Maserati a Duster 360 Chr ysler Imperial Camaro Z28 Cadillac Lincoln Fleetwood Continental 10 Miles Per Gallon Fiat X1−9 Weight of Car (lbs/1000) www.it-ebooks.info Figure 11.13 Bubble plot of car weight vs. mpg, where point size is proportional to engine displacement 268 CHAPTER 11 Intermediate graphs judgments about volume than distance. But bubble charts are popular in the business world, so I’m including them here for completeness. l’ve certainly had a lot to say about scatter plots. This attention to detail is due, in part, to the central place that scatter plots hold in data analysis. Although simple, they can help you visualize your data in an immediate and straightforward manner, uncovering relationships that might otherwise be missed. 11.2 Line charts If you connect the points in a scatter plot moving from left to right, you have a line plot. The dataset Orange that come with the base installation contains age and circumference data for five orange trees. Consider the growth of the first orange tree, depicted in figure 11.14. The plot on the left is a scatter plot, and the plot on the right is a line chart. As you can see, line charts are particularly good vehicles for conveying change. The graphs in figure 11.14 were created with the code in the following listing. Listing 11.2 Creating side-by-side scatter and line plots opar <- par(no.readonly=TRUE) par(mfrow=c(1,2)) t1 <- subset(Orange, Tree==1) plot(t1$age, t1$circumference, xlab="Age (days)", ylab="Circumference (mm)", main="Orange Tree 1 Growth") plot(t1$age, t1$circumference, xlab="Age (days)", ylab="Circumference (mm)", main="Orange Tree 1 Growth", type="b") par(opar) Orange Tree 1 Growth 140 120 100 80 40 60 Circumference (mm) 120 100 80 60 40 Circumference (mm) 140 Orange Tree 1 Growth 500 1000 1500 Age (days) Figure 11.14 500 1000 Age (days) Comparison of a scatter plot and a line plot www.it-ebooks.info 1500 269 Line charts You’ve seen the elements that make up this code in chapter 3, so I won’t go into detail here. The main difference between the two plots in figure 11.14 is produced by the option type="b". In general, line charts are created with one of the following two functions plot(x, y, type=) lines(x, y, type=) where x and y are numeric vectors of (x,y) points to connect. The option type= can take the values described in table 11.1. Examples of each type are given in figure 11.15. As you can see, type="p" produces the typical scatter plot. The option type="b" is the most common for line charts. The difference between b and c is whether the points appear or gaps are left instead. Both type="s" and type="S" produce stair steps (step functions). The first runs, then rises, whereas the second rises, then runs. 4 5 Type Points only l Lines only o Over-plotted points (that is, lines overlaid on top of points) b, c Points (empty if c) joined by lines s, S Stair steps h Histogram-line vertical lines n Doesn’t produce any points or lines (used to set up the axes for later commands) 3 4 5 5 4 y 1 2 y 2 2 3 4 5 type= "b" 1 1 What is plotted p 3 4 y 2 1 3 1 2 3 4 5 1 3 4 x type= "c" type= "s" type= "S" type= "h" 4 3 3 x 4 5 1 2 3 x 4 5 Figure 11.15 type= options in the plot() and lines() functions 1 2 y 4 1 1 2 2 y 3 5 4 3 y 4 3 2 5 5 x 5 x 2 1 2 x 1 y 3 4 y 3 2 1 2 5 1 Line chart options type= "o" 5 type= "l" 5 type= "p" Table 11.1 1 2 3 4 5 x www.it-ebooks.info 1 2 3 x 4 5 270 CHAPTER 11 Intermediate graphs There’s an important difference between the plot() and lines() functions. The plot() function creates a new graph when invoked. The lines() function adds information to an existing graph but can’t produce a graph on its own. Because of this, the lines() function is typically used after a plot() command has produced a graph. If desired, you can use the type="n" option in the plot() function to set up the axes, titles, and other graph features, and then use the lines() function to add various lines to the plot. To demonstrate the creation of a more complex line chart, let’s plot the growth of all five orange trees over time. Each tree will have its own distinctive line. The code is shown in the next listing and the results in figure 11.16. Listing 11.3 Line chart displaying the growth of five orange trees over time Orange$Tree <- as.numeric(Orange$Tree) ntrees <- max(Orange$Tree) Converts a factor to numeric for convenience xrange <- range(Orange$age) yrange <- range(Orange$circumference) plot(xrange, yrange, type="n", xlab="Age (days)", ylab="Circumference (mm)" ) Sets up the plot colors <- rainbow(ntrees) linetype <- c(1:ntrees) plotchar <- seq(18, 18+ntrees, 1) for (i in 1:ntrees) { tree <- subset(Orange, Tree==i) lines(tree$age, tree$circumference, type="b", lwd=2, lty=linetype[i], col=colors[i], pch=plotchar[i] ) } Adds lines title("Tree Growth", "example of line plot") legend(xrange[1], yrange[2], 1:ntrees, cex=0.8, col=colors, pch=plotchar, lty=linetype, title="Tree" ) Adds a legend In listing 11.3, the plot() function is used to set up the graph and specify the axis labels and ranges but plots no actual data. The lines() function is then used to add a separate line and set of points for each orange tree. You can see that tree 4 and tree 5 www.it-ebooks.info 271 Corrgrams Tree Growth 100 150 1 2 3 4 5 50 Circumference (mm) 200 Tree Figure 11.16 Line chart displaying the growth of five orange trees 500 1000 1500 Age (days) example of line plot demonstrated the greatest growth across the range of days measured, and that tree 5 overtakes tree 4 at around 664 days. Many of the programming conventions in R that I discussed in chapters 2, 3, and 4 are used in listing 11.3. You may want to test your understanding by working through each line of code and visualizing what it’s doing. If you can, you’re on your way to becoming a serious R programmer (and fame and fortune is near at hand)! In the next section, you’ll explore ways of examining a number of correlation coefficients at once. 11.3 Corrgrams Correlation matrices are a fundamental aspect of multivariate statistics. Which variables under consideration are strongly related to each other, and which aren’t? Are there clusters of variables that relate in specific ways? As the number of variables grows, such questions can be harder to answer. Corrgrams are a relatively recent tool for visualizing the data in correlation matrices. It’s easier to explain a corrgram once you’ve seen one. Consider the correlations among the variables in the mtcars data frame. Here you have 11 variables, each measuring some aspect of 32 automobiles. You can get the correlations using the following code: > options(digits=2) > cor(mtcars) mpg cyl disp hp drat wt qsec vs am gear carb mpg 1.00 -0.85 -0.85 -0.78 0.681 -0.87 0.419 0.66 0.600 0.48 -0.551 cyl -0.85 1.00 0.90 0.83 -0.700 0.78 -0.591 -0.81 -0.523 -0.49 0.527 www.it-ebooks.info 272 CHAPTER 11 disp hp drat wt qsec vs am gear carb -0.85 -0.78 0.68 -0.87 0.42 0.66 0.60 0.48 -0.55 0.90 0.83 -0.70 0.78 -0.59 -0.81 -0.52 -0.49 0.53 1.00 0.79 -0.71 0.89 -0.43 -0.71 -0.59 -0.56 0.39 0.79 1.00 -0.45 0.66 -0.71 -0.72 -0.24 -0.13 0.75 Intermediate graphs -0.710 -0.449 1.000 -0.712 0.091 0.440 0.713 0.700 -0.091 0.89 0.66 -0.71 1.00 -0.17 -0.55 -0.69 -0.58 0.43 -0.434 -0.708 0.091 -0.175 1.000 0.745 -0.230 -0.213 -0.656 -0.71 -0.72 0.44 -0.55 0.74 1.00 0.17 0.21 -0.57 -0.591 -0.243 0.713 -0.692 -0.230 0.168 1.000 0.794 0.058 -0.56 0.395 -0.13 0.750 0.70 -0.091 -0.58 0.428 -0.21 -0.656 0.21 -0.570 0.79 0.058 1.00 0.274 0.27 1.000 Which variables are most related? Which variables are relatively independent? Are there any patterns? It isn’t that easy to tell from the correlation matrix without significant time and effort (and probably a set of colored pens to make notations). You can display that same correlation matrix using the corrgram() function in the corrgram package (see figure 11.17). The code is library(corrgram) corrgram(mtcars, order=TRUE, lower.panel=panel.shade, upper.panel=panel.pie, text.panel=panel.txt, main="Corrgram of mtcars intercorrelations") Corrgram of mtcars intercorrelations gear am drat mpg vs qsec wt disp cyl hp carb Figure 11.17 Corrgram of the correlations among the variables in the mtcars data frame. Rows and columns have been reordered using principal components analysis. www.it-ebooks.info 273 Corrgrams To interpret this graph, start with the lower triangle of cells (the cells below the principal diagonal). By default, a blue color and hashing that goes from lower left to upper right represent a positive correlation between the two variables that meet at that cell. Conversely, a red color and hashing that goes from the upper left to lower right represent a negative correlation. The darker and more saturated the color, the greater the magnitude of the correlation. Weak correlations, near zero, appear washed out. In the current graph, the rows and columns have been reordered (using principal components analysis) to cluster variables together that have similar correlation patterns. You can see from the shaded cells that gear, am, drat, and mpg are positively correlated with one another. You can also see that wt, disp, cyl, hp, and carb are positively correlated with one another. But the first group of variables is negatively correlated with the second group of variables. You can also see that the correlation between carb and am is weak, as is the correlation between vs and gear, vs and am, and drat and qsec. The upper triangle of cells displays the same information using pies. Here, color plays the same role, but the strength of the correlation is displayed by the size of the filled pie slice. Positive correlations fill the pie starting at 12 o’clock and moving in a clockwise direction. Negative correlations fill the pie by moving in a counterclockwise direction. The format of the corrgram() function is corrgram(x, order=, panel=, text.panel=, diag.panel=) where x is a data frame with one observation per row. When order=TRUE, the variables are reordered using a principal component analysis of the correlation matrix. Reordering can help make patterns of bivariate relationships more obvious. The option panel specifies the type of off-diagonal panels to use. Alternatively, you can use the options lower.panel and upper.panel to choose different options below and above the main diagonal. The text.panel and diag.panel options refer to the main diagonal. Allowable values for panel are described in table 11.2. Table 11.2 Panel options for the corrgram() function Placement Off diagonal Main diagonal Panel Option Description panel.pie The filled portion of the pie indicates the magnitude of the correlation. panel.shade The depth of the shading indicates the magnitude of the correlation. panel.ellipse Plots a confidence ellipse and smoothed line. panel.pts Plots a scatter plot. panel.conf Prints correlations and their confidence intervals. panel.txt Prints the variable name. panel.minmax Prints the minimum and maximum value and variable name. panel.density Prints the kernel density plot and variable name. www.it-ebooks.info 274 CHAPTER 11 Intermediate graphs Corrgram of mtcars data using scatter plots and ellipses 5 gear 3 1 am 0 4.93 drat 2.76 33.9 mpg 10.4 1 vs 0 22.9 qsec 14.5 5.42 wt 1.51 472 disp 71.1 8 cyl 4 335 hp 52 8 carb 1 Figure 11.18 Corrgram of the correlations among the variables in the mtcars data frame. The lower triangle contains smoothed bestfit lines and confidence ellipses, and the upper triangle contains scatter plots. The diagonal panel contains minimum and maximum values. Rows and columns have been reordered using principal components analysis. Let’s try a second example. The code library(corrgram) corrgram(mtcars, order=TRUE, lower.panel=panel.ellipse, upper.panel=panel.pts, text.panel=panel.txt, diag.panel=panel.minmax, main="Corrgram of mtcars data using scatter plots and ellipses") produces the graph in figure 11.18. Here you’re using smoothed fit lines and confidence ellipses in the lower triangle and scatter plots in the upper triangle. Why do the scatter plots look odd? Several of the variables that are plotted in figure 11.18 have limited allowable values. For example, the number of gears is 3, 4, or 5. The number of cylinders is 4, 6, or 8. Both am (transmission type) and vs (V/S) are dichotomous. This explains the oddlooking scatter plots in the upper diagonal. Always be careful that the statistical methods you choose are appropriate to the form of the data. Specifying these variables as ordered or unordered factors can serve as a useful check. When R knows that a variable is categorical or ordinal, it attempts to apply statistical methods that are appropriate to that level of measurement. www.it-ebooks.info 275 Corrgrams We’ll finish with one more example. The code library(corrgram) corrgram(mtcars, lower.panel=panel.shade, upper.panel=NULL, text.panel=panel.txt, main="Car Mileage Data (unsorted)") produces the graph in figure 11.19. Here you’re using shading in the lower triangle, keeping the original variable order, and leaving the upper triangle blank. Before moving on, I should point out that you can control the colors used by the corrgram() function. To do so, specify four colors in the colorRampPalette() function, and include the results using the col.regions option. Here’s an example: library(corrgram) cols <- colorRampPalette(c("darkgoldenrod4", "burlywood1", "darkkhaki", "darkgreen")) corrgram(mtcars, order=TRUE, col.regions=cols, lower.panel=panel.shade, upper.panel=panel.conf, text.panel=panel.txt, main="A Corrgram (or Horse) of a Different Color") Try it and see what you get. Car Mileage Data (unsorted) mpg cyl disp hp drat wt qsec vs am gear carb Figure 11.19 Corrgram of the correlations among the variables in the mtcars data frame. The lower triangle is shaded to represent the magnitude and direction of the correlations. The variables are plotted in their original order. www.it-ebooks.info 276 CHAPTER 11 Intermediate graphs Corrgrams can be a useful way to examine large numbers of bivariate relationships among quantitative variables. Because they’re relatively new, the greatest challenge is to educate the recipient on how to interpret them. To learn more, see Michael Friendly’s article “Corrgrams: Exploratory Displays for Correlation Matrices,” available at www.math.yorku.ca/SCS/Papers/corrgram.pdf. 11.4 Mosaic plots Up to this point, we’ve been exploring methods of visualizing relationships among quantitative/continuous variables. But what if your variables are categorical? When you’re looking at a single categorical variable, you can use a bar or pie chart. If there are two categorical variables, you can look at a 3D bar chart (which, by the way, is not easy to do in R). But what do you do if there are more than two categorical variables? One approach is to use mosaic plots. In a mosaic plot, the frequencies in a multidimensional contingency table are represented by nested rectangular regions that are proportional to their cell frequency. Color and/or shading can be used to represent residuals from a fitted model. For details, see Meyer, Zeileis, and Hornick (2006), or Michael Friendly’s excellent tutorial (http://mng.bz/3p0d). Mosaic plots can be created with the mosaic() function from the vcd library (there’s a mosaicplot() function in the basic installation of R, but I recommend you use the vcd package for its more extensive features). As an example, consider the Titanic dataset available in the base installation. It describes the number of passengers who survived or died, cross-classified by their class (1st, 2nd, 3rd, Crew), sex (Male, Female), and age (Child, Adult). This is a well-studied dataset. You can see the crossclassification using the following code: > ftable(Titanic) Survived Class Sex 1st Male Female 2nd Male Female 3rd Male Female Crew Male Female Age Child Adult Child Adult Child Adult Child Adult Child Adult Child Adult Child Adult Child Adult No Yes 0 5 118 57 0 1 4 140 0 11 154 14 0 13 13 80 35 13 387 75 17 14 89 76 0 0 670 192 0 0 3 20 The mosaic() function can be invoked as mosaic(table) www.it-ebooks.info 277 Mosaic plots where table is a contingency table in array form, or mosaic(formula, data=) where formula is a standard R formula, and data specifies either a data frame or a table. Adding the option shade=TRUE colors the figure based on Pearson residuals from a fitted model (independence by default), and the option legend=TRUE displays a legend for these residuals. For example, both library(vcd) mosaic(Titanic, shade=TRUE, legend=TRUE) and library(vcd) mosaic(~Class+Sex+Age+Survived, data=Titanic, shade=TRUE, legend=TRUE) will produce the graph shown in figure 11.20. The formula version gives you greater control over the selection and placement of variables in the graph. Sex Adult Child Female Pearson residuals: 26 Adult Age Class 3rd Child 2nd AdultChild 1st Male Child 4 2 0 −2 Adult Crew −4 −11 p−value = <2e−16 No Yes No Survived Figure 11.20 Mosaic plot describing Titanic survivors by class, sex, and age www.it-ebooks.info 278 CHAPTER 11 Intermediate graphs A great deal of information is packed into this one picture. For example, as a person moves from crew to first class, the survival rate increases precipitously. Most children were in third and second class. Most females in first class survived, whereas only about half the females in third class survived. There were few females in the crew, causing the Survived labels (No, Yes at the bottom of the chart) to overlap for this group. Keep looking, and you’ll see many more interesting facts. Remember to look at the relative widths and heights of the rectangles. What else can you learn about that night? Extended mosaic plots add color and shading to represent the residuals from a fitted model. In this example, the blue shading indicates cross-classifications that occur more often than expected, assuming that survival is unrelated to class, gender, and age. Red shading indicates cross-classifications that occur less often than expected under the independence model. Be sure to run the example so that you can see the results in color. The graph indicates that more first-class women survived and more male crew members died than would be expected under an independence model. Fewer third-class men survived than would be expected if survival was independent of class, gender, and age. If you’d like to explore mosaic plots in greater detail, try running example(mosaic). 11.5 Summary In this chapter, we considered a wide range of techniques for displaying relationships among two or more variables. These included the use of 2D and 3D scatter plots, scatter-plot matrices, bubble plots, line plots, corrgrams, and mosaic plots. Some of these methods are standard techniques, where others are relatively new. Taken together with methods that allow you to customize graphs (chapter 3), display univariate distributions (chapter 6), explore regression models (chapter 8), and visualize group differences (chapter 9), you now have a comprehensive toolbox for visualizing and extracting meaning from your data. In later chapters, you’ll expand your skills with additional specialized techniques, including graphics for latent variable models (chapter 14), time series (chapter 15), clustered data (chapter 16), and techniques for creating graphs that are conditioned on one or more variables (chapter 18). In the next chapter, we’ll explore resampling statistics and bootstrapping. These are computer-intensive methods that allow you to analyze data in new and unique ways. www.it-ebooks.info Resampling statistics and bootstrapping This chapter covers ■ Understanding the logic of permutation tests ■ Applying permutation tests to linear models ■ Using bootstrapping to obtain confidence intervals In chapters 7, 8, and 9, we reviewed statistical methods that test hypotheses and estimate confidence intervals for population parameters by assuming that the observed data is sampled from a normal distribution or some other well-known theoretical distribution. But there will be many cases in which this assumption is unwarranted. Statistical approaches based on randomization and resampling can be used in cases where the data is sampled from unknown or mixed distributions, where sample sizes are small, where outliers are a problem, or where devising an appropriate test based on a theoretical distribution is too complex and mathematically intractable. In this chapter, we’ll explore two broad statistical approaches that use randomization: permutation tests and bootstrapping. Historically, these methods were only available to experienced programmers and expert statisticians. Contributed packages in R now make them readily available to a wider audience of data analysts. 279 www.it-ebooks.info 280 CHAPTER 12 Resampling statistics and bootstrapping We’ll also revisit problems that were initially analyzed using traditional methods (for example, t-tests, chi-square tests, ANOVA, and regression) and see how they can be approached using these robust, computer-intensive methods. To get the most out of section 12.2, be sure to read chapter 7 first. Chapters 8 and 9 serve as prerequisites for section 12.3. Other sections can be read on their own. 12.1 Permutation tests Treatment Permutation tests, also called randomization or re-randomization tests, have been around for decades, but it took the advent of high-speed computers to make them practically available. To understand the logic of a permutation test, consider the following hypothetical problem. Ten subjects have been randomly assigned to one of two treatment conditions (A or B), and an outcome variable (score) has been recorded. The results of the experiment are presented in table 12.1. The data are also displayed in the strip Table 12.1 Hypothetical two-group problem chart in figure 12.1. Is there enough evidence to conclude that the treatments difTreatment A Treatment B fer in their impact? 40 57 In a parametric approach, you might 57 64 assume that the data are sampled from normal populations with equal variances 45 55 and apply a two-tailed independent-groups 55 62 t-test. The null hypothesis is that the popu58 65 lation mean for Treatment A is equal to the population mean for Treatment B. You’d calculate a t-statistic from the data and compare it to the theoretical distribution. If the observed t-statistic is sufficiently extreme, say outside the middle 95% of values in the theoretical distribution, you’d reject the null hypothesis and declare that the population means for the two groups are unequal at the 0.05 level of significance. A permutation test takes a different approach. If the two treatments are truly equivalent, the label (Treatment A or Treatment B) assigned to an observed score is B A 40 45 50 55 60 score Figure 12.1 Strip chart of the hypothetical treatment data in table 12.1 www.it-ebooks.info 65 Permutation tests 281 arbitrary. To test for differences between the two treatments, you could follow these steps: 1 2 3 4 5 6 7 Calculate the observed t-statistic, as in the parametric approach; call this t0. Place all 10 scores in a single group. Randomly assign five scores to Treatment A and five scores to Treatment B. Calculate and record the new observed t-statistic. Repeat steps 3–4 for every possible way of assigning five scores to Treatment A and five scores to Treatment B. There are 252 such possible arrangements. Arrange the 252 t-statistics in ascending order. This is the empirical distribution, based on (or conditioned on) the sample data. If t0 falls outside the middle 95% of the empirical distribution, reject the null hypothesis that the population means for the two treatment groups are equal at the 0.05 level of significance. Notice that the same t-statistic is calculated in both the permutation and parametric approaches. But instead of comparing the statistic to a theoretical distribution in order to determine if it was extreme enough to reject the null hypothesis, it’s compared to an empirical distribution created from permutations of the observed data. This logic can be extended to most classical statistical tests and linear models. In the previous example, the empirical distribution was based on all possible permutations of the data. In such cases, the permutation test is called an exact test. As the sample sizes increase, the time required to form all possible permutations can become prohibitive. In such cases, you can use Monte Carlo simulation to sample from all possible permutations. Doing so provides an approximate test. If you’re uncomfortable assuming that the data is normally distributed, concerned about the impact of outliers, or feel that the dataset is too small for standard parametric approaches, a permutation test provides an excellent alternative. R has some of the most comprehensive and sophisticated packages for performing permutation tests currently available. The remainder of this section focuses on two contributed packages: the coin package and the lmPerm package. The coin package provides a comprehensive framework for permutation tests applied to independence problems, whereas the lmPerm package provides permutation tests for ANOVA and regression designs. We’ll consider each in turn and end the section with a quick review of other permutation packages available in R. To install the coin package, use install.packages(c("coin") Sadly, Bob Wheeler, the author of the lmPerm package, passed away in 2012, and the source code has been moved into the CRAN archive for unsupported packages. Therefore, installation of the package is a bit more complicated than usual: 1 Download the file lmPerm_1.1-2.tar.gz from http://cran.r-project.org/src/ contrib/Archive/lmPerm/, and save it on your hard drive. www.it-ebooks.info 282 CHAPTER 12 2 3 Resampling statistics and bootstrapping MS Windows users: install RTools from http://cran.r-project.org/bin/ windows/Rtools/. Mac and Linux users can skip this step. Execute the function install.packages(file.choose(), repos=NULL, type="source") from within R. When a dialog box pops up, find and choose the lmPerm_1.1-2.tar.gz file. This will install the package on your machine. Setting the random number seed Before moving on, it’s important to remember that permutation tests use pseudorandom numbers to sample from all possible permutations (when performing an approximate test). Therefore, the results will change each time the test is performed. Setting the random-number seed in R allows you to fix the random numbers generated. This is particularly useful when you want to share your examples with others, because results will always be the same if the calls are made with the same seed. Setting the random number seed to 1234 (that is, set.seed(1234)) will allow you to replicate the results presented in this chapter. 12.2 Permutation tests with the coin package The coin package provides a general framework for applying permutation tests to independence problems. With this package, you can answer such questions as ■ ■ ■ Are responses independent of group assignment? Are two numeric variables independent? Are two categorical variables independent? Using convenience functions provided in the package (see table 12.2), you can perform permutation test equivalents for most of the traditional statistical tests covered in chapter 7. Table 12.2 coin functions providing permutation test alternatives to traditional tests coin function Test Two- and K-sample permutation test oneway_test( y ~ A) Wilcoxon–Mann–Whitney rank-sum test wilcox_test( y ~ A ) Kruskal–Wallis test kruskal_test( y ~ A ) Pearson’s chi-square test chisq_test( A ~ B ) Cochran–Mantel–Haenszel test cmh_test( A ~ B | C ) Linear-by-linear association test lbl_test( D ~ E) Spearman’s test spearman_test( y ~ x ) www.it-ebooks.info Permutation tests with the coin package Table 12.2 283 coin functions providing permutation test alternatives to traditional tests coin function Test Friedman test friedman_test( y ~ A | C ) Wilcoxon signed-rank test wilcoxsign_test( y1 ~ y2 ) In the coin function column, y and x are numeric variables, A and B are categorical factors, C is a categorical blocking variable, D and E are ordered factors, and y1 and y2 are matched numeric variables. Each of the functions listed in table 12.2 takes the form function_name( formula, data, distribution= ) where ■ ■ ■ formula describes the relationship among variables to be tested. Examples are given in the table. data identifies a data frame. distribution specifies how the empirical distribution under the null hypothesis should be derived. Possible values are exact, asymptotic, and approximate. If distribution="exact", the distribution under the null hypothesis is computed exactly (that is, from all possible permutations). The distribution can also be approximated by its asymptotic distribution (distribution="asymptotic") or via Monte Carlo resampling (distribution="approximate(B=#)"), where # indicates the number of replications used to approximate the exact distribution. At present, distribution="exact" is only available for two-sample problems. In the coin package, categorical variables and ordinal variables must be coded as factors and ordered factors, respectively. Additionally, the data must be stored in a data frame. NOTE In the remainder of this section, you’ll apply several of the permutation tests described in table 12.2 to problems from previous chapters. This will allow you to compare the results to more traditional parametric and nonparametric approaches. We’ll end this discussion of the coin package by considering advanced extensions. 12.2.1 Independent two-sample and k-sample tests To begin, let’s compare an independent samples t-test with a one-way exact test applied to the hypothetical data in table 12.2. The results are given in the following listing. Listing 12.1 t-test vs. one-way permutation test for the hypothetical data > > > > > library(coin) score <- c(40, 57, 45, 55, 58, 57, 64, 55, 62, 65) treatment <- factor(c(rep("A",5), rep("B",5))) mydata <- data.frame(treatment, score) t.test(score~treatment, data=mydata, var.equal=TRUE) www.it-ebooks.info 284 CHAPTER 12 Resampling statistics and bootstrapping Two Sample t-test data: score by treatment t = -2.3, df = 8, p-value = 0.04705 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -19.04 -0.16 sample estimates: mean in group A mean in group B 51 61 > oneway_test(score~treatment, data=mydata, distribution="exact") Exact 2-Sample Permutation Test data: score by treatment (A, B) Z = -1.9, p-value = 0.07143 alternative hypothesis: true mu is not equal to 0 The traditional t-test indicates a significant group difference (p < .05), whereas the exact test doesn’t (p > 0.072). With only 10 observations, l’d be more inclined to trust the results of the permutation test and attempt to collect more data before reaching a final conclusion. Next, consider the Wilcoxon–Mann–Whitney U test. In chapter 7, we examined the difference in the probability of imprisonment in Southern versus non-Southern US states using the wilcox.test() function. Using an exact Wilcoxon rank-sum test, you’d get > library(MASS) > UScrime <- transform(UScrime, So = factor(So)) > wilcox_test(Prob ~ So, data=UScrime, distribution="exact") Exact Wilcoxon Mann-Whitney Rank Sum Test data: Prob by So (0, 1) Z = -3.7, p-value = 8.488e-05 alternative hypothesis: true mu is not equal to 0 suggesting that incarceration is more likely in Southern states. Note that in the previous code, the numeric variable So was transformed into a factor. This is because the coin package requires that all categorical variables be coded as factors. Additionally, you may have noted that these results agree exactly with the results of the wilcox.test() function in chapter 7. This is because wilcox.test() also computes an exact distribution by default. Finally, consider a k-sample test. In chapter 9, you used a one-way ANOVA to evaluate the impact of five drug regimens on cholesterol reduction in a sample of 50 patients. An approximate k-sample permutation test can be performed instead, using this code: > library(multcomp) > set.seed(1234) > oneway_test(response~trt, data=cholesterol, distribution=approximate(B=9999)) www.it-ebooks.info Permutation tests with the coin package 285 Approximative K-Sample Permutation Test data: response by trt (1time, 2times, 4times, drugD, drugE) maxT = 4.7623, p-value < 2.2e-16 Here, the reference distribution is based on 9,999 permutations of the data. The random-number seed is set so that your results will be the same as mine. There’s clearly a difference in response among patients in the various groups. 12.2.2 Independence in contingency tables You can use permutation tests to assess the independence of two categorical variables using either the chisq_test() or cmh_test() function. The latter function is used when data is stratified on a third categorical variable. If both variables are ordinal, you can use the lbl_test() function to test for a linear trend. In chapter 7, you applied a chi-square test to assess the relationship between arthritis treatment and improvement. Treatment had two levels (Placebo and Treated), and Improved had three levels (None, Some, and Marked). The Improved variable was encoded as an ordered factor. If you want to perform a permutation version of the chi-square test, you can use the following code: > library(coin) > library(vcd) > Arthritis <- transform(Arthritis, Improved=as.factor(as.numeric(Improved))) > set.seed(1234) > chisq_test(Treatment~Improved, data=Arthritis, distribution=approximate(B=9999)) Approximative Pearson's Chi-Squared Test data: Treatment by Improved (1, 2, 3) chi-squared = 13.055, p-value = 0.0018 This gives you an approximate chi-square test based on 9,999 replications. You might ask why you transformed the variable Improved from an ordered factor to a categorical factor. (Good question!) If you’d left it an ordered factor, coin() would have generated a linear × linear trend test instead of a chi-square test. Although a trend test would be a good choice in this situation, keeping it a chi-square test allows you to compare the results with those reported in chapter 7. 12.2.3 Independence between numeric variables The spearman_test() function provides a permutation test of the independence of two numeric variables. In chapter 7, we examined the correlation between illiteracy rates and murder rates for US states. You can test the association via permutation, using the following code: > states <- as.data.frame(state.x77) > set.seed(1234) www.it-ebooks.info 286 CHAPTER 12 Resampling statistics and bootstrapping > spearman_test(Illiteracy~Murder, data=states, distribution=approximate(B=9999)) Approximative Spearman Correlation Test data: Illiteracy by Murder Z = 4.7065, p-value < 2.2e-16 alternative hypothesis: true mu is not equal to 0 Based on an approximate permutation test with 9,999 replications, the hypothesis of independence can be rejected. Note that state.x77 is a matrix. It had to be converted into a data frame for use in the coin package. 12.2.4 Dependent two-sample and k-sample tests Dependent sample tests are used when observations in different groups have been matched or when repeated measures are used. For permutation tests with two paired groups, the wilcoxsign_test() function can be used. For more than two groups, use the friedman_test() function. In chapter 7, we compared the unemployment rate for urban males age 14–24 (U1) with urban males age 35–39 (U2). Because the two variables are reported for each of the 50 US states, you have a two-dependent groups design (state is the matching variable). You can use an exact Wilcoxon signed-rank test to see if unemployment rates for the two age groups are equal: > library(coin) > library(MASS) > wilcoxsign_test(U1~U2, data=UScrime, distribution="exact") Exact Wilcoxon-Signed-Rank Test data: y by x (neg, pos) stratified by block Z = 5.9691, p-value = 1.421e-14 alternative hypothesis: true mu is not equal to 0 Based on the results, you’d conclude that the unemployment rates differ. 12.2.5 Going further The coin package provides a general framework for testing that one group of variables is independent of a second group of variables (with optional stratification on a blocking variable) against arbitrary alternatives, via approximate permutation tests. In particular, the independence_test() function lets you approach most traditional tests from a permutation perspective and create new and novel statistical tests for situations not covered by traditional methods. This flexibility comes at a price: a high level of statistical knowledge is required to use the function appropriately. See the vignettes that accompany the package (accessed via vignette("coin")) for further details. In the next section, you’ll learn about the lmPerm package. This package provides a permutation approach to linear models, including regression and analysis of variance. www.it-ebooks.info Permutation tests with the lmPerm package 287 12.3 Permutation tests with the lmPerm package The lmPerm package provides support for a permutation approach to linear models. In particular, the lmp() and aovp() functions are the lm() and aov() functions modified to perform permutation tests rather than normal theory tests. The parameters in the lmp() and aovp() functions are similar to those in the lm() and aov() functions, with the addition of a perm= parameter. The perm= option can take the value Exact, Prob, or SPR. Exact produces an exact test, based on all possible permutations. Prob samples from all possible permutations. Sampling continues until the estimated standard deviation falls below 0.1 of the estimated p-value. The stopping rule is controlled by an optional Ca parameter. Finally, SPR uses a sequential probability ratio test to decide when to stop sampling. Note that if the number of observations is greater than 10, perm="Exact" will automatically default to perm="Prob"; exact tests are only available for small problems. To see how this works, you’ll apply a permutation approach to simple regression, polynomial regression, multiple regression, one-way analysis of variance, one-way analysis of covariance, and a two-way factorial design. 12.3.1 Simple and polynomial regression In chapter 8, you used linear regression to study the relationship between weight and height for a group of 15 women. Using lmp() instead of lm() generates the permutation test results shown in the following listing. Listing 12.2 Permutation tests for simple linear regression > library(lmPerm) > set.seed(1234) > fit <- lmp(weight~height, data=women, perm="Prob") [1] "Settings: unique SS : numeric variables centered" > summary(fit) Call: lmp(formula = weight ~ height, data = women, perm = "Prob") Residuals: Min 1Q Median -1.733 -1.133 -0.383 3Q 0.742 Max 3.117 Coefficients: Estimate Iter Pr(Prob) height 3.45 5000 <2e-16 *** --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 1.5 on 13 degrees of freedom Multiple R-Squared: 0.991, Adjusted R-squared: 0.99 F-statistic: 1.43e+03 on 1 and 13 DF, p-value: 1.09e-14 To fit a quadratic equation, you could use the code in this next listing. www.it-ebooks.info 288 CHAPTER 12 Resampling statistics and bootstrapping Listing 12.3 Permutation tests for polynomial regression > library(lmPerm) > set.seed(1234) > fit <- lmp(weight~height + I(height^2), data=women, perm="Prob") [1] "Settings: unique SS : numeric variables centered" > summary(fit) Call: lmp(formula = weight ~ height + I(height^2), data = women, perm = "Prob") Residuals: Min 1Q Median -0.5094 -0.2961 -0.0094 3Q 0.2862 Max 0.5971 Coefficients: Estimate Iter Pr(Prob) height -7.3483 5000 <2e-16 *** I(height^2) 0.0831 5000 <2e-16 *** --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.38 on 12 degrees of freedom Multiple R-Squared: 0.999, Adjusted R-squared: 0.999 F-statistic: 1.14e+04 on 2 and 12 DF, p-value: <2e-16 As you can see, it’s a simple matter to test these regressions using permutation tests and requires little change in the underlying code. The output is also similar to that produced by the lm() function. Note that an Iter column is added, indicating how many iterations were required to reach the stopping rule. 12.3.2 Multiple regression In chapter 8, multiple regression was used to predict the murder rate based on population, illiteracy, income, and frost for 50 US states. Applying the lmp() function to this problem results in the following output. Listing 12.4 Permutation tests for multiple regression > > > > library(lmPerm) set.seed(1234) states <- as.data.frame(state.x77) fit <- lmp(Murder~Population + Illiteracy+Income+Frost, data=states, perm="Prob") [1] "Settings: unique SS : numeric variables centered" > summary(fit) Call: lmp(formula = Murder ~ Population + Illiteracy + Income + Frost, data = states, perm = "Prob") Residuals: Min 1Q Median -4.79597 -1.64946 -0.08112 3Q 1.48150 Max 7.62104 www.it-ebooks.info Permutation tests with the lmPerm package 289 Coefficients: Estimate Iter Pr(Prob) Population 2.237e-04 51 1.0000 Illiteracy 4.143e+00 5000 0.0004 *** Income 6.442e-05 51 1.0000 Frost 5.813e-04 51 0.8627 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '. ' 0.1 ' ' 1 Residual standard error: 2.535 on 45 degrees of freedom Multiple R-Squared: 0.567, Adjusted R-squared: 0.5285 F-statistic: 14.73 on 4 and 45 DF, p-value: 9.133e-08 Looking back to chapter 8, both Population and Illiteracy are significant (p < 0.05) when normal theory is used. Based on the permutation tests, the Population variable is no longer significant. When the two approaches don’t agree, you should look at your data more carefully. It may be that the assumption of normality is untenable or that outliers are present. 12.3.3 One-way ANOVA and ANCOVA Each of the analysis of variance designs discussed in chapter 9 can be performed via permutation tests. First, let’s look at the one-way ANOVA problem considered in section 9.1 on the impact of treatment regimens on cholesterol reduction. The code and results are given in the next listing. Listing 12.5 Permutation test for one-way ANOVA > library(lmPerm) > library(multcomp) > set.seed(1234) > fit <- aovp(response~trt, data=cholesterol, perm="Prob") [1] "Settings: unique SS " > anova(fit) Component 1 : Df R Sum Sq R Mean Sq Iter Pr(Prob) trt 4 1351.37 337.84 5000 < 2.2e-16 *** Residuals 45 468.75 10.42 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '. ' 0.1 ' ' 1 The results suggest that the treatment effects are not all equal. This second example in this section applies a permutation test to a one-way analysis of covariance. The problem is from chapter 9, where you investigated the impact of four drug doses on the litter weights of rats, controlling for gestation times. The next listing shows the permutation test and results. Listing 12.6 Permutation test for one-way ANCOVA > library(lmPerm) > set.seed(1234) > fit <- aovp(weight ~ gesttime + dose, data=litter, perm="Prob") www.it-ebooks.info 290 CHAPTER 12 Resampling statistics and bootstrapping [1] "Settings: unique SS : numeric variables centered" > anova(fit) Component 1 : Df R Sum Sq R Mean Sq Iter Pr(Prob) gesttime 1 161.49 161.493 5000 0.0006 *** dose 3 137.12 45.708 5000 0.0392 * Residuals 69 1151.27 16.685 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Based on the p-values, the four drug doses don’t equally impact litter weights, controlling for gestation time. 12.3.4 Two-way ANOVA You’ll end this section by applying permutation tests to a factorial design. In chapter 9, you examined the impact of vitamin C on the tooth growth in guinea pigs. The two manipulated factors were dose (three levels) and delivery method (two levels). Ten guinea pigs were placed in each treatment combination, resulting in a balanced 3 × 2 factorial design. The permutation tests are provided in the next listing. Listing 12.7 Permutation test for two-way ANOVA > library(lmPerm) > set.seed(1234) > fit <- aovp(len~supp*dose, data=ToothGrowth, perm="Prob") [1] "Settings: unique SS : numeric variables centered" > anova(fit) Component 1 : Df R Sum Sq R Mean Sq Iter Pr(Prob) supp 1 205.35 205.35 5000 < 2e-16 *** dose 1 2224.30 2224.30 5000 < 2e-16 *** supp:dose 1 88.92 88.92 2032 0.04724 * Residuals 56 933.63 16.67 --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 At the .05 level of significance, all three effects are statistically different from zero. At the .01 level, only the main effects are significant. It’s important to note that when aovp() is applied to ANOVA designs, it defaults to unique sums of squares (also called SAS Type III sums of squares). Each effect is adjusted for every other effect. The default for parametric ANOVA designs in R is sequential sums of squares (SAS Type I sums of squares). Each effect is adjusted for those that appear earlier in the model. For balanced designs, the two approaches will agree, but for unbalanced designs with unequal numbers of observations per cell, they won’t. The greater the imbalance, the greater the disagreement. If desired, specifying seqs=TRUE in the aovp() function will produce sequential sums of squares. For more on Type I and Type III sums of squares, see section 9.2. www.it-ebooks.info Bootstrapping 291 12.4 Additional comments on permutation tests R offers other permutation packages besides coin and lmPerm. The perm package provides some of the same functionality provided by the coin package and can act as an independent validation of that package. The corrperm package provides permutation tests of correlations with repeated measures. The logregperm package offers a permutation test for logistic regression. Perhaps most important, the glmperm package extends permutation tests to generalized linear models. Generalized linear models are described in the next chapter. Permutation tests provide a powerful alternative to tests that rely on a knowledge of the underlying sampling distribution. In each of the permutation tests described, you were able to test statistical hypotheses without recourse to the normal, t, F, or chisquare distributions. You may have noticed how closely the results of the tests based on normal theory agreed with the results of the permutation approach in previous sections. The data in these problems were well behaved, and the agreement between methods is a testament to how well normal-theory methods work in such cases. Permutation tests really shine in cases where the data are clearly non-normal (for example, highly skewed), outliers are present, samples sizes are small, or no parametric tests exist. But if the original sample is a poor representation of the population of interest, no test, including permutation tests, will improve the inferences generated. Permutation tests are primarily useful for generating p-values that can be used to test null hypotheses. They can help answer the question, “Does an effect exist?” It’s more difficult to use permutation methods to obtain confidence intervals and estimates of measurement precision. Fortunately, this is an area in which bootstrapping excels. 12.5 Bootstrapping Bootstrapping generates an empirical distribution of a test statistic or set of test statistics by repeated random sampling with replacement from the original sample. It allows you to generate confidence intervals and test statistical hypotheses without having to assume a specific underlying theoretical distribution. It’s easiest to demonstrate the logic of bootstrapping with an example. Say that you want to calculate the 95% confidence interval for a sample mean. Your sample has 10 observations, a sample mean of 40, and a sample standard deviation of 5. If you’re willing to assume that the sampling distribution of the mean is normally distributed, the (1-α/2 )% confidence interval can be calculated using X −t s s < < X +t n n where t is the upper 1-α/2 critical value for a t distribution with n – 1 degrees of freedom. For a 95% confidence interval, you have 40 – 2.262(5/3.163) < µ < 40 + 2.262 (5/3.162) or 36.424 < µ < 43.577. You’d expect 95% of confidence intervals created in this way to surround the true population mean. www.it-ebooks.info 292 CHAPTER 12 Resampling statistics and bootstrapping But what if you aren’t willing to assume that the sampling distribution of the mean is normally distributed? You can use a bootstrapping approach instead: 1 2 3 4 5 Randomly select 10 observations from the sample, with replacement after each selection. Some observations may be selected more than once, and some may not be selected at all. Calculate and record the sample mean. Repeat the first two steps 1,000 times. Order the 1,000 sample means from smallest to largest. Find the sample means representing the 2.5th and 97.5th percentiles. In this case, it’s the 25th number from the bottom and top. These are your 95% confidence limits. In the present case, where the sample mean is likely to be normally distributed, you gain little from the bootstrap approach. Yet there are many cases where the bootstrap approach is advantageous. What if you wanted confidence intervals for the sample median, or the difference between two sample medians? There are no simple normaltheory formulas here, and bootstrapping is the approach of choice. If the underlying distributions are unknown, if outliers are a problem, if sample sizes are small, or if parametric approaches don’t exist, bootstrapping can often provide a useful method of generating confidence intervals and testing hypotheses. 12.6 Bootstrapping with the boot package The boot package provides extensive facilities for bootstrapping and related resampling methods. You can bootstrap a single statistic (for example, a median) or a vector of statistics (for example, a set of regression coefficients). Be sure to download and install the boot package before first use: install.packages("boot") The bootstrapping process will seem complicated, but once you review the examples it should make sense. In general, bootstrapping involves three main steps: 1 2 3 Write a function that returns the statistic or statistics of interest. If there is a single statistic (for example, a median), the function should return a number. If there is a set of statistics (for example, a set of regression coefficients), the function should return a vector. Process this function through the boot() function in order to generate R bootstrap replications of the statistic(s). Use the boot.ci() function to obtain confidence intervals for the statistic(s) generated in step 2. Now to the specifics. The main bootstrapping function is boot(). It has the format bootobject <- boot(data=, statistic=, R=, ...) www.it-ebooks.info Bootstrapping with the boot package 293 The parameters are described in table 12.3. Table 12.3 Parameters of the boot() function Parameter Description data A vector, matrix, or data frame. statistic A function that produces the k statistics to be bootstrapped (k=1 if bootstrapping a single statistic). The function should include an indices parameter that the boot() function can use to select cases for each replication (see the examples in the text). R Number of bootstrap replicates. ... Additional parameters to be passed to the function that produces the statistic of interest. The boot() function calls the statistic function R times. Each time, it generates a set of random indices, with replacement, from the integers 1:nrow(data). These indices are used in the statistic function to select a sample. The statistics are calculated on the sample, and the results are accumulated in bootobject. The bootobject structure is described in table 12.4. Table 12.4 Elements of the object returned by the boot() function Element Description t0 The observed values of k statistics applied to the original data t An R × k matrix, where each row is a bootstrap replicate of the k statistics You can access these elements as bootobject$t0 and bootobject$t. Once you generate the bootstrap samples, you can use print() and plot() to examine the results. If the results look reasonable, you can use the boot.ci() function to obtain confidence intervals for the statistic(s). The format is boot.ci(bootobject, conf=, type= ) The parameters are given in table 12.5. Table 12.5 Parameters of the boot.ci() function Parameter Description bootobject The object returned by the boot() function. conf The desired confidence interval (default: conf=0.95). type The type of confidence interval returned. Possible values are norm, basic, stud, perc, bca, and all (default: type="all") www.it-ebooks.info 294 CHAPTER 12 Resampling statistics and bootstrapping The type parameter specifies the method for obtaining the confidence limits. The perc method (percentile) was demonstrated in the sample mean example. bca provides an interval that makes simple adjustments for bias. I find bca preferable in most circumstances. See Mooney and Duval (1993) for an introduction to these methods. In the remaining sections, we’ll look at bootstrapping a single statistic and a vector of statistics. 12.6.1 Bootstrapping a single statistic The mtcars dataset contains information on 32 automobiles reported in the 1974 Motor Trend magazine. Suppose you’re using multiple regression to predict miles per gallon from a car’s weight (lb/1,000) and engine displacement (cu. in.). In addition to the standard regression statistics, you’d like to obtain a 95% confidence interval for the R-squared value (the percent of variance in the response variable explained by the predictors). The confidence interval can be obtained using nonparametric bootstrapping. The first task is to write a function for obtaining the R-squared value: rsq <- function(formula, data, indices) { d <- data[indices,] fit <- lm(formula, data=d) return(summary(fit)$r.square) } The function returns the R-squared value from a regression. The d <- data[indices,] statement is required for boot() to be able to select samples. You can then draw a large number of bootstrap replications (say, 1,000) with the following code: library(boot) set.seed(1234) results <- boot(data=mtcars, statistic=rsq, R=1000, formula=mpg~wt+disp) The boot object can be printed using > print(results) ORDINARY NONPARAMETRIC BOOTSTRAP Call: boot(data = mtcars, statistic = rsq, R = 1000, formula = mpg ~ wt + disp) Bootstrap Statistics : original bias t1* 0.7809306 0.01333670 std. error 0.05068926 and plotted using plot(results). The resulting graph is shown in figure 12.2. www.it-ebooks.info 295 Bootstrapping with the boot package 0 0.60 0.65 2 0.70 0.75 t* 4 Density 0.80 6 0.85 8 0.90 Histogram of t 0.6 0.7 0.8 t* Figure 12.2 0.9 −3 −2 −1 0 1 2 3 Quantiles of Standard Normal Distribution of bootstrapped R-squared values In figure 12.2, you can see that the distribution of bootstrapped R-squared values isn’t normally distributed. A 95% confidence interval for the R-squared values can be obtained using > boot.ci(results, type=c("perc", "bca")) BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS Based on 1000 bootstrap replicates CALL : boot.ci(boot.out = results, type = c("perc", "bca")) Intervals : Level Percentile BCa 95% ( 0.6838, 0.8833 ) ( 0.6344, 0.8549 ) Calculations and Intervals on Original Scale Some BCa intervals may be unstable You can see from this example that different approaches to generating the confidence intervals can lead to different intervals. In this case, the bias-adjusted interval is moderately different from the percentile method. In either case, the null hypothesis H0: R-square = 0 would be rejected, because zero is outside the confidence limits. In this section, you estimated the confidence limits of a single statistic. In the next section, you’ll estimate confidence intervals for several statistics. www.it-ebooks.info 296 CHAPTER 12 Resampling statistics and bootstrapping 12.6.2 Bootstrapping several statistics In the previous example, bootstrapping was used to estimate the confidence interval for a single statistic (R-squared). Continuing the example, let’s obtain the 95% confidence intervals for a vector of statistics. Specifically, let’s get confidence intervals for the three model regression coefficients (intercept, car weight, and engine displacement). First, create a function that returns the vector of regression coefficients: bs <- function(formula, data, indices) { d <- data[indices,] fit <- lm(formula, data=d) return(coef(fit)) } Then use this function to bootstrap 1,000 replications: library(boot) set.seed(1234) results <- boot(data=mtcars, statistic=bs, R=1000, formula=mpg~wt+disp) > print(results) ORDINARY NONPARAMETRIC BOOTSTRAP Call: boot(data = mtcars, statistic = bs, R = 1000, formula = mpg ~ wt + disp) Bootstrap Statistics : original bias std. error t1* 34.9606 0.137873 2.48576 t2* -3.3508 -0.053904 1.17043 t3* -0.0177 -0.000121 0.00879 When bootstrapping multiple statistics, add an index parameter to the plot() and boot.ci() functions to indicate which column of bootobject$t to analyze. In this example, index 1 refers to the intercept, index 2 is car weight, and index 3 is the engine displacement. To plot the results for car weight, use plot(results, index=2) The graph is given in figure 12.3. To get the 95% confidence intervals for car weight and engine displacement, use > boot.ci(results, type="bca", index=2) BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS Based on 1000 bootstrap replicates CALL : boot.ci(boot.out = results, type = "bca", index = 2) Intervals : Level BCa 95% (-5.66, -1.19 ) Calculations and Intervals on Original Scale www.it-ebooks.info 297 Bootstrapping with the boot package −3 0.0 −6 −5 0.1 −4 t* 0.2 Density −2 0.3 −1 0 0.4 Histogram of t −6 −4 −2 t* Figure 12.3 0 −3 −2 −1 0 1 2 3 Quantiles of Standard Normal Distribution of bootstrapping regression coefficients for car weight > boot.ci(results, type="bca", index=3) BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS Based on 1000 bootstrap replicates CALL : boot.ci(boot.out = results, type = "bca", index = 3) Intervals : Level BCa 95% (-0.0331, 0.0010 ) Calculations and Intervals on Original Scale NOTE The previous example resamples the entire sample of data each time. If you can assume that the predictor variables have fixed levels (typical in planned experiments), you’d do better to only resample residual terms. See Mooney and Duval (1993, pp. 16–17) for a simple explanation and algorithm. Before we leave bootstrapping, it’s worth addressing two questions that come up often: ■ ■ How large does the original sample need to be? How many replications are needed? www.it-ebooks.info 298 CHAPTER 12 Resampling statistics and bootstrapping There’s no simple answer to the first question. Some say that an original sample size of 20–30 is sufficient for good results, as long as the sample is representative of the population. Random sampling from the population of interest is the most trusted method for assuring the original sample’s representativeness. With regard to the second question, I find that 1,000 replications are more than adequate in most cases. Computer power is cheap, and you can always increase the number of replications if desired. There are many helpful sources of information about permutation tests and bootstrapping. An excellent starting place is an online article by Yu (2003). Good (2006) provides a comprehensive overview of resampling in general and includes R code. A good, accessible introduction to bootstrapping is provided by Mooney and Duval (1993). The definitive source on bootstrapping is Efron and Tibshirani (1998). Finally, there are a number of great online resources, including Simon (1997), Canty (2002), Shah (2005), and Fox (2002). 12.7 Summary This chapter introduced a set of computer-intensive methods based on randomization and resampling that allow you to test hypotheses and form confidence intervals without reference to a known theoretical distribution. They’re particularly valuable when your data comes from unknown population distributions, when there are serious outliers, when your sample sizes are small, and when there are no existing parametric methods to answer the hypotheses of interest. The methods in this chapter are particularly exciting because they provide an avenue for answering questions when your standard data assumptions are clearly untenable or when you have no other idea how to approach the problem. Permutation tests and bootstrapping aren’t panaceas, though. They can’t turn bad data into good data. If your original samples aren’t representative of the population of interest or are too small to accurately reflect it, then these techniques won’t help. In the next chapter, we’ll consider data models for variables that follow known, but not necessarily normal, distributions. www.it-ebooks.info Part 4 Advanced methods I n this part of the book, we’ll consider advanced methods of statistical analysis to round out your data analysis toolkit. The methods in this part play a key role in the growing field of data mining and predictive analytics. Chapter 13 expands on the regression methods in chapter 8 to cover parametric approaches to data that are not normally distributed. The chapter starts with a discussion of the generalized linear model and then focuses on cases where you’re trying to predict an outcome variable that is either categorical (logistic regression) or a count (Poisson regression). Dealing with a large number of variables can be challenging, due to the complexity inherent in multivariate data. Chapter 14 describes two popular methods for exploring and simplifying multivariate data. Principal components analysis can be used to transform a large number of correlated variables into a smaller set of composite variables. Factor analysis consists of a set of techniques for uncovering the latent structure underlying a given set of variables. Chapter 14 provides step-by-step instructions for carrying out each. Chapter 15 explores time-dependent data. Analysts are frequently faced with the need to understand trends and predict future events. Chapter 15 provides a thorough introduction to the analysis of time-series data and forecasting. After describing the general characteristics of time series data, two of the most popular forecasting approaches (Exponential and ARIMA) are illustrated. Cluster analysis is the subject of chapter 16. While principal components and factor analysis simplify multivariate data by combining individual variables into composite variables, cluster analysis attempts to simplify multivariate data by combining individual observations into subgroups called clusters. Clusters contain cases that are similar to each other and different from the cases in other www.it-ebooks.info 300 Advanced CHAPTERmethods clusters. The chapter considers methods for determining the number of clusters present in a data set and combining observations into these clusters. Chapter 17 addresses the important topic of classification. In classification problems, the analyst attempts to develop a model for predicting the group membership of new cases (for example, good credit/bad credit risk, benign/malignant, pass/fail) from a (potentially large) set of predictor variables. A wide variety of methods are considered, including logistic regression, decision trees, random forests, and support-vector machines. Methods for assessing the efficacy of the resulting classification models are also described. In practice, researchers must often deal with incomplete datasets. Chapter 18 considers modern approaches to the ubiquitous problem of missing data values. R supports a number of elegant approaches for analyzing datasets that are incomplete for various reasons. Several of the best are described here, along with guidance about which ones to use and which ones to avoid. After completing part 4, you’ll have the tools to manage a wide range of complex data-analysis problems. This includes modeling non-normal outcome variables, dealing with large numbers of correlated variables, reducing a large number of cases to a smaller number of homogeneous clusters, developing models to predict future values or categorical outcomes, and handling messy and incomplete data. www.it-ebooks.info Generalized linear models This chapter covers ■ Formulating a generalized linear model ■ Predicting categorical outcomes ■ Modeling count data In chapters 8 (regression) and 9 (ANOVA), we explored linear models that can be used to predict a normally distributed response variable from a set of continuous and/or categorical predictor variables. But there are many situations in which it’s unreasonable to assume that the dependent variable is normally distributed (or even continuous). For example: ■ ■ The outcome variable may be categorical. Binary variables (for example, yes/ no, passed/failed, lived/died) and polytomous variables (for example, poor/good/excellent, republican/democrat/independent) clearly aren’t normally distributed. The outcome variable may be a count (for example, number of traffic accidents in a week, number of drinks per day). Such variables take on a limited number of values and are never negative. Additionally, their mean and variance are often related (which isn’t true for normally distributed variables). 301 www.it-ebooks.info 302 CHAPTER 13 Generalized linear models Generalized linear models extend the linear-model framework to include dependent variables that are decidedly non-normal. In this chapter, we’ll start with a brief overview of generalized linear models and the glm() function used to estimate them. Then we’ll focus on two popular models in this framework: logistic regression (where the dependent variable is categorical) and Poisson regression (where the dependent variable is a count variable). To motivate the discussion, you’ll apply generalized linear models to two research questions that aren’t easily addressed with standard linear models: ■ ■ What personal, demographic, and relationship variables predict marital infidelity? In this case, the outcome variable is binary (affair/no affair). What impact does a drug treatment for seizures have on the number of seizures experienced over an eight-week period? In this case, the outcome variable is a count (number of seizures). You’ll apply logistic regression to address the first question and Poison regression to address the second. Along the way, we’ll consider extensions of each technique. 13.1 Generalized linear models and the glm() function A wide range of popular data-analytic methods are subsumed within the framework of the generalized linear model. In this section, we’ll briefly explore some of the theory behind this approach. You can safely skip this section if you like and come back to it later. Let’s say that you want to model the relationship between a response variable Y and a set of p predictor variables X1 ...Xp. In the standard linear model, you assume that Y is normally distributed and that the form of the relationship is μY = β0 + p j =1 βj X j This equation states that the conditional mean of the response variable is a linear combination of the predictor variables. The βj are the parameters specifying the expected change in Y for a unit change in Xj, and β0 is the expected value of Y when all the predictor variables are 0. You’re saying that you can predict the mean of the Y distribution for observations with a given set of X values by applying the proper weights to the X variables and adding them up. Note that you’ve made no distributional assumptions about the predictor variables, Xj. Unlike Y, there’s no requirement that they be normally distributed. In fact, they’re often categorical (for example, ANOVA designs). Additionally, nonlinear functions of the predictors are allowed. You often include such predictors as X2 or X1 × X2. What is important is that the equation is linear in the parameters (β0, β1,… βp). In generalized linear models, you fit models of the form g(μY) = β0 + p j =1 βj X j www.it-ebooks.info Generalized linear models and the glm() function 303 where g(µY) is a function of the conditional mean (called the link function). Additionally, you relax the assumption that Y is normally distributed. Instead, you assume that Y follows a distribution that’s a member of the exponential family. You specify the link function and the probability distribution, and the parameters are derived through an iterative maximum-likelihood-estimation procedure. 13.1.1 The glm() function Generalized linear models are typically fit in R through the glm() function (although other specialized functions are available). The form of the function is similar to lm() but includes additional parameters. The basic format of the function is glm(formula, family=family(link=function), data=) where the probability distribution (family) and corresponding default link function (function) are given in table 13.1. Table 13.1 glm() parameters Family Default link function binomial (link = "logit") gaussian (link = "identity") gamma (link = "inverse") inverse.gaussian (link = "1/mu^2") poisson (link = "log") quasi (link = "identity", variance = "constant") quasibinomial (link = "logit") quasipoisson (link = "log") The glm() function allows you to fit a number of popular models, including logistic regression, Poisson regression, and survival analysis (not considered here). You can demonstrate this for the first two models as follows. Assume that you have a single response variable (Y), three predictor variables (X1, X2, X3), and a data frame (mydata) containing the data. Logistic regression is applied to situations in which the response variable is dichotomous (0 or 1). The model assumes that Y follows a binomial distribution and that you can fit a linear model of the form loge π = β0 + 1− π p j =1 βj X j where π = μ Y is the conditional mean of Y (that is, the probability that Y = 1 given a set of X values), (π/1 – π) is the odds that Y = 1, and log(π/1 – π) is the log odds, or logit. www.it-ebooks.info 304 CHAPTER 13 Generalized linear models In this case, log(π/1 – π) is the link function, the probability distribution is binomial, and the logistic regression model can be fit using glm(Y~X1+X2+X3, family=binomial(link="logit"), data=mydata) Logistic regression is described more fully in section 13.2. Poisson regression is applied to situations in which the response variable is the number of events to occur in a given period of time. The Poisson regression model assumes that Y follows a Poisson distribution and that you can fit a linear model of the form loge (λ) = β0 + p j =1 βj X j where λ is the mean (and variance) of Y. In this case, the link function is log(λ), the probability distribution is Poisson, and the Poisson regression model can be fit using glm(Y~X1+X2+X3, family=poisson(link="log"), data=mydata) Poisson regression is described in section 13.3. It’s worth noting that the standard linear model is also a special case of the generalized linear model. If you let the link function g(μY ) = μY or the identity function and specify that the probability distribution is normal (Gaussian), then glm(Y~X1+X2+X3, family=gaussian(link="identity"), data=mydata) would produce the same results as lm(Y~X1+X2+X3, data=mydata) To summarize, generalized linear models extend the standard linear model by fitting a function of the conditional mean response (rather than the conditional mean response) and assuming that the response variable follows a member of the exponential family of distributions (rather than being limited to the normal distribution). The parameter estimates are derived via maximum likelihood rather than least squares. 13.1.2 Supporting functions Many of the functions that you used in conjunction with lm() when analyzing standard linear models have corresponding versions for glm(). Some commonly used functions are given in table 13.2. Functions that support glm() Table 13.2 Function Description summary() Displays detailed results for the fitted model coefficients(), coef() Lists the model parameters (intercept and slopes) for the fitted model confint() Provides confidence intervals for the model parameters (95% by default) residuals() Lists the residual values in a fitted model anova() Generates an ANOVA table comparing two fitted models www.it-ebooks.info Generalized linear models and the glm() function Table 13.2 305 Functions that support glm() Function Description plot() Generates diagnostic plots for evaluating the fit of a model predict() Uses a fitted model to predict response values for a new dataset deviance() Deviance for the fitted model df.residual() Residual degrees of freedom for the fitted model We’ll explore examples of these functions in later sections. In the next section, we’ll briefly consider the assessment of model adequacy. 13.1.3 Model fit and regression diagnostics The assessment of model adequacy is as important for generalized linear models as it is for standard (OLS) linear models. Unfortunately, there’s less agreement in the statistical community regarding appropriate assessment procedures. In general, you can use the techniques described in chapter 8, with the following caveats. When assessing model adequacy, you’ll typically want to plot predicted values expressed in the metric of the original response variable against residuals of the deviance type. For example, a common diagnostic plot would be plot(predict(model, type="response"), residuals(model, type= "deviance")) where model is the object returned by the glm() function. The hat values, studentized residuals, and Cook’s D statistics that R provides will be approximate values. Additionally, there’s no general consensus on cutoff values for identifying problematic observations. Values have to be judged relative to each other. One approach is to create index plots for each statistic and look for unusually large values. For example, you could use the following code to create three diagnostic plots: plot(hatvalues(model)) plot(rstudent(model)) plot(cooks.distance(model)) Alternatively, you could use the code library(car) influencePlot(model) to create one omnibus plot. In the latter graph, the horizontal axis is the leverage, the vertical axis is the studentized residual, and the plotted symbol is proportional to the Cook’s distance. Diagnostic plots tend to be most helpful when the response variable takes on many values. When the response variable can only take on a limited number of values (for example, logistic regression), the utility of these plots is decreased. For more on regression diagnostics for generalized linear models, see Fox (2008) and Faraway (2006). In the remaining portion of this chapter, we’ll consider two of www.it-ebooks.info 306 CHAPTER 13 Generalized linear models the most popular forms of the generalized linear model in detail: logistic regression and Poisson regression. 13.2 Logistic regression Logistic regression is useful when you’re predicting a binary outcome from a set of continuous and/or categorical predictor variables. To demonstrate this, let’s explore the data on infidelity contained in the data frame Affairs, provided with the AER package. Be sure to download and install the package (using install.packages("AER")) before first use. The infidelity data, known as Fair’s Affairs, is based on a cross-sectional survey conducted by Psychology Today in 1969 and is described in Greene (2003) and Fair (1978). It contains 9 variables collected on 601 participants and includes how often the respondent engaged in extramarital sexual intercourse during the past year, as well as their gender, age, years married, whether they had children, their religiousness (on a 5-point scale from 1=anti to 5=very), education, occupation (Hollingshead 7-point classification with reverse numbering), and a numeric self-rating of their marriage (from 1=very unhappy to 5=very happy). Let’s look at some descriptive statistics: > data(Affairs, package="AER") > summary(Affairs) affairs gender Min. : 0.000 female:315 1st Qu.: 0.000 male :286 Median : 0.000 Mean : 1.456 3rd Qu.: 0.000 Max. :12.000 religiousness education Min. :1.000 Min. : 9.00 1st Qu.:2.000 1st Qu.:14.00 Median :3.000 Median :16.00 Mean :3.116 Mean :16.17 3rd Qu.:4.000 3rd Qu.:18.00 Max. :5.000 Max. :20.00 age Min. :17.50 1st Qu.:27.00 Median :32.00 Mean :32.49 3rd Qu.:37.00 Max. :57.00 occupation Min. :1.000 1st Qu.:3.000 Median :5.000 Mean :4.195 3rd Qu.:6.000 Max. :7.000 yearsmarried Min. : 0.125 1st Qu.: 4.000 Median : 7.000 Mean : 8.178 3rd Qu.:15.000 Max. :15.000 rating Min. :1.000 1st Qu.:3.000 Median :4.000 Mean :3.932 3rd Qu.:5.000 Max. :5.000 children no :171 yes:430 > table(Affairs$affairs) 0 1 2 3 7 12 451 34 17 19 42 38 From these statistics, you can see that that 52% of respondents were female, that 72% had children, and that the median age for the sample was 32 years. With regard to the response variable, 75% of respondents reported not engaging in an infidelity in the past year (451/601). The largest number of encounters reported was 12 (6%). Although the number of indiscretions was recorded, your interest here is in the binary outcome (had an affair/didn’t have an affair). You can transform affairs into a dichotomous factor called ynaffair with the following code. > Affairs$ynaffair[Affairs$affairs > 0] <- 1 > Affairs$ynaffair[Affairs$affairs == 0] <- 0 www.it-ebooks.info Logistic regression 307 > Affairs$ynaffair <- factor(Affairs$ynaffair, levels=c(0,1), labels=c("No","Yes")) > table(Affairs$ynaffair) No Yes 451 150 This dichotomous factor can now be used as the outcome variable in a logistic regression model: > fit.full <- glm(ynaffair ~ gender + age + yearsmarried + children + religiousness + education + occupation +rating, data=Affairs, family=binomial()) > summary(fit.full) Call: glm(formula = ynaffair ~ gender + age + yearsmarried + children + religiousness + education + occupation + rating, family = binomial(), data = Affairs) Deviance Residuals: Min 1Q Median -1.571 -0.750 -0.569 3Q -0.254 Max 2.519 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 1.3773 0.8878 1.55 0.12081 gendermale 0.2803 0.2391 1.17 0.24108 age -0.0443 0.0182 -2.43 0.01530 * yearsmarried 0.0948 0.0322 2.94 0.00326 ** childrenyes 0.3977 0.2915 1.36 0.17251 religiousness -0.3247 0.0898 -3.62 0.00030 *** education 0.0211 0.0505 0.42 0.67685 occupation 0.0309 0.0718 0.43 0.66663 rating -0.4685 0.0909 -5.15 2.6e-07 *** --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 675.38 Residual deviance: 609.51 AIC: 627.5 on 600 on 592 degrees of freedom degrees of freedom Number of Fisher Scoring iterations: 4 From the p-values for the regression coefficients (last column), you can see that gender, presence of children, education, and occupation may not make a significant contribution to the equation (you can’t reject the hypothesis that the parameters are 0). Let’s fit a second equation without them and test whether this reduced model fits the data as well: > fit.reduced <- glm(ynaffair ~ age + yearsmarried + religiousness + rating, data=Affairs, family=binomial()) > summary(fit.reduced) www.it-ebooks.info 308 Generalized linear models CHAPTER 13 Call: glm(formula = ynaffair ~ age + yearsmarried + religiousness + rating, family = binomial(), data = Affairs) Deviance Residuals: Min 1Q Median -1.628 -0.755 -0.570 3Q -0.262 Max 2.400 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 1.9308 0.6103 3.16 0.00156 ** age -0.0353 0.0174 -2.03 0.04213 * yearsmarried 0.1006 0.0292 3.44 0.00057 *** religiousness -0.3290 0.0895 -3.68 0.00023 *** rating -0.4614 0.0888 -5.19 2.1e-07 *** --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 675.38 Residual deviance: 615.36 AIC: 625.4 on 600 on 596 degrees of freedom degrees of freedom Number of Fisher Scoring iterations: 4 Each regression coefficient in the reduced model is statistically significant (p < .05). Because the two models are nested (fit.reduced is a subset of fit.full), you can use the anova() function to compare them. For generalized linear models, you’ll want a chi-square version of this test: > anova(fit.reduced, fit.full, test="Chisq") Analysis of Deviance Table Model 1: ynaffair ~ age + yearsmarried + religiousness + rating Model 2: ynaffair ~ gender + age + yearsmarried + children + religiousness + education + occupation + rating Resid. Df Resid. Dev Df Deviance P(>|Chi|) 1 596 615 2 592 610 4 5.85 0.21 The nonsignificant chi-square value (p = 0.21) suggests that the reduced model with four predictors fits as well as the full model with nine predictors, reinforcing your belief that gender, children, education, and occupation don’t add significantly to the prediction above and beyond the other variables in the equation. Therefore, you can base your interpretations on the simpler model. 13.2.1 Interpreting the model parameters Let’s look at the regression coefficients: > coef(fit.reduced) (Intercept) 1.931 age -0.035 yearsmarried religiousness 0.101 -0.329 www.it-ebooks.info rating -0.461 309 Logistic regression In a logistic regression, the response being modeled is the log(odds) that Y = 1. The regression coefficients give the change in log(odds) in the response for a unit change in the predictor variable, holding all other predictor variables constant. Because log(odds) are difficult to interpret, you can exponentiate them to put the results on an odds scale: > exp(coef(fit.reduced)) (Intercept) age 6.895 0.965 yearsmarried religiousness 1.106 0.720 rating 0.630 Now you can see that the odds of an extramarital encounter are increased by a factor of 1.106 for a one-year increase in years married (holding age, religiousness, and marital rating constant). Conversely, the odds of an extramarital affair are multiplied by a factor of 0.965 for every year increase in age. The odds of an extramarital affair increase with years married and decrease with age, religiousness, and marital rating. Because the predictor variables can’t equal 0, the intercept isn’t meaningful in this case. If desired, you can use the confint() function to obtain confidence intervals for the coefficients. For example, exp(confint(fit.reduced)) would print 95% confidence intervals for each of the coefficients on an odds scale. Finally, a one-unit change in a predictor variable may not be inherently interesting. For binary logistic regression, the change in the odds of the higher value on the response variable for an n unit change in a predictor variable is exp(βj)^n. If a oneyear increase in years married multiplies the odds of an affair by 1.106, a 10-year increase would increase the odds by a factor of 1.106^10, or 2.7, holding the other predictor variables constant. 13.2.2 Assessing the impact of predictors on the probability of an outcome For many of us, it’s easier to think in terms of probabilities than odds. You can use the predict() function to observe the impact of varying the levels of a predictor variable on the probability of the outcome. The first step is to create an artificial dataset containing the values of the predictor variables you’re interested in. Then you can use this artificial dataset with the predict() function to predict the probabilities of the outcome event occurring for these values. Let’s apply this strategy to assess the impact of marital ratings on the probability of having an extramarital affair. First, create an artificial dataset where age, years married, and religiousness are set to their means, and marital rating varies from 1 to 5: > testdata <- data.frame(rating=c(1, 2, 3, 4, 5), age=mean(Affairs$age), yearsmarried=mean(Affairs$yearsmarried), religiousness=mean(Affairs$religiousness)) > testdata rating age yearsmarried religiousness 1 1 32.5 8.18 3.12 2 2 32.5 8.18 3.12 3 3 32.5 8.18 3.12 4 4 32.5 8.18 3.12 5 5 32.5 8.18 3.12 www.it-ebooks.info 310 CHAPTER 13 Generalized linear models Next, use the test dataset and prediction equation to obtain probabilities: > testdata$prob <- predict(fit.reduced, newdata=testdata, type="response") testdata rating age yearsmarried religiousness prob 1 1 32.5 8.18 3.12 0.530 2 2 32.5 8.18 3.12 0.416 3 3 32.5 8.18 3.12 0.310 4 4 32.5 8.18 3.12 0.220 5 5 32.5 8.18 3.12 0.151 From these results, you see that the probability of an extramarital affair decreases from 0.53 when the marriage is rated 1=very unhappy to 0.15 when the marriage is rated 5=very happy (holding age, years married, and religiousness constant). Now look at the impact of age: > testdata <- data.frame(rating=mean(Affairs$rating), age=seq(17, 57, 10), yearsmarried=mean(Affairs$yearsmarried), religiousness=mean(Affairs$religiousness)) > testdata rating age yearsmarried religiousness 1 3.93 17 8.18 3.12 2 3.93 27 8.18 3.12 3 3.93 37 8.18 3.12 4 3.93 47 8.18 3.12 5 3.93 57 8.18 3.12 > testdata$prob <- predict(fit.reduced, newdata=testdata, type="response") > testdata rating age yearsmarried religiousness prob 1 3.93 17 8.18 3.12 0.335 2 3.93 27 8.18 3.12 0.262 3 3.93 37 8.18 3.12 0.199 4 3.93 47 8.18 3.12 0.149 5 3.93 57 8.18 3.12 0.109 Here, you see that as age increases from 17 to 57, the probability of an extramarital encounter decreases from 0.34 to 0.11, holding the other variables constant. Using this approach, you can explore the impact of each predictor variable on the outcome. 13.2.3 Overdispersion The expected variance for data drawn from a binomial distribution is σ2 = nπ(1 − π), where n is the number of observations and π is the probability of belonging to the Y = 1 group. Overdispersion occurs when the observed variance of the response variable is larger than what would be expected from a binomial distribution. Overdispersion can lead to distorted test standard errors and inaccurate tests of significance. When overdispersion is present, you can still fit a logistic regression using the glm() function, but in this case, you should use the quasibinomial distribution rather than the binomial distribution. www.it-ebooks.info Logistic regression 311 One way to detect overdispersion is to compare the residual deviance with the residual degrees of freedom in your binomial model. If the ratio φ = Residual deviance Residual df is considerably larger than 1, you have evidence of overdispersion. Applying this to the Affairs example, you have > deviance(fit.reduced)/df.residual(fit.reduced) [1] 1.032 which is close to 1, suggesting no overdispersion. You can also test for overdispersion. To do this, you fit the model twice, but in the first instance you use family="binomial" and in the second instance you use family="quasibinomial". If the glm() object returned in the first case is called fit and the object returned in the second case is called fit.od, then pchisq(summary(fit.od)$dispersion * fit$df.residual, fit$df.residual, lower = F) provides the p-value for testing the null hypothesis H0: φ = 1 versus the alternative hypothesis H1: φ ≠ 1. If p is small (say, less than 0.05), you’d reject the null hypothesis. Applying this to the Affairs dataset, you have > fit <- glm(ynaffair ~ age + yearsmarried + religiousness + rating, family = binomial(), data = Affairs) > fit.od <- glm(ynaffair ~ age + yearsmarried + religiousness + rating, family = quasibinomial(), data = Affairs) > pchisq(summary(fit.od)$dispersion * fit$df.residual, fit$df.residual, lower = F) [1] 0.34 The resulting p-value (0.34) is clearly not significant (p > 0.05), strengthening your belief that overdispersion isn’t a problem. We’ll return to the issue of overdispersion when we discuss Poisson regression. 13.2.4 Extensions Several logistic regression extensions and variations are available in R: ■ ■ ■ Robust logistic regression—The glmRob() function in the robust package can be used to fit a robust generalized linear model, including robust logistic regression. Robust logistic regression can be helpful when fitting logistic regression models to data containing outliers and influential observations. Multinomial logistic regression—If the response variable has more than two unordered categories (for example, married/widowed/divorced), you can fit a polytomous logistic regression using the mlogit() function in the mlogit package. Ordinal logistic regression—If the response variable is a set of ordered categories (for example, credit risk as poor/good/excellent), you can fit an ordinal logistic regression using the lrm() function in the rms package. www.it-ebooks.info 312 CHAPTER 13 Generalized linear models The ability to model a response variable with multiple categories (both ordered and unordered) is an important extension, but it comes at the expense of greater interpretive complexity. Assessing model fit and regression diagnostics in these cases will also be more complex. In the Affairs example, the number of extramarital contacts was dichotomized into a yes/no response variable because our interest centered on whether respondents had an affair in the past year. If our interest had been centered on magnitude— the number of encounters in the past year—we would have analyzed the count data directly. One popular approach to analyzing count data is Poisson regression, the next topic we’ll address. 13.3 Poisson regression Poisson regression is useful when you’re predicting an outcome variable representing counts from a set of continuous and/or categorical predictor variables. A comprehensive yet accessible introduction to Poisson regression is provided by Coxe, West, and Aiken (2009). To illustrate the fitting of a Poisson regression model, along with some issues that can come up in the analysis, we’ll use the Breslow seizure data (Breslow, 1993) provided in the robust package. Specifically, we’ll consider the impact of an antiepileptic drug treatment on the number of seizures occurring over an eight-week period following the initiation of therapy. Be sure to install the robust package before continuing. Data were collected on the age and number of seizures reported by patients suffering from simple or complex partial seizures during an eight-week period before, and eight-week period after, randomization into a drug or placebo condition. SumY (the number of seizures in the eight-week period post-randomization) is the response variable. Treatment condition (Trt), age in years (Age), and number of seizures reported in the baseline eight-week period (Base) are the predictor variables. The baseline number of seizures and age are included because of their potential effect on the response variable. We're interested in whether or not evidence exists that the drug treatment decreases the number of seizures after accounting for these covariates. First, let’s look at summary statistics for the dataset: > data(breslow.dat, package="robust") > names(breslow.dat) [1] "ID" "Y1" "Y2" "Y3" "Y4" [10] "sumY" "Age10" "Base4" > summary(breslow.dat[c(6,7,8,10)]) Base Age Trt Min. : 6.0 Min. :18.0 placebo :28 1st Qu.: 12.0 1st Qu.:23.0 progabide:31 Median : 22.0 Median :28.0 Mean : 31.2 Mean :28.3 3rd Qu.: 41.0 3rd Qu.:32.0 Max. :151.0 Max. :42.0 www.it-ebooks.info "Base" "Age" sumY Min. : 0.0 1st Qu.: 11.5 Median : 16.0 Mean : 33.1 3rd Qu.: 36.0 Max. :302.0 "Trt" "Ysum" 313 Poisson regression Group Comparisons 150 100 15 0 0 5 50 10 Frequency 20 200 25 250 30 300 Distribution of Seizures 0 50 150 250 placebo Seizure Count Figure 13.1 progabide Treatment Distribution of post-treatment seizure counts (source: Breslow seizure data) Note that although there are 12 variables in the dataset, we’re limiting our attention to the 4 described earlier. Both the baseline and post-randomization number of seizures are highly skewed. Let’s look at the response variable in more detail. The following code produces the graphs in figure 13.1: opar <- par(no.readonly=TRUE) par(mfrow=c(1,2)) attach(breslow.dat) hist(sumY, breaks=20, xlab="Seizure Count", main="Distribution of Seizures") boxplot(sumY ~ Trt, xlab="Treatment", main="Group Comparisons") par(opar) You can clearly see the skewed nature of the dependent variable and the possible presence of outliers. At first glance, the number of seizures in the drug condition appears to be smaller and has a smaller variance. (You’d expect a smaller variance to accompany a smaller mean with Poisson distributed data.) Unlike standard OLS regression, this heterogeneity of variance isn’t a problem in Poisson regression. The next step is to fit the Poisson regression: > fit <- glm(sumY ~ Base + Age + Trt, data=breslow.dat, family=poisson()) > summary(fit) www.it-ebooks.info 314 CHAPTER 13 Generalized linear models Call: glm(formula = sumY ~ Base + Age + Trt, family = poisson(), data = breslow.dat) Deviance Residuals: Min 1Q Median -6.057 -2.043 -0.940 3Q 0.793 Max 11.006 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 1.948826 0.135619 14.37 < 2e-16 *** Base 0.022652 0.000509 44.48 < 2e-16 *** Age 0.022740 0.004024 5.65 1.6e-08 *** Trtprogabide -0.152701 0.047805 -3.19 0.0014 ** --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: 2122.73 Residual deviance: 559.44 AIC: 850.7 on 58 on 55 degrees of freedom degrees of freedom Number of Fisher Scoring iterations: 5 The output provides the deviances, regression parameters, and standard errors, and tests that these parameters are 0. Note that each of the predictor variables is significant at the p < 0.05 level. 13.3.1 Interpreting the model parameters The model coefficients are obtained using the coef() function or by examining the Coefficients table in the summary() function output: > coef(fit) (Intercept) 1.9488 Base 0.0227 Age Trtprogabide 0.0227 -0.1527 In a Poisson regression, the dependent variable being modeled is the log of the conditional mean loge(λ). The regression parameter 0.0227 for Age indicates that a oneyear increase in age is associated with a 0.03 increase in the log mean number of seizures, holding baseline seizures and treatment condition constant. The intercept is the log mean number of seizures when each of the predictors equals 0. Because you can’t have a zero age and none of the participants had a zero number of baseline seizures, the intercept isn’t meaningful in this case. It’s usually much easier to interpret the regression coefficients in the original scale of the dependent variable (number of seizures, rather than log number of seizures). To accomplish this, exponentiate the coefficients: > exp(coef(fit)) (Intercept) 7.020 Base 1.023 Age Trtprogabide 1.023 0.858 Now you see that a one-year increase in age multiplies the expected number of seizures by 1.023, holding the other variables constant. This means that increased age is www.it-ebooks.info Poisson regression 315 associated with higher numbers of seizures. More important, a one-unit change in Trt (that is, moving from placebo to progabide) multiplies the expected number of seizures by 0.86. You’d expect a 20% decrease in the number of seizures for the drug group compared with the placebo group, holding baseline number of seizures and age constant. It’s important to remember that, like the exponentiated parameters in logistic regression, the exponentiated parameters in the Poisson model have a multiplicative rather than an additive effect on the response variable. Also, as with logistic regression, you must evaluate your model for overdispersion. 13.3.2 Overdispersion In a Poisson distribution, the variance and mean are equal. Overdispersion occurs in Poisson regression when the observed variance of the response variable is larger than would be predicted by the Poisson distribution. Because overdispersion is often encountered when dealing with count data and can have a negative impact on the interpretation of the results, we’ll spend some time discussing it. There are several reasons why overdispersion may occur (Coxe et al., 2009): ■ ■ ■ The omission of an important predictor variable can lead to overdispersion. Overdispersion can also be caused by a phenomenon known as state dependence. Within observations, each event in a count is assumed to be independent. For the seizure data, this would imply that for any patient, the probability of a seizure is independent of each other seizure. But this assumption is often untenable. For a given individual, the probability of having a first seizure is unlikely to be the same as the probability of having a 40th seizure, given that they’ve already had 39. In longitudinal studies, overdispersion can be caused by the clustering inherent in repeated measures data. We won’t discuss longitudinal Poisson models here. If overdispersion is present and you don’t account for it in your model, you’ll get standard errors and confidence intervals that are too small, and significance tests that are too liberal (that is, you’ll find effects that aren’t really there). As with logistic regression, overdispersion is suggested if the ratio of the residual deviance to the residual degrees of freedom is much larger than 1. For the seizure data, the ratio is > deviance(fit)/df.residual(fit) [1] 10.17 which is clearly much larger than 1. The qcc package provides a test for overdispersion in the Poisson case. (Be sure to download and install this package before first use.) You can test for overdispersion in the seizure data using the following code: > library(qcc) > qcc.overdispersion.test(breslow.dat$sumY, type="poisson") www.it-ebooks.info 316 CHAPTER 13 Generalized linear models Overdispersion test Obs.Var/Theor.Var Statistic p-value poisson data 62.9 3646 0 Not surprisingly, the significance test has a p-value less than 0.05, strongly suggesting the presence of overdispersion. You can still fit a model to your data using the glm() function, by replacing family="poisson" with family="quasipoisson". Doing so is analogous to the approach to logistic regression when overdispersion is present: > fit.od <- glm(sumY ~ Base + Age + Trt, data=breslow.dat, family=quasipoisson()) > summary(fit.od) Call: glm(formula = sumY ~ Base + Age + Trt, family = quasipoisson(), data = breslow.dat) Deviance Residuals: Min 1Q Median -6.057 -2.043 -0.940 3Q 0.793 Max 11.006 Coefficients: Estimate Std. Error t (Intercept) 1.94883 0.46509 Base 0.02265 0.00175 Age 0.02274 0.01380 Trtprogabide -0.15270 0.16394 --Signif. codes: 0 '***' 0.001 '**' value Pr(>|t|) 4.19 0.00010 *** 12.97 < 2e-16 *** 1.65 0.10509 -0.93 0.35570 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for quasipoisson family taken to be 11.8) Null deviance: 2122.73 Residual deviance: 559.44 AIC: NA on 58 on 55 degrees of freedom degrees of freedom Number of Fisher Scoring iterations: 5 Notice that the parameter estimates in the quasi-Poisson approach are identical to those produced by the Poisson approach. The standard errors are much larger, though. In this case, the larger standard errors have led to p-values for Trt (and Age) that are greater than 0.05. When you take overdispersion into account, there’s insufficient evidence to declare that the drug regimen reduces seizure counts more than receiving a placebo, after controlling for baseline seizure rate and age. Please remember that this example is used for demonstration purposes only. The results shouldn’t be taken to imply anything about the efficacy of progabide in the real world. I’m not a doctor—at least not a medical doctor—and I don’t even play one on TV. We’ll finish this exploration of Poisson regression with a discussion of some important variants and extensions. www.it-ebooks.info Poisson regression 317 13.3.3 Extensions R provides several useful extensions to the basic Poisson regression model, including models that allow varying time periods, models that correct for too many zeros, and robust models that are useful when data includes outliers and influential observations. I’ll describe each separately. POISSON REGRESSION WITH VARYING TIME PERIODS Our discussion of Poisson regression has been limited to response variables that measure a count over a fixed length of time (for example, number of seizures in an eightweek period, number of traffic accidents in the past year, or number of pro-social behaviors in a day). The length of time is constant across observations. But you can fit Poisson regression models that allow the time period to vary for each observation. In this case, the outcome variable is a rate. To analyze rates, you must include a variable (for example, time) that records the length of time over which the count occurs for each observation. You then change the model from loge (λ) = β0 + p j =1 βj X j to loge λ = β0 + time p j =1 βj X j or equivalently loge(λ) = loge(time) + β0 + p j =1 βj X j To fit this new model, you use the offset option in the glm() function. For example, assume that the length of time that patients participated post-randomization in the Breslow study varied from 14 days to 60 days. You could use the rate of seizures as the dependent variable (assuming you had recorded time for each patient in days) and fit the model fit <- glm(sumY ~ Base + Age + Trt, data=breslow.dat, offset= log(time), family=poisson) where sumY is the number of seizures that occurred post-randomization for a patient during the time the patient was studied. In this case, you’re assuming that rate doesn’t vary over time (for example, 2 seizures in 4 days is equivalent to 10 seizures in 20 days). ZERO-INFLATED POISSON REGRESSION There are times when the number of zero counts in a dataset is larger than would be predicted by the Poisson model. This can occur when there’s a subgroup of the population that would never engage in the behavior being counted. For example, in the Affairs dataset described in the section on logistic regression, the original outcome www.it-ebooks.info 318 CHAPTER 13 Generalized linear models variable (affairs) counted the number of extramarital sexual intercourse experiences participants had in the past year. It’s likely that there’s a subgroup of faithful marital partners who would never have an affair, no matter how long the period of time studied. These are called structural zeros (primarily by the swingers in the group). In such cases, you can analyze the data using an approach called zero-inflated Poisson regression. The approach fits two models simultaneously—one that predicts who would or would not have an affair, and the second that predicts how many affairs a participant would have if you excluded the permanently faithful. Think of this as a model that combines a logistic regression (for predicting structural zeros) and a Poisson regression model (that predicts counts for observations that aren’t structural zeros). Zero-inflated Poisson regression can be fit using the zeroinfl() function in the pscl package. ROBUST POISSON REGRESSION Finally, the glmRob() function in the robust package can be used to fit a robust generalized linear model, including robust Poisson regression. As mentioned previously, this can be helpful in the presence of outliers and influential observations. Going further Generalized linear models are a complex and mathematically sophisticated subject, but many fine resources are available for learning about them. A good, short introduction to the topic is Dunteman and Ho (2006). The classic (and advanced) text on generalized linear models is provided by McCullagh and Nelder (1989). Comprehensive and accessible presentations are provided by Dobson and Barnett (2008) and Fox (2008). Faraway (2006) and Fox (2002) provide excellent introductions within the context of R. 13.4 Summary In this chapter, we used generalized linear models to expand the range of approaches available for helping you to understand your data. In particular, the framework allows you to analyze response variables that are decidedly non-normal, including categorical outcomes and discrete counts. After briefly describing the general approach, we focused on logistic regression (for analyzing a dichotomous outcome) and Poisson regression (for analyzing outcomes measured as counts or rates). We also discussed the important topic of overdispersion, including how to detect it and how to adjust for it. Finally, we looked at some of the extensions and variations that are available in R. Each of the statistical approaches covered so far has dealt with directly observed and recorded variables. In the next chapter, we’ll look at statistical models that deal with latent variables—unobserved, theoretical variables that you believe underlie and account for the behavior of the variables you do observe. In particular, you’ll see how you can use factor analytic methods to detect and test hypotheses about these unobserved variables. www.it-ebooks.info Principal components and factor analysis This chapter covers ■ Principal components analysis ■ Exploratory factor analysis ■ Understanding other latent variable models One of the most challenging aspects of multivariate data is the sheer complexity of the information. If you have a dataset with 100 variables, how do you make sense of all the interrelationships present? Even with 20 variables, there are 190 pairwise correlations to consider when you’re trying to understand how the individual variables relate to one another. Two related but distinct methodologies for exploring and simplifying complex multivariate data are principal components and exploratory factor analysis. Principal components analysis (PCA) is a data-reduction technique that transforms a larger number of correlated variables into a much smaller set of uncorrelated variables called principal components. For example, you might use PCA to transform 30 correlated (and possibly redundant) environmental variables into 5 uncorrelated composite variables that retain as much information from the original set of variables as possible. 319 www.it-ebooks.info 320 CHAPTER 14 Principal components and factor analysis In contrast, exploratory factor analysis (EFA) is a collection of methods designed to uncover the latent structure in a given set of variables. It looks for a smaller set of underlying or latent constructs that can explain the relationships among the observed or manifest variables. For example, the dataset Harman74.cor contains the correlations among 24 psychological tests given to 145 seventh- and eighth-grade children. If you apply EFA to this data, the results suggest that the 276 test intercorrelations can be explained by the children’s abilities on 4 underlying factors (verbal ability, processing speed, deduction, and memory). The differences between the PCA and EFA models can be seen in figure 14.1. Principal components (PC1 and PC2) are linear combinations of the observed variables (X1 to X5). The weights used to form the linear composites are chosen to maximize the variance each principal component accounts for, while keeping the components uncorrelated. In contrast, factors (F1 and F2) are assumed to underlie or “cause” the observed variables, rather than being linear combinations of them. The errors (e1 to e5) represent the variance in the observed variables unexplained by the factors. The circles indicate that the factors and errors aren’t directly observable but are inferred from the correlations among the variables. In this example, the curved arrow between the factors indicates that they’re correlated. Correlated factors are common, but not required, in the EFA model. The methods described in this chapter require large samples to derive stable solutions. What constitutes an adequate sample size is somewhat complicated. Until recently, analysts used rules of thumb like “factor analysis requires 5–10 times as many subjects as variables.” Recent studies suggest that the required sample size depends on the number of factors, the number of variables associated with each factor, and how well the set of factors explains the variance in the variables (Bandalos and BoehmKaufman, 2009). I’ll go out on a limb and say that if you have several hundred observations, you’re probably safe. In this chapter, we’ll look at artificially small problems in order to keep the output (and page count) manageable. We’ll start by reviewing the functions in R that can be used to perform PCA or EFA and give a brief overview of the steps involved. Then we’ll work carefully through two PCA examples, followed by an extended EFA example. A brief overview of other packages in R that can be used for fitting latent variable models is provided at the end of X1 X2 PC1 e1 X2 e2 X3 e3 X4 e4 X5 e5 F1 X3 F1 X4 X1 PC2 X5 (a) Principal Components Model (b) Factor Analysis Model www.it-ebooks.info Figure 14.1 Comparing principal components and factor analysis models. The diagrams show the observed variables (X1 to X5), the principal components (PC1, PC2), factors (F1, F2), and errors (e1 to e5). Principal components and factor analysis in R 321 the chapter. This discussion includes packages for confirmatory factor analysis, structural equation modeling, correspondence analysis, and latent class analysis. 14.1 Principal components and factor analysis in R In the base installation of R, the functions for PCA and EFA are princomp() and factanal(), respectively. In this chapter, we’ll focus on functions provided in the psych package. They offer many more useful options than their base counterparts. Additionally, the results are reported in a metric that will be more familiar to social scientists and more likely to match the output provided by corresponding programs in other statistical packages such as SAS and SPSS. The psych package functions that are most relevant here are listed in table 14.1. Be sure to install the package before trying the examples in this chapter. Table 14.1 Useful factor analytic functions in the psych package Function Description principal() Principal components analysis with optional rotation fa() Factor analysis by principal axis, minimum residual, weighted least squares, or maximum likelihood fa.parallel() Scree plots with parallel analyses factor.plot() Plot the results of a factor or principal components analysis fa.diagram() Graph factor or principal components loading matrices scree() Scree plot for factor and principal components analysis EFA (and to a lesser degree PCA) are often confusing to new users. The reason is that they describe a wide range of approaches, and each approach requires several steps (and decisions) to achieve a final result. The most common steps are as follows: 1 2 3 4 5 6 7 Prepare the data. Both PCA and EFA derive their solutions from the correlations among the observed variables. You can input either the raw data matrix or the correlation matrix to the principal() and fa() functions. If raw data is input, the correlation matrix is automatically calculated. Be sure to screen the data for missing values before proceeding. Select a factor model. Decide whether PCA (data reduction) or EFA (uncovering latent structure) is a better fit for your research goals. If you select an EFA approach, you’ll also need to choose a specific factoring method (for example, maximum likelihood). Decide how many components/factors to extract. Extract the components/factors. Rotate the components/factors. Interpret the results. Compute component or factor scores. www.it-ebooks.info 322 CHAPTER 14 Principal components and factor analysis In the remainder of this chapter, we’ll carefully consider each of the steps, starting with PCA. At the end of the chapter, you’ll find a detailed flow chart of the possible steps in PCA/EFA (figure 14.7). The chart will make more sense once you’ve read through the intervening material. 14.2 Principal components The goal of PCA is to replace a large number of correlated variables with a smaller number of uncorrelated variables while capturing as much information in the original variables as possible. These derived variables, called principal components, are linear combinations of the observed variables. Specifically, the first principal component PC1 = a1X1 + a2X2 + ... + akXk is the weighted combination of the k observed variables that accounts for the most variance in the original set of variables. The second principal component is the linear combination that accounts for the most variance in the original variables, under the constraint that it’s orthogonal (uncorrelated) to the first principal component. Each subsequent component maximizes the variance accounted for, while at the same time remaining uncorrelated with all previous components. Theoretically, you can extract as many principal components as there are variables. But from a practical viewpoint, you hope that you can approximate the full set of variables with a much smaller set of components. Let’s look at a simple example. The dataset USJudgeRatings contains lawyers’ ratings of state judges in the US Superior Court. The data frame contains 43 observations on 12 numeric variables. The variables are listed in table 14.2. Table 14.2 Variables in the USJudgeRatings dataset Variable Description Variable Description CONT Number of contacts of lawyer with judge PREP Preparation for trial INTG Judicial integrity FAMI Familiarity with law DMNR Demeanor ORAL Sound oral rulings DILG Diligence WRIT Sound written rulings CFMG Case flow managing PHYS Physical ability DECI Prompt decisions RTEN Worthy of retention From a practical point of view, can you summarize the 11 evaluative ratings (INTG to RTEN) with a smaller number of composite variables? If so, how many will you need, and how will they be defined? Because the goal is to simplify the data, you’ll approach this problem using PCA. The data are in raw score format, and there are no missing values. Therefore, your next step is deciding how many principal components you’ll need. www.it-ebooks.info 323 Principal components 14.2.1 Selecting the number of components to extract Several criteria are available for deciding how many components to retain in a PCA. They include ■ ■ ■ Basing the number of components on prior experience and theory Selecting the number of components needed to account for some threshold cumulative amount of variance in the variables (for example, 80%) Selecting the number of components to retain by examining the eigenvalues of the k × k correlation matrix among the variables The most common approach is based on the eigenvalues. Each component is associated with an eigenvalue of the correlation matrix. The first PC is associated with the largest eigenvalue, the second PC with the second-largest eigenvalue, and so on. The Kaiser–Harris criterion suggests retaining components with eigenvalues greater than 1. Components with eigenvalues less than 1 explain less variance than contained in a single variable. In the Cattell Scree test, the eigenvalues are plotted against their component numbers. Such plots typically demonstrate a bend or elbow, and the components above this sharp break are retained. Finally, you can run simulations, extracting eigenvalues from random data matrices of the same size as the original matrix. If an eigenvalue based on real data is larger than the average corresponding eigenvalues from a set of random data matrices, that component is retained. The approach is called parallel analysis (see Hayton, Allen, and Scarpello, 2004, for more details). You can assess all three eigenvalue criteria at the same time via the fa.parallel() function. For the 11 ratings (dropping the CONT variable), the necessary code is as follows: library(psych) fa.parallel(USJudgeRatings[,-1], fa="pc", n.iter=100, show.legend=FALSE, main="Scree plot with parallel analysis") 10 8 6 4 2 eigen values of principal components Figure 14.2 Assessing the number of principal components to retain for the USJudgeRatings example. A scree plot (the line with x’s), eigenvalues greater than 1 criteria (horizontal line), and parallel analysis with 100 simulations (dashed line) suggest retaining a single component. Scree plot with parallel analysis 0 This code produces the graph shown in figure 14.2. The plot displays the scree test based on the observed eigenvalues (as straightline segments and x’s), the mean eigenvalues derived from 100 random data matrices (as dashed lines), and the eigenvalues greater than 1 criteria (as a horizontal line at y=1). www.it-ebooks.info 2 4 6 Factor Number 8 10 324 CHAPTER 14 Principal components and factor analysis All three criteria suggest that a single component is appropriate for summarizing this dataset. Your next step is to extract the principal component using the principal() function. 14.2.2 Extracting principal components As indicated earlier, the principal() function performs a principal components analysis starting with either a raw data matrix or a correlation matrix. The format is principal(r, nfactors=, rotate=, scores=) where ■ ■ ■ r is a correlation matrix or a raw data matrix. nfactors specifies the number of principal components to extract (1 by default). rotate indicates the rotation to be applied (varimax by default; see section 14.2.3). ■ scores specifies whether to calculate principal-component scores (false by default). To extract the first principal component, you can use the code in the following listing. Listing 14.1 Principal components analysis of USJudgeRatings > library(psych) > pc <- principal(USJudgeRatings[,-1], nfactors=1) > pc Principal Components Analysis Call: principal(r = USJudgeRatings[, -1], nfactors=1) Standardized loadings based upon correlation matrix PC1 h2 u2 INTG 0.92 0.84 0.157 DMNR 0.91 0.83 0.166 DILG 0.97 0.94 0.061 CFMG 0.96 0.93 0.072 DECI 0.96 0.92 0.076 PREP 0.98 0.97 0.030 FAMI 0.98 0.95 0.047 ORAL 1.00 0.99 0.009 WRIT 0.99 0.98 0.020 PHYS 0.89 0.80 0.201 RTEN 0.99 0.97 0.028 PC1 SS loadings 10.13 Proportion Var 0.92 [... additional output omitted ...] Here, you’re inputting the raw data without the CONT variable and specifying that one unrotated component should be extracted. (Rotation is explained in section 14.3.3.) Because PCA is performed on a correlation matrix, the raw data is automatically converted to a correlation matrix before the components are extracted. www.it-ebooks.info 325 Principal components The column labeled PC1 contains the component loadings, which are the correlations of the observed variables with the principal component(s). If you extracted more than one principal component, there would be columns for PC2, PC3, and so on. Component loadings are used to interpret the meaning of components. You can see that each variable correlates highly with the first component (PC1). It therefore appears to be a general evaluative dimension. The column labeled h2 contains the component communalities—the amount of variance in each variable explained by the components. The u2 column contains the component uniquenesses—the amount of variance not accounted for by the components (or 1 – h2). For example, 80% of the variance in physical ability (PHYS) ratings is accounted for by the first PC, and 20% isn’t. PHYS is the variable least well represented by a one-component solution. The row labeled SS Loadings contains the eigenvalues associated with the components. The eigenvalues are the standardized variance associated with a particular component (in this case, the value for the first component is 10). Finally, the row labeled Proportion Var represents the amount of variance accounted for by each component. Here you see that the first principal component accounts for 92% of the variance in the 11 variables. Let’s consider a second example, one that results in a solution with more than one principal component. The dataset Harman23.cor contains data on 8 body measurements for 305 girls. In this case, the dataset consists of the correlations among the variables rather than the original data (see table 14.3). Table 14.3 Correlations among body measurements for 305 girls (Harman23.cor) Height Arm span Forearm Lower leg Weight Bitro diameter Chest girth Chest width Height 1.00 0.85 0.80 0.86 0.47 0.40 0.30 0.38 Arm span 0.85 1.00 0.88 0.83 0.38 0.33 0.28 0.41 Forearm 0.80 0.88 1.00 0.80 0.38 0.32 0.24 0.34 Lower leg 0.86 0.83 0.8 1.00 0.44 0.33 0.33 0.36 Weight 0.47 0.38 0.38 0.44 1.00 0.76 0.73 0.63 Bitro diameter 0.40 0.33 0.32 0.33 0.76 1.00 0.58 0.58 Chest girth 0.30 0.28 0.24 0.33 0.73 0.58 1.00 0.54 Chest width 0.38 0.41 0.34 0.36 0.63 0.58 0.54 1.00 Source: H. H. Harman, Modern Factor Analysis, Third Edition Revised, University of Chicago Press, 1976, Table 2.3. Again, you wish to replace the original physical measurements with a smaller number of derived variables. You can determine the number of components to extract using www.it-ebooks.info 326 Principal components and factor analysis CHAPTER 14 4 3 2 1 Figure 14.3 Assessing the number of principal components to retain for the body measurements example. The scree plot (line with x’s), eigenvalues greater than 1 criteria (horizontal line), and parallel analysis with 100 simulations (dashed line) suggest retaining two components. 0 eigen values of principal components Scree plot with parallel analysis 1 2 3 4 5 6 7 8 Factor Number the following code. In this case, you need to identify the correlation matrix (the cov component of the Harman23.cor object) and specify the sample size (n.obs): library(psych) fa.parallel(Harman23.cor$cov, n.obs=302, fa="pc", n.iter=100, show.legend=FALSE, main="Scree plot with parallel analysis") The resulting graph is displayed in figure 14.3. You can see from the plot that a two-component solution is suggested. As in the first example, the Kaiser–Harris criteria, scree test, and parallel analysis agree. This won’t always be the case, and you may need to extract different numbers of components and select the solution that appears most useful. The next listing extracts the first two principal components from the correlation matrix. Listing 14.2 Principal components analysis of body measurements > library(psych) > pc <- principal(Harman23.cor$cov, nfactors=2, rotate="none") > pc Principal Components Analysis Call: principal(r = Harman23.cor$cov, nfactors = 2, rotate = "none") Standardized loadings based upon correlation matrix PC1 PC2 h2 u2 height 0.86 -0.37 0.88 0.123 arm.span 0.84 -0.44 0.90 0.097 forearm 0.81 -0.46 0.87 0.128 lower.leg 0.84 -0.40 0.86 0.139 weight 0.76 0.52 0.85 0.150 bitro.diameter 0.67 0.53 0.74 0.261 www.it-ebooks.info Principal components chest.girth chest.width 0.62 0.67 327 0.58 0.72 0.283 0.42 0.62 0.375 PC1 PC2 SS loadings 4.67 1.77 Proportion Var 0.58 0.22 Cumulative Var 0.58 0.81 [... additional output omitted ...] If you examine the PC1 and PC2 columns in listing 14.2, you see that the first component accounts for 58% of the variance in the physical measurements, whereas the second component accounts for 22%. Together, the two components account for 81% of the variance. The two components together account for 88% of the variance in the height variable. Components and factors are interpreted by examining their loadings. The first component correlates positively with each physical measure and appears to be a general size factor. The second component contrasts the first four variables (height, arm span, forearm, and lower leg), with the second four variables (weight, bitro diameter, chest girth, and chest width). It therefore appears to be a length-versus-volume factor. Conceptually, this isn’t an easy construct to work with. Whenever two or more components have been extracted, you can rotate the solution to make it more interpretable. This is the topic we’ll turn to next. 14.2.3 Rotating principal components Rotations are a set of mathematical techniques for transforming the component loading matrix into one that’s more interpretable. They do this by “purifying” the components as much as possible. Rotation methods differ with regard to whether the resulting components remain uncorrelated (orthogonal rotation) or are allowed to correlate (oblique rotation). They also differ in their definition of purifying. The most popular orthogonal rotation is the varimax rotation, which attempts to purify the columns of the loading matrix, so that each component is defined by a limited set of variables (that is, each column has a few large loadings and many very small loadings). Applying a varimax rotation to the body measurement data, you get the results provided in the next listing. You’ll see an example of an oblique rotation in section 14.4. Listing 14.3 Principal components analysis with varimax rotation > rc <- principal(Harman23.cor$cov, nfactors=2, rotate="varimax") > rc Principal Components Analysis Call: principal(r = Harman23.cor$cov, nfactors = 2, rotate = "varimax") Standardized loadings based upon correlation matrix RC1 RC2 h2 u2 height 0.90 0.25 0.88 0.123 arm.span 0.93 0.19 0.90 0.097 forearm 0.92 0.16 0.87 0.128 lower.leg 0.90 0.22 0.86 0.139 www.it-ebooks.info 328 weight bitro.diameter chest.girth chest.width CHAPTER 14 Principal components and factor analysis 0.26 0.19 0.11 0.26 0.85 0.74 0.72 0.62 0.88 0.84 0.84 0.75 0.150 0.261 0.283 0.375 RC1 RC2 SS loadings 3.52 2.92 Proportion Var 0.44 0.37 Cumulative Var 0.44 0.81 [... additional output omitted ...] The column names change from PC to RC to denote rotated components. Looking at the loadings in column RC1, you see that the first component is primarily defined by the first four variables (length variables). The loadings in the column RC2 indicate that the second component is primarily defined by variables 5 through 8 (volume variables). Note that the two components are still uncorrelated and that together, they still explain the variables equally well. You can see that the rotated solution explains the variables equally well because the variable communalities haven’t changed. Additionally, the cumulative variance accounted for by the two-component rotated solution (81%) hasn’t changed. But the proportion of variance accounted for by each individual component has changed (from 58% to 44% for component 1 and from 22% to 37% for component 2). This spreading out of the variance across components is common, and technically you should now call them components rather than principal components (because the variance-maximizing properties of individual components haven’t been retained). The ultimate goal is to replace a larger set of correlated variables with a smaller set of derived variables. To do this, you need to obtain scores for each observation on the components. 14.2.4 Obtaining principal components scores In the USJudgeRatings example, you extracted a single principal component from the raw data describing lawyers’ ratings on 11 variables. The principal() function makes it easy to obtain scores for each participant on this derived variable (see the next listing). Listing 14.4 Obtaining component scores from raw data > library(psych) > pc <- principal(USJudgeRatings[,-1], nfactors=1, score=TRUE) > head(pc$scores) PC1 AARONSON,L.H. -0.1857981 ALEXANDER,J.M. 0.7469865 ARMENTANO,A.J. 0.0704772 BERDON,R.I. 1.1358765 BRACKEN,J.J. -2.1586211 BURNS,E.B. 0.7669406 www.it-ebooks.info Principal components 329 The principal component scores are saved in the scores element of the object returned by the principal() function when the option scores=TRUE. If you wanted, you could now get the correlation between the number of contacts occurring between a lawyer and a judge and their evaluation of the judge using > cor(USJudgeRatings$CONT, pc$score) PC1 [1,] -0.008815895 Apparently, there’s no relationship between the lawyer’s familiarity and their opinions! When the principal components analysis is based on a correlation matrix and the raw data aren’t available, getting principal component scores for each observation is clearly not possible. But you can get the coefficients used to calculate the principal components. In the body measurement data, you have correlations among body measurements, but you don’t have the individual measurements for these 305 girls. You can get the scoring coefficients using the code in the following listing. Listing 14.5 Obtaining principal component scoring coefficients > library(psych) > rc <- principal(Harman23.cor$cov, nfactors=2, rotate="varimax") > round(unclass(rc$weights), 2) RC1 RC2 height 0.28 -0.05 arm.span 0.30 -0.08 forearm 0.30 -0.09 lower.leg 0.28 -0.06 weight -0.06 0.33 bitro.diameter -0.08 0.32 chest.girth -0.10 0.34 chest.width -0.04 0.27 The component scores are obtained using the formulas PC1 = 0.28*height + 0.30*arm.span + 0.30*forearm + 0.29*lower.leg 0.06*weight - 0.08*bitro.diameter - 0.10*chest.girth 0.04*chest.width and PC2 = -0.05*height - 0.08*arm.span - 0.09*forearm - 0.06*lower.leg + 0.33*weight + 0.32*bitro.diameter + 0.34*chest.girth + 0.27*chest.width These equations assume that the physical measurements have been standardized (mean = 0, sd = 1). Note that the weights for PC1 tend to be around 0.3 or 0. The same is true for PC2. As a practical matter, you could simplify your approach further by taking the first composite variable as the mean of the standardized scores for the first four variables. Similarly, you could define the second composite variable as the mean of the standardized scores for the second four variables. This is typically what I’d do in practice. www.it-ebooks.info 330 CHAPTER 14 Principal components and factor analysis Little Jiffy conquers the world There’s quite a bit of confusion among data analysts regarding PCA and EFA. One reason for this is historical and can be traced back to a program called Little Jiffy (no kidding). Little Jiffy was one of the most popular early programs for factor analysis, and it defaulted to a principal components analysis, extracting components with eigenvalues greater than 1 and rotating them to a varimax solution. The program was so widely used that many social scientists came to think of this default behavior as synonymous with EFA. Many later statistical packages also incorporated these defaults in their EFA programs. As I hope you’ll see in the next section, there are important and fundamental differences between PCA and EFA. To learn more about the PCA/EFA confusion, see Hayton, Allen, and Scarpello, 2004. If your goal is to look for latent underlying variables that explain your observed variables, you can turn to factor analysis. This is the topic of the next section. 14.3 Exploratory factor analysis The goal of EFA is to explain the correlations among a set of observed variables by uncovering a smaller set of more fundamental unobserved variables underlying the data. These hypothetical, unobserved variables are called factors. (Each factor is assumed to explain the variance shared among two or more observed variables, so technically, they’re called common factors.) The model can be represented as X i = a1F1 + a2F2 + ... + ap Fp + Ui where X i is the ith observed variable (i = 1…k), Fj are the common factors (j = 1…p), and p < k. Ui is the portion of variable X i unique to that variable (not explained by the common factors). The ai can be thought of as the degree to which each factor contributes to the composition of an observed variable. If we go back to the Harman74.cor example at the beginning of this chapter, we’d say that an individual’s scores on each of the 24 observed psychological tests is due to a weighted combination of their ability on 4 underlying psychological constructs. Although the PCA and EFA models differ, many of the steps appear similar. To illustrate the process, you’ll apply EFA to the correlations among six psychological tests. One hundred twelve individuals were given six tests, including a nonverbal measure of general intelligence (general), a picture-completion test (picture), a block design test (blocks), a maze test (maze), a reading comprehension test (reading), and a vocabulary test (vocab). Can you explain the participants’ scores on these tests with a smaller number of underlying or latent psychological constructs? The covariance matrix among the variables is provided in the dataset ability.cov. You can transform this into a correlation matrix using the cov2cor() function: www.it-ebooks.info 331 Exploratory factor analysis > > > > options(digits=2) covariances <- ability.cov$cov correlations <- cov2cor(covariances) correlations general picture blocks maze reading vocab general 1.00 0.47 0.55 0.34 0.58 0.51 picture 0.47 1.00 0.57 0.19 0.26 0.24 blocks 0.55 0.57 1.00 0.45 0.35 0.36 maze 0.34 0.19 0.45 1.00 0.18 0.22 reading 0.58 0.26 0.35 0.18 1.00 0.79 vocab 0.51 0.24 0.36 0.22 0.79 1.00 Because you’re looking for hypothetical constructs that explain the data, you’ll use an EFA approach. As in PCA, the next task is to decide how many factors to extract. 14.3.1 Deciding how many common factors to extract To decide on the number of factors to extract, turn to the fa.parallel() function: > > > > library(psych) covariances <- ability.cov$cov correlations <- cov2cor(covariances) fa.parallel(correlations, n.obs=112, fa="both", n.iter=100, main="Scree plots with parallel analysis") The resulting plot is shown in figure 14.4. Notice you’ve requested that the function display results for both a principal-components and common-factor approach, so that you can compare them (fa = "both"). There are several things to notice in this graph. If you’d taken a PCA approach, you might have chosen one component (scree test, parallel analysis) or two components Actual Data Simulated Data Actual Data Simulated Data 1.5 2.0 2.5 3.0 PC PC FA FA 0.5 1.0 Figure 14.4 Assessing the number of factors to retain for the psychological tests example. Results for both PCA and EFA are present. The PCA results suggest one or two components. The EFA results suggest two factors. 0.0 eigenvalues of principal components and factor analysis Scree plots with parallel analysis 1 2 3 4 5 Factor Number www.it-ebooks.info 6 332 CHAPTER 14 Principal components and factor analysis (eigenvalues greater than 1). When in doubt, it’s usually a better idea to overfactor than to underfactor. Overfactoring tends to lead to less distortion of the “true” solution. Looking at the EFA results, a two-factor solution is clearly indicated. The first two eigenvalues (triangles) are above the bend in the scree test and also above the mean eigenvalues based on 100 simulated data matrices. For EFA, the Kaiser–Harris criterion is number of eigenvalues above 0, rather than 1. (Most people don’t realize this, so it’s a good way to win bets at parties.) In the present case the Kaiser–Harris criteria also suggest two factors. 14.3.2 Extracting common factors Now that you’ve decided to extract two factors, you can use the fa() function to obtain your solution. The format of the fa() function is fa(r, nfactors=, n.obs=, rotate=, scores=, fm=) where ■ ■ ■ ■ ■ ■ r is a correlation matrix or a raw data matrix. nfactors specifies the number of factors to extract (1 by default). n.obs is the number of observations (if a correlation matrix is input). rotate indicates the rotation to be applied (oblimin by default). scores specifies whether or not to calculate factor scores (false by default). fm specifies the factoring method (minres by default). Unlike PCA, there are many methods of extracting common factors. They include maximum likelihood (ml), iterated principal axis (pa), weighted least square (wls), generalized weighted least squares (gls), and minimum residual (minres). Statisticians tend to prefer the maximum likelihood approach because of its well-defined statistical model. Sometimes, this approach fails to converge, in which case the iterated principal axis option often works well. To learn more about the different approaches, see Mulaik (2009) and Gorsuch (1983). For this example, you’ll extract the unrotated factors using the iterated principal axis (fm = "pa") approach. The results are given in the next listing. Listing 14.6 Principal axis factoring without rotation > fa <- fa(correlations, nfactors=2, rotate="none", fm="pa") > fa Factor Analysis using method = pa Call: fa(r = correlations, nfactors = 2, rotate = "none", fm = "pa") Standardized loadings based upon correlation matrix PA1 PA2 h2 u2 general 0.75 0.07 0.57 0.43 picture 0.52 0.32 0.38 0.62 blocks 0.75 0.52 0.83 0.17 maze 0.39 0.22 0.20 0.80 reading 0.81 -0.51 0.91 0.09 vocab 0.73 -0.39 0.69 0.31 www.it-ebooks.info Exploratory factor analysis 333 PA1 PA2 SS loadings 2.75 0.83 Proportion Var 0.46 0.14 Cumulative Var 0.46 0.60 [... additional output deleted ...] You can see that the two factors account for 60% of the variance in the six psychological tests. When you examine the loadings, though, they aren’t easy to interpret. Rotating them should help. 14.3.3 Rotating factors You can rotate the two-factor solution from section 14.3.4 using either an orthogonal rotation or an oblique rotation. Let’s try both so you can see how they differ. First try an orthogonal rotation (in the next listing). Listing 14.7 Factor extraction with orthogonal rotation > fa.varimax <- fa(correlations, nfactors=2, rotate="varimax", fm="pa") > fa.varimax Factor Analysis using method = pa Call: fa(r = correlations, nfactors = 2, rotate = "varimax", fm = "pa") Standardized loadings based upon correlation matrix PA1 PA2 h2 u2 general 0.49 0.57 0.57 0.43 picture 0.16 0.59 0.38 0.62 blocks 0.18 0.89 0.83 0.17 maze 0.13 0.43 0.20 0.80 reading 0.93 0.20 0.91 0.09 vocab 0.80 0.23 0.69 0.31 PA1 PA2 SS loadings 1.83 1.75 Proportion Var 0.30 0.29 Cumulative Var 0.30 0.60 [... additional output omitted ...] Looking at the factor loadings, the factors are certainly easier to interpret. Reading and vocabulary load on the first factor; and picture completion, block design, and mazes load on the second factor. The general nonverbal intelligence measure loads on both factors. This may indicate a verbal intelligence factor and a nonverbal intelligence factor. By using an orthogonal rotation, you artificially force the two factors to be uncorrelated. What would you find if you allowed the two factors to correlate? You can try an oblique rotation such as promax (see the next listing). Listing 14.8 Factor extraction with oblique rotation > fa.promax <- fa(correlations, nfactors=2, rotate="promax", fm="pa") > fa.promax Factor Analysis using method = pa www.it-ebooks.info 334 CHAPTER 14 Principal components and factor analysis Call: fa(r = correlations, nfactors = 2, rotate = "promax", fm = "pa") Standardized loadings based upon correlation matrix PA1 PA2 h2 u2 general 0.36 0.49 0.57 0.43 picture -0.04 0.64 0.38 0.62 blocks -0.12 0.98 0.83 0.17 maze -0.01 0.45 0.20 0.80 reading 1.01 -0.11 0.91 0.09 vocab 0.84 -0.02 0.69 0.31 PA1 PA2 SS loadings 1.82 1.76 Proportion Var 0.30 0.29 Cumulative Var 0.30 0.60 With factor correlations of PA1 PA2 PA1 1.00 0.57 PA2 0.57 1.00 [... additional output omitted ...] Several differences exist between the orthogonal and oblique solutions. In an orthogonal solution, attention focuses on the factor structure matrix (the correlations of the variables with the factors). In an oblique solution, there are three matrices to consider: the factor structure matrix, the factor pattern matrix, and the factor intercorrelation matrix. The factor pattern matrix is a matrix of standardized regression coefficients. They give the weights for predicting the variables from the factors. The factor intercorrelation matrix gives the correlations among the factors. In listing 14.8, the values in the PA1 and PA2 columns constitute the factor pattern matrix. They’re standardized regression coefficients rather than correlations. Examination of the columns of this matrix is still used to name the factors (although there’s some controversy here). Again, you’d find a verbal and nonverbal factor. The factor intercorrelation matrix indicates that the correlation between the two factors is 0.57. This is a hefty correlation. If the factor intercorrelations had been low, you might have gone back to an orthogonal solution to keep things simple. The factor structure matrix (or factor loading matrix) isn’t provided. But you can easily calculate it using the formula F = P*Phi, where F is the factor loading matrix, P is the factor pattern matrix, and Phi is the factor intercorrelation matrix. A simple function for carrying out the multiplication is as follows: fsm <- function(oblique) { if (class(oblique)[2]=="fa" & is.null(oblique$Phi)) { warning("Object doesn't look like oblique EFA") } else { P <- unclass(oblique$loading) F <- P %*% oblique$Phi colnames(F) <- c("PA1", "PA2") return(F) } } www.it-ebooks.info 335 Exploratory factor analysis 1.0 Factor Analysis PA2 picture general maze 0.0 0.2 0.4 0.6 0.8 blocks vocab reading 0.0 0.2 0.4 0.6 0.8 Figure 14.5 Two-factor plot for the psychological tests in ability.cov. vocab and reading load on the first factor (PA1), and blocks, picture, and maze load on the second factor (PA2). The general intelligence test loads on both. 1.0 PA1 Applying this to the example, you get > fsm(fa.promax) PA1 PA2 general 0.64 0.69 picture 0.33 0.61 blocks 0.44 0.91 maze 0.25 0.45 reading 0.95 0.47 vocab 0.83 0.46 Now you can review the correlations between the variables and the factors. Comparing them to the factor loading matrix in the orthogonal solution, you see that these columns aren’t as pure. This is because you’ve allowed the underlying factors to be correlated. Although the oblique approach is more complicated, it’s often a more realistic model of the data. You can graph an orthogonal or oblique solution using the factor.plot() or fa.diagram() function. The code factor.plot(fa.promax, labels=rownames(fa.promax$loadings)) produces the graph in figure 14.5. The code fa.diagram(fa.promax, simple=FALSE) produces the diagram in figure 14.6. If you let simple = TRUE, only the largest loading per item is displayed. It shows the largest loadings for each factor, as well as the www.it-ebooks.info 336 CHAPTER 14 Principal components and factor analysis Factor Analysis reading 1 vocab 0.8 PA1 blocks picture 0.6 0.4 1 0.6 general 0.5 PA2 0.5 maze Figure 14.6 Diagram of the oblique two-factor solution for the psychological test data in ability.cov correlations between the factors. This type of diagram is helpful when there are several factors. When you’re dealing with data in real life, it’s unlikely that you’d apply factor analysis to a dataset with so few variables. We’ve done it here to keep things manageable. If you’d like to test your skills, try factor-analyzing the 24 psychological tests contained in Harman74.cor. The code library(psych) fa.24tests <- fa(Harman74.cor$cov, nfactors=4, rotate="promax") should get you started! 14.3.4 Factor scores Compared with PCA, the goal of EFA is much less likely to be the calculation of factor scores. But these scores are easily obtained from the fa() function by including the score = TRUE option (when raw data are available). Additionally, the scoring coefficients (standardized regression weights) are available in the weights element of the object returned. For the ability.cov dataset, you can obtain the beta weights for calculating the factor score estimates for the two-factor oblique solution using > fa.promax$weights [,1] [,2] general 0.080 0.210 picture 0.021 0.090 blocks 0.044 0.695 maze 0.027 0.035 reading 0.739 0.044 vocab 0.176 0.039 Unlike component scores, which are calculated exactly, factor scores can only be estimated. Several methods exist. The fa() function uses the regression approach. To learn more about factor scores, see DiStefano, Zhu, and Mîndrila, (2009). ∩ www.it-ebooks.info Other latent variable models 337 Before moving on, let’s briefly review other R packages that are useful for exploratory factor analysis. 14.3.5 Other EFA-related packages R contains a number of other contributed packages that are useful for conducting factor analyses. The FactoMineR package provides methods for PCA and EFA, as well as other latent variable models. It provides many options that we haven’t considered here, including the use of both numeric and categorical variables. The FAiR package estimates factor analysis models using a genetic algorithm that permits the ability to impose inequality restrictions on model parameters. The GPArotation package offers many additional factor rotation methods. Finally, the nFactors package offers sophisticated techniques for determining the number of factors underlying data. 14.4 Other latent variable models EFA is only one of a wide range of latent variable models used in statistics. We’ll end this chapter with a brief description of other models that can be fit within R. These include models that test a priori theories, that can handle mixed data types (numeric and categorical), or that are based solely on categorical multiway tables. In EFA, you allow the data to determine the number of factors to be extracted and their meaning. But you could start with a theory about how many factors underlie a set of variables, how the variables load on those factors, and how the factors correlate with one another. You could then test this theory against a set of collected data. The approach is called confirmatory factor analysis (CFA). CFA is a subset of a methodology called structural equation modeling (SEM). SEM allows you to posit not only the number and composition of underlying factors but also how these factors impact one another. You can think of SEM as a combination of confirmatory factor analyses (for the variables) and regression analyses (for the factors). The resulting output includes statistical tests and fit indices. There are several excellent packages for CFA and SEM in R. They include sem, OpenMx, and lavaan. The ltm package can be used to fit latent models to the items contained in tests and questionnaires. The methodology is often used to create large-scale standardized tests. Examples include the Scholastic Aptitude Test (SAT) and the Graduate Record Exam (GRE). Latent class models (where the underlying factors are assumed to be categorical rather than continuous) can be fit with the FlexMix, lcmm, randomLCA, and poLCA packages. The lcda package performs latent class discriminant analysis, and the lsa package performs latent semantic analysis, a methodology used in natural language processing. The ca package provides functions for simple and multiple correspondence analysis. These methods allow you to explore the structure of categorical variables in twoway and multiway tables, respectively. Finally, R contains numerous methods for multidimensional scaling (MDS). MDS is designed to detect underlying dimensions that explain the similarities and distances www.it-ebooks.info 338 CHAPTER 14 Principal components and factor analysis between a set of measured objects (for example, countries). The cmdscale() function in the base installation performs a classical MDS, whereas the isoMDS() function in the MASS package performs a nonmetric MDS. The vegan package also contains functions for classical and nonmetric MDS. 14.5 Summary In this chapter, we reviewed methods for principal components (PCA) analysis and exploratory factor analysis (EFA). PCA is a useful data-reduction method that can replace a large number of correlated variables with a smaller number of uncorrelated variables, simplifying the analyses. EFA contains a broad range of methods for identifying latent or unobserved constructs (factors) that may underlie a set of observed or manifest variables. Whereas the goal of PCA is typically to summarize the data and reduce its dimensionality, EFA can be used as a hypothesis-generating tool, useful when you’re trying to Select Factor Model Common Factor Components Principal Components Maximum Likelihood Principal Axis Weighted Least Squares Minimum Residual Select Number of Components/Factors Scree Test Kaiser/Harris Parallel Analysis Theory Interpretability Variance Accounted Rotate Components/Factors Oblique Orthogonal Varimax Promax Other Orthogonal Rotation Other Oblique Rotation Interpret Components/Factors Calculate Factor Scores www.it-ebooks.info Figure 14.7 A principal components/ exploratory factor analysis decision chart Summary 339 understand the relationships between a large number of variables. It’s often used in the social sciences for theory development. Although there are many superficial similarities between the two approaches, important differences exist as well. In this chapter, we considered the models underlying each, methods for selecting the number of components/factors to extract, methods for extracting components/factors and rotating (transforming) them to enhance interpretability, and techniques for obtaining component or factor scores. The steps in a PCA or EFA are summarized in figure 14.7. We ended the chapter with a brief discussion of other latent variable methods available in R. In the next chapter, we’ll consider methods for working with time-series data. www.it-ebooks.info Time series This chapter covers ■ Creating a time series ■ Decomposing a time series into components ■ Developing predictive models ■ Forecasting future values How fast is global warming occurring, and what will the impact be in 10 years? With the exception of repeated measures ANOVA in section 9.6, each of the preceding chapters has focused on cross-sectional data. In a cross-sectional dataset, variables are measured at a single point in time. In contrast, longitudinal data involves measuring variables repeatedly over time. By following a phenomenon over time, it’s possible to learn a great deal about it. In this chapter, we’ll examine observations that have been recorded at regularly spaced time intervals for a given span of time. We can arrange observations such as these into a time series of the form Y1, Y2, Y3, … , Yt, …, YT, where Yt represents the value of Y at time t and T is the total number of observations in the series. Consider two very different time series displayed in figure 15.1. The series on the left contains the quarterly earnings (dollars) per Johnson & Johnson share between 1960 and 1980. There are 84 observations: one for each quarter over 21 340 www.it-ebooks.info 341 Time series 200 150 100 50 0 5 10 Mean monthly frequency 15 250 Sunspots 0 Quarterly earnings per share (dollars) Johnson & Johnson 1960 1970 1980 1750 Time 1850 1950 Time Figure 15.1 Time series plots for (a) Johnson & Johnson quarterly earnings per share (in dollars) from 1960 to 1980, and (b) the monthly mean relative sunspot numbers recorded from 1749 to 1983 years. The series on the right describes the monthly mean relative sunspot numbers from 1749 to 1983 recorded by the Swiss Federal Observatory and the Tokyo Astronomical Observatory. The sunspots time series is much longer, with 2,820 observations—1 per month for 235 years. Studies of time-series data involve two fundamental questions: what happened (description), and what will happen next (forecasting)? For the Johnson & Johnson data, you might ask ■ ■ ■ Is the price of Johnson & Johnson shares changing over time? Are there quarterly effects, with share prices rising and falling in a regular fashion throughout the year? Can you forecast what future share prices will be and, if so, to what degree of accuracy? For the sunspot data, you might ask ■ ■ ■ What statistical models best describe sunspot activity? Do some models fit the data better than others? Are the number of sunspots at a given time predictable and, if so, to what degree? The ability to accurately predict stock prices has relevance for my (hopefully) early retirement to a tropical island, whereas the ability to predict sunspot activity has relevance for my cell phone reception on said island. Predicting future values of a time series, or forecasting, is a fundamental human activity, and studies of time series data have important real-world applications. Economists use time-series data in an attempt to understand and predict what will happen in www.it-ebooks.info 342 CHAPTER 15 Time series financial markets. City planners use time-series data to predict future transportation demands. Climate scientists use time-series data to study global climate change. Corporations use time series to predict product demand and future sales. Healthcare officials use time-series data to study the spread of disease and to predict the number of future cases in a given region. Seismologists study times-series data in order to predict earthquakes. In each case, the study of historical time series is an indispensable part of the process. Because different approaches may work best with different types of time series, we’ll investigate many examples in this chapter. There is a wide range of methods for describing time-series data and forecasting future values. If you work with time-series data, you’ll find that R has some of the most comprehensive analytic capabilities available anywhere. This chapter explores some of the most common descriptive and forecasting approaches and the R functions used to perform them. The functions are listed in table 15.1 in their order of appearance in the chapter. Table 15.1 Functions for time-series analysis Function Package Use ts() stats Creates a time-series object. plot() graphics Plots a time series. start() stats Returns the starting time of a time series. end() stats Returns the ending time of a time series. frequency() stats Returns the period of a time series. window() stats Subsets a time-series object. ma() forecast Fits a simple moving-average model. stl() stats Decomposes a time series into seasonal, trend, and irregular components using loess. monthplot() stats Plots the seasonal components of a time series. seasonplot() forecast Generates a season plot. HoltWinters() stats Fits an exponential smoothing model. forecast() forecast Forecasts future values of a time series. accuracy() forecast Reports fit measures for a time-series model. ets() forecast Fits an exponential smoothing model. Includes the ability to automate the selection of a model. lag() stats Returns a lagged version of a time series. Acf() forecast Estimates the autocorrelation function. Pacf() forecast Estimates the partial autocorrelation function. diff() base Returns lagged and iterated differences. www.it-ebooks.info 343 Creating a time-series object in R Table 15.1 Functions for time-series analysis Function Package Use ndiffs() forecast Determines the level of differencing needed to remove trends in a time series. adf.test() tseries Computes an Augmented Dickey–Fuller test that a time series is stationary. arima() stats Fits autoregressive integrated moving-average models. Box.test() stats Computes a Ljung–Box test that the residuals of a time series are independent. bds.test() tseries Computes the BDS test that a series consists of independent, identically distributed random variables. auto.arima() forecast Automates the selection of an ARIMA model. Table 15.2 lists the time-series data that you’ll analyze. They’re available with the base installation of R. The datasets vary greatly in their characteristics and the models that fit them best. Table 15.2 Datasets used in this chapter Time series Description AirPassengers Monthly airline passenger numbers from 1949–1960 JohnsonJohnson Quarterly earnings per Johnson & Johnson share nhtemp Average yearly temperatures in New Haven, Connecticut, from 1912–1971 Nile Flow of the river Nile sunspots Monthly sunspot numbers from 1749–1983 We’ll start with methods for creating and manipulating time series, describing and plotting them, and decomposing them into level, trend, seasonal, and irregular (error) components. Then we’ll turn to forecasting, starting with popular exponential modeling approaches that use weighted averages of time-series values to predict future values. Next we’ll consider a set of forecasting techniques called autoregressive integrated moving averages (ARIMA) models that use correlations among recent data points and among recent prediction errors to make future forecasts. Throughout, we’ll consider methods of evaluating the fit of models and the accuracy of their predictions. The chapter ends with a description of resources available for learning more about these topics. 15.1 Creating a time-series object in R In order to work with a time series in R, you have to place it into a time-series object—an R structure that contains the observations, the starting and ending time of the series, www.it-ebooks.info 344 CHAPTER 15 Time series and information about its periodicity (for example, monthly, quarterly, or annual data). Once the data are in a time-series object, you can use numerous functions to manipulate, model, and plot the data. A vector of numbers, or a column in a data frame, can be saved as a time-series object using the ts() function. The format is myseries <- ts(data, start=, end=, frequency=) where myseries is the time-series object, data is a numeric vector containing the observations, start specifies the series start time, end specifies the end time (optional), and frequency indicates the number of observations per unit time (for example, frequency=1 for annual data, frequency=12 for monthly data, and frequency=4 for quarterly data). An example is given in the following listing. The data consist of monthly sales figures for two years, starting in January 2003. Listing 15.1 Creating a time-series object > sales <- c(18, 33, 41, 7, 34, 35, 24, 25, 24, 21, 25, 20, 22, 31, 40, 29, 25, 21, 22, 54, 31, 25, 26, 35) > tsales <- ts(sales, start=c(2003, 1), frequency=12) > tsales 2003 2004 Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 18 33 41 7 34 35 24 25 24 21 25 20 22 31 40 29 25 21 22 54 31 25 26 35 > plot(tsales) > start(tsales) [1] 2003 1 c b Creates a time-series object Gets information about the object > end(tsales) [1] 2004 12 > frequency(tsales) Subsets the object [1] 12 d > tsales.subset <- window(tsales, start=c(2003, 5), end=c(2004, 6)) > tsales.subset 2003 2004 Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 34 35 24 25 24 21 25 20 22 31 40 29 25 21 In this listing, the ts() function is used to create the time-series object b. Once it’s created, you can print and plot it; the plot is given in figure 15.2. You can modify the plot using the techniques described in chapter 3. For example, plot(tsales, type="o", pch=19) would create a time-series plot with connected, solid-filled circles. www.it-ebooks.info 345 30 Figure 15.2 Time-series plot for the sales data in listing 15.1. The decimal notation on the time dimension is used to represent the portion of a year. For example, 2003.5 represents July 1 (halfway through 2003). 10 20 tsales 40 50 Smoothing and seasonal decomposition 2003.0 2003.5 2004.0 2004.5 Time Once you’ve created the time-series object, you can use functions like start(), end(), and frequency() to return its properties c. You can also use the window() function to create a new time series that’s a subset of the original d. 15.2 Smoothing and seasonal decomposition Just as analysts explore a dataset with descriptive statistics and graphs before attempting to model the data, describing a time series numerically and visually should be the first step before attempting to build complex models. In this section, we’ll look at smoothing a time series to clarify its general trend, and decomposing a time series in order to observe any seasonal effects. 15.2.1 Smoothing with simple moving averages The first step when investigating a time series is to plot it, as in listing 15.1. Consider the Nile time series. It records the annual flow of the river Nile at Ashwan from 1871– 1970. A plot of the series can be seen in the upper-left panel of figure 15.3. The time series appears to be decreasing, but there is a great deal of variation from year to year. Time series typically have a significant irregular or error component. In order to discern any patterns in the data, you’ll frequently want to plot a smoothed curve that damps down these fluctuations. One of the simplest methods of smoothing a time series is to use simple moving averages. For example, each data point can be replaced with the mean of that observation and one observation before and after it. This is called a centered moving average. A centered moving average is defined as St = (Yt-q + … + Yt + … + Yt+q) / (2q + 1) where St is the smoothed value at time t and k = 2q + 1 is the number of observations that are averaged. The k value is usually chosen to be an odd number (3 in this example). By necessity, when using a centered moving average, you lose the (k – 1) / 2 observations at each end of the series. www.it-ebooks.info 346 Time series CHAPTER 15 Several functions in R can provide a simple moving average, including SMA() in the TTR package, rollmean() in the zoo package, and ma() in the forecast package. Here, you’ll use the ma() function to smooth the Nile time series that comes with the base R installation. The code in the next listing plots the raw time series and smoothed versions using k equal to 3, 7, and 15. The plots are given in figure 15.3. Listing 15.2 Simple moving averages library(forecast) opar <- par(no.readonly=TRUE) par(mfrow=c(2,2)) ylim <- c(min(Nile), max(Nile)) plot(Nile, main="Raw time series") plot(ma(Nile, 3), main="Simple Moving Averages (k=3)", ylim=ylim) plot(ma(Nile, 7), main="Simple Moving Averages (k=7)", ylim=ylim) plot(ma(Nile, 15), main="Simple Moving Averages (k=15)", ylim=ylim) par(opar) As k increases, the plot becomes increasingly smoothed. The challenge is to find the value of k that highlights the major patterns in the data, without under- or oversmoothing. This is more art than science, and you’ll probably want to try several values of k before settling on one. From the plots in figure 15.3, there certainly appears to have been a drop in river flow between 1892 and 1900. Other changes are open to interpretation. For example, there may have been a small increasing trend between 1941 and 1961, but this could also have been a random variation. For time-series data with a periodicity greater than one (that is, with a seasonal component), you’ll want to go beyond a description of the overall trend. Seasonal decomposition can be used to examine both seasonal and general trends. Simple Moving Averages (k=3) 1400 1880 1920 1960 1880 1920 1960 Simple Moving Averages (k=7) Simple Moving Averages (k=15) 600 1000 ma(Nile, 15) 1000 1400 Time 1400 Time 600 ma(Nile, 7) 1000 ma(Nile, 3) 600 1000 600 Nile 1400 Raw time series 1880 1920 Time 1960 1880 1920 Time www.it-ebooks.info 1960 Figure 15.3 The Nile time series measuring annual river flow at Ashwan from 1871–1970 (upper left). The other plots are smoothed versions using simple moving averages at three smoothing levels (k=3, 7, and 15). 347 Smoothing and seasonal decomposition 15.2.2 Seasonal decomposition Time-series data that have a seasonal aspect (such as monthly or quarterly data) can be decomposed into a trend component, a seasonal component, and an irregular component. The trend component captures changes in level over time. The seasonal component captures cyclical effects due to the time of year. The irregular (or error) component captures those influences not described by the trend and seasonal effects. The decomposition can be additive or multiplicative. In an additive model, the components sum to give the values of the time series. Specifically, Yt = Trendt + Seasonalt + Irregulart where the observation at time t is the sum of the contributions of the trend at time t, the seasonal effect at time t, and an irregular effect at time t. In a multiplicative model, given by the equation Yt = Trendt * Seasonalt * Irregulart the trend, seasonal, and irregular influences are multiplied. Examples are given in figure 15.4. (a) Stationary 700 Y 300 500 600 550 Y 650 900 (b) Additive Trend and Irregular Components 2000 2002 2004 2006 2008 2010 2000 2002 Time 2004 2006 2008 2010 Time (d) Additive Trend, Seasonal, and Irregular Components Y 600 200 500 Y 600 700 (c) Additive Seasonal and Irregular Components 2000 2002 2004 2006 2008 2010 2000 Time 2002 2004 2006 2008 2010 Time Figure 15.4 Time-series examples consisting of different combinations of trend, seasonal, and irregular components 200 Y 600 1000 (e) Multiplicative Trend, Seasonal, and Irregular Components 2000 2002 2004 2006 2008 2010 Time www.it-ebooks.info 348 CHAPTER 15 Time series In the first plot (a), there is neither a trend nor a seasonal component. The only influence is a random fluctuation around a given level. In the second plot (b), there is an upward trend over time, as well as random fluctuations. In the third plot (c), there are seasonal effects and random fluctuations, but no overall trend away from a horizontal line. In the fourth plot (d), all three components are present: an upward trend, seasonal effects, and random fluctuations. You also see all three components in the final plot (e), but here they combine in a multiplicative way. Notice how the variability is proportional to the level: as the level increases, so does the variability. This amplification (or possible damping) based on the current level of the series strongly suggests a multiplicative model. An example may make the difference between additive and multiplicative models clearer. Consider a time series that records the monthly sales of motorcycles over a 10year period. In a model with an additive seasonal effect, the number of motorcycles sold tends to increase by 500 in November and December (due to the Christmas rush) and decrease by 200 in January (when sales tend to be down). The seasonal increase or decrease is independent of the current sales volume. In a model with a multiplicative seasonal effect, motorcycle sales in November and December tend to increase by 20% and decrease in January by 10%. In the multiplicative case, the impact of the seasonal effect is proportional to the current sales volume. This isn’t the case in an additive model. In many instances, the multiplicative model is more realistic. A popular method for decomposing a time series into trend, seasonal, and irregular components is seasonal decomposition by loess smoothing. In R, this can be accomplished with the stl() function. The format is stl(ts, s.window=, t.window=) where ts is the time series to be decomposed, s.window controls how fast the seasonal effects can change over time, and t.window controls how fast the trend can change over time. Smaller values allow more rapid change. Setting s.window="periodic" forces seasonal effects to be identical across years. Only the ts and s.window parameters are required. See help(stl) for details. The stl() function can only handle additive models, but this isn’t a serious limitation. Multiplicative models can be transformed into additive models using a log transformation: log(Yt) = log(Trendt * Seasonalt * Irregulart) = log(Trendt) + log(Seasonalt) + log(Irregulart) After fitting the additive model to the log transformed series, the results can be backtransformed to the original scale. Let’s look at an example. The time series AirPassengers comes with a base R installation and describes the monthly totals (in thousands) of international airline passengers between 1949 and 1960. A plot of the data is given in the top of figure 15.5. From the graph, it appears that variability of the series increases with the level, suggesting a multiplicative model. www.it-ebooks.info 349 Figure 15.5 1950 1952 1954 1956 1958 1960 1950 1952 1954 1956 1958 1960 5.0 5.5 6.0 log(AirPassengers) 6.5 AirPassengers 100 200 300 400 500 600 Smoothing and seasonal decomposition Plot of the AirPassengers time series (top). The time series contains the monthly totals (in thousands) of international airline passengers between 1949 and 1960. The logtransformed time series (bottom) stabilizes the variance and fits an additive seasonal decomposition model better. Time The plot in the lower portion of figure 15.5 displays the time series created by taking the log of each observation. The variance has stabilized, and the logged series looks like an appropriate candidate for an additive decomposition. This is carried out using the stl() function in the following listing. Listing 15.3 Seasonal decomposition using stl() > plot(AirPassengers) > lAirPassengers <- log(AirPassengers) > plot(lAirPassengers, ylab="log(AirPassengers)") > fit <- stl(lAirPassengers, s.window="period") > plot(fit) > fit$time.series Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov 1949 1949 1949 1949 1949 1949 1949 1949 1949 1949 1949 seasonal -0.09164 -0.11403 0.01587 -0.01403 -0.01502 0.10979 0.21640 0.20961 0.06747 -0.07025 -0.21353 trend 4.829 4.830 4.831 4.833 4.835 4.838 4.841 4.843 4.846 4.851 4.856 remainder -0.0192494 0.0543448 0.0355884 0.0404633 -0.0245905 -0.0426814 -0.0601152 -0.0558625 -0.0008274 -0.0015113 0.0021631 d www.it-ebooks.info b Plots the time series c Decomposes the time series Components for each observation 350 CHAPTER 15 Dec 1949 -0.10064 4.865 ... output omitted ... Time series 0.0067347 > exp(fit$time.series) Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec ... seasonal trend remainder 1949 0.9124 125.1 0.9809 1949 0.8922 125.3 1.0558 1949 1.0160 125.4 1.0362 1949 0.9861 125.6 1.0413 1949 0.9851 125.9 0.9757 1949 1.1160 126.2 0.9582 1949 1.2416 126.6 0.9417 1949 1.2332 126.9 0.9457 1949 1.0698 127.2 0.9992 1949 0.9322 127.9 0.9985 1949 0.8077 128.5 1.0022 1949 0.9043 129.6 1.0068 output omitted ... 5.5 5.6 −0.10 0.00 irregular 4.8 5.2 trend 6.0 −0.2 0.0 seasonal 0.1 0.2 5.0 data 6.0 6.5 First, the time series is plotted and transformed b. A seasonal decomposition is performed and saved in an object called fit c. Plotting the results gives the graph in figure 15.6. The graph shows the time series, seasonal, trend, and irregular components from 1949 to 1960. Note that the seasonal components have been constrained to 1950 1952 1954 1956 1958 1960 time Figure 15.6 A seasonal decomposition of the logged AirPassengers time series using the stl() function. The time series (data) is decomposed into seasonal, trend, and irregular components. www.it-ebooks.info 351 Smoothing and seasonal decomposition remain the same across each year (using the s.window="period" option). The trend is monotonically increasing, and the seasonal effect suggests more passengers in the summer (perhaps during vacations). The grey bars on the right are magnitude guides—each bar represents the same magnitude. This is useful because the y-axes are different for each graph. The object returned by the stl() function contains a component called time.series that contains the trend, season, and irregular portion of each observation d. In this case, fit$time.series is based on the logged time series. exp(fit$time.series) converts the decomposition back to the original metric. Examining the seasonal effects suggests that the number of passengers increased by 24% in July (a multiplier of 1.24) and decreased by 20% in November (with a multiplier of .80). Two additional graphs can help to visualize a seasonal decomposition. They’re created by the monthplot() function that comes with base R and the seasonplot() function provided in the forecast package. The code par(mfrow=c(2,1)) library(forecast) monthplot(AirPassengers, xlab="", ylab="") seasonplot(AirPassengers, year.labels="TRUE", main="") 100 200 300 400 500 600 produces the graphs in figure 15.7. F M A M J J A S O N D 500 600 J 400 1960 1959 100 200 300 1958 1957 1956 1955 1954 1953 1952 1951 1950 1949 Jan Feb Mar Apr May Jun Jul Aug Sep Oct www.it-ebooks.info Nov Dec Figure 15.7 A month plot (top) and season plot (bottom) for the AirPassengers time series. Each shows an increasing trend and similar seasonal pattern year to year. 352 CHAPTER 15 Time series The month plot (top figure) displays the subseries for each month (all January values connected, all February values connected, and so on), along with the average of each subseries. From this graph, it appears that the trend is increasing for each month in a roughly uniform way. Additionally, the greatest number of passengers occurs in July and August. The season plot (lower figure) displays the subseries by year. Again you see a similar pattern, with increases in passengers each year, and the same seasonal pattern. Note that although you’ve described the time series, you haven’t predicted any future values. In the next section, we’ll consider the use of exponential models for forecasting beyond the available data. 15.3 Exponential forecasting models Exponential models are some of the most popular approaches to forecasting the future values of a time series. They’re simpler than many other types of models, but they can yield good short-term predictions in a wide range of applications. They differ from each other in the components of the time series that are modeled. A simple exponential model (also called a single exponential model) fits a time series that has a constant level and an irregular component at time i but has neither a trend nor a seasonal component. A double exponential model (also called a Holt exponential smoothing) fits a time series with both a level and a trend. Finally, a triple exponential model (also called a Holt-Winters exponential smoothing) fits a time series with level, trend, and seasonal components. Exponential models can be fit with either the HoltWinters() function in the base installation or the ets() function that comes with the forecast package. The ets() function has more options and is generally more powerful. We’ll focus on the ets() function in this section. The format of the ets() function is ets(ts, model="ZZZ") where ts is a time series and the model is specified by three letters. The first letter denotes the error type, the second letter denotes the trend type, and the third letter denotes the seasonal type. Allowable letters are A for additive, M for multiplicative, N for none, and Z for automatically selected. Examples of common models are given in table 15.3. Table 15.3 Type Functions for fitting simple, double, and triple exponential forecasting models Parameters fit Functions simple level ets(ts, model="ANN") ses(ts) double level, slope ets(ts, model="AAN") holt(ts) triple level, slope, seasonal ets(ts, model="AAA") hw(ts) www.it-ebooks.info 353 Exponential forecasting models The ses(), holt(), and hw() functions are convenience wrappers to the ets() function with prespecified defaults. First we’ll look at the most basic exponential model: simple exponential smoothing. Be sure to install the forecast package (install.packages("forecast")) before proceeding. 15.3.1 Simple exponential smoothing Simple exponential smoothing uses a weighted average of existing time-series values to make a short-term prediction of future values. The weights are chosen so that observations have an exponentially decreasing impact on the average as you go back in time. The simple exponential smoothing model assumes that an observation in the time series can be described by Yt = level + irregulart The prediction at time Yt+1 (called the 1-step ahead forecast) is written as Yt+1 = c 0Yt + c 1Yt−1 + c 2Yt−2 + c 2Yt−2 + ... where ci = α(1−α)i, i = 0, 1, 2, ... and 0 ≤ α ≤ 1. The ci weights sum to one, and the 1-step ahead forecast can be seen to be a weighted average of the current value and all past values of the time series. The alpha (α) parameter controls the rate of decay for the weights. The closer alpha is to 1, the more weight is given to recent observations. The closer alpha is to 0, the more weight is given to past observations. The actual value of alpha is usually chosen by computer in order to optimize a fit criterion. A common fit criterion is the sum of squared errors between the actual and predicted values. An example will help clarify these ideas. The nhtemp time series contains the mean annual temperature in degrees Fahrenheit in New Haven, Connecticut, from 1912 to 1971. A plot of the time series can be seen as the line in figure 15.8. There is no obvious trend, and the yearly data lack a seasonal component, so the simple exponential model is a reasonable place to start. The code for making a 1-step ahead forecast using the ses() function is given next. Listing 15.4 Simple exponential smoothing > library(forecast) > fit <- ets(nhtemp, model="ANN") > fit ETS(A,N,N) Call: ets(y = nhtemp, model = "ANN") Smoothing parameters: alpha = 0.182 Initial states: l = 50.2759 www.it-ebooks.info b Fits the model 354 CHAPTER 15 sigma: Time series 1.126 AIC AICc BIC 263.9 264.1 268.1 c 1-step ahead forecast > forecast(fit, 1) 1972 Point Forecast Lo 80 Hi 80 Lo 95 Hi 95 51.87 50.43 53.31 49.66 54.08 > plot(forecast(fit, 1), xlab="Year", ylab=expression(paste("Temperature (", degree*F,")",)), main="New Haven Annual Mean Temperature") > accuracy(fit) d ME RMSE MAE MPE MAPE MASE Training set 0.146 1.126 0.8951 0.2419 1.749 0.9228 Prints accuracy measures The ets(mode="ANN") statement fits the simple exponential model to the nhtemp time series b. The A indicates that the errors are additive, and the NN indicates that there is no trend and no seasonal component. The relatively low value of alpha (0.18) indicates that distant as well as recent observations are being considered in the forecast. This value is automatically chosen to maximize the fit of the model to the given dataset. The forecast() function is used to predict the time series k steps into the future. The format is forecast(fit, k). The 1-step ahead forecast for this series is 51.9°F with a 95% confidence interval (49.7°F to 54.1°F) c. The time series, the forecasted value, and the 80% and 95% confidence intervals are plotted in figure 15.8 d. 52 51 50 48 49 Temperature (°F) 53 54 New Haven Annual Mean Temperature 1910 1920 1930 1940 1950 1960 1970 Year Figure 15.8 Average yearly temperatures in New Haven, Connecticut; and a 1-step ahead prediction from a simple exponential forecast using the ets() function www.it-ebooks.info 355 Exponential forecasting models The forecast package also provides an accuracy() function that displays the most popular predictive accuracy measures for time-series forecasts d. A description of each is given in table 15.4. The et represent the error or irregular component of each observation (Yt − Yi ). Table 15.4 Predictive accuracy measures Measure Abbreviation Definition Mean error ME mean( et ) Root mean squared error RMSE sqrt( mean( e2t ) ) Mean absolute error MAE mean( | et | ) Mean percentage error MPE mean( 100 * et / Yt ) Mean absolute percentage error MAPE mean( | 100 * et / Yt | ) Mean absolute scaled error MASE mean( | qt | ) where qt = et / ( 1/(T-1) * sum( | yt – yt-1| ) ), T is the number of observations, and the sum goes from t=2 to t=T The mean error and mean percentage error may not be that useful, because positive and negative errors can cancel out. The RMSE gives the square root of the mean square error, which in this case is 1.13°F. The mean absolute percentage error reports the error as a percentage of the time-series values. It’s unit-less and can be used to compare prediction accuracy across time series. But it assumes a measurement scale with a true zero point (for example, number of passengers per day). Because the Fahrenheit scale has no true zero, you can’t use it here. The mean absolute scaled error is the most recent accuracy measure and is used to compare the forecast accuracy across time series on different scales. There is no one best measure of predictive accuracy. The RMSE is certainly the best known and often cited. Simple exponential smoothing assumes the absence of trend or seasonal components. The next section considers exponential models that can accommodate both. 15.3.2 Holt and Holt-Winters exponential smoothing The Holt exponential smoothing approach can fit a time series that has an overall level and a trend (slope). The model for an observation at time t is Yt = level + slope*t + irregulart An alpha smoothing parameter controls the exponential decay for the level, and a beta smoothing parameter controls the exponential decay for the slope. Again, each parameter ranges from 0 to 1, with larger values giving more weight to recent observations. The Holt-Winters exponential smoothing approach can be used to fit a time series that has an overall level, a trend, and a seasonal component. Here, the model is Yt = level + slope*t + st + irregulart www.it-ebooks.info 356 CHAPTER 15 Time series where st represents the seasonal influence at time t. In addition to alpha and beta parameters, a gamma smoothing parameter controls the exponential decay of the seasonal component. Like the others, it ranges from 0 to 1, and larger values give more weight to recent observations in calculating the seasonal effect. In section 15.2, you decomposed a time series describing the monthly totals (in log thousands) of international airline passengers into additive trend, seasonal, and irregular components. Let’s use an exponential model to predict future travel. Again, you’ll use log values so that an additive model fits the data. The code in the following listing applies the Holt-Winters exponential smoothing approach to predicting the next five values of the AirPassengers time series. Listing 15.5 Exponential smoothing with level, slope, and seasonal components > library(forecast) > fit <- ets(log(AirPassengers), model="AAA") > fit ETS(A,A,A) Call: ets(y = log(AirPassengers), model = "AAA") Smoothing alpha = beta = gamma = parameters: 0.8528 4e-04 0.0121 b Smoothing parameters Initial states: l = 4.8362 b = 0.0097 s=-0.1137 -0.2251 -0.0756 0.0623 0.2079 0.2222 0.1235 -0.009 0 0.0203 -0.1203 -0.0925 sigma: 0.0367 AIC AICc BIC -204.1 -199.8 -156.5 >accuracy(fit) ME RMSE MAE MPE MAPE MASE Training set -0.0003695 0.03672 0.02835 -0.007882 0.5206 0.07532 > pred <- forecast(fit, > pred Point Forecast Jan 1961 6.101 Feb 1961 6.084 Mar 1961 6.233 Apr 1961 6.222 May 1961 6.225 5) Lo 80 6.054 6.022 6.159 6.138 6.131 Hi 80 6.148 6.146 6.307 6.306 6.318 Lo 95 6.029 5.989 6.120 6.093 6.082 Hi 95 6.173 6.179 6.346 6.350 6.367 > plot(pred, main="Forecast for Air Travel", ylab="Log(AirPassengers)", xlab="Time") www.it-ebooks.info c Future forecasts 357 Exponential forecasting models > > > > > > pred$mean <- exp(pred$mean) Makes forecasts in pred$lower <- exp(pred$lower) the original scale pred$upper <- exp(pred$upper) p <- cbind(pred$mean, pred$lower, pred$upper) dimnames(p)[[2]] <- c("mean", "Lo 80", "Lo 95", "Hi 80", "Hi 95") p Jan Feb Mar Apr May d 1961 1961 1961 1961 1961 mean 446.3 438.8 509.2 503.6 505.0 Lo 80 425.8 412.5 473.0 463.0 460.1 Lo 95 415.3 399.2 454.9 442.9 437.9 Hi 80 467.8 466.8 548.2 547.7 554.3 Hi 95 479.6 482.3 570.0 572.6 582.3 The smoothing parameters for the level (.82), trend (.0004), and seasonal components (.012) are given in b. The low value for the trend (.0004) doesn’t mean there is no slope; it indicates that the slope estimated from early observations didn’t need to be updated. The forecast() function produces forecasts for the next five months c and is plotted in figure 15.9. Because the predictions are on a log scale, exponentiation is used to get the predictions in the original metric: numbers (in thousands) of passengers d. The matrix pred$mean contains the point forecasts, and the matrices pred$lower and pred$upper contain the 80% and 95% lower and upper confidence limits, respectively. The exp() function is used to return the predictions to the original scale, and cbind() creates a single table. Thus the model predicts 509,200 passengers in March, with a 95% confidence band ranging from 454,900 to 570,000. 6.0 5.5 5.0 Log(AirPassengers) 6.5 Forecast for Air Travel 1950 1952 1954 1956 1958 1960 Time Figure 15.9 Five-year forecast of log(number of international airline passengers in thousands) based on a Holt-Winters exponential smoothing model. Data are from the AirPassengers time series. www.it-ebooks.info 358 CHAPTER 15 Time series 15.3.3 The ets() function and automated forecasting The ets() function has additional capabilities. You can use it to fit exponential models that have multiplicative components, add a dampening component, and perform automated forecasts. Let’s consider each in turn. In the previous section, you fit an additive exponential model to the log of the AirPassengers time series. Alternatively, you could fit a multiplicative model to the original data. The function call would be either ets(AirPassengers, model="MAM") or the equivalent hw(AirPassengers, seasonal="multiplicative"). The trend remains additive, but the seasonal and irregular components are assumed to be multiplicative. By using a multiplicative model in this case, the accuracy statistics and forecasted values are reported in the original metric (thousands of passengers)—a decided advantage. The ets() function can also fit a damping component. Time-series predictions often assume that a trend will continue up forever (housing market, anyone?). A damping component forces the trend to a horizontal asymptote over a period of time. In many cases, a damped model makes more realistic predictions. Finally, you can invoke the ets() function to automatically select a best-fitting model for the data. Let’s fit an automated exponential model to the Johnson & Johnson data described in the introduction to this chapter. The following code allows the software to select a best-fitting model. Listing 15.6 Automatic exponential forecasting with ets() > library(forecast) > fit <- ets(JohnsonJohnson) > fit ETS(M,M,M) Call: ets(y = JohnsonJohnson) Smoothing alpha = beta = gamma = parameters: 0.2328 0.0367 0.5261 Initial states: l = 0.625 b = 1.0286 s=0.6916 1.2639 0.9724 1.0721 sigma: 0.0863 AIC AICc BIC 162.4737 164.3937 181.9203 > plot(forecast(fit), main="Johnson & Johnson Forecasts", ylab="Quarterly Earnings (Dollars)", xlab="Time", flty=2) www.it-ebooks.info ARIMA forecasting models 359 20 15 10 Figure 15.10 Multiplicative exponential smoothing forecast with trend and seasonal components. The forecasts are a dashed line, and the 80% and 95% confidence intervals are provided in light and dark gray, respectively. 0 5 Quarterly Earnings (Dollars) 25 Johnson & Johnson Forecasts 1960 1965 1970 1975 1980 Time Because no model is specified, the software performs a search over a wide array of models to find one that minimizes the fit criterion (log-likelihood by default). The selected model is one that has multiplicative trend, seasonal, and error components. The plot, along with forecasts for the next eight quarters (the default in this case), is given in figure 15.10. The flty parameter sets the line type for the forecast line (dashed in this case). As stated earlier, exponential time-series modeling is popular because it can give good short-term forecasts in many situations. A second approach that is also popular is the Box-Jenkins methodology, commonly referred to as ARIMA models. These are described in the next section. 15.4 ARIMA forecasting models In the autoregressive integrated moving average (ARIMA) approach to forecasting, predicted values are a linear function of recent actual values and recent errors of prediction (residuals). ARIMA is a complex approach to forecasting. In this section, we’ll limit discussion to ARIMA models for non-seasonal time series. Before describing ARIMA models, a number of terms need to be defined, including lags, autocorrelation, partial autocorrelation, differencing, and stationarity. Each is considered in the next section. 15.4.1 Prerequisite concepts When you lag a time series, you shift it back by a given number of observations. Consider the first few observations from the Nile time series, displayed in table 15.5. Lag 0 www.it-ebooks.info 360 CHAPTER 15 Time series is the unshifted time series. Lag 1 is the time series shifted one position to the left. Lag 2 shifts the time series two positions to the left, and so on. Time series can be lagged using the function lag(ts,k), where ts is the time series and k is the number of lags. Table 15.5 Lag The Nile time series at various lags 1869 1870 1871 1872 1873 1874 1875 … 1120 1160 963 1210 1160 … 1120 1160 963 1210 1160 1160 … 1160 963 1210 1160 1160 813 … 0 1 2 1120 Autocorrelation measures the way observations in a time series relate to each other. ACk is the correlation between a set of observations (Yt) and observations k periods earlier (Yt-k). So AC1 is the correlation between the Lag 1 and Lag 0 time series, AC2 is the correlation between the Lag 2 and Lag 0 time series, and so on. Plotting these correlations (AC1, AC2, …, ACk) produces an autocorrelation function (ACF) plot. The ACF plot is used to select appropriate parameters for the ARIMA model and to assess the fit of the final model. An ACF plot can be produced with the acf() function in the stats package or the Acf() function in the forecast package. Here, the Acf() function is used because it produces a plot that is somewhat easier to read. The format is Acf(ts), where ts is the original time series. The ACF plot for the Nile time series, with k=1 to 18, is provided a little later, in the top half of figure 15.12. A partial autocorrelation is the correlation between Yt and Yt-k with the effects of all Y values between the two (Yt-1, Yt-2, …, Yt-k+1) removed. Partial autocorrelations can also be plotted for multiple values of k. The PACF plot can be generated with either the pacf() function in the stats package or the Pacf() function in the forecast package. Again, the Pacf() function is preferred due to its formatting. The function call is Pacf(ts), where ts is the time series to be assessed. The PACF plot is also used to determine the most appropriate parameters for the ARIMA model. The results for the Nile time series are given in the bottom half of figure 15.12. ARIMA models are designed to fit stationary time series (or time series that can be made stationary). In a stationary time series, the statistical properties of the series don’t change over time. For example, the mean and variance of Yt are constant. Additionally, the autocorrelations for any lag k don’t change with time. It may be necessary to transform the values of a time series in order to achieve constant variance before proceeding to fitting an ARIMA model. The log transformation is often useful here, as you saw in section 15.1.3. Other transformations, such as the BoxCox transformation described in section 8.5.2, may also be helpful. Because stationary time series are assumed to have constant means, they can’t have a trend component. Many non-stationary time series can be made stationary through www.it-ebooks.info ARIMA forecasting models 361 differencing. In differencing, each value of a time series Yt is replaced with Yt-1 – Yt. Differencing a time series once removes a linear trend. Differencing it a second time removes a quadratic trend. A third time removes a cubic trend. It’s rarely necessary to difference more than twice. You can difference a time series with the diff() function. The format is diff(ts, differences=d), where d indicates the number of times the time series ts is differenced. The default is d=1. The ndiffs() function in the forecast package can be used to help determine the best value of d. The format is ndiffs(ts). Stationarity is often evaluated with a visual inspection of a time-series plot. If the variance isn’t constant, the data are transformed. If there are trends, the data are differenced. You can also use a statistical procedure called the Augmented Dickey-Fuller (ADF) test to evaluate the assumption of stationarity. In R, the function adf.test() in the tseries package performs the test. The format is adf.test(ts), where ts is the time series to be evaluated. A significant result suggests stationarity. To summarize, ACF and PCF plots are used to determine the parameters of ARIMA models. Stationarity is an important assumption, and transformations and differencing are used to help achieve stationarity. With these concepts in hand, we can now turn to fitting models with an autoregressive (AR) component, a moving averages (MA) component, or both components (ARMA). Finally, we’ll examine ARIMA models that include ARMA components and differencing to achieve stationarity (Integration). 15.4.2 ARMA and ARIMA models In an autoregressive model of order p, each value in a time series is predicted from a linear combination of the previous p values AR(p):Yt = μ + β 1Yt−1 + β 2Yt−2 + ... + β pYt−p + ε t where Yt is a given value of the series, µ is the mean of the series, the β s are the weights, and ε t is the irregular component. In a moving average model of order q, each value in the time series is predicted from a linear combination of q previous errors. In this case MA(q):Yt = μ − θ 1ε t−1 − θ2ε t−2 ... − θqε t−q + ε t where the ε s are the errors of prediction and the θ s are the weights. (It’s important to note that the moving averages described here aren’t the simple moving averages described in section 15.1.2.) Combining the two approaches yields an ARMA(p, q) model of the form Yt = μ + β 1Yt−1 + β 2Yt−2 + ... + β pYt−p − θ 1ε t−1 − θ2ε t−2 ... − θqε t−q + ε t that predicts each value of the time series from the past p values and q residuals. An ARIMA(p, d, q) model is a model in which the time series has been differenced d times, and the resulting values are predicted from the previous p actual values and q www.it-ebooks.info 362 CHAPTER 15 Time series previous errors. The predictions are “un-differenced” or integrated to achieve the final prediction. The steps in ARIMA modeling are as follows: 1 2 3 4 5 Ensure that the time series is stationary. Identify a reasonable model or models (possible values of p and q). Fit the model. Evaluate the model’s fit, including statistical assumptions and predictive accuracy. Make forecasts. Let’s apply each step in turn to fit an ARIMA model to the Nile time series. ENSURING THAT THE TIME SERIES IS STATIONARY 1000 600 800 Nile 1400 First you plot the time series and assess its stationarity (see listing 15.7 and the top half of figure 15.11). The variance appears to be stable across the years observed, so there’s no need for a transformation. There may be a trend, which is supported by the results of the ndiffs() function. 1900 1920 1940 1960 1900 1920 1940 1960 0 −400 −200 diff(Nile) 200 400 1880 1880 Time Figure 15.11 Time series displaying the annual flow of the river Nile at Ashwan from 1871 to 1970 (top) along with the times series differenced once (bottom). The differencing removes the decreasing trend evident in the original plot. www.it-ebooks.info 363 ARIMA forecasting models Listing 15.7 Transforming the time series and assessing stationarity > > > > library(forecast) library(tseries) plot(Nile) ndiffs(Nile) [1] 1 > dNile <- diff(Nile) > plot(dNile) > adf.test(dNile) Augmented Dickey-Fuller Test data: dNile Dickey-Fuller = -6.5924, Lag order = 4, p-value = 0.01 alternative hypothesis: stationary The series is differenced once (lag=1 is the default) and saved as dNile. The differenced time series is plotted in the bottom half of figure 15.11 and certainly looks more stationary. Applying the ADF test to the differenced series suggest that it’s now stationary, so you can proceed to the next step. IDENTIFYING ONE OR MORE REASONABLE MODELS Possible models are selected based on the ACF and PACF plots: Acf(dNile) Pacf(dNile) 0.0 −0.4 −0.2 ACF 0.2 The resulting plots are given in figure 15.12. 2 3 4 5 6 7 8 9 10 12 14 16 18 1 2 3 4 5 6 7 8 9 10 12 14 16 18 0.0 −0.2 −0.4 Figure 15.12 Autocorrelation and partial autocorrelation plots for the differenced Nile time series Partial ACF 0.2 1 Lag www.it-ebooks.info 364 CHAPTER 15 Time series The goal is to identify the parameters p, d, and q. You already know that d=1 from the previous section. You get p and q by comparing the ACF and PACF plots with the guidelines given in table 15.6. Table 15.6 Guidelines for selecting an ARIMA model Model ACF PACF ARIMA(p, d, 0) Trails off to zero Zero after lag p ARIMA(0, d, q) Zero after lag q Trails off to zero ARIMA(p, d, q) Trails off to zero Trails off to zero The results in table 15.6 are theoretical, and the actual ACF and PACF may not match this exactly. But they can be used to give a rough guide of reasonable models to try. For the Nile time series in figure 15.12, there appears to be one large autocorrelation at lag 1, and the partial autocorrelations trail off to zero as the lags get bigger. This suggests trying an ARIMA(0, 1, 1) model. FITTING THE MODEL(S) The ARIMA model is fit with the arima() function. The format is arima(ts, order=c(q, d, q)). The result of fitting an ARIMA(0, 1, 1) model to the Nile time series is given in the following listing. Listing 15.8 Fitting an ARIMA model > library(forecast) > fit <- arima(Nile, order=c(0,1,1)) > fit Series: Nile ARIMA(0,1,1) Coefficients: ma1 -0.7329 s.e. 0.1143 sigma^2 estimated as 20600: AIC=1269.09 AICc=1269.22 log likelihood=-632.55 BIC=1274.28 > accuracy(fit) ME RMSE MAE MPE MAPE MASE Training set -11.94 142.8 112.2 -3.575 12.94 0.8089 Note that you apply the model to the original time series. By specifying d=1, it calculates first differences for you. The coefficient for the moving averages (-0.73) is provided along with the AIC. If you fit other models, the AIC can help you choose which one is most reasonable. Smaller AIC values suggest better models. The accuracy www.it-ebooks.info ARIMA forecasting models 365 measures can help you determine whether the model fits with sufficient accuracy. Here the mean absolute percent error is 13% of the river level. EVALUATING MODEL FIT If the model is appropriate, the residuals should be normally distributed with mean zero, and the autocorrelations should be zero for every possible lag. In other words, the residuals should be normally and independently distributed (no relationship between them). The assumptions can be evaluated with the following code. Listing 15.9 Evaluating the model fit > qqnorm(fit$residuals) > qqline(fit$residuals) > Box.test(fit$residuals, type="Ljung-Box") Box-Ljung test data: fit$residuals X-squared = 1.3711, df = 1, p-value = 0.2416 The qqnorm() and qqline() functions produce the plot in figure 15.13. Normally distributed data should fall along the line. In this case, the results look good. The Box.test() function provides a test that the autocorrelations are all zero. The results aren’t significant, suggesting that the autocorrelations don’t differ from zero. This ARIMA model appears to fit the data well. MAKING FORECASTS If the model hadn’t met the assumptions of normal residuals and zero autocorrelations, it would have been necessary to alter the model, add parameters, or try a different approach. Once a final model has been chosen, it can be used to make predictions of future values. In the next listing, the forecast() function from the forecast package is used to predict three years ahead. 0 −200 Figure 15.13 Normal Q-Q plot for determining the normality of the time-series residuals −400 Sample Quantiles 200 Normal Q−Q Plot −2 −1 0 1 Theoretical Quantiles www.it-ebooks.info 2 366 CHAPTER 15 Time series Figure 15.14 Three-year forecast for the Nile time series from a fitted ARIMA(0,1,1) model. Blue dots represent point estimates, and the light and dark gray bands represent the 80% and 95% confidence bands limits, respectively. 600 800 Annual Flow 1000 1200 1400 Forecasts from ARIMA(0,1,1) 1880 1900 1920 1940 1960 Year Listing 15.10 Forecasting with an ARIMA model > forecast(fit, 3) 1971 1972 1973 Point Forecast Lo 80 798.3673 614.4307 798.3673 607.9845 798.3673 601.7495 Hi 80 Lo 95 Hi 95 982.3040 517.0605 1079.674 988.7502 507.2019 1089.533 994.9851 497.6663 1099.068 > plot(forecast(fit, 3), xlab="Year", ylab="Annual Flow") The plot() function is used to plot the forecast in figure 15.14. Point estimates are given by the blue dots, and 80% and 95% confidence bands are represented by dark and light bands, respectively. 15.4.3 Automated ARIMA forecasting In section 15.2.3, you used the ets() function in the forecast package to automate the selection of a best exponential model. The package also provides an auto.arima() function to select a best ARIMA model. The next listing applies this approach to the sunspots time series described in the chapter introduction. Listing 15.11 Automated ARIMA forecasting > library(forecast) > fit <- auto.arima(sunspots) > fit Series: sunspots ARIMA(2,1,2) www.it-ebooks.info 367 Summary Coefficients: ar1 ar2 1.35 -0.396 s.e. 0.03 0.029 ma1 -1.77 0.02 ma2 0.810 0.019 sigma^2 estimated as 243: log likelihood=-11746 AIC=23501 AICc=23501 BIC=23531 > forecast(fit, 3) Jan 1984 Feb 1984 Mar 1984 Point Forecast 40.437722 41.352897 39.796425 Lo 80 Hi 80 20.4412613 60.43418 18.2795867 64.42621 15.2537785 64.33907 Lo 95 Hi 95 9.855774 71.01967 6.065314 76.64048 2.261686 77.33116 > accuracy(fit) ME RMSE MAE MPE MAPE MASE Training set -0.02673 15.6 11.03 NaN Inf 0.32 The function selects an ARIMA model with p=2, d=1, and q=2. These are values that minimize the AIC criterion over a large number of possible models. The MPE and MAPE accuracy blow up because there are zero values in the series (a drawback of these two statistics). Plotting the results and evaluating the fit are left for you as an exercise. 15.5 Going further There are many good books on time-series analysis and forecasting. If you’re new to the subject, I suggest starting with the book Time Series (Open University, 2006). Although it doesn’t include R code, it provides a very understandable and intuitive introduction. A Little Book of R for Time Series by Avril Coghlan (http://mng.bz/8fz0, 2010) pairs well with the Open University text and includes R code and examples. Forecasting: Principles and Practice (http://otexts.com/fpp, 2013) is a clear and concise online textbook written by Rob Hyndman and George Athanasopoulos; it includes R code throughout. I highly recommend it. Additionally, Cowpertwait & Metcalfe (2009) have written an excellent text on analyzing time series with R. A more advanced treatment that also includes R code can be found in Shumway & Stoffer (2010). Finally, you can consult the CRAN Task View on Time Series Analysis (http:// cran.r-project.org/web/views/TimeSeries.html). It contains a comprehensive summary of all of R’s time-series capabilities. 15.6 Summary Forecasting has a long and varied history, from early shamans predicting the weather to modern data scientists predicting the results of recent elections. Prediction is fundamental to both science and human nature. In this chapter, we’ve looked at how to create time series in R, assess trends, and examine seasonal effects. Then we www.it-ebooks.info 368 CHAPTER 15 Time series considered two of the most popular approaches to forecasting: exponential models and ARIMA models. Although these methodologies can be crucial in understanding and predicting a wide variety of phenomena, it’s important to remember that they each entail extrapolation—going beyond the data. They assume that future conditions mirror current conditions. Financial predictions made in 2007 assumed continued economic growth in 2008 and beyond. As we all know now, that isn’t exactly how things turned out. Significant events can change the trend and pattern in a time series, and the farther out you try to predict, the greater the uncertainty. In the next chapter, we’ll shift gears and look at methodologies that are important to anyone trying to classify individuals or observations into discrete groups. www.it-ebooks.info Cluster analysis This chapter covers ■ Identifying cohesive subgroups (clusters) of observations ■ Determining the number of clusters present ■ Obtaining a nested hierarchy of clusters ■ Obtaining discrete clusters Cluster analysis is a data-reduction technique designed to uncover subgroups of observations within a dataset. It allows you to reduce a large number of observations to a much smaller number of clusters or types. A cluster is defined as a group of observations that are more similar to each other than they are to the observations in other groups. This isn’t a precise definition, and that fact has led to an enormous variety of clustering methods. Cluster analysis is widely used in the biological and behavioral sciences, marketing, and medical research. For example, a psychological researcher might cluster data on the symptoms and demographics of depressed patients, seeking to uncover subtypes of depression. The hope would be that finding such subtypes might lead to more targeted and effective treatments and a better understanding of the disorder. Marketing researchers use cluster analysis as a customer-segmentation strategy. 369 www.it-ebooks.info 370 CHAPTER 16 Cluster analysis Customers are arranged into clusters based on the similarity of their demographics and buying behaviors. Marketing campaigns are then tailored to appeal to one or more of these subgroups. Medical researchers use cluster analysis to help catalog gene-expression patterns obtained from DNA microarray data. This can help them to understand normal growth and development and the underlying causes of many human diseases. The two most popular clustering approaches are hierarchical agglomerative clustering and partitioning clustering. In agglomerative hierarchical clustering, each observation starts as its own cluster. Clusters are then combined, two at a time, until all clusters are merged into a single cluster. In the partitioning approach, you specify K: the number of clusters sought. Observations are then randomly divided into K groups and reshuffled to form cohesive clusters. Within each of these broad approaches, there are many clustering algorithms to choose from. For hierarchical clustering, the most popular are single linkage, complete linkage, average linkage, centroid, and Ward’s method. For partitioning, the two most popular are k-means and partitioning around medoids (PAM). Each clustering method has advantages and disadvantages, which we’ll discuss. The examples in this chapter focus on food and wine (I suspect my friends aren’t surprised). Hierarchical clustering is applied to the nutrient dataset contained in the flexclust package to answer the following questions: ■ ■ What are the similarities and differences among 27 types of fish, fowl, and meat, based on 5 nutrient measures? Is there a smaller number of groups into which these foods can be meaningfully clustered? Partitioning methods will be used to evaluate 13 chemical analyses of 178 Italian wine samples. The data are contained in the wine dataset available with the rattle package. Here, the questions are as follows: ■ ■ Are there subtypes of wine in the data? If so, how many subtypes are there, and what are their characteristics? In fact, the wine samples represent three varietals (recorded as Type). This will allow you to evaluate how well the cluster analysis recovers the underlying structure. Although there are many approaches to cluster analysis, they usually follow a similar set of steps. These common steps are described in section 16.1. Hierarchical agglomerative clustering is described in section 16.2, and partitioning methods are covered in section 16.3. Some final advice and cautionary statements are provided in section 16.4. In order to run the examples in this chapter, be sure to install the cluster, NbClust, flexclust, fMultivar, ggplot2, and rattle packages. The rattle package will also be used in chapter 17. 16.1 Common steps in cluster analysis Like factor analysis (chapter 14), an effective cluster analysis is a multistep process with numerous decision points. Each decision can affect the quality and usefulness of www.it-ebooks.info Common steps in cluster analysis 371 the results. This section describes the 11 typical steps in a comprehensive cluster analysis: 1 2 Choose appropriate attributes. The first (and perhaps most important) step is to select variables that you feel may be important for identifying and understanding differences among groups of observations within the data. For example, in a study of depression, you might want to assess one or more of the following: psychological symptoms; physical symptoms; age at onset; number, duration, and timing of episodes; number of hospitalizations; functional status with regard to self-care; social and work history; current age; gender; ethnicity; socioeconomic status; marital status; family medical history; and response to previous treatments. A sophisticated cluster analysis can’t compensate for a poor choice of variables. Scale the data. If the variables in the analysis vary in range, the variables with the largest range will have the greatest impact on the results. This is often undesirable, and analysts scale the data before continuing. The most popular approach is to standardize each variable to a mean of 0 and a standard deviation of 1. Other alternatives include dividing each variable by its maximum value or subtracting the variable’s mean and dividing by the variable’s median absolute deviation. The three approaches are illustrated with the following code snippets: df1 <- apply(mydata, 2, function(x){(x-mean(x))/sd(x)}) df2 <- apply(mydata, 2, function(x){x/max(x)}) df3 <- apply(mydata, 2, function(x){(x – mean(x))/mad(x)}) 3 4 5 In this chapter, you’ll use the scale() function to standardize the variables to a mean of 0 and a standard deviation of 1. This is equivalent to the first code snippet (df1). Screen for outliers. Many clustering techniques are sensitive to outliers, distorting the cluster solutions obtained. You can screen for (and remove) univariate outliers using functions from the outliers package. The mvoutlier package contains functions that can be used to identify multivariate outliers. An alternative is to use a clustering method that is robust to the presence of outliers. Partitioning around medoids (section 16.3.2) is an example of the latter approach. Calculate distances. Although clustering algorithms vary widely, they typically require a measure of the distance among the entities to be clustered. The most popular measure of the distance between two observations is the Euclidean distance, but the Manhattan, Canberra, asymmetric binary, maximum, and Minkowski distance measures are also available (see ?dist for details). In this chapter, the Euclidean distance is used throughout. Calculating Euclidean distances is covered in section 16.1.1. Select a clustering algorithm. Next, you select a method of clustering the data. Hierarchical clustering is useful for smaller problems (say, 150 observations or less) and where a nested hierarchy of groupings is desired. The partitioning method can handle much larger problems but requires that the number of clusters be specified in advance. www.it-ebooks.info 372 CHAPTER 16 6 7 8 9 10 11 Cluster analysis Once you’ve chosen the hierarchical or partitioning approach, you must select a specific clustering algorithm. Again, each has advantages and disadvantages. The most popular methods are described in sections 16.2 and 16.3. You may wish to try more than one algorithm to see how robust the results are to the choice of methods. Obtain one or more cluster solutions. This step uses the method(s) selected in step 5. Determine the number of clusters present. In order to obtain a final cluster solution, you must decide how many clusters are present in the data. This is a thorny problem, and many approaches have been proposed. It usually involves extracting various numbers of clusters (say, 2 to K) and comparing the quality of the solutions. The NbClust() function in the NBClust package provides 30 different indices to help you make this decision (elegantly demonstrating how unresolved this issue is). NbClust is used throughout this chapter. Obtain a final clustering solution. Once the number of clusters has been determined, a final clustering is performed to extract that number of subgroups. Visualize the results. Visualization can help you determine the meaning and usefulness of the cluster solution. The results of a hierarchical clustering are usually presented as a dendrogram. Partitioning results are typically visualized using a bivariate cluster plot. Interpret the clusters. Once a cluster solution has been obtained, you must interpret (and possibly name) the clusters. What do the observations in a cluster have in common? How do they differ from the observations in other clusters? This step is typically accomplished by obtaining summary statistics for each variable by cluster. For continuous data, the mean or median for each variable within each cluster is calculated. For mixed data (data that contain categorical variables), the summary statistics will also include modes or category distributions. Validate the results. Validating the cluster solution involves asking the question, “Are these groupings in some sense real, and not a manifestation of unique aspects of this dataset or statistical technique?” If a different cluster method or different sample is employed, would the same clusters be obtained? The fpc, clv, and clValid packages each contain functions for evaluating the stability of a clustering solution. Because the calculations of distances between observations is such an integral part of cluster analysis, it’s described next and in some detail. 16.2 Calculating distances Every cluster analysis begins with the calculation of a distance, dissimilarity, or proximity between each entity to be clustered. The Euclidean distance between two observations is given by dij = p p =1 (xip − xjp)2 where i and j are observations and P is the number of variables. www.it-ebooks.info Calculating distances 373 Consider the nutrient dataset provided with the flexclust package. The dataset contains measurements on the nutrients of 27 types of meat, fish, and fowl. The first few observations are given by > data(nutrient, package="flexclust") > head(nutrient, 4) BEEF BRAISED HAMBURGER BEEF ROAST BEEF STEAK energy protein fat calcium iron 340 20 28 9 2.6 245 21 17 9 2.7 420 15 39 7 2.0 375 19 32 9 2.6 and the Euclidean distance between the first two (beef braised and hamburger) is d= (340 − 245)2 + (20 − 21)2 + (28 − 17)2 + (9 − 9)2 + (26 − 27)2 = 95.64 The dist() function in the base R installation can be used to calculate the distances between all rows (observations) of a matrix or data frame. The format is dist(x, method=), where x is the input data and method="euclidean" by default. The function returns a lower triangle matrix by default, but the as.matrix() function can be used to access the distances using standard bracket notation. For the nutrient data frame, > d <- dist(nutrient) > as.matrix(d)[1:4,1:4] BEEF BRAISED HAMBURGER BEEF ROAST BEEF STEAK BEEF BRAISED HAMBURGER BEEF ROAST BEEF STEAK 0.0 95.6 80.9 35.2 95.6 0.0 176.5 130.9 80.9 176.5 0.0 45.8 35.2 130.9 45.8 0.0 Larger distances indicate larger dissimilarities between observations. The distance between an observation and itself is 0. As expected, the dist() function provides the same distance between beef braised and hamburger as the hand calculations. Cluster analysis with mixed data types Euclidean distances are usually the distance measure of choice for continuous data. But if other variable types are present, alternative dissimilarity measures are required. You can use the daisy() function in the cluster package to obtain a dissimilarity matrix among observations that have any combination of binary, nominal, ordinal, and continuous attributes. Other functions in the cluster package can use these dissimilarities to carry out a cluster analysis. For example, agnes() offers agglomerative hierarchical clustering, and pam() provides partitioning around medoids. Note that distances in the nutrient data frame are heavily dominated by the contribution of the energy variable, which has a much larger range. Scaling the data will help www.it-ebooks.info 374 CHAPTER 16 Cluster analysis to equalize the impact of each variable. In the next section, you’ll apply hierarchical cluster analysis to this dataset. 16.3 Hierarchical cluster analysis As stated previously, in agglomerative hierarchical clustering, each case or observation starts as its own cluster. Clusters are then combined two at a time until all clusters are merged into a single cluster. The algorithm is as follows: 1 2 3 4 Define each observation (row, case) as a cluster. Calculate the distances between every cluster and every other cluster. Combine the two clusters that have the smallest distance. This reduces the number of clusters by one. Repeat steps 2 and 3 until all clusters have been merged into a single cluster containing all observations. The primary difference among hierarchical clustering algorithms is their definitions of cluster distances (step 2). Five of the most common hierarchical clustering methods and their definitions of the distance between two clusters are given in table 16.1. Table 16.1 Hierarchical clustering methods Cluster method Definition of the distance between two clusters Single linkage Shortest distance between a point in one cluster and a point in the other cluster. Complete linkage Longest distance between a point in one cluster and a point in the other cluster. Average linkage Average distance between each point in one cluster and each point in the other cluster (also called UPGMA [unweighted pair group mean averaging]). Centroid Distance between the centroids (vector of variable means) of the two clusters. For a single observation, the centroid is the variable’s values. Ward The ANOVA sum of squares between the two clusters added up over all the variables. Single-linkage clustering tends to find elongated, cigar-shaped clusters. It also commonly displays a phenomenon called chaining—dissimilar observations are joined into the same cluster because they’re similar to intermediate observations between them. Complete-linkage clustering tends to find compact clusters of approximately equal diameter. It can also be sensitive to outliers. Average-linkage clustering offers a compromise between the two. It’s less likely to chain and is less susceptible to outliers. It also has a tendency to join clusters with small variances. Ward’s method tends to join clusters with small numbers of observations and tends to produce clusters with roughly equal numbers of observations. It can also be sensitive to outliers. The centroid method offers an attractive alternative due to its simple and easily understood definition of cluster distances. It’s also less sensitive to outliers than other hierarchical methods. But it may not perform as well as the averagelinkage or Ward method. www.it-ebooks.info 375 Hierarchical cluster analysis Hierarchical clustering can be accomplished with the hclust() function. The format is hclust(d, method=), where d is a distance matrix produced by the dist() function and methods include "single", "complete", "average", "centroid", and "ward". In this section, you’ll apply average-linkage clustering to the nutrient data introduced from section 16.1.1. The goal is to identify similarities, differences, and groupings among 27 food types based on nutritional information. The code for carrying out the clustering is provided in the following listing. Listing 16.1 Average-linkage clustering of the nutrient data data(nutrient, package="flexclust") row.names(nutrient) <- tolower(row.names(nutrient)) nutrient.scaled <- scale(nutrient) d <- dist(nutrient.scaled) fit.average <- hclust(d, method="average") plot(fit.average, hang=-1, cex=.8, main="Average Linkage Clustering") First the data are imported, and the row names are set to lowercase (because I hate UPPERCASE LABELS). Because the variables differ widely in range, they’re standardized to a mean of 0 and a standard deviation of 1. Euclidean distances between each of the 27 food types are calculated, and an average-linkage clustering is performed. Finally, the results are plotted as a dendrogram (see figure 16.1). The hang option in the plot() function justifies the observation labels (causing them to hang down from 0). 2 1 sardines canned clams raw clams canned beef heart beef roast lamb shoulder roast beef steak beef braised smoked ham pork roast pork simmered mackerel canned salmon canned mackerel broiled perch fried crabmeat canned haddock fried beef canned veal cutlet beef tongue hamburger lamb leg roast shrimp canned chicken canned tuna canned chicken broiled bluefish baked 0 Height 3 4 5 Average-Linkage Clustering d hclust (*, "average") www.it-ebooks.info Figure 16.1 Average-linkage clustering of nutrient data 376 CHAPTER 16 Cluster analysis The dendrogram displays how items are combined into clusters and is read from the bottom up. Each observation starts as its own cluster. Then the two observations that are closest (beef braised and smoked ham) are combined. Next, pork roast and pork simmered are combined, followed by chicken canned and tuna canned. In the fourth step, the beef braised/smoked ham cluster and the pork roast/pork simmered clusters are combined (and the cluster now contains four food items). This continues until all observations are combined into a single cluster. The height dimension indicates the criterion value at which clusters are joined. For average-linkage clustering, this criterion is the average distance between each point in one cluster and each point in the other cluster. If your goal is to understand how food types are similar or different with regard to their nutrients, then figure 16.1 may be sufficient. It creates a hierarchical view of the similarity/dissimilarity among the 27 items. Canned tuna and chicken are similar, and both differ greatly from canned clams. But if the end goal is to assign these foods to a smaller number of (hopefully meaningful) groups, additional analyses are required to select an appropriate number of clusters. The NbClust package offers numerous indices for determining the best number of clusters in a cluster analysis. There is no guarantee that they will agree with each other. In fact, they probably won’t. But the results can be used as a guide for selecting possible candidate values for K, the number of clusters. Input to the NbClust() function includes the matrix or data frame to be clustered, the distance measure and clustering method to employ, and the minimum and maximum number of clusters to consider. It returns each of the clustering indices, along with the best number of clusters proposed by each. The next listing applies this approach to the average-linkage clustering of the nutrient data. Listing 16.2 Selecting the number of clusters > library(NbClust) > devAskNewPage(ask=TRUE) > nc <- NbClust(nutrient.scaled, distance="euclidean", min.nc=2, max.nc=15, method="average") > table(nc$Best.n[1,]) 0 2 2 4 3 4 4 3 5 4 9 10 13 14 15 1 1 2 1 4 > barplot(table(nc$Best.n[1,]), xlab="Numer of Clusters", ylab="Number of Criteria", main="Number of Clusters Chosen by 26 Criteria") Here, four criteria each favor two clusters, four criteria favor three clusters, and so on. The results are plotted in figure 16.2. You could try the number of clusters (2, 3, 5, and 15) with the most “votes” and select the one that makes the most interpretive sense. The following listing explores the five-cluster solution. www.it-ebooks.info 377 Hierarchical cluster analysis 3 2 1 Figure 16.2 Recommended number of clusters using 26 criteria provided by the NbClust package 0 Number of Criteria 4 Number of Clusters Chosen by 26 Criteria 0 1 2 3 4 5 9 10 13 14 15 Number of Clusters Listing 16.3 Obtaining the final cluster solution > clusters <- cutree(fit.average, k=5) > table(clusters) clusters 1 2 3 7 16 1 4 2 b Assigns cases 5 1 > aggregate(nutrient, by=list(cluster=clusters), median) 1 2 3 4 5 cluster energy protein fat calcium iron 1 340.0 19 29 9 2.50 2 170.0 20 8 13 1.45 3 160.0 26 5 14 5.90 4 57.5 9 1 78 5.70 5 180.0 22 9 367 2.50 c Describes clusters > aggregate(as.data.frame(nutrient.scaled), by=list(cluster=clusters), median) 1 2 3 4 5 cluster 1 2 3 4 5 energy protein fat calcium iron 1.310 0.000 1.379 -0.448 0.0811 -0.370 0.235 -0.487 -0.397 -0.6374 -0.468 1.646 -0.753 -0.384 2.4078 -1.481 -2.352 -1.109 0.436 2.2709 -0.271 0.706 -0.398 4.140 0.0811 > plot(fit.average, hang=-1, cex=.8, main="Average Linkage Clustering\n5 Cluster Solution") > rect.hclust(fit.average, k=5) d Plots results The cutree() function is used to cut the tree into five clusters b. The first cluster has 7 observations, the second cluster has 16 observations, and so on. The aggregate() function is then used to obtain the median profile for each cluster c. The results are reported in both the original metric and in standardized form. Finally, the www.it-ebooks.info 378 CHAPTER 16 Cluster analysis sardines canned clams raw clams canned beef heart beef roast lamb shoulder roast beef steak beef braised smoked ham pork roast pork simmered mackerel canned salmon canned mackerel broiled perch fried crabmeat canned haddock fried beef canned veal cutlet beef tongue hamburger lamb leg roast shrimp canned chicken canned tuna canned chicken broiled bluefish baked 0 1 Height 2 3 4 5 Average-Linkage Clustering 5 Cluster Solution Figure 16.3 Averagelinkage clustering of the nutrient data with a fivecluster solution d hclust (*, "average") dendrogram is replotted, and the rect.hclust() function is used to superimpose the five-cluster solution d. The results are displayed in figure 16.3. Sardines form their own cluster and are much higher in calcium than the other food groups. Beef heart is also a singleton and is high in protein and iron. The clam cluster is low in protein and high in iron. The items in the cluster containing beef roast to pork simmered are high in energy and fat. Finally, the largest group (mackerel to bluefish) is relatively low in iron. Hierarchical clustering can be particularly useful when you expect nested clustering and a meaningful hierarchy. This is often the case in the biological sciences. But the hierarchical algorithms are greedy in the sense that once an observation is assigned to a cluster, it can’t be reassigned later in the process. Additionally, hierarchical clustering is difficult to apply in large samples, where there may be hundreds or even thousands of observations. Partitioning methods can work well in these situations. 16.4 Partitioning cluster analysis In the partitioning approach, observations are divided into K groups and reshuffled to form the most cohesive clusters possible according to a given criterion. This section considers two methods: k-means and partitioning around medoids (PAM). 16.4.1 K-means clustering The most common partitioning method is the k-means cluster analysis. Conceptually, the k-means algorithm is as follows: www.it-ebooks.info Partitioning cluster analysis 1 2 3 4 5 379 Select K centroids (K rows chosen at random). Assign each data point to its closest centroid. Recalculate the centroids as the average of all data points in a cluster (that is, the centroids are p-length mean vectors, where p is the number of variables). Assign data points to their closest centroids. Continue steps 3 and 4 until the observations aren’t reassigned or the maximum number of iterations (R uses 10 as a default) is reached. Implementation details for this approach can vary. R uses an efficient algorithm by Hartigan and Wong (1979) that partitions the observations into k groups such that the sum of squares of the observations to their assigned cluster centers is a minimum. This means, in steps 2 and 4, each observation is assigned to the cluster with the smallest value of n ss(k) = p i =1 j = 0 (xij − x−kj)2 where k is the cluster, xij is the value of the j th variable for the i th observation, x̄kj is the mean of the j th variable for the k th cluster, and p is the number of variables. K-means clustering can handle larger datasets than hierarchical cluster approaches. Additionally, observations aren’t permanently committed to a cluster. They’re moved when doing so improves the overall solution. But the use of means implies that all variables must be continuous, and the approach can be severely affected by outliers. It also performs poorly in the presence of non-convex (for example, U-shaped) clusters. The format of the k-means function in R is kmeans(x, centers), where x is a numeric dataset (matrix or data frame) and centers is the number of clusters to extract. The function returns the cluster memberships, centroids, sums of squares (within, between, total), and cluster sizes. Because k-means cluster analysis starts with k randomly chosen centroids, a different solution can be obtained each time the function is invoked. Use the set.seed() function to guarantee that the results are reproducible. Additionally, this clustering approach can be sensitive to the initial selection of centroids. The kmeans() function has an nstart option that attempts multiple initial configurations and reports on the best one. For example, adding nstart=25 generates 25 initial configurations. This approach is often recommended. Unlike hierarchical clustering, k-means clustering requires that you specify in advance the number of clusters to extract. Again, the NbClust package can be used as a guide. Additionally, a plot of the total within-groups sums of squares against the number of clusters in a k-means solution can be helpful. A bend in the graph (similar to the bend in the Scree test described in section 14.2.1) can suggest the appropriate number of clusters. The graph can be produced with the following function: wssplot <- function(data, nc=15, seed=1234){ wss <- (nrow(data)-1)*sum(apply(data,2,var)) www.it-ebooks.info 380 CHAPTER 16 Cluster analysis for (i in 2:nc){ set.seed(seed) wss[i] <- sum(kmeans(data, centers=i)$withinss)} plot(1:nc, wss, type="b", xlab="Number of Clusters", ylab="Within groups sum of squares")} The data parameter is the numeric dataset to be analyzed, nc is the maximum number of clusters to consider, and seed is a random-number seed. Let’s apply k-means clustering to a dataset containing 13 chemical measurements on 178 Italian wine samples. The data originally come from the UCI Machine Learning Repository (www.ics.uci.edu/~mlearn/MLRepository.html), but you’ll access them here via the rattle package. In this dataset, the observations represent three wine varietals, as indicated by the first variable (Type). You’ll drop this variable, perform the cluster analysis, and see if you can recover the known structure. Listing 16.4 K-means clustering of wine data > data(wine, package="rattle") > head(wine) Type Alcohol Malic Ash Alcalinity Magnesium Phenols Flavanoids 1 14.23 1.71 2.43 15.6 127 2.80 3.06 1 13.20 1.78 2.14 11.2 100 2.65 2.76 1 13.16 2.36 2.67 18.6 101 2.80 3.24 1 14.37 1.95 2.50 16.8 113 3.85 3.49 1 13.24 2.59 2.87 21.0 118 2.80 2.69 1 14.20 1.76 2.45 15.2 112 3.27 3.39 1 2 3 4 5 6 > df <- scale(wine[-1]) 1 2 3 4 5 6 Nonflavanoids Proanthocyanins Color 0.28 2.29 5.64 0.26 1.28 4.38 0.30 2.81 5.68 0.24 2.18 7.80 0.39 1.82 4.32 0.34 1.97 6.75 > > > > > > Hue Dilution Proline 1.04 3.92 1065 1.05 3.40 1050 1.03 3.17 1185 0.86 3.45 1480 1.04 2.93 735 1.05 2.85 1450 wssplot(df) library(NbClust) set.seed(1234) devAskNewPage(ask=TRUE) nc <- NbClust(df, min.nc=2, max.nc=15, method="kmeans") table(nc$Best.n[1,]) 0 2 2 3 3 14 b c 8 13 14 15 1 2 1 1 > barplot(table(nc$Best.n[1,]), xlab="Number of Clusters", ylab="Number of Criteria", main="Number of Clusters Chosen by 26 Criteria") > set.seed(1234) www.it-ebooks.info Standardizes the data Determines the number of clusters 381 Partitioning cluster analysis > fit.km <- kmeans(df, 3, nstart=25) > fit.km$size d Performs the k-means cluster analysis [1] 62 65 51 > fit.km$centers Alcohol Malic Ash Alcalinity Magnesium Phenols Flavanoids Nonflavanoids 0.83 -0.30 0.36 -0.61 0.576 0.883 0.975 -0.561 -0.92 -0.39 -0.49 0.17 -0.490 -0.076 0.021 -0.033 0.16 0.87 0.19 0.52 -0.075 -0.977 -1.212 0.724 Proanthocyanins Color Hue Dilution Proline 1 0.579 0.17 0.47 0.78 1.12 2 0.058 -0.90 0.46 0.27 -0.75 3 -0.778 0.94 -1.16 -1.29 -0.41 1 2 3 > aggregate(wine[-1], by=list(cluster=fit.km$cluster), mean) cluster Alcohol Malic Ash Alcalinity Magnesium Phenols Flavanoids 1 14 1.8 2.4 17 106 2.8 3.0 2 12 1.6 2.2 20 88 2.2 2.0 3 13 3.3 2.4 21 97 1.6 0.7 Nonflavanoids Proanthocyanins Color Hue Dilution Proline 1 0.29 1.9 5.4 1.07 3.2 1072 2 0.35 1.6 2.9 1.04 2.8 495 3 0.47 1.1 7.3 0.67 1.7 620 1 2 3 2000 1500 Figure 16.4 Plotting the within-groups sums of squares vs. the number of clusters extracted. The sharp decreases from one to three clusters (with little decrease after) suggests a three-cluster solution. 1000 Within groups sum of squares Because the variables vary in range, they’re standardized prior to clustering b. Next, the number of clusters is determined using the wssplot() and NbClust() functions c. Figure 16.4 indicates that there is a distinct drop in the within-groups sum of squares when moving from one to three clusters. After three clusters, this decrease drops off, suggesting that a three-cluster solution may be a good fit to the data. In figure 16.5, 14 of 24 criteria provided by the NbClust package suggest a three-cluster solution. Note that not all 30 criteria can be calculated for every dataset. A final cluster solution is obtained with the kmeans() function, and the cluster centroids are printed d. Because the centroids provided by the function are based on 2 4 6 8 10 12 Number of Clusters www.it-ebooks.info 14 382 CHAPTER 16 Cluster analysis 10 8 6 4 Figure 16.5 Recommended number of clusters using 26 criteria provided by the NbClust package 0 2 Number of Criteria 12 14 Number of Clusters Chosen by 26 Criteria 0 1 2 3 10 12 14 15 Number of Clusters standardized data, the aggregate() function is used along with the cluster memberships to determine variable means for each cluster in the original metric. How well did k-means clustering uncover the actual structure of the data contained in the Type variable? A cross-tabulation of Type (wine varietal) and cluster membership is given by > ct.km <- table(wine$Type, fit.km$cluster) > ct.km 1 2 3 1 59 0 0 2 3 65 3 3 0 0 48 You can quantify the agreement between type and cluster using an adjusted Rand index, provided by the flexclust package: > library(flexclust) > randIndex(ct.km) [1] 0.897 The adjusted Rand index provides a measure of the agreement between two partitions, adjusted for chance. It ranges from -1 (no agreement) to 1 (perfect agreement). Agreement between the wine varietal type and the cluster solution is 0.9. Not bad—shall we have some wine? 16.4.2 Partitioning around medoids Because it’s based on means, the k-means clustering approach can be sensitive to outliers. A more robust solution is provided by partitioning around medoids (PAM). Rather than representing each cluster using a centroid (a vector of variable means), each cluster is identified by its most representative observation (called a medoid). Whereas k-means uses Euclidean distances, PAM can be based on any distance measure. It can therefore accommodate mixed data types and isn’t limited to continuous variables. www.it-ebooks.info 383 Partitioning cluster analysis The PAM algorithm is as follows: 1 2 3 4 5 6 7 8 9 Randomly select K observations (call each a medoid). Calculate the distance/dissimilarity of every observation to each medoid. Assign each observation to its closest medoid. Calculate the sum of the distances of each observation from its medoid (total cost). Select a point that isn’t a medoid, and swap it with its medoid. Reassign every point to its closest medoid. Calculate the total cost. If this total cost is smaller, keep the new point as a medoid. Repeat steps 5–8 until the medoids don’t change. A good worked example of the underlying math in the PAM approach can be found at http://en.wikipedia.org/wiki/k-medoids (I don’t usually cite Wikipedia, but this is a great example). You can use the pam() function in the cluster package to partition around medoids. The format is pam(x, k, metric="euclidean", stand=FALSE), where x is a data matrix or data frame, k is the number of clusters, metric is the type of distance/ dissimilarity measure to use, and stand is a logical value indicating whether the variables should be standardized before calculating this metric. PAM is applied to the wine data in the following listing; see figure 16.6. 0 −2 Component 2 2 4 Bivariate Cluster Plot Figure 16.6 Cluster plot for the three-group PAM clustering of the Italian wine data −4 −2 0 2 4 Component 1 These two components explain 55.41 % of the point variability. www.it-ebooks.info 384 CHAPTER 16 Cluster analysis Listing 16.5 Partitioning around medoids for the wine data > > > > library(cluster) set.seed(1234) fit.pam <- pam(wine[-1], k=3, stand=TRUE) fit.pam$medoids [1,] [2,] [3,] [1,] [2,] [3,] Alcohol Malic 13.5 1.81 12.2 1.73 13.4 3.91 Nonflavanoids 0.26 0.37 0.43 Clusters standardized data Ash Alcalinity Magnesium Phenols Flavanoids 2.41 20.5 100 2.70 2.98 2.12 19.0 80 1.65 2.03 2.48 23.0 102 1.80 0.75 Proanthocyanins Color Hue Dilution Proline 1.86 5.1 1.04 3.47 920 1.63 3.4 1.00 3.17 510 1.41 7.3 0.70 1.56 750 Prints the medoids Plots the cluster solution > clusplot(fit.pam, main="Bivariate Cluster Plot") Note that the medoids are actual observations contained in the wine dataset. In this case, they’re observations 36, 107, and 175, and they have been chosen to represent the three clusters. The bivariate plot is created by plotting the coordinates of each observation on the first two principal components (see chapter 14) derived from the 13 assay variables. Each cluster is represented by an ellipse with the smallest area containing all its points. Also note that PAM didn’t perform as well as k-means in this instance: > ct.pam <- table(wine$Type, fit.pam$clustering) 1 2 3 1 59 0 0 2 16 53 2 3 0 1 47 > randIndex(ct.pam) [1] 0.699 The adjusted Rand index has decreased from 0.9 (for k-means) to 0.7. 16.5 Avoiding nonexistent clusters Before I finish this discussion, a word of caution is in order. Cluster analysis is a methodology designed to identify cohesive subgroups in a dataset. It’s very good at doing this. In fact, it’s so good, it can find clusters where none exist. Consider the following code: library(fMultivar) set.seed(1234) df <- rnorm2d(1000, rho=.5) df <- as.data.frame(df) plot(df, main="Bivariate Normal Distribution with rho=0.5") The rnorm2d() function in the fMultivar package is used to sample 1,000 observations from a bivariate normal distribution with a correlation of 0.5. The resulting graph is displayed in figure 16.7. Clearly there are no clusters in this data. www.it-ebooks.info 385 Avoiding nonexistent clusters −2 −1 0 V2 1 2 3 Bivariate Normal Distribution with rho=0.5 −3 Figure 16.7 Bivariate normal data (n = 1000). There are no clusters in this data. −3 −2 −1 0 1 2 3 V1 The wssplot() and NbClust() functions are then used to determine the number of clusters present: wssplot(df) library(NbClust) nc <- NbClust(df, min.nc=2, max.nc=15, method="kmeans") dev.new() barplot(table(nc$Best.n[1,]), xlab="Number of Clusters", ylab="Number of Criteria", main="Number of Clusters Chosen by 26 Criteria") 2000 1500 1000 Figure 16.8 Plot of within-groups sums of squares vs. number of k-means clusters for bivariate normal data 500 Within groups sum of squares The results are plotted in figures 16.8 and 16.9. 2 4 6 8 10 Number of Clusters www.it-ebooks.info 12 14 386 Cluster analysis CHAPTER 16 6 2 4 Figure 16.9 Number of clusters recommended for bivariate normal data by criteria in the NbClust package. Two or three clusters are suggested. 0 Number of Criteria 8 Number of Clusters Chosen by 26 Criteria 0 1 2 3 4 5 8 10 12 13 Number of Clusters The wssplot() function suggest that there are three clusters, whereas many of the criteria returned by NbClust() suggest between two and three clusters. If you carry out a two-cluster analysis with PAM, library(ggplot2) library(cluster) fit <- pam(df, k=2) df$clustering <- factor(fit$clustering) ggplot(data=df, aes(x=V1, y=V2, color=clustering, shape=clustering)) + geom_point() + ggtitle("Clustering of Bivariate Normal Data") you get the two-cluster plot shown in figure 16.10. (The ggplot() statement is part of the comprehensive graphics package ggplot2. Chapter 19 covers ggplot2 in detail.) Clustering of Bivariate Normal Data 2 Clustering V2 1 2 0 Figure 16.10 PAM cluster analysis of bivariate normal data, extracting two clusters. Note that the clusters are an arbitrary division of the data. −2 −2 0 2 V1 www.it-ebooks.info 387 −12 −14 −18 −16 CCC −10 −8 Summary 2 4 6 8 10 12 14 Number of clusters Figure 16.11 CCC plot for bivariate normal data. It correctly suggests that no clusters are present. Clearly the partitioning is artificial. There are no real clusters here. How can you avoid this mistake? Although it isn’t foolproof, I have found that the Cubic Cluster Criteria (CCC) reported by NbClust can often help to uncover situations where no structure exists. The code is plot(nc$All.index[,4], type="o", ylab="CCC", xlab="Number of clusters", col="blue") and the resulting graph is displayed in figure 16.11. When the CCC values are all negative and decreasing for two or more clusters, the distribution is typically unimodal. The ability of cluster analysis (or your interpretation of it) to find erroneous clusters makes the validation step of cluster analysis important. If you’re trying to identify clusters that are “real” in some sense (rather than a convenient partitioning), be sure the results are robust and repeatable. Try different clustering methods, and replicate the findings with new samples. If the same clusters are consistently recovered, you can be more confident in the results. 16.6 Summary In this chapter, we reviewed some of the most common approaches to clustering observations into cohesive groups. First we reviewed the general steps for a comprehensive cluster analysis. Next, common methods for hierarchical and partitioning clustering were described. Finally, I reinforced the need to validate the resulting clusters in situations where you seek more than convenient partitioning. Cluster analysis is a broad topic, and R has some of the most comprehensive facilities for applying this methodology currently available. To learn more about these capabilities, see the CRAN Task View for Cluster Analysis & Finite Mixture Models (http://cran.r-project.org/web/views/Cluster.html). Additionally, Tan, Steinbach, & Kumar (2006) have an excellent book on data-mining techniques. It contains a lucid www.it-ebooks.info 388 CHAPTER 16 Cluster analysis chapter on cluster analysis that you can freely downloaded (www-users.cs.umn.edu/ ~kumar/dmbook/ch8.pdf). Finally, Everitt, Landau, Leese, & Stahl (2011) have written a practical and highly regarded textbook on this subject. Cluster analysis is a methodology for discovering cohesive subgroups of observations in a dataset. In the next chapter, we’ll consider situations where the groups have already been defined and your goal is to find an accurate method of classifying observations into them. www.it-ebooks.info Classification This chapter covers ■ Classifying with decision trees ■ Ensemble classification with random forests ■ Creating a support vector machine ■ Evaluating classification accuracy Data analysts are frequently faced with the need to predict a categorical outcome from a set of predictor variables. Some examples include ■ ■ ■ Predicting whether an individual will repay a loan, given their demographics and financial history Determining whether an ER patient is having a heart attack, based on their symptoms and vital signs Deciding whether an email is spam, given the presence of key words, images, hypertext, header information, and origin Each of these cases involves the prediction of a binary categorical outcome (good credit risk/bad credit risk, heart attack/no heart attack, spam/not spam) from a set of predictors (also called features). The goal is to find an accurate method of classifying new cases into one of the two groups. 389 www.it-ebooks.info 390 CHAPTER 17 Classification The field of supervised machine learning offers numerous classification methods that can be used to predict categorical outcomes, including logistic regression, decision trees, random forests, support vector machines, and neural networks. The first four are discussed in this chapter. Neural networks are beyond the scope of this book. Supervised learning starts with a set of observations containing values for both the predictor variables and the outcome. The dataset is then divided into a training sample and a validation sample. A predictive model is developed using the data in the training sample and tested for accuracy using the data in the validation sample. Both samples are needed because classification techniques maximize prediction for a given set of data. Estimates of their effectiveness will be overly optimistic if they’re evaluated using the same data that generated the model. By applying the classification rules developed on a training sample to a separate validation sample, you can obtain a more realistic accuracy estimate. Once you’ve created an effective predictive model, you can use it to predict outcomes in situations where only the predictor variables are known. In this chapter, you’ll use the rpart, rpart.plot, and party packages to create and visualize decision trees; the randomForest package to fit random forests; and the e1071 package to build support vector machines. Logistic regression will be fit with the glm() function in the base R installation. Before starting, be sure to install the necessary packages: pkgs <- c("rpart", "rpart.plot", "party", "randomForest", "e1071") install.packages(pkgs, depend=TRUE) The primary example used in this chapter comes from the Wisconsin Breast Cancer data originally posted to the UCI Machine Learning Repository. The goal will be to develop a model for predicting whether a patient has breast cancer from the characteristics of a fine-needle tissue aspiration (a tissue sample taken with a thin hollow needle from a lump or mass just under the skin). 17.1 Preparing the data The Wisconsin Breast Cancer dataset is available as a comma-delimited text file on the UCI Machine Learning Server (http://archive.ics.uci.edu/ml). The dataset contains 699 fine-needle aspirate samples, where 458 (65.5%) are benign and 241 (34.5%) are malignant. The dataset contains a total of 11 variables and doesn’t include the variable names in the file. Sixteen samples have missing data and are coded in the text file with a question mark (?). The variables are as follows: ■ ■ ■ ■ ■ ID Clump thickness Uniformity of cell size Uniformity of cell shape Marginal adhesion www.it-ebooks.info Preparing the data ■ ■ ■ ■ ■ ■ 391 Single epithelial cell size Bare nuclei Bland chromatin Normal nucleoli Mitoses Class The first variable is an ID variable (which you’ll drop), and the last variable (class) contains the outcome (coded 2=benign, 4=malignant). For each sample, nine cytological characteristics previously found to correlate with malignancy are also recorded. These variables are each scored from 1 (closest to benign) to 10 (most anaplastic). But no one predictor alone can distinguish between benign and malignant samples. The challenge is to find a set of classification rules that can be used to accurately predict malignancy from some combination of these nine cell characteristics. See Mangasarian and Wolberg (1990) for details. In the following listing, the comma-delimited text file containing the data is downloaded from the UCI repository and randomly divided into a training sample (70%) and a validation sample (30%). Listing 17.1 Preparing the breast cancer data loc <- "http://archive.ics.uci.edu/ml/machine-learning-databases/" ds <- "breast-cancer-wisconsin/breast-cancer-wisconsin.data" url <- paste(loc, ds, sep="") breast <- read.table(url, sep=",", header=FALSE, na.strings="?") names(breast) <- c("ID", "clumpThickness", "sizeUniformity", "shapeUniformity", "maginalAdhesion", "singleEpithelialCellSize", "bareNuclei", "blandChromatin", "normalNucleoli", "mitosis", "class") df <- breast[-1] df$class <- factor(df$class, levels=c(2,4), labels=c("benign", "malignant")) set.seed(1234) train <- sample(nrow(df), 0.7*nrow(df)) df.train <- df[train,] df.validate <- df[-train,] table(df.train$class) table(df.validate$class) The training sample has 499 cases (329 benign, 160 malignant), and the validation sample has 210 cases (129 benign, 81 malignant). The training sample will be used to create classification schemes using logistic regression, a decision tree, a conditional decision tree, a random forest, and a support vector machine. The validation sample will be used to evaluate the effectiveness of these schemes. By using the same example throughout the chapter, you can compare the results of each approach. www.it-ebooks.info 392 CHAPTER 17 Classification 17.2 Logistic regression Logistic regression is a type of generalized linear model that is often used to predict a binary outcome from a set of numeric variables (see section 13.2 for details). The glm() function in the base R installation is used for fitting the model. Categorical predictors (factors) are automatically replaced with a set of dummy coded variables. All the predictors in the Wisconsin Breast Cancer data are numeric, so dummy coding is unnecessary. The next listing provides a logistic regression analysis of the data. Listing 17.2 Logistic regression with glm() > fit.logit <- glm(class~., data=df.train, family=binomial()) > summary(fit.logit) c b Fits the logistic regression Examines the model Call: glm(formula = class ~ ., family = binomial(), data = df.train) Deviance Residuals: Min 1Q Median -2.7581 -0.1060 -0.0568 3Q 0.0124 Max 2.6432 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -10.4276 1.4760 -7.06 1.6e-12 *** clumpThickness 0.5243 0.1595 3.29 0.0010 ** sizeUniformity -0.0481 0.2571 -0.19 0.8517 shapeUniformity 0.4231 0.2677 1.58 0.1141 maginalAdhesion 0.2924 0.1469 1.99 0.0465 * singleEpithelialCellSize 0.1105 0.1798 0.61 0.5387 bareNuclei 0.3357 0.1072 3.13 0.0017 ** blandChromatin 0.4235 0.2067 2.05 0.0405 * normalNucleoli 0.2889 0.1399 2.06 0.0390 * mitosis 0.6906 0.3983 1.73 0.0829 . --Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 > prob <- predict(fit.logit, df.validate, type="response") > logit.pred <- factor(prob > .5, levels=c(FALSE, TRUE), labels=c("benign", "malignant")) > logit.perf <- table(df.validate$class, logit.pred, dnn=c("Actual", "Predicted")) > logit.perf e d Classifies new cases Evaluates the predictive accuracy Predicted Actual benign malignant benign 118 2 malignant 4 76 First, a logistic regression model is fit using class as the dependent variable and the remaining variables as predictors b. The model is based on the cases in the df.train data frame. The coefficients for the model are displayed next c. Section 13.2 provides guidelines for interpreting logistic model coefficients. www.it-ebooks.info Decision trees 393 Next, the prediction equation developed on the df.train dataset is used to classify cases in the df.validate dataset. By default, the predict() function predicts the log odds of having a malignant outcome. By using the type="response" option, the probability of obtaining a malignant classification is returned instead d. In the next line, cases with probabilities greater than 0.5 are classified into the malignant group and cases with probabilities less than or equal to 0.5 are classified as benign. Finally, a cross-tabulation of actual status and predicted status (called a confusion matrix) is printed e. It shows that 118 cases that were benign were classified as benign, and 76 cases that were malignant were classified as malignant. Ten cases in the df.validate data frame had missing predictor data and could not be included in the evaluation. The total number of cases correctly classified (also called the accuracy) was (76 + 118) / 200 or 97% in the validation sample. Statistics for evaluating the accuracy of a classification scheme are discussed more fully in section 17.4. Before moving on, note that three of the predictor variables (sizeUniformity, shapeUniformity, and singleEpithelialCellSize) have coefficients that don’t differ from zero at the p < .10 level. What, if anything, should you do with predictor variables that have nonsignificant coefficients? In a prediction context, it’s often useful to remove such variables from the final model. This is especially important in situations where a large number of non-informative predictor variables are adding what is essentially noise to the system. In this case, stepwise logistic regression can be used to generate a smaller model with fewer variables. Predictor variables are added or removed in order to obtain a model with a smaller AIC value. In the current context, you could use logit.fit.reduced <- step(fit.logit) to obtain a more parsimonious model. The reduced model excludes the three variables mentioned previously. When used to predict outcomes in the validation dataset, this reduced model makes fewer errors. Try it out. The next approach we’ll consider involves the creation of decision or classification trees. 17.3 Decision trees Decision trees are popular in data-mining contexts. They involve creating a set of binary splits on the predictor variables in order to create a tree that can be used to classify new observations into one of two groups. In this section, we’ll look at two types of decision trees: classical trees and conditional inference trees. 17.3.1 Classical decision trees The process of building a classical decision tree starts with a binary outcome variable (benign/malignant in this case) and a set of predictor variables (the nine cytology measurements). The algorithm is as follows: 1 Choose the predictor variable that best splits the data into two groups such that the purity (homogeneity) of the outcome in the two groups is maximized (that www.it-ebooks.info 394 CHAPTER 17 2 3 4 Classification is, as many benign cases in one group and malignant cases in the other as possible). If the predictor is continuous, choose a cut-point that maximizes purity for the two groups created. If the predictor variable is categorical (not applicable in this case), combine the categories to obtain two groups with maximum purity. Separate the data into these two groups, and continue the process for each subgroup. Repeat steps 1 and 2 until a subgroup contains fewer than a minimum number of observations or no splits decrease the impurity beyond a specified threshold. The subgroups in the final set are called terminal nodes. Each terminal node is classified as one category of the outcome or the other based on the most frequent value of the outcome for the sample in that node. To classify a case, run it down the tree to a terminal node, and assign it the modal outcome value assigned in step 3. Unfortunately, this process tends to produce a tree that is too large and suffers from overfitting. As a result, new cases aren’t classified well. To compensate, you can prune back the tree by choosing the tree with the lowest 10-fold cross-validated prediction error. This pruned tree is then used for future predictions. In R, decision trees can be grown and pruned using the rpart() and prune() functions in the rpart package. The following listing creates a decision tree for classifying the cell data as benign or malignant. Listing 17.3 Creating a classical decision tree with rpart() > library(rpart) > set.seed(1234) > dtree <- rpart(class ~ ., data=df.train, method="class", parms=list(split="information")) > dtree$cptable 1 2 3 4 CP nsplit rel error 0.800000 0 1.00000 0.046875 1 0.20000 0.012500 3 0.10625 0.010000 4 0.09375 xerror 1.00000 0.30625 0.20625 0.18125 b Grows the tree xstd 0.06484605 0.04150018 0.03467089 0.03264401 > plotcp(dtree) > dtree.pruned <- prune(dtree, cp=.0125) > library(rpart.plot) > prp(dtree.pruned, type = 2, extra = 104, fallen.leaves = TRUE, main="Decision Tree") c Prunes the tree > dtree.pred <- predict(dtree.pruned, df.validate, type="class") > dtree.perf <- table(df.validate$class, dtree.pred, dnn=c("Actual", "Predicted")) > dtree.perf www.it-ebooks.info d Classifies new cases 395 Decision trees Predicted Actual benign malignant benign 122 7 malignant 2 79 First the tree is grown using the rpart() function b. You can use print(dtree) and summary(dtree) to examine the fitted model (not shown here). The tree may be too large and need to be pruned. In order to choose a final tree size, examine the cptable component of the list returned by rpart(). It contains data about the prediction error for various tree sizes. The complexity parameter (cp) is used to penalize larger trees. Tree size is defined by the number of branch splits (nsplit). A tree with n splits has n + 1 terminal nodes. The rel error column contains the error rate for a tree of a given size in the training sample. The cross-validated error (xerror) is based on 10-fold cross validation (also using the training sample). The xstd column contains the standard error of the crossvalidation error. The plotcp() function plots the cross-validated error against the complexity parameter (see figure 17.1). A good choice for the final tree size is the smallest tree whose cross-validated error is within one standard error of the minimum crossvalidated error value. The minimum cross-validated error is 0.18 with a standard error of 0.0326. In this case, the smallest tree with a cross-validated error within 0.18 ± 0.0326 (that is, between 0.15 and 0.21) is selected. Looking at the cptable table in listing 17.3, a tree with three splits (cross-validated error = 0.20625) fits this requirement. Equivalently, size of tree 2 Inf 0.19 4 5 0.024 0.011 0.8 0.6 0.4 0.2 X−val Relative Error 1.0 1.2 1 cp Figure 17.1 Complexity parameter vs. cross-validated error. The dotted line is the upper limit of the one standard deviation rule (0.18 + 1 * 0.0326 = .21). The plot suggests selecting the tree with the leftmost cp value below the line. www.it-ebooks.info 396 CHAPTER 17 Classification you can select the tree size associated with the largest complexity parameter below the line in figure 17.1. Results again suggest a tree with three splits (four terminal nodes). The prune() function uses the complexity parameter to cut back a tree to the desired size. It takes the full tree and snips off the least important splits based on the desired complexity parameter. From the cptable in listing 17.3, a tree with three splits has a complexity parameter of 0.0125, so the statement prune(dtree, cp=0.0125) returns a tree with the desired size c. The prp() function in the rpart.plot package is used to draw an attractive plot of the final decision tree (see figure 17.2). The prp() function has many options (see ?prp for details). The type=2 option draws the split labels below each node. The extra=104 parameter includes the probabilities for each class, along with the percentage of observations in each node. The fallen.leaves=TRUE option displays the terminal nodes at the bottom of the graph. To classify an observation, start at the top of the tree, moving to the left branch if a condition is true or to the right otherwise. Continue moving down the tree until you hit a terminal node. Classify the observation using the label of the node. Decision Tree benign .67 .33 100% yes sizeUnif < 3.5 no benign .93 .07 71% bareNucl < 2.5 malignant .48 .52 9% shapeUni < 2.5 benign .99 .01 62% benign .78 .22 5% malignant .14 .86 4% malignant .05 .95 29% Figure 17.2 Traditional (pruned) decision tree for predicting cancer status. Start at the top of the tree, moving left if a condition is true or right otherwise. When an observation hits a terminal node, it’s classified. Each node contains the probability of the classes in that node, along with the percentage of the sample. www.it-ebooks.info Decision trees 397 Finally, the predict() function is used to classify each observation in the validation sample d. A cross-tabulation of the actual status against the predicted status is provided. The overall accuracy was 96% in the validation sample. Unlike the logistic regression example, all 210 cases in the validation sample could be classified by the final tree. Note that decision trees can be biased toward selecting predictors that have many levels or many missing values. 17.3.2 Conditional inference trees Before moving on to random forests, let’s look at an important variant of the traditional decision tree called a conditional inference tree. Conditional inference trees are similar to traditional trees, but variables and splits are selected based on significance tests rather than purity/homogeneity measures. The significance tests are permutation tests (discussed in chapter 12). In this case, the algorithm is as follows: 1 2 3 4 5 Calculate p-values for the relationship between each predictor and the outcome variable. Select the predictor with the lowest p-value. Explore all possible binary splits on the chosen predictor and dependent variable (using permutation tests), and pick the most significant split. Separate the data into these two groups, and continue the process for each subgroup. Continue until splits are no longer significant or the minimum node size is reached. Conditional inference trees are provided by the ctree() function in the party package. In the next listing, a conditional inference tree is grown for the breast cancer data. Listing 17.4 Creating a conditional inference tree with ctree() library(party) fit.ctree <- ctree(class~., data=df.train) plot(fit.ctree, main="Conditional Inference Tree") > ctree.pred <- predict(fit.ctree, df.validate, type="response") > ctree.perf <- table(df.validate$class, ctree.pred, dnn=c("Actual", "Predicted")) > ctree.perf Predicted Actual benign malignant benign 122 7 malignant 3 78 Note that pruning isn’t required for conditional inference trees, and the process is somewhat more automated. Additionally, the party package has attractive plotting www.it-ebooks.info 398 CHAPTER 17 Classification Conditional Inference Tree 1 sizeUniformity p < 0.001 ≤3 >3 2 bareNuclei p < 0.001 ≤5 >5 3 normalNucleoli p < 0.001 ≤3 >3 4 bareNuclei p < 0.001 0 Figure 17.3 0.4 0.2 0 0.4 0.2 0 benign benign 0.6 0.8 Node 9 (n = 142) 1 0.6 0.4 0.2 0 0.8 0.6 malignant 0.2 0.6 0.8 Node 8 (n = 17) 1 malignant 0.4 malignant malignant 0.6 0.8 Node 7 (n = 13) 1 malignant 0.8 Node 6 (n = 20) 1 benign benign Node 5 (n = 297) 1 >2 benign ≤2 0.4 0.2 0 Conditional inference tree for the breast cancer data options. The conditional inference tree is plotted in figure 17.3. The shaded area of each node represents the proportion of malignant cases in that node. Displaying an rpart() tree with a ctree()-like graph If you create a classical decision tree using rpart(), but you’d like to display the resulting tree using a plot like the one in figure 17.3, the partykit package can help. After installing and loading the package, you can use the statement plot(as.party(an.rpart.tree)) to create the desired graph. For example, try creating a graph like figure 17.3 using the dtree.pruned object created in listing 17.3, and compare the results to the plot presented in figure 17.2. The decision trees grown by the traditional and conditional methods can differ substantially. In the current example, the accuracy of each is similar. In the next section, a large number of decision trees are grown and combined in order to classify cases into groups. www.it-ebooks.info 399 Random forests 17.4 Random forests A random forest is an ensemble learning approach to supervised learning. Multiple predictive models are developed, and the results are aggregated to improve classification rates. You can find a comprehensive introduction to random forests, written by Leo Breiman and Adele Cutler, at http://mng.bz/7Nul. The algorithm for a random forest involves sampling cases and variables to create a large number of decision trees. Each case is classified by each decision tree. The most common classification for that case is then used as the outcome. Assume that N is the number of cases in the training sample and M is the number of variables. Then the algorithm is as follows: 1 2 3 4 5 Grow a large number of decision trees by sampling N cases with replacement from the training set. Sample m < M variables at each node. These variables are considered candidates for splitting in that node. The value m is the same for each node. Grow each tree fully without pruning (the minimum node size is set to 1). Terminal nodes are assigned to a class based on the mode of cases in that node. Classify new cases by sending them down all the trees and taking a vote—majority rules. An out-of-bag (OOB) error estimate is obtained by classifying the cases that aren’t selected when building a tree, using that tree. This is an advantage when a validation sample is unavailable. Random forests also provide a natural measure of variable importance, as you’ll see. Random forests are grown using the randomForest() function in the randomForest package. The default number of trees is 500, the default number of variables sampled at each node is sqrt(M), and the minimum node size is 1. The following listing provides the code and results for predicting malignancy status in the breast cancer data. Listing 17.5 Random forest > library(randomForest) > set.seed(1234) > fit.forest <- randomForest(class~., data=df.train, na.action=na.roughfix, importance=TRUE) > fit.forest b Call: randomForest(formula = class ~ ., data = df.train, importance = TRUE, na.action = na.roughfix) Type of random forest: classification Number of trees: 500 No. of variables tried at each split: 3 OOB estimate of error rate: 3.68% www.it-ebooks.info Grows the forest 400 CHAPTER 17 Classification Confusion matrix: benign malignant class.error benign 319 10 0.0304 malignant 8 152 0.0500 > importance(fit.forest, type=2) MeanDecreaseGini clumpThickness 12.50 sizeUniformity 54.77 shapeUniformity 48.66 maginalAdhesion 5.97 singleEpithelialCellSize 14.30 bareNuclei 34.02 blandChromatin 16.24 normalNucleoli 26.34 mitosis 1.81 c Determines variable importance > forest.pred <- predict(fit.forest, df.validate) > forest.perf <- table(df.validate$class, forest.pred, dnn=c("Actual", "Predicted")) > forest.perf d Classifies new cases Predicted Actual benign malignant benign 117 3 malignant 1 79 First, the randomForest() function is used to grow 500 traditional decision trees by sampling 489 observations with replacement from the training sample and sampling 3 variables at each node of each tree b. The na.action=na.roughfix option replaces missing values on numeric variables with column medians, and missing values on categorical variables with the modal category for that variable (breaking ties at random). Random forests can provide a natural measure of variable importance, requested with the information=TRUE option, and printed with the importance() function c. The relative importance measure specified by the type=2 option is the total decrease in node impurities (heterogeneity) from splitting on that variable, averaged over all trees. Node impurity is measured with the Gini coefficient. sizeUniformity is the most important variable and mitosis is the least important. Finally, the validation sample is classified using the random forest and the predictive accuracy is calculated d. Note that cases with missing values in the validation sample aren’t classified. The prediction accuracy (98% overall) is good. Whereas the randomForest package provides forests based on traditional decision trees, the cforest() function in the party package can be used to generate random forests based on conditional inference trees. If predictor variables are highly correlated, a random forest using conditional inference trees may provide better predictions. Random forests tend to be very accurate compared with other classification methods. Additionally, they can handle large problems (many observations and variables), can handle large amounts of missing data in the training set, and can handle cases in www.it-ebooks.info 401 Support vector machines which the number of variables is much greater than the number of observations. The provision of OOB error rates and measures of variable importance are also significant advantages. A significant disadvantage is that it’s difficult to understand the classification rules (there are 500 trees!) and communicate them to others. Additionally, you need to store the entire forest in order to classify new cases. The final classification model we’ll consider here is the support vector machine, described next. 17.5 Support vector machines Support vector machines (SVMs) are a group of supervised machine-learning models that can be used for classification and regression. They’re popular at present, in part because of their success in developing accurate prediction models, and in part because of the elegant mathematics that underlie the approach. We’ll focus on the use of SVMs for binary classification. SVMs seek an optimal hyperplane for separating two classes in a multidimensional space. The hyperplane is chosen to maximize the margin between the two classes’ closest points. The points on the boundary of the margin are called support vectors (they help define the margin), and the middle of the margin is the separating hyperplane. For an N-dimensional space (that is, with N predictor variables), the optimal hyperplane (also called a linear decision surface) has N – 1 dimensions. If there are two variables, the surface is a line. For three variables, the surface is a plane. For 10 variables, the surface is a 9-dimensional hyperplane. Trying to picture it will give you headache. Consider the two-dimensional example shown in figure 17.4. Circles and triangles represent the two groups. The margin is the gap, represented by the distance between 0 −4 −2 x2 2 4 Linear Separable Features −2 −1 0 1 2 x1 Figure 17.4 Two-group classification problem where the two groups are linearly separable. The separating hyperplane is indicated by the solid black line. The margin is the distance from the line to the dashed line on either side. The filled circles and triangles are the support vectors. www.it-ebooks.info 402 CHAPTER 17 Classification the two dashed lines. The points on the dashed lines (filled circles and triangles) are the support vectors. In the two-dimensional case, the optimal hyperplane is the black line in the middle of the gap. In this idealized example, the two groups are linearly separable—the line can completely separate the two groups without errors. The optimal hyperplane is identified using quadratic programming to optimize the margin under the constraint that the data points on one side have an outcome value of +1 and the data on the other side has an outcome value of -1. If the data points are “almost” separable (not all the points are on one side or the other), a penalizing term is added to the optimization in order to account for errors, and “soft” margins are produced. But the data may be fundamentally nonlinear. Consider the example in figure 17.5. There is no line that can correctly separate the circles and triangles. SVMs use kernel functions to transform the data into higher dimensions, in the hope that they will become more linearly separable. Imagine transforming the data in figure 17.5 in such a way that the circles lift off the page. One way to do this is to transform the twodimensional data into three dimensions using (X,Y) → (X 2, 2 XY,Y 2 ) → (Z1, Z2, Z2) Then you can separate the triangles from the circles using a rigid sheet of paper (that is, a two-dimensional plane in what is now a three-dimensional space). The mathematics of SVMs is complex and well beyond the scope of this book. Statnikov, Aliferis, Hardin, & Guyon (2011) offer a lucid and intuitive presentation of SVMs that goes into quite a bit of conceptual detail without getting bogged down in higher math. −3 −2 −1 Y 0 1 2 3 Features are not Linearly Separable −3 −2 −1 0 1 2 3 X Figure 17.5 Two-group classification problem where the two groups aren’t linearly separable. The groups can’t be separated with a hyperplane (line). www.it-ebooks.info Support vector machines 403 SVMs are available in R using the ksvm() function in the kernlab package and the svm() function in the e1071 package. The former is more powerful, but the latter is a bit easier to use. The example in the next listing uses the latter (easy is good) to develop an SVM for the Wisconsin breast cancer data. Listing 17.6 A support vector machine > > > > library(e1071) set.seed(1234) fit.svm <- svm(class~., data=df.train) fit.svm Call: svm(formula = class ~ ., data = df.train) Parameters: SVM-Type: SVM-Kernel: cost: gamma: C-classification radial 1 0.1111 Number of Support Vectors: 76 > svm.pred <- predict(fit.svm, na.omit(df.validate)) > svm.perf <- table(na.omit(df.validate)$class, svm.pred, dnn=c("Actual", "Predicted")) > svm.perf Predicted Actual benign malignant benign 116 4 malignant 3 77 Because predictor variables with larger variances typically have a greater influence on the development of SVMs, the svm() function scales each variable to a mean of 0 and standard deviation of 1 before fitting the model by default. As you can see, the predictive accuracy is good, but not quite as good as that found for the random forest approach in section 17.2. Unlike the random forest approach, the SVM is also unable to accommodate missing predictor values when classifying new cases. 17.5.1 Tuning an SVM By default, the svm() function uses a radial basis function (RBF) to map samples into a higher-dimensional space (the kernel trick). The RBF kernel is often a good choice because it’s a nonlinear mapping that can handle relations between class labels and predictors that are nonlinear. When fitting an SVM with the RBF kernel, two parameters can affect the results: gamma and cost. Gamma is a kernel parameter that controls the shape of the separating hyperplane. Larger values of gamma typically result in a larger number of support vectors. Gamma can also be thought of as a parameter that controls how widely a www.it-ebooks.info 404 CHAPTER 17 Classification training sample “reaches,” with larger values meaning far and smaller values meaning close. Gamma must be greater than zero. The cost parameter represents the cost of making errors. A large value severely penalizes errors and leads to a more complex classification boundary. There will be less misclassifications in the training sample, but over-fitting may result in poor predictive ability in new samples. Smaller values lead to a flatter classification boundary but may result in under-fitting. Like gamma, cost is always positive. By default, the svm() function sets gamma to 1 / (number of predictors) and cost to 1. But a different combination of gamma and cost may lead to a more effective model. You can try fitting SVMs by varying parameter values one at a time, but a grid search is more efficient. You can specify a range of values for each parameter using the tune.svm() function. tune.svm() fits every combination of values and reports on the performance of each. An example is given next. Listing 17.7 Tuning an RBF support vector machine > set.seed(1234) > tuned <- tune.svm(class~., data=df.train, gamma=10^(-6:1), cost=10^(-10:10)) > tuned - sampling method: 10-fold cross validation b c Varies the parameters Prints the best model - best parameters: gamma cost 0.01 1 Fits the model with these parameters - best performance: 0.02904 > fit.svm <- svm(class~., data=df.train, gamma=.01, cost=1) > svm.pred <- predict(fit.svm, na.omit(df.validate)) > svm.perf <- table(na.omit(df.validate)$class, svm.pred, dnn=c("Actual", "Predicted")) > svm.perf e d Evaluates the cross-validation performance Predicted Actual benign malignant benign 117 3 malignant 3 77 First, an SVM model is fit with an RBF kernel and varying values of gamma and cost b. Eight values of gamma (ranging from 0.000001 to 10) and 21 values of cost (ranging from .01 to 10000000000) are specified. In all, 168 models (8 × 21) are fit and compared. The model with the fewest 10-fold cross validated errors in the training sample has gamma = 0.01 and cost = 1. Using these parameter values, a new SVM is fit to the training sample d. The model is then used to predict outcomes in the validation sample e, and the number of errors is displayed. Tuning the model c decreased the number of errors slightly (from seven to six). In many cases, tuning the SVM parameters will lead to greater gains. www.it-ebooks.info Choosing a best predictive solution 405 As stated previously, SVMs are popular because they work well in many situations. They can also handle situations in which the number of variables is much larger than the number of observations. This has made them popular in the field of biomedicine, where the number of variables collected in a typical DNA microarray study of gene expressions may be one or two orders of magnitude larger than the number of cases available. One drawback of SVMs is that, like random forests, the resulting classification rules are difficult to understand and communicate. They’re essentially a black box. Additionally, SVMs don’t scale as well as random forests when building models from large training samples. But once a successful model is built, classifying new observations does scale well. 17.6 Choosing a best predictive solution In sections 17.1 through 17.3, fine-needle aspiration samples were classified as malignant or benign using several supervised machine-learning techniques. Which approach was most accurate? To answer this question, we need to define the term accurate in a binary classification context. The most commonly reported statistic is the accuracy, or how often the classifier is correct. Although informative, the accuracy is insufficient by itself. Additional information is also needed to evaluate the utility of a classification scheme. Consider a set of rules for classifying individuals as schizophrenic or non-schizophrenic. Schizophrenia is a rare disorder, with a prevalence of roughly 1% in the general population. If you classify everyone as non-schizophrenic, you’ll be right 99% of time. But this isn’t a good classifier because it will also misclassify every schizophrenic as non-schizophrenic. In addition to the accuracy, you should ask these questions: ■ ■ ■ ■ What percentage of schizophrenics are correctly identified? What percentage of non-schizophrenics are correctly identified? If a person is classified as schizophrenic, how likely is it that this classification will be correct? If a person is classified as non-schizophrenic, how likely is it that this classification is correct? These are questions pertaining to a classifier’s sensitivity, specificity, positive predictive power, and negative predictive power. Each is defined in table 17.1. Table 17.1 Measures of predictive accuracy Statistic Interpretation Sensitivity Probability of getting a positive classification when the true outcome is positive (also called true positive rate or recall) Specificity Probability of getting a negative classification when the true outcome is negative (also called true negative rate) www.it-ebooks.info 406 CHAPTER 17 Table 17.1 Classification Measures of predictive accuracy (continued) Statistic Interpretation Positive predictive value Probability that an observation with a positive classification is correctly identified as positive (also called precision) Negative predictive value Probability that an observation with a negative classification is correctly identified as negative Accuracy Proportion of observations correctly identified (also called ACC) A function for calculating these statistics is provided next. Listing 17.8 Function for assessing binary classification accuracy performance <- function(table, n=2){ if(!all(dim(table) == c(2,2))) stop("Must be a 2 x 2 table") tn = table[1,1] fp = table[1,2] Extracts frequencies fn = table[2,1] tp = table[2,2] sensitivity = tp/(tp+fn) specificity = tn/(tn+fp) ppp = tp/(tp+fp) Calculates statistics npp = tn/(tn+fn) hitrate = (tp+tn)/(tp+tn+fp+fn) result <- paste("Sensitivity = ", round(sensitivity, n) , "\nSpecificity = ", round(specificity, n), "\nPositive Predictive Value = ", round(ppp, n), "\nNegative Predictive Value = ", round(npp, n), "\nAccuracy = ", round(hitrate, n), "\n", sep="") cat(result) } b c d Prints results The performance() function takes a table containing the true outcome (rows) and predicted outcome (columns) and returns the five accuracy measures. First, the number of true negatives (benign tissue identified as benign), false positives (benign tissue identified as malignant), false negatives (malignant tissue identified as benign), and true positives (malignant tissue identified as malignant) are extracted b. Next, these counts are used to calculate the sensitivity, specificity, positive and negative predictive values, and accuracy c. Finally, the results are formatted and printed d. In the following listing, the performance() function is applied to each of the five classifiers developed in this chapter. Listing 17.9 Performance of breast cancer data classifiers > performance(logit.perf) Sensitivity = 0.95 Specificity = 0.98 Positive Predictive Value = 0.97 www.it-ebooks.info Choosing a best predictive solution 407 Negative Predictive Value = 0.97 Accuracy = 0.97 > performance(dtree.perf) Sensitivity = 0.98 Specificity = 0.95 Positive Predictive Power = 0.92 Negative Predictive Power = 0.98 Accuracy = 0.96 > performance(ctree.perf) Sensitivity = 0.96 Specificity = 0.95 Positive Predictive Value = 0.92 Negative Predictive Value = 0.98 Accuracy = 0.95 > performance(forest.perf) Sensitivity = 0.99 Specificity = 0.98 Positive Predictive Value = 0.96 Negative Predictive Value = 0.99 Accuracy = 0.98 > performance(svm.perf) Sensitivity = 0.96 Specificity = 0.98 Positive Predictive Value = 0.96 Negative Predictive Value = 0.98 Accuracy = 0.97 Each of these classifiers (logistic regression, traditional decision tree, conditional inference tree, random forest, and support vector machine) performed exceedingly well on each of the accuracy measures. This won’t always be the case! In this particular instance, the award appears to go to the random forest model (although the differences are so small, they may be due to chance). For the random forest model, 99% of malignancies were correctly identified, 98% of benign samples were correctly identified, and the overall percent of correct classifications is 99%. A diagnosis of malignancy was correct 96% of the time (for a 4% false positive rate), and a benign diagnosis was correct 99% of the time (for a 1% false negative rate). For diagnoses of cancer, the specificity (proportion of malignant samples correctly identified as malignant) is particularly important. Although it’s beyond the scope of this chapter, you can often improve a classification system by trading specificity for sensitivity and vice versa. In the logistic regression model, predict() was used to estimate the probability that a case belonged in the malignant group. If the probability was greater than 0.5, the case was assigned to that group. The 0.5 value is called the threshold or cutoff value. If you vary this threshold, you can increase the sensitivity of the classification model at the expense of its specificity. predict() can generate probabilities for decision trees, random forests, and SVMs as well (although the syntax varies by method). www.it-ebooks.info 408 CHAPTER 17 Classification The impact of varying the threshold value is typically assessed using a receiver operating characteristic (ROC) curve. A ROC curve plots sensitivity versus specificity for a range of threshold values. You can then select a threshold with the best balance of sensitivity and specificity for a given problem. Many R packages generate ROC curves, including ROCR and pROC. Analytic functions in these packages can help you to select the best threshold values for a given scenario or to compare the ROC curves produced by different classification algorithms in order to choose the most useful approach. To learn more, see Kuhn & Johnson (2013). A more advanced discussion is offered by Fawcett (2005). Until now, each classification technique has been applied to data by writing and executing code. In the next section, we’ll look at a graphical user interface that lets you develop and deploy predictive models using a visual interface. 17.7 Using the rattle package for data mining Rattle (R Analytic Tool to Learn Easily) offers a graphic user interface (GUI) for data mining in R. It gives the user point-and-click access to many of the R functions you’ve been using in this chapter, as well as other unsupervised and supervised data models not covered here. Rattle also supports the ability to transform and score data, and it offers a number of data-visualization tools for evaluating models. You can install the rattle package from CRAN using install.packages("rattle") This installs the rattle package, along with several additional packages. A full installation of Rattle and all the packages it can access would require downloading and installing hundreds of packages. To save time and space, a basic set of packages is installed by default. Other packages are installed when you first request an analysis that requires them. In this case, you’ll be prompted to install the missing package(s), and if you reply Yes, the required package will be downloaded and installed from CRAN. Depending on your operating system and current software, you may have to install additional software. In particular, Rattle requires access to the GTK+ Toolkit. If you have difficulty, follow the OS-specific installation directions and troubleshooting suggestions offered at http://rattle.togaware.com. Once rattle is installed, launch the interface using library(rattle) rattle() The GUI (see figure 17.6) should open on top of the R console. In this section, you’ll use Rattle to develop a conditional inference tree for predicting diabetes. The data also comes from the UCI Machine Learning Repository. The Pima Indians Diabetes dataset contains 768 cases originally collected by the National Institute of Diabetes and Digestive and Kidney Disease. The variables are as follows: ■ ■ Number of times pregnant Plasma glucose concentration at 2 hours in an oral glucose tolerance test www.it-ebooks.info Using the rattle package for data mining ■ ■ ■ ■ ■ ■ ■ 409 Diastolic blood pressure (mm Hg) Triceps skin fold thickness (mm) 2-hour serum insulin (mu U/ml) Body mass index (weight in kg/(height in m)^2) Diabetes pedigree function Age (years) Class variable (0 = non-diabetic or 1 = diabetic) Thirty-four percent of the sample was diagnosed with diabetes. To access this data in Rattle, use the following code: loc <- "http://archive.ics.uci.edu/ml/machine-learning-databases/" ds <- "pima-indians-diabetes/pima-indians-diabetes.data" url <- paste(loc, ds, sep="") diabetes <- read.table(url, sep=",", header=FALSE) names(diabetes) <- c("npregant", "plasma", "bp", "triceps", "insulin", "bmi", "pedigree", "age", "class") diabetes$class <- factor(diabetes$class, levels=c(0,1), labels=c("normal", "diabetic")) library(rattle) rattle() This downloads the data from the UCI repository, names the variables, adds labels to the outcome variable, and opens Rattle. You should be presented with the tabbed dialog box in figure 17.6. Figure 17.6 Opening Rattle screen www.it-ebooks.info 410 CHAPTER 17 Figure 17.7 Classification Data tab with options to specify the role of each variable To access the diabetes dataset, click the R Dataset radio button, and select Diabetes from the drop-down box that appears. Then click the Execute button in the upper-left corner. This opens the window shown in figure 17.7. This window provides a description of each variable and allows you to specify the role each will play in the analyses. Here, variables 1–9 are input (predictor) variables, and class is the target (or predicted) outcome, so no changes are necessary. You can also specify the percentage of cases to be used as a training sample, validation sample, and testing sample. Analysts frequently build models with a training sample, fine-tune parameters with a validation sample, and evaluate the results with a testing sample. By default, Rattle uses a 70/15/15 split and a seed value of 42. You’ll divide the data into training and validation samples, skipping the test sample. Therefore, enter 70/30/0 in the Partition text box and 1234 in the Seed text box, and click Execute again. Now let’s fit a prediction model. To generate a conditional inference tree, select the Model tab. Be sure the Tree radio button is selected (the default); and for Algorithm, choose the Conditional radio button. Clicking Execute builds the model using the ctree() function in the party package and displays the results in the bottom of the window (see figure 17.8). Clicking the Draw button produces an attractive graph (see figure 17.9). (Hint: specifying Use Cairo Graphics in the Settings menu before clicking Draw often produces a more attractive plot.) www.it-ebooks.info 411 Using the rattle package for data mining Figure 17.8 Model tab with options to build decision trees, random forests, support vector machines, and more. Here, a conditional inference tree has been fitted to the training data. 1 plasma p < 0.001 ≤ 123 > 123 2 npregant p < 0.001 ≤6 7 plasma p < 0.001 >6 ≤ 157 3 age p = 0.001 ≤ 34 > 157 9 age p = 0.012 > 34 ≤ 59 > 59 Figure 17.9 sample 0.2 0 0.4 0.2 0 0.6 0.4 0.2 0 0.8 0.6 0.4 0.2 0 normal 0.8 normal normal normal 0.6 diabetic 0 0.4 0.8 diabetic 0.2 0.6 diabetic 0.4 0.8 diabetic 0.6 normal 0.8 diabetic diabetic normal Node 4 (n = 216) Node 5 (n = 46) Node 6 (n = 50) Node 8 (n = 148) Node 10 (n = 68) Node 11 (n = 9) 1 1 1 1 1 1 0.8 0.6 0.4 0.2 0 Tree diagram for the conditional inference tree using the diabetes training www.it-ebooks.info 412 CHAPTER 17 Classification Figure 17.10 Evaluation tab with the error matrix for the conditional inference tree calculated on the validation sample To evaluate the fitted model, select the Evaluate tab. Here you can specify a number of evaluative criteria and the sample (training, validation) to use. By default, the error matrix (also called a confusion matrix in this chapter) is selected. Clicking Execute produces the results shown in figure 17.10. You can import the error matrix into the performance() function to obtain the accuracy statistics: > cv <- matrix(c(145, 50, 8, 27), nrow=2) > performance(as.table(cv)) Sensitivity = 0.35 Specificity = 0.95 Positive Predictive Value = 0.77 Negative Predictive Value = 0.74 Accuracy = 0.75 Although the overall accuracy (75%) isn’t terrible, only 35% of diabetics were correctly identified. Try to develop a better classification scheme using random forests or support vector machines—it can be done. A significant advantage of using Rattle is the ability to fit multiple models to the same dataset and compare each model directly on the Evaluate tab. Check each method on this tab that you want to compare, and click Execute. Additionally, all the www.it-ebooks.info Summary 413 R code executed during the data-mining session can be viewed in the Log tab and exported to a text file for reuse. To learn more, visit the Rattle homepage (http://rattle.togaware.com/), and see Graham J. Williams’ overview article in the R journal (http://mng.bz/D16Q). Data Mining with Rattle and R, also by Williams (2011), is the definitive book on Rattle. 17.8 Summary This chapter presented a number of machine-learning techniques for classifying observations into one of two groups. First, the use of logistic regression as a classification tool was described. Next, traditional decision trees were described, followed by conditional inference trees. The ensemble random forest approach was considered next. Finally, the increasingly popular support vector machine approach was described. The last section introduced Rattle, a graphic user interface for data mining, which allows the user point-and-click access to these functions. Rattle can be particularly useful for comparing the results of various classification techniques. Because it generates reusable R code in a log file, it can also be a useful tool for learning the syntax of many of R’s predictive analytics functions. The techniques described in this chapter vary in complexity. Data miners typically try some of the simpler approaches (logistic regression, decision trees) and more complex, black-box approaches (random forests, support vector machines). If the black-box approaches don’t provide a significant improvement over the simpler methods, the simpler methods are usually selected for deployment. The examples in this chapter (cancer and diabetes diagnosis) both came from the field of medicine, but classification techniques are used widely in other disciplines, including computer science, marketing, finance, economics, and the behavioral sciences. Although the examples involved a binary classification (malignant/benign, diabetic/non-diabetic), modifications are available that allow these techniques to be used with multigroup classification problems. To learn more about the functions in R that support classification, look in the CRAN Task View for Machine Learning and Statistical Learning (http://mng.bz/ I1Lm). Other good resources include books by Kuhn & Johnson (2013) and Torgo (2010). www.it-ebooks.info Advanced methods for missing data This chapter covers ■ Identifying missing data ■ Visualizing missing data patterns ■ Complete-case analysis ■ Multiple imputation of missing data In previous chapters, we focused on analyzing complete datasets (that is, datasets without missing values). Although doing so helps simplify the presentation of statistical and graphical methods, in the real world, missing data are ubiquitous. In some ways, the impact of missing data is a subject that most of us want to avoid. Statistics books may not mention it or may limit discussion to a few paragraphs. Statistical packages offer automatic handling of missing data using methods that may not be optimal. Even though most data analyses (at least in social sciences) involve missing data, this topic is rarely mentioned in the methods and results sections of journal articles. Given how often missing values occur, and the degree to which their presence can invalidate study results, it’s fair to say that the subject has received insufficient attention outside of specialized books and courses. 414 www.it-ebooks.info Steps in dealing with missing data 415 Data can be missing for many reasons. Survey participants may forget to answer one or more questions, refuse to answer sensitive questions, or grow fatigued and fail to complete a long questionnaire. Study participants may miss appointments or drop out of a study prematurely. Recording equipment may fail, internet connections may be lost, or data may be miscoded. Analysts may even plan for some data to be missing. For example, to increase study efficiency or reduce costs, you may choose not to collect all data from all participants. Finally, data may be lost for reasons that you’re never able to ascertain. Unfortunately, most statistical methods assume that you’re working with complete matrices, vectors, and data frames. In most cases, you have to eliminate missing data before you address the substantive questions that led you to collect the data. You can eliminate missing data by (1) removing cases with missing data or (2) replacing missing data with reasonable substitute values. In either case, the end result is a dataset without missing values. In this chapter, we’ll look at both traditional and modern approaches for dealing with missing data. We’ll primarily use the VIM and mice packages. The command install.packages(c("VIM", "mice")) will download and install both. To motivate the discussion, we’ll look at the mammal sleep dataset (sleep) provided in the VIM package (not to be confused with the sleep dataset describing the impact of drugs on sleep provided in the base installation). The data come from a study by Allison and Chichetti (1976) that examined the relationship between sleep and ecological and constitutional variables for 62 mammal species. The authors were interested in why animals’ sleep requirements vary from species to species. The sleep variables served as the dependent variables, whereas the ecological and constitutional variables served as the independent or predictor variables. Sleep variables included length of dreaming sleep (Dream), nondreaming sleep (NonD), and their sum (Sleep). The constitutional variables included body weight in kilograms (BodyWgt), brain weight in grams (BrainWgt), life span in years (Span), and gestation time in days (Gest). The ecological variables included degree to which species were preyed upon (Pred), degree of their exposure while sleeping (Exp), and overall danger (Danger) faced. The ecological variables were measured on 5-point rating scales that ranged from 1 (low) to 5 (high). In their original article, Allison and Chichetti limited their analyses to the species that had complete data. We’ll go further, analyzing all 62 cases using a multiple imputation approach. 18.1 Steps in dealing with missing data If you’re new to the study of missing data, you’ll find a bewildering array of approaches, critiques, and methodologies. The classic text in this area is Little and Rubin (2002). Excellent, accessible reviews can be found in Allison (2001); Schafer www.it-ebooks.info 416 CHAPTER 18 Advanced methods for missing data and Graham (2002); and Schlomer, Bauman, and Card (2010). A comprehensive approach usually includes the following steps: 1 2 3 Identify the missing data. Examine the causes of the missing data. Delete the cases containing missing data, or replace (impute) the missing values with reasonable alternative data values. Unfortunately, identifying missing data is usually the only unambiguous step. Learning why data are missing depends on your understanding of the processes that generated the data. Deciding how to treat missing values will depend on your estimation of which procedures will produce the most reliable and accurate results. A classification system for missing data Statisticians typically classify missing data into one of three types. These types are usually described in probabilistic terms, but the underlying ideas are straightforward. We’ll use the measurement of dreaming in the sleep study (where 12 animals have missing values) to illustrate each type in turn: ■ ■ ■ Missing completely at random—If the presence of missing data on a variable is unrelated to any other observed or unobserved variable, then the data are missing completely at random (MCAR). If there’s no systematic reason why dream sleep is missing for these 12 animals, the data are said to be MCAR. Note that if every variable with missing data is MCAR, you can consider the complete cases to be a simple random sample from the larger dataset. Missing at random—If the presence of missing data on a variable is related to other observed variables but not to its own unobserved value, the data are missing at random (MAR). For example, if animals with smaller body weights are more likely to have missing values for dream sleep (perhaps because it’s harder to observe smaller animals), and the “missingness” is unrelated to an animal’s time spent dreaming, the data are considered MAR. In this case, the presence or absence of dream sleep data is random, once you control for body weight. Not missing at random—If the missing data for a variable are neither MCAR nor MAR, the data are not missing at random (NMAR). For example, if animals that spend less time dreaming are also more likely to have a missing dream value (perhaps because it’s harder to measure shorter events), the data are considered NMAR. Most approaches to missing data assume that the data are either MCAR or MAR. In this case, you can ignore the mechanism producing the missing data and (after replacing or deleting the missing data) model the relationships of interest directly. Data that are NMAR can be difficult to analyze properly. When data are NMAR, you have to model the mechanisms that produced the missing values, as well as the relationships of interest. (Current approaches to analyzing NMAR data include the use of selection models and pattern mixtures. The analysis of NMAR data can be complex and is beyond the scope of this book.) www.it-ebooks.info 417 Identifying missing values Identify Missing Values is.na() !complete.cases() VIM package Delete Missing Values Maximum Likelihood Estimation Impute Missing Values mvmle package Casewise (Listwise) omit.na() Figure 18.1 Available Case (Pairwise) Single (simple) Imputation Option available for some functions Hmisc Package Multiple Imputation mi package mice package amelia package mitools package Methods for handling incomplete data, along with the R packages that support them There are many methods for dealing with missing data—and no guarantee that they’ll produce the same results. Figure 18.1 describes an array of methods used for handling incomplete data and the R packages that support them. A complete review of missing-data methodologies would require a book in itself. In this chapter, we’ll review methods for exploring missing-values patterns and focus on the three most popular methods for dealing with incomplete data (a rational approach, listwise deletion, and multiple imputation). We’ll end the chapter with a brief discussion of other methods, including those that are useful in special circumstances. 18.2 Identifying missing values To begin, let’s review the material introduced in section 4.5, and expand on it. R represents missing values using the symbol NA (not available) and impossible values using the symbol NaN (not a number). In addition, the symbols Inf and -Inf represent positive infinity and negative infinity, respectively. The functions is.na(), is.nan(), and is.infinite() can be used to identify missing, impossible, and infinite values, respectively. Each returns either TRUE or FALSE. Examples are given in table 18.1. Table 18.1 Examples of return values for the is.na(), is.nan(), and is.infinite() functions x is.na(x) is.nan(x) is.infinite(x) x <- NA TRUE FALSE FALSE x <- 0 / 0 TRUE TRUE FALSE x <- 1 / 0 FALSE FALSE TRUE www.it-ebooks.info 418 CHAPTER 18 Advanced methods for missing data These functions return an object that’s the same size as its argument, with each element replaced by TRUE if the element is of the type being tested or FALSE otherwise. For example, let y <- c(1, 2, 3, NA). Then is.na(y) will return the vector c(FALSE, FALSE, FALSE, TRUE). The function complete.cases() can be used to identify the rows in a matrix or data frame that don’t contain missing data. It returns a logical vector with TRUE for every row that contains complete cases and FALSE for every row that has one or more missing values. Let’s apply this to the sleep dataset: # load the dataset data(sleep, package="VIM") # list the rows that do not have missing values sleep[complete.cases(sleep),] # list the rows that have one or more missing values sleep[!complete.cases(sleep),] Examining the output reveals that 42 cases have complete data and 20 cases have one or more missing values. Because the logical values TRUE and FALSE are equivalent to the numeric values 1 and 0, the sum() and mean() functions can be used to obtain useful information about missing data. Consider the following: > sum(is.na(sleep$Dream)) [1] 12 > mean(is.na(sleep$Dream)) [1] 0.19 > mean(!complete.cases(sleep)) [1] 0.32 The results indicate that 12 values are missing for the variable Dream. Nineteen percent of the cases have a missing value on this variable. In addition, 32% of the cases in the dataset have one or more missing values. There are two things to keep in mind when identifying missing values. First, the complete.cases() function only identifies NA and NaN as missing. Infinite values (Inf and –Inf) are treated as valid values. Second, you must use missing-values functions, like those in this section, to identify the missing values in R data objects. Logical comparisons such as myvar == NA are never true. Now that you know how to identify missing values programmatically, let’s look at tools that help you explore possible patterns in the occurrence of missing data. 18.3 Exploring missing-values patterns Before deciding how to deal with missing data, you’ll find it useful to determine which variables have missing values, in what amounts, and in what combinations. In this section, we’ll review tabular, graphical, and correlational methods for exploring missing www.it-ebooks.info Exploring missing-values patterns 419 values patterns. Ultimately, you want to understand why the data are missing. The answer will have implications for how you proceed with further analyses. 18.3.1 Tabulating missing values You’ve already seen a rudimentary approach to identifying missing values. You can use the complete.cases() function from section 18.2 to list cases that are complete or, conversely, list cases that have one or more missing values. As the size of a dataset grows, though, it becomes a less attractive approach. In this case, you can turn to other R functions. The md.pattern() function in the mice package produces a tabulation of the missing data patterns in a matrix or data frame. Applying this function to the sleep dataset, you get the following: > library(mice) > data(sleep, package="VIM") > md.pattern(sleep) BodyWgt BrainWgt Pred Exp Danger Sleep Span Gest Dream NonD 42 1 1 1 1 1 1 1 1 1 1 0 2 1 1 1 1 1 1 0 1 1 1 1 3 1 1 1 1 1 1 1 0 1 1 1 9 1 1 1 1 1 1 1 1 0 0 2 2 1 1 1 1 1 0 1 1 1 0 2 1 1 1 1 1 1 1 0 0 1 1 2 2 1 1 1 1 1 0 1 1 0 0 3 1 1 1 1 1 1 1 0 1 0 0 3 0 0 0 0 0 4 4 4 12 14 38 The 1s and 0s in the body of the table indicate the missing-values patterns, with a 0 indicating a missing value for a given column variable and a 1 indicating a nonmissing value. The first row describes the pattern of “no missing values” (all elements are 1). The second row describes the pattern “no missing values except for Span.” The first column indicates the number of cases in each missing data pattern, and the last column indicates the number of variables with missing values present in each pattern. Here you can see that there are 42 cases without missing data and 2 cases that are missing Span alone. Nine cases are missing both NonD and Dream values. The dataset has a total of (42 × 0) + (2 × 1) + … + (1 × 3) = 38 missing values. The last row gives the total number of missing values on each variable. 18.3.2 Exploring missing data visually Although the tabular output from the md.pattern() function is compact, I often find it easier to discern patterns visually. Luckily, the VIM package provides numerous functions for visualizing missing-values patterns in datasets. In this section, we’ll review several, including aggr(), matrixplot(), and scattMiss(). The aggr() function plots the number of missing values for each variable alone and for each combination of variables. For example, the code library("VIM") aggr(sleep, prop=FALSE, numbers=TRUE) www.it-ebooks.info 420 Advanced methods for missing data 14 CHAPTER 18 12 1 8 Combinations 2 6 Number of Missings 10 1 2 2 4 3 2 9 Figure 18.2 BodyWgt BrainWgt NonD Dream Sleep Span Gest Pred Exp Danger BodyWgt BrainWgt NonD Dream Sleep Span Gest Pred Exp Danger 0 42 aggr()-produced plot of missing-values patterns for the sleep dataset produces the graph in figure 18.2. (The VIM package opens a GUI interface. You can close it; you’ll be using code to accomplish the tasks in this chapter.) You can see that the variable NonD has the largest number of missing values (14), and that two mammals are missing NonD, Dream, and Sleep scores. Forty-two mammals have no missing data. The statement aggr(sleep, prop=TRUE, numbers=TRUE) produces the same plot, but proportions are displayed instead of counts. The option numbers=FALSE (the default) suppresses the numeric labels. The matrixplot() function produces a plot displaying the data for each case. A graph created using matrixplot(sleep) is displayed in figure 18.3. Here, the numeric data are rescaled to the interval [0, 1] and represented by grayscale colors, with lighter colors representing lower values and darker colors representing larger values. By default, missing values are represented in red. Note that in figure 18.3, red has been replaced with crosshatching by hand, so that the missing values are viewable in grayscale. It will look different when you create the graph yourself. The graph is interactive: clicking a column re-sorts the matrix by that variable. The rows in figure 18.3 are sorted in descending order by BodyWgt. A matrix plot allows www.it-ebooks.info Exploring missing-values patterns 421 Figure 18.3 Matrix plot of actual and missing values by case (row) for the sleep dataset. The matrix is sorted by BodyWgt. you to see if the fact that values are missing on one or more variables is related to the actual values of other variables. Here, you can see that there are no missing values on sleep variables (Dream, NonD, Sleep) for low values of body or brain weight (BodyWgt, BrainWgt). The marginplot() function produces a scatter plot between two variables with information about missing values shown in the plot’s margins. Consider the relationship between the amount of dream sleep and the length of a mammal’s gestation. The statement marginplot(sleep[c("Gest","Dream")], pch=c(20), col=c("darkgray", "red", "blue")) produces the graph in figure 18.4. The pch and col parameters are optional and provide control over the plotting symbols and colors used. The body of the graph displays the scatter plot between Gest and Dream (based on complete cases for the two variables). In the left margin, box plots display the distribution of Dream for mammals with (dark gray) and without (red) Gest values. (Note that in grayscale, red is the darker shade.) Four red dots represent the values of Dream for mammals missing Gest scores. In the bottom margin, the roles of Gest and Dream are www.it-ebooks.info 422 Advanced methods for missing data 3 1 2 Dream 4 5 6 CHAPTER 18 0 Figure 18.4 Scatter plot between amount of dream sleep and length of gestation, with information about missing data in the margins 12 0 4 0 100 200 300 400 500 600 Gest reversed. You can see that a negative relationship exists between length of gestation and dream sleep and that dream sleep tends to be higher for mammals that are missing a gestation score. The number of observations missing values on both variables at the same time is printed in blue at the intersection of both margins (bottom left). The VIM package has many graphs that can help you understand the role of missing data in a dataset and is well worth exploring. There are functions to produce scatter plots, box plots, histograms, scatter plot matrices, parallel plots, rug plots, and bubble plots that incorporate information about missing values. 18.3.3 Using correlations to explore missing values Before moving on, there’s one more approach worth noting. You can replace the data in a dataset with indicator variables, coded 1 for missing and 0 for present. The resulting matrix is sometimes called a shadow matrix. Correlating these indicator variables with each other and with the original (observed) variables can help you to see which variables tend to be missing together, as well as relationships between a variable’s “missingness” and the values of the other variables. Consider the following code: x <- as.data.frame(abs(is.na(sleep))) The elements of data frame x are 1 if the corresponding element of sleep is missing and 0 otherwise. You can see this by viewing the first few rows of each: > head(sleep, n=5) BodyWgt BrainWgt NonD Dream Sleep Span Gest Pred Exp Danger 1 6654.000 5712.0 NA NA 3.3 38.6 645 3 5 3 www.it-ebooks.info 423 Exploring missing-values patterns 2 1.000 3 3.385 4 0.920 5 2547.000 6.6 44.5 5.7 4603.0 6.3 NA NA 2.1 2.0 NA NA 1.8 8.3 4.5 12.5 14.0 16.5 NA 3.9 69.0 42 60 25 624 3 1 5 3 1 1 2 5 3 1 3 4 > head(x, n=5) BodyWgt BrainWgt NonD Dream Sleep Span Gest Pred Exp Danger 1 0 0 1 1 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 3 0 0 1 1 0 0 0 0 0 0 4 0 0 1 1 0 1 0 0 0 0 5 0 0 0 0 0 0 0 0 0 0 The statement y <- x[which(apply(x,2,sum)>0)] extracts the variables that have some (but not all) missing values, and cor(y) gives you the correlations among these indicator variables: NonD Dream Sleep Span Gest NonD 1.000 0.907 0.486 0.015 -0.142 Dream 0.907 1.000 0.204 0.038 -0.129 Sleep 0.486 0.204 1.000 -0.069 -0.069 Span 0.015 0.038 -0.069 1.000 0.198 Gest -0.142 -0.129 -0.069 0.198 1.000 Here, you can see that Dream and NonD tend to be missing together (r = 0.91). To a lesser extent, Sleep and NonD tend to be missing together (r = 0.49) and Sleep and Dream tend to be missing together (r = 0.20). Finally, you can look at the relationship between missing values in a variable and the observed values on other variables: > cor(sleep, y, use="pairwise.complete.obs") NonD Dream Sleep Span Gest BodyWgt 0.227 0.223 0.0017 -0.058 -0.054 BrainWgt 0.179 0.163 0.0079 -0.079 -0.073 NonD NA NA NA -0.043 -0.046 Dream -0.189 NA -0.1890 0.117 0.228 Sleep -0.080 -0.080 NA 0.096 0.040 Span 0.083 0.060 0.0052 NA -0.065 Gest 0.202 0.051 0.1597 -0.175 NA Pred 0.048 -0.068 0.2025 0.023 -0.201 Exp 0.245 0.127 0.2608 -0.193 -0.193 Danger 0.065 -0.067 0.2089 -0.067 -0.204 Warning message: In cor(sleep, y, use = "pairwise.complete.obs") : the standard deviation is zero In this correlation matrix, the rows are observed variables, and the columns are indicator variables representing missingness. You can ignore the warning message and NA values in the correlation matrix; they’re artifacts of our approach. www.it-ebooks.info 424 CHAPTER 18 Advanced methods for missing data From the first column of the correlation matrix, you can see that nondreaming sleep scores are more likely to be missing for mammals with higher body weight (r = 0.227), gestation period (r = 0.202), and sleeping exposure (r = 0.245). Other columns are read in a similar fashion. None of the correlations in this table are particularly large or striking, which suggests that the data deviate minimally from MCAR and may be MAR. Note that you can never rule out the possibility that the data are NMAR, because you don’t know what the values would have been for data that are missing. For example, you don’t know if there’s a relationship between the amount of dreaming a mammal engages in and the probability of a value being missing on this variable. In the absence of strong external evidence to the contrary, we typically assume that data are either MCAR or MAR. 18.4 Understanding the sources and impact of missing data You can identify the amount, distribution, and pattern of missing data in order to evaluate the potential mechanisms leading to the missing data and the impact of the missing data on your ability to answer substantive questions. In particular, you want to answer the following questions: ■ ■ ■ ■ What percentage of the data is missing? Are the missing data concentrated in a few variables or widely distributed? Do the missing values appear to be random? Does the covariation of missing data with each other or with observed data suggest a possible mechanism that’s producing the missing values? Answers to these questions help determine which statistical methods are most appropriate for analyzing your data. For example, if the missing data are concentrated in a few relatively unimportant variables, you may be able to delete these variables and continue your analyses normally. If a small amount of data (say, less than 10%) is randomly distributed throughout the dataset (MCAR), you may be able to limit your analyses to cases with complete data and still get reliable and valid results. If you can assume that the data are either MCAR or MAR, you may be able to apply multiple imputation methods to arrive at valid conclusions. If the data are NMAR, you can turn to specialized methods, collect new data, or go into an easier and more rewarding profession. Here are some examples: ■ ■ In a recent survey employing paper questionnaires, I found that several items tended to be missing together. It became apparent that these items clustered together because participants didn’t realize that the third page of the questionnaire had a reverse side—which contained the items. In this case, the data could be considered MCAR. In another study, an education variable was frequently missing in a global survey of leadership styles. Investigation revealed that European participants were more likely to leave this item blank. It turned out that the categories didn’t www.it-ebooks.info Rational approaches for dealing with incomplete data ■ 425 make sense for participants in certain countries. In this case, the data were most likely MAR. Finally, I was involved in a study of depression in which older patients were more likely to omit items describing depressed mood when compared with younger patients. Interviews revealed that older patients were loath to admit to such symptoms because doing so violated their values about keeping a “stiff upper lip.” Unfortunately, it was also determined that severely depressed patients were more likely to omit these items due to a sense of hopelessness and difficulties with concentration. In this case, the data had to be considered NMAR. As you can see, identifying patterns is only the first step. You need to bring your understanding of the research subject matter and the data collection process to bear in order to determine the source of the missing values. Now that we’ve considered the source and impact of missing data, let’s see how standard statistical approaches can be altered to accommodate them. We’ll focus on three popular approaches: a rational approach for recovering data, a traditional approach that involves deleting missing data, and a modern approach that uses simulation. Along the way, we’ll briefly look at methods for specialized situations and methods that have become obsolete and should be retired. The goal will remain constant: to answer, as accurately as possible, the substantive questions that led you to collect the data, given the absence of complete information. 18.5 Rational approaches for dealing with incomplete data In a rational approach, you use mathematical or logical relationships among variables to attempt to fill in or recover missing values. A few examples will help clarify this approach. In the sleep dataset, the variable Sleep is the sum of the Dream and NonD variables. If you know a mammal’s scores on any two, you can derive the third. Thus, if some observations were missing only one of the three variables, you could recover the missing information through addition or subtraction. As a second example, consider research that focuses on work/life balance differences between generational cohorts (for example, Silents, Early Boomers, Late Boomers, Xers, Millennials), where cohorts are defined by their birth year. Participants are asked both their date of birth and their age. If date of birth is missing, you can recover their birth year (and therefore their generational cohort) by knowing their age and the date they completed the survey. An example that uses logical relationships to recover missing data comes from a set of leadership studies in which participants were asked if they were a manager (yes/ no) and the number of their direct reports (integer). If they left the manager question blank but indicated that they had one or more direct reports, it would be reasonable to infer that they were a manager. As a final example, I frequently engage in gender research that compares the leadership styles and effectiveness of men and women. Participants complete surveys that www.it-ebooks.info 426 CHAPTER 18 Advanced methods for missing data include their name (first and last), their gender, and a detailed assessment of their leadership approach and impact. If participants leave the gender question blank, I have to impute the value in order to include them in the research. In one recent study of 66,000 managers, 11,000 (17%) were missing a value for gender. To remedy the situation, I employed the following rational process. First, I crosstabulated first name with gender. Some first names were associated with males, some with females, and some with both. For example, “William” appeared 417 times and was always a male. Conversely, the name “Chris” appeared 237 times but was sometimes a male (86%) and sometimes a female (14%). If a first name appeared more than 20 times in the dataset and was always associated with males or with females (but never both), I assumed that the name represented a single gender. I used this assumption to create a gender-lookup table for gender-specific first names. Using this lookup table for participants with missing gender values, I was able to recover 7,000 cases (63% of the missing responses). A rational approach typically requires creativity and thoughtfulness, along with a degree of data-management skill. Data recovery may be exact (as in the sleep example) or approximate (as in the gender example). In the next section, we’ll explore an approach that creates complete datasets by removing observations. 18.6 Complete-case analysis (listwise deletion) In complete-case analysis, only observations containing valid data values on every variable are retained for further analysis. Practically, this involves deleting any row with one or more missing values, and is also known as listwise, or case-wise, deletion. Most popular statistical packages employ listwise deletion as the default approach for handling missing data. In fact, it’s so common that many analysts carrying out analyses like regression or ANOVA may not even realize that there’s a “missing-values problem” to be dealt with! The function complete.cases() can be used to save the cases (rows) of a matrix or data frame without missing data: newdata <- mydata[complete.cases(mydata),] The same result can be accomplished with the na.omit function: newdata <- na.omit(mydata) In both statements, any rows that are missing data are deleted from mydata before the results are saved to newdata. Suppose you’re interested in the correlations among the variables in the sleep study. Applying listwise deletion, you’d delete all mammals with missing data prior to calculating the correlations: > options(digits=1) > cor(na.omit(sleep)) BodyWgt BrainWgt NonD Dream Sleep BodyWgt 1.00 0.96 -0.4 -0.07 -0.3 BrainWgt 0.96 1.00 -0.4 -0.07 -0.3 Span 0.47 0.63 www.it-ebooks.info Gest Pred 0.71 0.10 0.73 -0.02 Exp Danger 0.4 0.26 0.3 0.15 427 Complete-case analysis (listwise deletion) NonD Dream Sleep Span Gest Pred Exp Danger -0.39 -0.07 -0.34 0.47 0.71 0.10 0.41 0.26 -0.39 -0.07 -0.34 0.63 0.73 -0.02 0.32 0.15 1.0 0.5 1.0 -0.4 -0.6 -0.4 -0.6 -0.5 0.52 1.00 0.72 -0.27 -0.41 -0.40 -0.50 -0.57 1.0 0.7 1.0 -0.4 -0.6 -0.4 -0.6 -0.6 -0.37 -0.61 -0.35 -0.6 -0.27 -0.41 -0.40 -0.5 -0.38 -0.61 -0.40 -0.6 1.00 0.65 -0.17 0.3 0.65 1.00 0.09 0.6 -0.17 0.09 1.00 0.6 0.32 0.57 0.63 1.0 0.01 0.31 0.93 0.8 -0.53 -0.57 -0.60 0.01 0.31 0.93 0.79 1.00 The correlations in this table are based solely on the 42 mammals that have complete data on all variables. (Note that the statement cor(sleep, use="complete.obs") would have produced the same results.) If you wanted to study the impact of life span and length of gestation on the amount of dream sleep, you could employ linear regression with listwise deletion: > fit <- lm(Dream ~ Span + Gest, data=na.omit(sleep)) > summary(fit) Call: lm(formula = Dream ~ Span + Gest, data = na.omit(sleep)) Residuals: Min 1Q Median -2.333 -0.915 -0.221 3Q 0.382 Max 4.183 Coefficients: Estimate Std. Error t (Intercept) 2.480122 0.298476 Span -0.000472 0.013130 Gest -0.004394 0.002081 --Signif. codes: 0 ‘***’ 0.001 ‘**’ value Pr(>|t|) 8.31 3.7e-10 *** -0.04 0.971 -2.11 0.041 * 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1 Residual standard error: 1 on 39 degrees of freedom Multiple R-squared: 0.167, Adjusted R-squared: 0.125 F-statistic: 3.92 on 2 and 39 DF, p-value: 0.0282 Here you see that mammals with shorter gestation periods have more dream sleep (controlling for life span) and that life span is unrelated to dream sleep when controlling for gestation period. The analysis is based on 42 cases with complete data. In the previous example, what would have happened if data=na.omit(sleep) had been replaced with data=sleep? Like many R functions, lm() uses a limited definition of listwise deletion. Cases with any missing data on the variables fitted by the function (Dream, Span, and Gest in this case) would have been deleted. The analysis would have been based on 44 cases. Listwise deletion assumes that the data are MCAR (that is, the complete observations are a random subsample of the full dataset). In the current example, we’ve assumed that the 42 mammals used are a random subsample of the 62 mammals collected. To the degree that the MCAR assumption is violated, the resulting regression parameters will be biased. Deleting all observations with missing data can also reduce statistical power by reducing the available sample size. In the current example, listwise www.it-ebooks.info 428 CHAPTER 18 Advanced methods for missing data deletion reduced the sample size by 32%. Next, we’ll consider an approach that employs the entire dataset (including cases with missing data). 18.7 Multiple imputation Multiple imputation (MI) provides an approach to missing values that’s based on repeated simulations. MI is frequently the method of choice for complex missing-values problems. In MI, a set of complete datasets (typically 3 to 10) is generated from an existing dataset that’s missing values. Monte Carlo methods are used to fill in the missing data in each of the simulated datasets. Standard statistical methods are applied to each of the simulated datasets, and the outcomes are combined to provide estimated results and confidence intervals that take into account the uncertainty introduced by the missing values. Good implementations are available in R through the Amelia, mice, and mi packages. In this section, we’ll focus with() on the approach provided by mice() pool() the mice (multivariate imputation by chained equations) package. To understand how Final result the mice package operates, Data frame consider the diagram in figure 18.5. Imputed datasets Analysis results The function mice() starts with a data frame that’s missFigure 18.5 Steps in applying multiple imputation to missing data via the mice approach ing data and returns an object containing several complete datasets (the default is five). Each complete dataset is created by imputing values for the missing data in the original data frame. There’s a random component to the imputations, so each complete dataset is slightly different. The with() function is then used to apply a statistical model (for example, a linear or generalized linear model) to each complete dataset in turn. Finally, the pool() function combines the results of these separate analyses into a single set of results. The standard errors and p-values in this final model correctly reflect the uncertainty produced by both the missing values and the multiple imputations. How does the mice() function impute missing values? Missing values are imputed by Gibbs sampling. By default, each variable with missing values is predicted from all other variables in the dataset. These prediction equations are used to impute plausible values for the missing data. The process iterates until convergence over the missing values is achieved. For each variable, you can choose the form of the prediction model (called an elementary imputation method) and the variables entered into it. www.it-ebooks.info Multiple imputation 429 By default, predictive mean matching is used to replace missing data on continuous variables, whereas logistic or polytomous logistic regression is used for target variables that are dichotomous (factors with two levels) or polytomous (factors with more than two levels), respectively. Other elementary imputation methods include Bayesian linear regression, discriminant function analysis, two-level normal imputation, and random sampling from observed values. You can supply your own methods as well. An analysis based on the mice package typically conforms to the following structure library(mice) imp <- mice(data, m) fit <- with(imp, analysis) pooled <- pool(fit) summary(pooled) where ■ ■ ■ ■ ■ data is a matrix or data frame containing missing values. imp is a list object containing the m imputed datasets, along with information on how the imputations were accomplished. By default, m = 5. analysis is a formula object specifying the statistical analysis to be applied to each of the m imputed datasets. Examples include lm() for linear regression models, glm() for generalized linear models, gam() for generalized additive models, and nbrm() for negative binomial models. Formulas within the parentheses give the response variables on the left of the ~ and the predictor variables (separated by + signs) on the right. fit is a list object containing the results of the m separate statistical analyses. pooled is a list object containing the averaged results of these m statistical analyses. Let’s apply multiple imputation to the sleep dataset. You’ll repeat the analysis from section 18.6, but this time use all 62 mammals. Set the seed value for the random number generator to 1,234 so that your results will match the following: > library(mice) > data(sleep, package="VIM") > imp <- mice(sleep, seed=1234) [...output deleted to save space...] > fit <- with(imp, lm(Dream ~ Span + Gest)) > pooled <- pool(fit) > summary(pooled) est se t df Pr(>|t|) lo 95 (Intercept) 2.58858 0.27552 9.395 52.1 8.34e-13 2.03576 Span -0.00276 0.01295 -0.213 52.9 8.32e-01 -0.02874 Gest -0.00421 0.00157 -2.671 55.6 9.91e-03 -0.00736 hi 95 nmis fmi (Intercept) 3.14141 NA 0.0870 Span 0.02322 4 0.0806 Gest -0.00105 4 0.0537 www.it-ebooks.info 430 CHAPTER 18 Advanced methods for missing data Here, you see that the regression coefficient for Span isn’t significant (p ≅ 0.08), and the coefficient for Gest is significant at the p < 0.01 level. If you compare these results with those produced by a complete case analysis (section 18.6), you see that you’d come to the same conclusions in this instance. Length of gestation has a (statistically) significant, negative relationship with amount of dream sleep, controlling for life span. Although the complete-case analysis was based on the 42 mammals with complete data, the current analysis is based on information gathered from the full set of 62 mammals. By the way, the fmi column reports the fraction of missing information (that is, the proportion of variability that is attributable to the uncertainty introduced by the missing data). You can access more information about the imputation by examining the objects created in the analysis. For example, let’s view a summary of the imp object: > imp Multiply imputed data set Call: mice(data = sleep, seed = 1234) Number of multiple imputations: 5 Missing cells per column: BodyWgt BrainWgt NonD Dream Sleep Span Gest Pred 0 0 14 12 4 4 4 0 Exp Danger 0 0 Imputation methods: BodyWgt BrainWgt NonD Dream Sleep Span Gest Pred "" "" "pmm" "pmm" "pmm" "pmm" "pmm" "" Exp Danger "" "" VisitSequence: NonD Dream Sleep Span Gest 3 4 5 6 7 PredictorMatrix: BodyWgt BrainWgt NonD Dream Sleep Span Gest Pred Exp Danger BodyWgt 0 0 0 0 0 0 0 0 0 0 BrainWgt 0 0 0 0 0 0 0 0 0 0 NonD 1 1 0 1 1 1 1 1 1 1 Dream 1 1 1 0 1 1 1 1 1 1 Sleep 1 1 1 1 0 1 1 1 1 1 Span 1 1 1 1 1 0 1 1 1 1 Gest 1 1 1 1 1 1 0 1 1 1 Pred 0 0 0 0 0 0 0 0 0 0 Exp 0 0 0 0 0 0 0 0 0 0 Danger 0 0 0 0 0 0 0 0 0 0 Random generator seed value: 1234 From the resulting output, you can see that five synthetic datasets were created and that the predictive mean matching (pmm) method was used for each variable with missing data. No imputation ("") was needed for BodyWgt, BrainWgt, Pred, Exp, or Danger, because they had no missing values. The visit sequence tells you that variables www.it-ebooks.info Multiple imputation 431 were imputed from right to left, starting with NonD and ending with Gest. Finally, the predictor matrix indicates that each variable with missing data was imputed using all the other variables in the dataset. (In this matrix, the rows represent the variables being imputed, the columns represent the variables used for the imputation, and 1s/0s indicate used/not used). You can view the imputations by looking at subcomponents of the imp object. For example, > imp$imp$Dream 1 2 3 4 1 0.5 0.5 0.5 0.5 3 2.3 2.4 1.9 1.5 4 1.2 1.3 5.6 2.3 14 0.6 1.0 0.0 0.3 24 1.2 1.0 5.6 1.0 26 1.9 6.6 0.9 2.2 30 1.0 1.2 2.6 2.3 31 5.6 0.5 1.2 0.5 47 0.7 0.6 1.4 1.8 53 0.7 0.5 0.7 0.5 55 0.5 2.4 0.7 2.6 62 1.9 1.4 3.6 5.6 5 0.0 2.4 1.3 0.5 6.6 2.0 1.4 1.4 3.6 0.5 2.6 6.6 displays the 5 imputed values for each of the 12 mammals with missing data on the Dream variable. A review of these matrices helps you determine whether the imputed values are reasonable. A negative value for length of sleep might give you pause (or nightmares). You can view each of the m imputed datasets via the complete() function. The format is complete(imp, action=#) where # specifies one of the m synthetically complete datasets. For example, > dataset3 <- complete(imp, action=3) > dataset3 BodyWgt BrainWgt NonD Dream Sleep Span Gest Pred Exp Danger 1 6654.00 5712.0 2.1 0.5 3.3 38.6 645 3 5 3 2 1.00 6.6 6.3 2.0 8.3 4.5 42 3 1 3 3 3.38 44.5 10.6 1.9 12.5 14.0 60 1 1 1 4 0.92 5.7 11.0 5.6 16.5 4.7 25 5 2 3 5 2547.00 4603.0 2.1 1.8 3.9 69.0 624 3 5 4 6 10.55 179.5 9.1 0.7 9.8 27.0 180 4 4 4 [...output deleted to save space...] displays the third (out of five) complete dataset created by the multiple imputation process. Due to space limitations, we’ve only briefly considered the MI implementation provided in the mice package. The mi and Amelia packages also contain valuable www.it-ebooks.info 432 CHAPTER 18 Advanced methods for missing data approaches. If you’re interested in the multiple imputation approach to missing data, I recommend the following resources: ■ ■ ■ The multiple imputation FAQ page (www.stat.psu.edu/~jls/mifaq.html) Articles by Van Buuren and Groothuis-Oudshoorn (2010) and Yu-Sung, Gelman, Hill, and Yajima (2010) Amelia II: A Program for Missing Data (http://gking.harvard.edu/amelia) Each can help to reinforce and extend your understanding of this important, but underutilized, methodology. 18.8 Other approaches to missing data R supports several other approaches for dealing with missing data. Although not as broadly applicable as the methods described thus far, the packages described in table 18.2 offer functions that can be useful in specialized circumstances. Table 18.2 Specialized methods for dealing with missing data Package Description mvnmle Maximum-likelihood estimation for multivariate normal data with missing values cat Analysis of categorical-variable datasets with missing values arrayImpute, arrayMissPattern, and SeqKnn Useful functions for dealing with missing microarray data longitudinalData Utility functions, including interpolation routines for imputing missing time-series values kmi Kaplan-Meier multiple imputation for survival analysis with missing data mix Multiple imputation for mixed categorical and continuous data pan Multiple imputation for multivariate panel or clustered data Finally, there are two methods for dealing with missing data that are still in use but should be considered obsolete: pairwise deletion and simple imputation. 18.8.1 Pairwise deletion Pairwise deletion is often considered an alternative to listwise deletion when working with datasets that are missing values. In pairwise deletion, observations are deleted only if they’re missing data for the variables involved in a specific analysis. Consider the following code: > cor(sleep, use="pairwise.complete.obs") BodyWgt BrainWgt NonD Dream Sleep BodyWgt 1.00 0.93 -0.4 -0.1 -0.3 BrainWgt 0.93 1.00 -0.4 -0.1 -0.4 Span Gest 0.30 0.7 0.51 0.7 www.it-ebooks.info Pred 0.06 0.03 Exp Danger 0.3 0.13 0.4 0.15 433 Summary NonD Dream Sleep Span Gest Pred Exp Danger -0.38 -0.11 -0.31 0.30 0.65 0.06 0.34 0.13 -0.37 -0.11 -0.36 0.51 0.75 0.03 0.37 0.15 1.0 0.5 1.0 -0.4 -0.6 -0.3 -0.5 -0.5 0.5 1.0 0.7 -0.3 -0.5 -0.4 -0.5 -0.6 1.0 0.7 1.0 -0.4 -0.6 -0.4 -0.6 -0.6 -0.38 -0.6 -0.32 -0.5 -0.30 -0.5 -0.45 -0.5 -0.41 -0.6 -0.40 -0.6 1.00 0.6 -0.10 0.4 0.61 1.0 0.20 0.6 -0.10 0.2 1.00 0.6 0.36 0.6 0.62 1.0 0.06 0.4 0.92 0.8 -0.48 -0.58 -0.59 0.06 0.38 0.92 0.79 1.00 In this example, correlations between any two variables use all available observations for those two variables (ignoring the other variables). The correlation between Kaplan-Meier multiple is based on all 62 mammals (the number of mammals with data on both variables). The correlation between Kaplan-Meier multiple is based on 42 mammals, and the correlation between Kaplan-Meier multiple is based on 46 mammals. Although pairwise deletion appears to use all available data, in fact each calculation is based on a different subset of the data. This can lead to distorted and difficultto-interpret results. I recommend staying away from this approach. 18.8.2 Simple (nonstochastic) imputation In simple imputation, the missing values in a variable are replaced with a single value (for example, mean, median, or mode). Using mean substitution, you could replace missing values on Kaplan-Meier multiple with the value 1.97 and missing values on KaplanMeier multiple with the value 8.67 (the means on Kaplan-Meier multiple, respectively). Note that the substitution is nonstochastic, meaning that random error isn’t introduced (unlike with multiple imputation). An advantage of simple imputation is that it solves the missing-values problem without reducing the sample size available for analyses. Simple imputation is, well, simple, but it produces biased results for data that isn’t MCAR. If moderate to large amounts of data are missing, simple imputation is likely to underestimate standard errors, distort correlations among variables, and produce incorrect p-values in statistical tests. Like pairwise deletion, I recommend avoiding this approach for most missing-data problems. 18.9 Summary Most statistical methods assume that the input data are complete and don't include missing values (such as, NA, NaN, or Inf). But most datasets in real-world settings contain missing values. Therefore, you must either delete the missing values or replace them with reasonable substitute values before continuing with the desired analyses. Often, statistical packages provide default methods for handling missing data, but these approaches may not be optimal. Therefore, it’s important that you understand the various approaches available and the ramifications of using each. In this chapter, we examined methods for identifying missing values and exploring patterns of missing data. The goal was to understand the mechanisms that led to the missing data and their possible impact on subsequent analyses. We then reviewed www.it-ebooks.info 434 CHAPTER 18 Advanced methods for missing data three popular methods for dealing with missing data: a rational approach, listwise deletion, and the use of multiple imputation. Rational approaches can be used to recover missing values when there are redundancies in the data or when external information can be brought to bear on the problem. The listwise deletion of missing data is useful if the data are MCAR and the subsequent sample size reduction doesn’t seriously impact the power of statistical tests. Multiple imputation is rapidly becoming the method of choice for complex missing-data problems when you can assume that the data are MCAR or MAR. Although many analysts may be unfamiliar with multiple imputation strategies, user-contributed packages (mice, mi, and Amelia) make them readily accessible. I believe we’ll see rapid growth in their use over the next few years. I ended the chapter by briefly mentioning R packages that provide specialized approaches for dealing with missing data, and singled out general approaches for handling missing data (pairwise deletion, simple imputation) that should be avoided. In the next chapter, we’ll explore advanced graphical methods, using the ggplot2 package to create innovative multivariate plots. www.it-ebooks.info Part 5 Expanding your skills I n this final section, we consider advanced topics that will enhance your skills as an R programmer. Chapter 19 completes our discussion of graphics with a presentation of one of R’s most powerful approaches to visualizing data. Based on a comprehensive grammar of graphics, the ggplot2 package provides a set of tools that allow you visualize complex data sets in new and creative ways. You’ll be able to easily create attractive and informative graphs that would be difficult or impossible to create using R’s base graphics system. Chapter 20 provides a review of the R language at a deeper level. This includes a discussion of R’s object-oriented programming features, working with environments, and advanced function writing. Tips for writing efficient code and debugging programs are also described. Although chapter 20 is more technical than the other chapters in this book, it provides extensive practical advice for developing more useful programs. Throughout this book, you’ve used packages to get work done. In chapter 21, you’ll learn to write your own packages. This can help you organize and document your work, create more complex and comprehensive software solutions, and share your creations with others. Sharing a useful package of functions with others can also be a wonderful way to give back to the R community (while spreading your fame far and wide). Chapter 22 is all about report writing. R provides compressive facilities for generating attractive reports dynamically from data. In this last chapter, you’ll learn how to create reports as web pages, PDF documents, and word processor documents (including Microsoft Word documents). After completing part 5, you’ll have a much deeper appreciation of how R works and the tools it offers for creating more sophisticated graphics, software, and reports. www.it-ebooks.info 436 CHAPTER www.it-ebooks.info Advanced graphics with ggplot2 This chapter covers ■ ■ An introduction to the ggplot2 package Using shape, color, and size to visualize multivariate data ■ Comparing groups with faceted graphs ■ Customizing ggplot2 plots In previous chapters, you created a wide variety of general and specialized graphs (and had lots of fun in the process). Most were produced using R’s base graphics system. Given the diversity of methods available in R, it may not surprise you to learn that four separate and complete graphics systems are currently available. In addition to base graphics, graphics systems are provided by the grid, lattice, and ggplot2 packages. Each is designed to expand on the capabilities of, and correct for deficiencies in, R’s base graphics system. The grid graphics system provides low-level access to graphic primitives, giving programmers a great deal of flexibility in the creation of graphic output. The lattice package provides an intuitive approach for examining multivariate 437 www.it-ebooks.info 438 CHAPTER 19 Advanced graphics with ggplot2 relationships through conditional one-, two-, or three-dimensional graphs called trellis graphs. The ggplot2 package provides a method of creating innovative graphs based on a comprehensive graphical “grammar.” In this chapter, we’ll start with a brief overview of the four graphic systems. Then we’ll focus on graphs that can be generated with the ggplot2 package. ggplot2 greatly expands the range and quality of the graphs you can produce in R. It allows you to visualize datasets with many variables using a comprehensive and consistent syntax, and easily generate graphs that would be difficult to create using base R graphics. 19.1 The four graphics systems in R As stated earlier, there are currently four graphical systems available in R. The base graphics system, written by Ross Ihaka, is included in every R installation. Most of the graphs produced in previous chapters rely on base graphics functions. The grid graphics system, written by Paul Murrell (2011), is implemented through the grid package. grid graphics offer a lower-level alternative to the standard graphics system. The user can create arbitrary rectangular regions on graphics devices, define coordinate systems for each region, and use a rich set of drawing primitives to control the arrangement and appearance of graphic elements. This flexibility makes grid graphics a valuable tool for software developers. But the grid package doesn’t provide functions for producing statistical graphics or complete plots. Because of this, the package is rarely used directly by data analysts and won’t be discussed further. If you’re interested in learning more about grid, visit Dr. Murrell’s Grid website (http://mng.bz/C86p) for details. The lattice package, written by Deepayan Sarkar (2008), implements trellis graphs, as outlined by Cleveland (1985, 1993). Basically, trellis graphs display the distribution of a variable or the relationship between variables, separately for each level of one or more other variables. Built using the grid package, the lattice package has grown beyond Cleveland’s original approach to visualizing multivariate data and now provides a comprehensive alternative system for creating statistical graphics in R. A number of packages described in this book (effects, flexclust, Hmisc, mice, and odfWeave) use functions in the lattice package to produce graphs. Finally, the ggplot2 package, written by Hadley Wickham (2009a), provides a system for creating graphs based on the grammar of graphics described by Wilkinson (2005) and expanded by Wickham (2009b). The intention of the ggplot2 package is to provide a comprehensive, grammar-based system for generating graphs in a unified and coherent manner, allowing users to create new and innovative data visualizations. The power of this approach has led ggplot2 to become an important tool for visualizing data using R. Access to the four systems differs, as outlined in table 19.1. Base graphic functions are automatically available. To access grid and lattice functions, you must load the appropriate package explicitly (for example, library(lattice)). To access ggplot2 functions, you have to download and install the package (install.packages ("ggplot2")) before first use and then load it (library(ggplot2)). www.it-ebooks.info An introduction to the ggplot2 package Table 19.1 System 439 Access to graphic systems Included in base installation? Must be explicitly loaded? base Yes No grid Yes Yes lattice Yes Yes ggplot2 No Yes The lattice and ggplot2 packages overlap in functionality but approach the creation of graphs differently. Analysts tend to rely on one package or the other when plotting multivariate data. Given its power and popularity, the remainder of this chapter will focus on ggplot2. If you would like to learn more about the lattice package, I’ve created a supplementary chapter that you can download (www.statmethods .net/RiA/lattice.pdf) or from the publisher’s website at www.manning.com/ RinActionSecondEdition. This chapter uses three datasets to illustrate the use of ggplot2. The first is a data frame called singer that comes from the lattice package; it contains the heights and voice parts of singers in the New York Choral Society. The second is the mtcars data frame that you’ve used throughout this book; it contains automotive details on 32 automobiles. The final data frame is called Salaries and is included with the car package described in chapter 8. Salaries contains salary information for university professors and was collected to explore gender discrepancies in income. Together, these datasets offer a variety of visualization challenges. Before continuing, be sure the ggplot2 and car packages are installed. You’ll also want to install the gridExtra package. It allows you to place multiple ggplot2 graphs into a single plot (see section 19.7.4). 19.2 An introduction to the ggplot2 package The ggplot2 package implements a system for creating graphics in R based on a comprehensive and coherent grammar. This provides a consistency to graph creation that’s often lacking in R and allows you to create graph types that are innovative and novel. In this section, we’ll start with an overview of ggplot2 grammar; subsequent sections dive into the details. In ggplot2, plots are created by chaining together functions using the plus (+) sign. Each function modifies the plot created up to that point. It’s easiest to see with an example (the graph is given in figure 19.1): library(ggplot2) ggplot(data=mtcars, aes(x=wt, y=mpg)) + geom_point() + labs(title="Automobile Data", x="Weight", y="Miles Per Gallon") www.it-ebooks.info 440 CHAPTER 19 Advanced graphics with ggplot2 Automobile Data 35 Miles Per Gallon 30 25 20 15 Figure 19.1 Scatterplot of automobile weight by mileage 10 2 3 4 5 Weight Let’s break down how the plot was produced. The ggplot() function initializes the plot and specifies the data source (mtcars) and variables (wt, mpg) to be used. The options in the aes() function specify what role each variable will play. (aes stands for aesthetics, or how information is represented visually.) Here, the wt values are mapped to distances along the x-axis, and mpg values are mapped to distances along the y-axis. The ggplot() function sets up the graph but produces no visual output on its own. Geometric objects (called geoms for short), which include points, lines, bars, box plots, and shaded regions, are added to the graph using one or more geom functions. In this example, the geom_point() function draws points on the graph, creating a scatterplot. The labs() function is optional and adds annotations (axis labels and a title). There are many functions in ggplot2, and most can include optional parameters. Expanding on the previous example, the code library(ggplot2) ggplot(data=mtcars, aes(x=wt, y=mpg)) + geom_point(pch=17, color="blue", size=2) + geom_smooth(method="lm", color="red", linetype=2) + labs(title="Automobile Data", x="Weight", y="Miles Per Gallon") produces the graph in figure 19.2. Options to geom_point() set the point shape to triangles (pch=17), double the points’ size (size=2), and render them in blue (color="blue"). The geom_smooth() function adds a “smoothed” line. Here a linear fit is requested (method="lm") and a red www.it-ebooks.info 441 An introduction to the ggplot2 package Automobile Data Miles Per Gallon 30 20 Figure 19.2 Scatterplot of automobile weight by gas mileage, with a superimposed line of best fit and 95% confidence region 10 2 3 4 5 Weight (color="red") dashed (linetype=2) line of size 1 (size=1) is produced. By default, the line includes 95% confidence intervals (the darker band). We’ll go into more detail about modeling relationships with linear and nonlinear fits in section 19.6. The ggplot2 package provides methods for grouping and faceting. Grouping displays two or more groups of observations in a single plot. Groups are usually differentiated by color, shape, or shading. Faceting displays groups of observations in separate, side-by-side plots. The ggplot2 package uses factors when defining groups or facets. You can see both grouping and faceting with the mtcars data frame. First, transform the am, vs, and cyl variables into factors: mtcars$am <- factor(mtcars$am, levels=c(0,1), labels=c("Automatic", "Manual")) mtcars$vs <- factor(mtcars$vs, levels=c(0,1), labels=c("V-Engine", "Straight Engine")) mtcars$cyl <- factor(mtcars$cyl) Next, generate a plot using the following code: library(ggplot2) ggplot(data=mtcars, aes(x=hp, y=mpg, shape=cyl, color=cyl)) + geom_point(size=3) + facet_grid(am~vs) + labs(title="Automobile Data by Engine Type", x="Horsepower", y="Miles Per Gallon") www.it-ebooks.info 442 CHAPTER 19 Advanced graphics with ggplot2 Automobile Data by Engine Type V−Engine 35 Straight Engine 30 Automatic 25 Miles Per Gallon 20 15 cyl 4 10 35 6 8 30 25 Manual 20 15 10 100 200 300 100 200 300 Horsepower Figure 19.3 A scatterplot showing the relationship between horsepower and gas mileage separately for transmission and engine type. The number of cylinders in each automobile engine is represented by both shape and color. The resulting graph (see figure 19.3) contains separate scatterplots for each combination of transmission type (automatic vs. manual) and engine arrangement (V-engine vs. straight engine). The color and shape of each point indicates the number of engine cylinders in that car. In this case, am and vs are the faceting variables, and cyl is the grouping variable. The ggplot2 package is powerful and can be used to create a wide array of informative graphs. It’s popular among seasoned R analysts and programmers; and, based on postings in R blogs and discussion groups, that popularity is growing. Unfortunately, with power comes complexity. Unlike other R packages, ggplot2 can be thought of as a comprehensive graphical programming language in its own right. It has its own learning curve, and at times that curve can be steep. Hang in there—the effort is worth it. Luckily, there are function defaults and language simplifications designed to make your introduction to this package easier. With practice, www.it-ebooks.info 443 Specifying the plot type with geoms you’ll be able to create a wide variety of interesting and useful plots with just a few lines of code. Let’s start with a description of geom functions and the type of graphs they can create. Then we’ll look at the aes() function in more detail and how you can use it to group data. Next, we’ll consider faceting and the creation of trellis graphs. Finally, we’ll look at ways to tweak the appearance of ggplot2 graphs, including modifying axes and legends, changing color schemes, and adding annotations. The chapter will end with pointers to resources that can help you master the ggplot2 approach more fully. 19.3 Specifying the plot type with geoms Whereas the ggplot() function specifies the data source and variables to be plotted, the geom functions specify how these variables are to be visually represented (using points, bars, lines, and shaded regions). Currently, 37 geoms are available. Table 19.2 lists the more common ones, along with frequently used options. The options are described more fully in table 19.3. Table 19.2 Geom functions Function Adds Options geom_bar() Bar chart color, fill, alpha geom_boxplot() Box plot color, fill, alpha, notch, width geom_density() Density plot color, fill, alpha, linetype geom_histogram() Histogram color, fill, alpha, linetype, binwidth geom_hline() Horizontal lines color, alpha, linetype, size geom_jitter() Jittered points color, size, alpha, shape geom_line() Line graph colorvalpha, linetype, size geom_point() Scatterplot color, alpha, shape, size geom_rug() Rug plot color, side geom_smooth() Fitted line method, formula, color, fill, linetype, size geom_text() Text annotations Many; see the help for this function geom_violin() Violin plot color, fill, alpha, linetype geom_vline() Vertical lines color, alpha, linetype, size Most of the graphs described in this book can be created using the geoms in table 19.2. For example, the code data(singer, package="lattice") ggplot(singer, aes(x=height)) + geom_histogram() www.it-ebooks.info 444 CHAPTER 19 Advanced graphics with ggplot2 30 count 20 10 0 60 65 70 75 height Figure 19.4 Histogram of singer heights produces the histogram in figure 19.4, and ggplot(singer, aes(x=voice.part, y=height)) + geom_boxplot() produces the box plot in figure 19.5. From figure 19.5, it appears that basses tend to be taller and sopranos tend to be shorter. Although gender wasn’t measured, it probably accounts for much of the variation you see. 75 height 70 65 60 Bass 2 Bass 1 Tenor 2 Tenor 1 Alto 2 Alto 1 voice.part Figure 19.5 Box plot of singer heights by voice part www.it-ebooks.info Soprano 2 Soprano 1 Specifying the plot type with geoms 445 Note that only the x variable was specified when creating a histogram, but both an x and a y variable were specified for the box plot. The geom_histogram() function defaults to counts on the y-axis when no y variable is specified. See the documentation for each function for details and additional examples. Each geom function has a set of options that can be used to modify its representation. Common options are listed in table 19.3. Table 19.3 Common options for geom functions Option Specifies color Color of points, lines, and borders around filled regions. fill Color of filled areas such as bars and density regions. alpha Transparency of colors, ranging from 0 (fully transparent) to 1 (opaque). linetype Pattern for lines (1 = solid, 2 = dashed, 3 = dotted, 4 = dotdash, 5 = longdash, 6 = twodash). size Point size and line width. shape Point shapes (same as pch, with 0 = open square, 1 = open circle, 2 = open triangle, and so on). See figure 3.4 for examples. position Position of plotted objects such as bars and points. For bars, "dodge" places grouped bar charts side by side, "stacked" vertically stacks grouped bar charts, and "fill" vertically stacks grouped bar charts and standardizes their heights to be equal. For points, "jitter" reduces point overlap. binwidth Bin width for histograms. notch Indicates whether box plots should be notched (TRUE/FALSE). sides Placement of rug plots on the graph ("b" = bottom, "l" = left, "t" = top, "r" = right, "bl" = both bottom and left, and so on). width Width of box plots. You can examine the use of many of these options using the Salaries dataset. The code data(Salaries, package="car") library(ggplot2) ggplot(Salaries, aes(x=rank, y=salary)) + geom_boxplot(fill="cornflowerblue", color="black", notch=TRUE)+ geom_point(position="jitter", color="blue", alpha=.5)+ geom_rug(side="l", color="black") produces the plot in figure 19.6. The figure displays notched box plots of salary by academic rank. The actual observations (teachers) are overlaid and given some transparency so they don’t obscure the box plots. They’re also jittered to reduce their overlap. Finally, a rug plot is provided on the left to indicate the general spread of salaries. www.it-ebooks.info 446 CHAPTER 19 Advanced graphics with ggplot2 ● ● ● salary 200000 150000 100000 50000 AsstProf AssocProf Prof rank Figure 19.6 Notched box plots with superimposed points describing the salaries of college professors by rank. A rug plot is provided on the vertical axis. From figure 19.6, you can see that the salaries of assistant, associate, and full professors differ significantly from each other (there is no overlap in the box plot notches). Additionally, the variance in salaries increases with greater rank, with a large range of salaries for full professors. In fact, at least one full professor earns less than assistant professors. There are also three full professors whose salaries are so large as to make them outliers (as indicated by the black dots in the Prof box plot). Having been a full professor earlier in my career, the data suggests to me that I was clearly underpaid. The real power of the ggplot2 package is realized when geoms are combined to form new types of plots. Returning to the singer dataset, the code library(ggplot2) data(singer, package="lattice") ggplot(singer, aes(x=voice.part, y=height)) + geom_violin(fill="lightblue") + geom_boxplot(fill="lightgreen", width=.2) combines box plots with violin plots to create a new type of graph (displayed in figure 19.7). The box plots show the 25th, 50th, and 75th percentile scores for each voice part in the singer dataframe, along with any outliers. The violin plots provide more visual cues as to the distribution of scores over the range of heights for each voice part. www.it-ebooks.info 447 Grouping 75 height 70 65 Figure 19.7 A combined violin and box plot graph of singer heights by voice part 60 Bass 2 Bass 1 Tenor 2 Tenor 1 Alto 2 Alto 1 Soprano 2 Soprano 1 voice.part In the remainder of this chapter, you’ll use geoms to create a wide range of graph types. Let’s start with grouping—the representation of more than one group of observations in a single graph. 19.4 Grouping In order to understand data, it’s often helpful to plot two or more groups of observations on the same graph. In R, the groups are usually defined as the levels of a categorical variable (factor). Grouping is accomplished in ggplot2 graphs by associating one or more grouping variables with visual characteristics such as shape, color, fill, size, and line type. The aes() function in the ggplot() statement assigns variables to roles (visual characteristics of the plot), so this is a natural place to assign grouping variables. Let’s use grouping to explore the Salaries dataset. The dataframe contains information on the salaries of university professors collected during the 2008–2009 academic year. Variables include rank (AsstProf, AssocProf, Prof), sex (Female, Male), yrs.since.phd (years since Ph.D.), yrs.service (years of service), and salary (nine-month salary in dollars). First, you can ask how salaries vary by academic rank. The code data(Salaries, package="car") library(ggplot2) ggplot(data=Salaries, aes(x=salary, fill=rank)) + geom_density(alpha=.3) www.it-ebooks.info 448 CHAPTER 19 Advanced graphics with ggplot2 4e−05 3e−05 density rank AsstProf AssocProf Prof 2e−05 1e−05 Figure 19.8 Density plots of university salaries, grouped by academic rank 0e+00 50000 100000 150000 200000 salary plots three density curves in the same graph (one for each level of academic rank) and distinguishes them by fill color. The fills are set to be somewhat transparent (alpha) so that the overlapping curves don’t obscure each other. The colors also combine to improve visualization in join areas. The plot is given is figure 19.8. Note that a legend is produced automatically. In section 19.7.2, you’ll learn how to customize the legend generated for grouped data. Salary increases by rank, but there is significant overlap, with some associate and full professors earning the same as assistant professors. As rank increases, so does the range of salaries. This is especially true for full professors, who have wide variation in their incomes. Placing all three distributions in the same graph facilitates comparisons among the groups. Next, let’s plot the relationship between years since Ph.D. and salary, grouping by sex and rank: ggplot(Salaries, aes(x=yrs.since.phd, y=salary, color=rank, shape=sex)) + geom_point() In the resulting graph (figure 19.9), academic rank is represented by point color (assistant professors in red, associate professors in green, and full professors in blue). Additionally, sex is indicated by point shape (circles are females and triangles are men). If you’re looking at a greyscale image, the color differences can be difficult to see; try running the code yourself. Note that reasonable legends are again produced www.it-ebooks.info 449 Grouping 200000 rank AsstProf salary AssocProf 150000 Prof sex Female Male 100000 Figure 19.9 Scatterplot of years since graduation and salary. Academic rank is represented by color, and sex is represented by shape. 50000 0 20 40 yrs.since.phd automatically. Here you can see that income increases with years since graduation, but the relationship is by no means linear. Finally, you can visualize the number of professors by rank and sex using a grouped bar chart. The following code provides three bar-chart variations, displayed in figure 19.10: ggplot(Salaries, aes(x=rank, fill=sex)) + geom_bar(position="stack") + labs(title='position="stack"') ggplot(Salaries, aes(x=rank, fill=sex)) + geom_bar(position="dodge") + labs(title='position="dodge"') ggplot(Salaries, aes(x=rank, fill=sex)) + geom_bar(position="fill") + labs(title='position="fill"') Each of the plots in figure 19.10 emphasizes different aspects of the data. It’s clear from the first two graphs that there are many more full professors than members of other ranks. Additionally, there are more female full professors than female assistant or associate professors. The third graph indicates that the relative percentage of women to men in the full-professor group is less than in the other two groups, even though the total number of women is greater. www.it-ebooks.info 450 Advanced graphics with ggplot2 CHAPTER 19 position="stack" position="dodge" position="fill" 250 1.00 200 200 count count count 0.75 150 sex 0.50 Female 100 100 Male 0.25 50 0 0 AsstProf AssocProf 0.00 Prof AsstProf rank AssocProf Prof AsstProf rank AssocProf Prof rank Figure 19.10 Three versions of a grouped bar chart. Each displays the number of professors by academic rank and sex. Note that the label on the y-axis for the third graph isn’t correct. It should say Proportion rather than count. You can correct this by adding y="Proportion" to the labs() function. Options can be used in different ways, depending on whether they occur inside or outside the aes() function. Look at the following examples and try to guess what they do: ggplot(Salaries, aes(x=rank, fill=sex))+ geom_bar() ggplot(Salaries, aes(x=rank)) + geom_bar(fill="red") ggplot(Salaries, aes(x=rank, fill="red")) + geom_bar() In the first example, sex is a variable represented by fill color in the bar graph. In the second example, each bar is filled with the color red. In the third example, ggplot2 assumes that "red" is the name of a variable, and you get unexpected (and undesirable) results. In general, variables should go inside aes(), and assigned constants should go outside aes(). 19.5 Faceting Sometimes relationships are clearer if groups appear in side-by-side graphs rather than overlapping in a single graph. You can create trellis graphs (called faceted graphs in ggplot2) using the facet_wrap() and facet_grid() functions. The syntax is given in table 19.4, where var, rowvar, and colvar are factors. Table 19.4 ggplot2 facet functions Syntax Results facet_wrap(~var, ncol=n) Separate plots for each level of var arranged into n columns facet_wrap(~var, nrow=n) Separate plots for each level of var arranged into n rows www.it-ebooks.info 451 Faceting Table 19.4 ggplot2 facet functions Syntax Results facet_grid(rowvar~colvar) Separate plots for each combination of rowvar and colvar, where rowvar represents rows and colvar represents columns facet_grid(rowvar~.) Separate plots for each level of rowvar, arranged as a single column facet_grid(.~colvar) Separate plots for each level of colvar, arranged as a single row Going back to the choral example, you can a faceted graph using the following code: data(singer, package="lattice") library(ggplot2) ggplot(data=singer, aes(x=height)) + geom_histogram() + facet_wrap(~voice.part, nrow=4) The resulting plot (figure 19.11) displays the distribution of singer heights by voice part. Separating the eight distributions into their own small, side-by-side plots makes them easier to compare. As a second example, let’s create a graph that has faceting and grouping: library(ggplot2) ggplot(Salaries, aes(x=yrs.since.phd, y=salary, color=rank, shape=rank)) + geom_point() + facet_grid(.~sex) Bass 2 Bass 1 Tenor 2 Tenor 1 Alto 2 Alto 1 Soprano 2 Soprano 1 15 10 5 0 15 10 count 5 0 15 10 5 0 15 Figure 19.11 Faceted graph showing the distribution (histogram) of singer heights by voice part 10 5 0 60 65 70 75 60 65 70 height www.it-ebooks.info 75 452 CHAPTER 19 Advanced graphics with ggplot2 Female Male 200000 salary rank AsstProf 150000 AssocProf Prof Figure 19.12 Scatterplot of years since graduation and salary. Academic rank is represented by color and shape, and sex is faceted. 100000 50000 0 20 40 0 20 40 yrs.since.phd The resulting graph is presented in 19.12. It contains the same information, but separating the plot into facets makes it somewhat easier to read. Finally, try displaying the height distribution of choral members in the singer dataset separately for each voice part, using kernel-density plots arranged horizontally. Give each a different color. One solution is as follows: data(singer, package="lattice") library(ggplot2) ggplot(data=singer, aes(x=height, fill=voice.part)) + geom_density() + facet_grid(voice.part~.) 0.2 Bass 2 0.1 0.0 0.2 Bass 1 0.1 0.0 Tenor 2 0.2 0.1 0.0 Tenor 1 density 0.2 0.1 Bass 2 Bass 1 Tenor 2 0.2 Tenor 1 0.1 0.0 Alto 2 Alto 1 Alto 1 0.2 0.1 0.0 Soprano 2 0.2 0.1 0.0 Soprano 1 Figure 19.13 Faceted density plots for singer heights by voice part voice.part 0.0 Alto 2 The result is displayed in figure 19.13. Note that the horizontal arrangement facilitates comparisons among the groups. The colors aren’t strictly necessary, but they can aid in distinguishing the plots. (If you’re viewing this in greyscale, be sure to try the example yourself.) 0.2 0.1 0.0 60 65 70 height www.it-ebooks.info 75 Soprano 2 Soprano 1 Adding smoothed lines 453 NOTE You may wonder why the legend for the density plots includes a black diagonal line through each box. Because you can control both the fill color of the density plots and their border colors (black by default), the legend displays both. 19.6 Adding smoothed lines The ggplot2 package contains a wide range of functions for calculating statistical summaries that can be added to graphs. These include functions for binning data and calculating densities, contours, and quantiles. This section looks at methods for adding smoothed lines (linear, nonlinear, and nonparametric) to scatter plots. You can use the geom_smooth() function to add a variety of smoothed lines and confidence regions. An example of a linear regression with confidence limits was given in figure 19.2. The parameters for the function are given in table 19.5. Table 19.5 geom_smooth() options Option Description method= Smoothing function to use. Allowable values include lm, glm, smooth, rlm, and gam, for linear, generalized linear, loess, robust linear, and generalized additive modeling, respectively. smooth is the default. formula= Formula to use in the smoothing function. Examples include y~x (the default), y~log(x), y~poly(x,n) for an nth degree polynomial fit, and y~ns(x,n) for a spline fit with n degrees of freedom. se Plots confidence intervals (TRUE/FALSE). TRUE is the default. level Level of confidence interval to use (95% by default). fullrange Specifies whether the fit should span the full range of the plot (TRUE) or just the data (FALSE). FALSE is the default. Using the Salaries dataset, let’s first examine the relationship between years since obtaining a Ph.D. and faculty salaries. In this example, you’ll use a nonparametric smoothed line (loess) with 95% confidence limits. Ignore sex and rank for now. The code is as follows, and the graph is shown in figure 19.14: data(Salaries, package="car") library(ggplot2) ggplot(data=Salaries, aes(x=yrs.since.phd, y=salary)) + geom_smooth() + geom_point() The plot suggests that the relationship between experience and salary isn’t linear, at least when considering faculty who graduated many years ago. www.it-ebooks.info 454 Advanced graphics with ggplot2 CHAPTER 19 ● ● ● 200000 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● salary ● ● ● ● ● ● ● 150000 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 100000 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 50000 0 20 Figure 19.14 Scatterplot of years since doctorate and current faculty salary. A fitted loess smoothed line with 95% confidence limits has been added. 40 yrs.since.phd Next, let’s fit a quadratic polynomial regression (one bend) separately by gender: ggplot(data=Salaries, aes(x=yrs.since.phd, y=salary, linetype=sex, shape=sex, color=sex)) + geom_smooth(method=lm, formula=y~poly(x,2), se=FALSE, size=1) + geom_point(size=2) 200000 salary The confidence limits are suppressed to simplify the graph (se=FALSE). Genders are differentiated by color, symbol shape, and line type. The plot is displayed in figure 19.15. 150000 sex Female Male 100000 Figure 19.15 Scatterplot of years since graduation vs. salary with separate fitted quadratic regression lines for men and women 50000 0 20 40 yrs.since.phd www.it-ebooks.info Modifying the appearance of ggplot2 graphs 455 The curve for males appears to increase from 0 to about 30 years and then decrease. The curve for women rises from 0 to 40 years. No women in the dataset received their degree more than 40 years ago. For most of the range where both genders have data, men have received higher salaries. Stat functions In this section, you’ve added smoothed lines to scatter plots. The ggplot2 package contains a wide range of statistical functions (called stat functions) for calculating the quantities necessary to produce a variety of data visualizations. Typically, geom functions call the stat functions implicitly, and you won’t need to deal with them directly. But it’s useful to know they exist. Each stat function has help pages that can aid you in understanding how the geoms work. For example, the geom_smooth() function relies on the stat_smooth() function to calculate the quantities needed to plot a fitted line and its confidence limits. The help page for geom_smooth() is sparse, but the help page for stat_smooth() contains a wealth of useful information. When exploring how a geom works and what options are available, be sure to check out both the geom function and its related stat function(s). 19.7 Modifying the appearance of ggplot2 graphs In chapter 3, you saw how to customize base graphics using graphical parameters placed in the par() function or specific plotting functions. Unfortunately, changing base graphics parameters has no effect on ggplot2 graphs. Instead, the ggplot2 package offers specific functions for changing the appearance of its graphs. In this section, we’ll look at several functions that allow you to customize the appearance of ggplot2 graphs. You’ll learn how to customize the appearance of axes (limits, tick marks, and tick mark labels), the placement and content of legends, and the colors used to represent variable values. You’ll also learn how to create custom themes (allowing you to add a consistent look and feel to your graphs) and arrange several plots into a single graph. 19.7.1 Axes The ggplot2 package automatically creates plot axes with tick marks, tick mark labels, and axis labels. Often they look fine, but occasionally you’ll want to take greater control over their appearance. You’ve already seen how to use the labs() function to add a title and change the axis labels. In this section, you’ll customize the axes themselves. Table 19.6 contains functions that are useful for customizing axes. Table 19.6 Functions that control the appearance of axes and tick marks Function Options scale_x_continuous(), scale_y_continuous() breaks= specifies tick marks, labels= specifies labels for tick marks, and limits= controls the range of the values displayed. www.it-ebooks.info 456 Advanced graphics with ggplot2 CHAPTER 19 Table 19.6 Functions that control the appearance of axes and tick marks (continued) Function Options scale_x_discrete(), scale_y_discrete() breaks= places and orders the levels of a factor, labels= specifies the labels for these levels, and limits= indicates which levels should be displayed. coord_flip() Reverses the x and y axes. As you can see, ggplot2 functions distinguish between the x- and y-axes and whether an axis represents a continuous or discrete (factor) variable. Let’s apply these functions to a graph with grouped box plots for faculty salaries by rank and sex. The code is as follows: data(Salaries,package="car") library(ggplot2) ggplot(data=Salaries, aes(x=rank, y=salary, fill=sex)) + geom_boxplot() + scale_x_discrete(breaks=c("AsstProf", "AssocProf", "Prof"), labels=c("Assistant\nProfessor", "Associate\nProfessor", "Full\nProfessor")) + scale_y_continuous(breaks=c(50000, 100000, 150000, 200000), labels=c("$50K", "$100K", "$150K", "$200K")) + labs(title="Faculty Salary by Rank and Sex", x="", y="") The resulting graph is provided in figure 19.16. Clearly, average income goes up with rank, and men make more than women within each teaching rank. (For a more complete picture, try controlling for years since Ph.D.) Faculty Salary by Rank and Sex $200K sex $150K Female Male $100K Figure 19.16 Box plots of faculty salaries grouped by academic rank and sex. The axis text has been customized. $50K Assistant Professor Associate Professor Full Professor www.it-ebooks.info 457 Modifying the appearance of ggplot2 graphs 19.7.2 Legends Legends are guides that indicate how visual characteristics like color, shape, and size represent qualities of the data. The ggplot2 package generates legends automatically, and in many cases they suffice quite well. At other times, you may want to customize them. The title and placement are the most commonly customized characteristics. When modifying a legend’s title, you have to take into account whether the legend is based on color, fill, size, shape, or a combination. In figure 19.16, the legend represents the fill aesthetic (as you can see in the aes() function), so you can change the title by adding fill="mytitle" to the labs() function. The placement of the legend is controlled by the legend.position option in the theme() function. Possible values include "left", "top", "right" (the default), and "bottom". Alternatively, you can specify a two-element vector that gives the position within the graph. Let’s modify the graph in figure 19.16 so that the legend appears in the upper-left corner and the title is changed from sex to Gender. This can be accomplished with the following code: data(Salaries,package="car") library(ggplot2) ggplot(data=Salaries, aes(x=rank, y=salary, fill=sex)) + geom_boxplot() + scale_x_discrete(breaks=c("AsstProf", "AssocProf", "Prof"), labels=c("Assistant\nProfessor", "Associate\nProfessor", "Full\nProfessor")) + scale_y_continuous(breaks=c(50000, 100000, 150000, 200000), labels=c("$50K", "$100K", "$150K", "$200K")) + labs(title="Faculty Salary by Rank and Gender", x="", y="", fill="Gender") + theme(legend.position=c(.1,.8)) The results are shown in figure 19.17. Faculty Salary by Rank and Gender Gender $200K Female Male $150K Figure 19.17 Box plots of faculty salaries grouped by academic rank. The axis text has been customized, along with the legend title and position. $100K $50K Assistant Professor www.it-ebooks.info Associate Professor Full Professor 458 CHAPTER 19 Advanced graphics with ggplot2 In this example, the upper-left corner of the legend was placed 10% from the left edge and 80% from the bottom edge of the graph. If you want to omit the legend, use legend.position="none". The theme() function can change many aspects of a ggplot2 graph’s appearance; other examples are given in section 19.7.4. 19.7.3 Scales The ggplot2 package uses scales to map observations from the data space to the visual space. Scales apply to both continuous and discrete variables. In figure 19.15, a continuous scale was used to map the numeric values of the yrs.since.phd variable to distances along the x-axis and map the numeric values of the salary variable to distances along the y-axis. Continuous scales can map numeric variables to other characteristics of the plot. Consider the following code: ggplot(mtcars, aes(x=wt, y=mpg, size=disp)) + geom_point(shape=21, color="black", fill="cornsilk") + labs(x="Weight", y="Miles Per Gallon", title="Bubble Chart", size="Engine\nDisplacement") The aes() parameter size=disp generates a scale for the continuous variable disp (engine displacement) and uses it to control the size of the points. The result is the bubble chart presented in figure 19.18. The graph shows that auto mileage decreases with both weight and engine displacement. Bubble Chart 35 Miles Per Gallon 30 Engine Displacement 25 100 200 300 20 400 15 Figure 19.18 Bubble chart of auto weight by mileage, with point size representing engine displacement 10 2 3 4 5 Weight www.it-ebooks.info 459 Modifying the appearance of ggplot2 graphs 200000 salary rank 150000 AsstProf AssocProf Prof 100000 Figure 19.19 Scatterplot of salary vs. experience for assistant, associate, and full professors. Point colors have been specified manually. 50000 0 20 40 yrs.since.phd In the discrete case, you can use a scale to associate visual cues (for example, color, shape, line type, size, and transparency) with the levels of a factor. The code data(Salaries, package="car") ggplot(data=Salaries, aes(x=yrs.since.phd, y=salary, color=rank)) + scale_color_manual(values=c("orange", "olivedrab", "navy")) + geom_point(size=2) uses the scale_color_manual() function to set the point colors for the three academic ranks. The results are displayed in figure 19.19. If you’re color challenged like I am (does purple go with orange?), you can use color presets via the scale_color_brewer() and scale_fill_brewer() functions to specify attractive color sets. For example, try the code ggplot(data=Salaries, aes(x=yrs.since.phd, y=salary, color=rank)) + scale_color_brewer(palette="Set1") + geom_point(size=2) and see what you get. Replacing palette="Set1" with another value (such as "Set2", "Set3", "Pastel1", "Pastel2", "Paired", "Dark2", or "Accent") will result in a different color scheme. To see the available color sets, use library(RColorBrewer) display.brewer.all() to generate a display. For more information, see help(scale_color_brewer) and the ColorBrewer homepage (http://colorbrewer2.org). www.it-ebooks.info 460 Advanced graphics with ggplot2 CHAPTER 19 The concept of scales is general in ggplot2. Although we won’t cover this further, you can control the characteristics of scales. See the functions that have scale_ in their name for more details. 19.7.4 Themes You’ve seen several methods for modifying specific visual elements of ggplot2 graphs. Themes allow you to control the overall appearance of these graphs. Options in the theme() function let you change fonts, backgrounds, colors, gridlines, and more. Themes can be used once or saved and applied to many graphs. Consider the following: data(Salaries, package="car") library(ggplot2) mytheme <- theme(plot.title=element_text(face="bold.italic", size="14", color="brown"), axis.title=element_text(face="bold.italic", size=10, color="brown"), axis.text=element_text(face="bold", size=9, color="darkblue"), panel.background=element_rect(fill="white", color="darkblue"), panel.grid.major.y=element_line(color="grey", linetype=1), panel.grid.minor.y=element_line(color="grey", linetype=2), panel.grid.minor.x=element_blank(), legend.position="top") ggplot(Salaries, aes(x=rank, y=salary, fill=sex)) + geom_boxplot() + labs(title="Salary by Rank and Sex", x="Rank", y="Salary") + mytheme Adding + mytheme to the plotting statements generates the graph shown in figure 19.20. Salary by Rank and Sex sex Female Male Salary 200000 150000 100000 Figure 19.20 Box plots with a customized theme 50000 AsstProf AssocProf Rank www.it-ebooks.info Prof 461 Modifying the appearance of ggplot2 graphs The theme, mytheme, specifies that plot titles should be printed in brown, 14-point, bold italics; axis titles should be printed in brown, 10-point, bold italics; axis labels should be printed in dark blue, 9-point bold; the plot area should have a white fill and dark blue borders; major horizontal grids should be gray solid lines; minor horizontal grids should be grey dashed lines; vertical grids should be suppressed; and the legend should appear at the top of the graph. The theme() function gives you great control over the look of the finished product. See help(theme) to learn more about these options. 19.7.5 Multiple graphs per page In section 3.5, you used the graphic parameter mfrow and the base function layout() to combine two or more base graphs into a single plot. Again, this approach won’t work with plots created with the ggplot2 package. The easiest way to place multiple ggplot2 graphs in a single figure is to use the grid.arrange() function in the gridExtra package. You’ll need to install it (install.packages(gridExtra)) before first use. Let’s create three ggplot2 graphs and place them in a single graph. The code is given in the following listing: data(Salaries, package="car") library(ggplot2) p1 <- ggplot(data=Salaries, aes(x=rank)) + geom_bar() p2 <- ggplot(data=Salaries, aes(x=sex)) + geom_bar() p3 <- ggplot(data=Salaries, aes(x=yrs.since.phd, y=salary)) + geom_point() library(gridExtra) grid.arrange(p1, p2, p3, ncol=3) The resulting graph is shown in figure 19.21. Each graph is saved as an object and then arranged into a single plot with the grid.arrange() function. 300 200000 200 salary count count 200 150000 100 100 0 100000 Figure 19.21 Placing three ggplot2 plots in a single graph 0 50000 AsstProf AssocProf Prof rank Female Male sex www.it-ebooks.info 0 20 40 yrs.since.phd 462 CHAPTER 19 Advanced graphics with ggplot2 Note the difference between faceting and multiple graphs. Faceting creates an array of plots based on one or more categorical variables. In this section, you’re arranging completely independent plots into a single graph. 19.8 Saving graphs You can save the graphs created by ggplot2 using the standard methods discussed in section 1.3.4. But a convenience function named ggsave() can be particularly useful. Options include which plot to save, where to save it, and in what format. For example, myplot <- ggplot(data=mtcars, aes(x=mpg)) + geom_histogram() ggsave(file="mygraph.png", plot=myplot, width=5, height=4) saves myplot as a 5-inch by 4-inch PNG file named mygraph.png in the current working directory. You can save the graph in a different format by setting the file extension to ps, tex, jpeg, pdf, jpeg, tiff, png, bmp, svg, or wmf. The wmf format is only available on Windows machines. If you omit the plot= option, the most recently created graph is saved. The code ggplot(data=mtcars, aes(x=mpg)) + geom_histogram() ggsave(file="mygraph.pdf") is valid and saves the graph to disk. See help(ggsave) for additional details. 19.9 Summary This chapter reviewed the ggplot2 package, which provides advanced graphical methods based on a comprehensive grammar of graphics. The package is designed to provide you with a complete and comprehensive alternative to the native graphics provided with R. It offers methods for creating attractive and meaningful visualizations of data that are difficult to generate in other ways. The ggplot2 package can be difficult to learn, but a wealth of material is available to help you on your journey (I promised myself that I would never use that word, but learning ggplot2 can certainly feel like one). A list of all ggplot2 functions, along with examples, can be found at http://docs.ggplot2.org. To learn about the theory underlying ggplot2, see the original book by Wickham (2009). Finally, Chang (2013) has written a very practical book, chock full of useful examples. Chang’s book is definitely where I would start. You should now have a firm grasp of the many ways that R allows you to create visual representations of data. If a picture is worth a thousand words, and R provides a thousand ways to create a picture, then R must be worth a million words (or something to that effect). In the next chapter two chapters, you’ll delve deeper into R as a programming language. www.it-ebooks.info Advanced programming This chapter covers ■ ■ A deeper dive into the R language Using R’s OOP features to create generic functions ■ Tweaking code to run more efficiently ■ Finding and correcting programming errors Previous chapters introduced various topics that are important for application development, including data types (section 2.2), control flow (section 5.4), and function creation (section 5.5). This chapter will review these aspects of R as a programming language—but from a more advanced and detailed perspective. By the end of this chapter, you’ll have a better idea of how the R language works. We’ll start with a review of objects, data types, and control flow before moving on to details of function creation, including the role of scope and environments. The chapter introduces R’s approach to object-oriented programming and discusses the creation of generic functions. Finally, we’ll go over tips for writing efficient code-generating and debugging applications. A mastery of these topics will help you to understand the code in other people’s applications and aid you in 463 www.it-ebooks.info 464 CHAPTER 20 Advanced programming creating new applications. In chapter 21, you’ll have an opportunity to put these skills into practice by creating a useful package from start to finish. 20.1 A review of the language R is an object-oriented, functional, array programming language in which objects are specialized data structures, stored in RAM, and accessed via names or symbols. Names of objects consist of uppercase and lowercase letters, the digits 0–9, the period, and the underscore. Names are case-sensitive and can’t start with a digit, and a period is treated as a simple character without special meaning. Unlike in languages such as C and C++, you can’t access memory locations directly. Data, functions, and just about everything else that can be stored and named are objects. Additionally, the names and symbols themselves are objects that can be manipulated. All objects are stored in RAM during program execution, which has significant implications for the analysis of massive datasets. Every object has attributes: meta-information describing the characteristics of the object. Attributes can be listed with the attributes() function and set with the attr() function. A key attribute is an object’s class. R functions use information about an object’s class in order to determine how the object should be handled. The class of an object can be read and set with the class() function. Examples will be given throughout this chapter and the next. 20.1.1 Data types There are two fundamental data types: atomic vectors and generic vectors. Atomic vectors are arrays that contain a single data type. Generic vectors, also called lists, are collections of atomic vectors. Lists are recursive in that they can also contain other lists. This section considers both types in some detail. Unlike in many languages, you don’t have to declare an object’s data type or allocate space for it. The type is determined implicitly from the object’s contents, and the size grows or shrinks automatically depending on the type and number of elements the object contains. ATOMIC VECTORS Atomic vectors are arrays that contain a single data type (logical, real, complex, character, or raw). For example, each of the following is a one-dimensional atomic vector: passed <- c(TRUE, TRUE, FALSE, TRUE) ages <- c(15, 18, 25, 14, 19) cmplxNums <- c(1+2i, 0+1i, 39+3i, 12+2i) names <- c("Bob", "Ted", "Carol", "Alice") Vectors of type "raw" hold raw bytes and aren’t discussed here. Many R data types are atomic vectors with special attributes. For example, R doesn’t have a scalar type. A scalar is an atomic vector with a single element. So k <- 2 is a shortcut for k <- c(2). www.it-ebooks.info A review of the language 465 A matrix is an atomic vector that has a dimension attribute, dim, containing two elements (number of rows and number of columns). For example, start with a onedimensional numeric vector x: > x <- c(1,2,3,4,5,6,7,8) > class(x) [1] "numeric" > print(x) {1] 1 2 3 4 5 6 7 8 Then add a dim attribute: > attr(x, "dim") <- c(2,4) The object x is now a 2 × 3 matrix of class matrix: > print(x) [,1] [,2] [,3] [,4] [1,] 1 3 5 7 [2,] 2 4 6 8 > class(x) [1] "matrix" > attributes(x) $dim [1] 2 2 Row and column names can be attached by adding a dimnames attribute: > attr(x, "dimnames") <- list(c("A1", "A2"), c("B1", "B2", "B3", "B4")) > print(x) B1 B2 B3 B4 A1 1 3 5 7 A2 2 4 6 8 Finally, the matrix can be returned to a one-dimensional vector by removing the dim attribute: > attr(x, "dim") <- NULL > class(x) [1] "numeric" > print(x) [1] 1 2 3 4 5 6 7 8 An array is an atomic vector with a dim attribute that has three or more elements. Again, you set the dimensions with the dim attribute, and you can attach labels with the dimnames attribute. Like one-dimensional vectors, matrices and arrays can be of type logical, numeric, character, complex, or raw. But you can’t mix types in a single matrix or array. The attr() function allows you to create arbitrary attributes and associate them with an object. Attributes store additional information about an object and can be used by functions to determine how they’re processed. www.it-ebooks.info 466 CHAPTER 20 Advanced programming There are a number of special functions for setting attributes, including dim(), dimnames(), names(), row.names(), class(), and tsp(). The latter is used to create time series objects. These special functions have restrictions on the values that can be set. Unless you’re creating custom attributes, it’s always a good idea to use these special functions. Their restrictions and the error messages they produce make coding errors less likely and more obvious. GENERIC VECTORS OR LISTS Lists are collections of atomic vectors and/or other lists. Data frames are a special type of list, where each atomic vector in the collection has the same length. Consider the iris data frame that comes with the base R installation. It describes four physical measures taken on each of 150 plants, along with their species (setosa, versicolor, or virginica): > head(iris) Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosa 4 4.6 3.1 1.5 0.2 setosa 5 5.0 3.6 1.4 0.2 setosa 6 5.4 3.9 1.7 0.4 setosa This data frame is actually a list containing five atomic vectors. It has a names attribute (a character vector of variable names), a row.names attribute (a numeric vector identifying individual plants), and a class attribute with the value "data.frame". Each vector represents a column (variable) in the data frame. This can be easily seen by printing the data frame with the unclass() function and obtaining the attributes with the attributes() function: unclass(iris) attributes(iris) The output is omitted here to save space. It’s important to understand lists because R functions frequently return them as values. Let’s look at an example using a cluster-analysis technique from chapter 16. Cluster analysis uses a family of methods to identify naturally occurring groupings of observations. You’ll apply k-means cluster analysis (section 16.3.1) to the iris data. Assume that there are three clusters present in the data, and observe how the observations (rows) become grouped. You’ll ignore the species variable and use only the physical measures of each plant to form the clusters. The required code is set.seed(1234) fit <- kmeans(iris[1:4], 3) What information is contained in the object fit? The help page for kmeans() indicates that the function returns a list with seven components. The str() function displays the object’s structure, and the unclass() function can be used to examine the www.it-ebooks.info 467 A review of the language object’s contents directly. The length() function indicates how many components the object contains, and the names() function provides the names of these components. You can use the attributes() function to examine the attributes of the object. The contents of the object returned by kmeans() are explored here: > [1] [5] [9] names(fit) "cluster" "centers" "tot.withinss" "betweenss" "ifault" > unclass(fit) $cluster [1] 1 1 1 1 1 [29] 1 1 1 1 1 [57] 2 2 2 2 2 [85] 2 2 2 2 2 [113] 3 2 2 3 3 [141] 3 3 2 3 3 1 1 2 2 3 3 1 1 2 2 3 2 1 1 2 2 2 3 1 1 2 2 3 3 1 1 2 2 2 2 1 1 2 2 3 1 1 2 2 2 1 1 2 2 3 "totss" "size" 1 1 2 2 3 1 1 2 2 2 1 1 2 2 2 1 1 2 3 3 1 1 2 2 3 "withinss" "iter" 1 1 2 3 3 1 1 2 3 3 1 1 2 3 3 1 1 3 3 2 1 2 2 2 3 1 2 2 3 3 1 3 2 3 3 1 2 2 3 3 1 2 2 3 2 1 2 2 3 3 $centers Sepal.Length Sepal.Width Petal.Length Petal.Width 1 5.006 3.428 1.462 0.246 2 5.902 2.748 4.394 1.434 3 6.850 3.074 5.742 2.071 $totss [1] 681.4 $withinss [1] 15.15 39.82 23.88 $tot.withinss [1] 78.85 $betweenss [1] 602.5 $size [1] 50 62 38 $iter [1] 2 $ifault [1] 0 Executing sapply(fit, class) returns the class of each component in the object: > sapply(fit, class) cluster centers "integer" "matrix" betweenss size "numeric" "integer" totss "numeric" iter "integer" withinss tot.withinss "numeric" "numeric" ifault "integer" In this example, cluster is an integer vector containing the cluster memberships, and centers is a matrix containing the cluster centroids (means on each variable for each www.it-ebooks.info 468 CHAPTER 20 Advanced programming cluster). The size component is an integer vector containing the number of plants in each of the three clusters. To learn about the other components, see the Value section of help(kmeans). INDEXING Learning to unpack the information in a list is a critical R programming skill. The elements of any data object can be extracted via indexing. Before diving into a list, let’s look at extracting elements from an atomic vector. Elements are extracted using object[index], where object is the vector and index is an integer vector. If the elements of the atomic vector have been named, index can also be a character vector with these names. Note that in R, indices start with 1, not 0 as in many other languages. Here is an example, using this approach for an atomic vector without named elements: > x <- c(20, 30, 40) > x[3] [1] 40 > x[c(2,3)] [1] 30 40 For an atomic vector with named elements, you could use > x <- c(A=20, B=30, C=40) > x[c(2,3)] B C 30 40 > x[c("B", "C")] B C 30 40 For lists, components (atomic vectors or other lists) can be extracted using object[index], where index is an integer vector. The following uses the fit object from the kmeans example that appears a little later, in listing 20.1: > fit[c(2,7)] $centers Sepal.Length Sepal.Width Petal.Length Petal.Width 1 5.006 3.428 1.462 0.246 2 5.902 2.748 4.394 1.434 3 6.850 3.074 5.742 2.071 $size [1] 50 62 38 Note that components are returned as a list. To get just the elements in the component, use object[[integer]]: > fit[2] $centers Sepal.Length Sepal.Width Petal.Length Petal.Width 1 5.006 3.428 1.462 0.246 2 5.902 2.748 4.394 1.434 3 6.850 3.074 5.742 2.071 www.it-ebooks.info 469 A review of the language > fit[[2]] Sepal.Length Sepal.Width Petal.Length Petal.Width 1 5.006 3.428 1.462 0.246 2 5.902 2.748 4.394 1.434 3 6.850 3.074 5.742 2.071 In the first case, a list is returned. In second case, a matrix is returned. The difference can be important, depending on what you do with the results. If you want to pass the results to a function that requires a matrix as input, you would want to use the doublebracket notation. To extract a single named component, you can use the $ notation. In this case, object[[integer]] and object$name are equivalent: > fit$centers Sepal.Length Sepal.Width Petal.Length Petal.Width 1 5.006 3.428 1.462 0.246 2 5.902 2.748 4.394 1.434 3 6.850 3.074 5.742 2.071 This also explains why the $ notation works with data frames. Consider the iris data frame. The data frame is a special case of a list, where each variable is represented as a component. This is why iris$Sepal.Length returns the 150-element vector of sepal lengths. Notations can be combined to obtain the elements within components. For example, > fit[[2]][1,] Sepal.Length Sepal.Width Petal.Length 5.006 3.428 1.462 Petal.Width 0.246 extracts the second component of fit (a matrix of means) and returns the first row (the means for the first cluster on each of the four variables). By extracting the components and elements of lists returned by functions, you can take the results and go further. For example, to plot the cluster centroids with a line graph, you can use the following code. Listing 20.1 Plotting the centroids from a k-means cluster analysis > > > > > > > > set.seed(1234) fit <- kmeans(iris[1:4], 3) Obtains the cluster means means <- fit$centers library(reshape2) dfm <- melt(means) names(dfm) <- c("Cluster", "Measurement", "Centimeters") dfm$Cluster <- factor(dfm$Cluster) head(dfm) 1 2 3 4 5 6 Cluster Measurement Centimeters 1 Sepal.Length 5.006 2 Sepal.Length 5.902 3 Sepal.Length 6.850 1 Sepal.Width 3.428 2 Sepal.Width 2.748 3 Sepal.Width 3.074 b www.it-ebooks.info c Reshapes the data to long form 470 CHAPTER 20 Advanced programming library(ggplot2) ggplot(data=dfm, aes(x=Measurement, y=Centimeters, group=Cluster)) + geom_point(size=3, aes(shape=Cluster, color=Cluster)) + geom_line(size=1, aes(color=Cluster)) + ggtitle("Profiles for Iris Clusters") d Plots a line graph First, the matrix of cluster centroids is extracted (rows are clusters, and columns are variable means) b. The matrix is then reshaped into long format using the reshape package (see section 5.6.2) c. Finally the data is plotted using the ggplot2 package (see section 18.3) d. The resulting graph is displayed in figure 20.1. This type of graph is possible because all the variables plotted use the same units of measurement (centimeters). If the cluster analysis involved variables on different scales, you would need to standardize the data before plotting and label the y-axis something like Standardized Scores. See section 16.1 for details. Now that you can represent data in structures and unpack the results, let’s look at flow control. 20.1.2 Control structures When the R interpreter processes code, it reads sequentially, line by line. If a line isn’t a complete statement, it reads additional lines until a fully formed statement can be constructed. For example, if you wanted to add 3 + 2 + 5, > 3 + 2 + 5 [1] 10 Profiles for Iris Clusters Centimeters 6 Cluster 1 4 2 3 2 0 Sepal.Length Sepal.Width Petal.Length Petal.Width Measurement Figure 20.1 A plot of the centroids (means) for three clusters extracted from the Iris dataset using k-means clustering www.it-ebooks.info A review of the language 471 will work. So will > 3 + 2 + 5 [1] 10 The + sign at the end of the first line indicates that the statement isn’t complete. But > 3 [1] > + [1] + 2 5 5 5 obviously doesn’t work, because 3 + 2 is interpreted as a complete statement. Sometimes you need to process code nonsequentially. You may want to execute code conditionally or repeat one or more statements multiple times. This section describes three control-flow functions that are particularly useful in writing functions: for(), if(), and ifelse(). FOR LOOPS The for() function allows you to execute a statement repeatedly. The syntax is for(var in seq){ statements } where var is a variable name and seq is an expression that evaluates to a vector. If there is only one statement, the curly braces are optional: > for(i [1] 1 [1] 1 2 [1] 1 2 [1] 1 2 [1] 1 2 > for(i [1] 1 2 [1] 1 2 [1] 1 2 [1] 1 2 [1] 1 in 1:5) print(1:i) 3 3 4 3 4 5 in 5:1)print(1:i) 3 4 5 3 4 3 Note that var continues to exist after the function exits. Here, i equals 1. IF() AND ELSE The if() function allows you to execute statements conditionally. The syntax for the if() construct is if(condition){ statements } else { statements } www.it-ebooks.info 472 CHAPTER 20 Advanced programming The condition should be a one-element logical vector (TRUE or FALSE) and can’t be missing (NA). The else portion is optional. If there is only one statement, the curly braces are also optional. As an example, consider the following code fragment: if(interactive()){ plot(x, y) } else { png("myplot.png") plot(x, y) dev.off() } If the code is being run interactively, the interactive() function returns TRUE and a plot is sent to the screen. Otherwise, the plot is saved to disk. You’ll use the if() function extensively in chapter 21. IFELSE() The ifelse() function is a vectorized version of if(). Vectorization allows a function to process objects without explicit looping. The format of ifelse() is ifelse(test, yes, no) where test is an object that has been coerced to logical mode, yes returns values for true elements of test, and no returns values for false elements of test. Let’s say that you have a vector of p-values that you have extracted from a statistical analysis that involved six statistical tests, and you want to flag the tests that are significant at the p < .05 level. This can be accomplished with the following code: > pvalues <- c(.0867, .0018, .0054, .1572, .0183, .5386) > results <- ifelse(pvalues <.05, "Significant", "Not Significant") > results [1] "Not Significant" "Significant" [4] "Not Significant" "Significant" "Significant" "Not Significant" The ifelse() function loops through the vector pvalues and returns a character vector containing the value "Significant" or "Not Significant" depending on whether the corresponding element of pvalues is greater than .05. The same result can be accomplished with explicit loops using pvalues <- c(.0867, .0018, .0054, .1572, .0183, .5386) results <- vector(mode="character", length=length(pvalues)) for(i in 1:length(pvalues)){ if (pvalues[i] < .05) results[i] <- "Significant" else results[i] <- "Not Significant" } The vectorized version is faster and more efficient. There are other control structures, including while(), repeat(), and switch(), but the ones presented here are the most commonly used. Now that you have data structures and control structures, we can talk about creating functions. www.it-ebooks.info A review of the language 473 20.1.3 Creating functions Almost everything in R is a function. Even arithmetic operators like +, -, /, and * are actually functions. For example, 2 + 2 is