On Regression: Part I

On Regression: Part I

Everyone loves doing regressions, but what are they? In the most general sense, a regression model says that the response \( y \) is a noisy observation of a function \( f: \mathbb{R}^p \to \mathbb{R} \) whose argument is the vector \( x \). So if we have \( n \) observations of \( y \) and the \( p \)-vector \( x \), then the \( i^{th} \) entry of \( y \) can be written

\[ y_i = f(x_i) + \epsilon_i, \]

with \( \epsilon_i \sim G \). \( G \) is just some probability law and \( f \) is some function, I've introduced no other conditions. \( G \) might not have a density and \( f \) may not be continuous. In practice, one usually assumes that the \( \epsilon \)'s are \( iid \) from \( G \), but they need not be, as in time series models, for example. For now, we'll stick with \( iid \) errors, a pretty mild assumption unless we're working with spatial or longitudinal data or something like that. Really, I just want to impress that if I'm interested in understanding how variable \( y \) depends on variables \( x_1,\ldots,x_p \), using a regression model can be very flexible and allow for most of the behavior that one could reasonably expect, depending on how we model \( f \) and \( \epsilon \).

That's a much more general setting than that in which most of us think of regression, and for the first couple of years I spent becoming a data nerd, I didn't mean that either when I talked about regression, I just meant a linear model. So let's start with the specific and familiar and work our way toward the general and the weird. Our old friend the Gaussian linear model is just

\[ y_i = x_i \beta + \epsilon_i \]

with \( \epsilon_i \sim N(0,\sigma^2) \). This is the simplest regression model, and it's what a lot of software packages refer to as “regression,'' so no wonder the term has been abused a bit. Notice though that we've made two important assumptions: (1) the function \( f \) is a linear map and (2) the noise is \( iid \) Gaussian. These assumptions are much stronger: we've picked a specific distribution for \( \epsilon \) and we've decided that among the myriad functions that exist, ours is going to be linear. This is great if it's a good approximation of the actual relationship between \( x \) and \( y \), and may be really terrible if it isn't. Let's have a look at some data generated from this model (so they fit the model exactly).

n <- 1000
x <- rnorm(n)
alpha <- 0.4
bet <- 1.2
y <- alpha + x * bet + rnorm(n, 0, 0.81)
dat <- data.frame(cbind(y, x))
names(dat) <- c("y", "x")
fit <- lm(y ~ x, data = dat)
dat$yhat <- fit$fitted.values
ggplot(dat, aes(y = y, x = x)) + geom_point(size = 1) + geom_line(aes(y = yhat), 
    colour = 2)

plot of chunk unnamed-chunk-3

The data certainly seem to lie on a line, but it's a little bit of a messy line, and that's because the magnitude of the noise scale (standard deviation of 0.81) is substantial relative to the magnitude of the effect (1.4). What if I make the line wiggly?

dat$y2 <- dat$y + 0.1 * cos(pi * x)
p1 <- ggplot(dat, aes(x = x, y = y)) + geom_point(size = 1) + ggtitle("straight line")
p2 <- ggplot(dat, aes(x = x, y = y2)) + geom_point(size = 1) + ggtitle("wiggly line")
multiplot(p1, p2, cols = 2)

plot of chunk unnamed-chunk-5

Can you tell the difference? No? Good, neither can I. But why not? The line's not straight anymore, I've added a periodic component. We know the model is wrong. The reason is that the magnitude of the periodic component is small relative to the error variance, so it's just lost in the noise. The linear model is a good approximation of what's actually going on in the data. This is a nice, very simple illustration of why I find the very common objection "you're crazy if you think the data actually fit the assumptions of your model” to be funny. The question is not whether the data were generated by the model. It is, after all, just a model, and there's a reason we use the word model instead of “candidate true state of the world.” Rather, the question is whether the model is reasonably close to what's going on in the data. One of the most famous lines in statistics (a fine distinction, if I do say so myself) is George Box's famous quip “all models are wrong, but some are useful.” That is, I think, the only reasonable way to think about it.

Now sometimes we can see very clearly that the model is wrong. Let's just make that periodic component a bit bigger.

dat$y3 <- dat$y + cos(pi * x)
p1 <- ggplot(dat, aes(x = x, y = y)) + geom_point(size = 1) + ggtitle("straight line")
p2 <- ggplot(dat, aes(x = x, y = y3)) + geom_point(size = 1) + ggtitle("wiggly-er line")
multiplot(p1, p2, cols = 2)

plot of chunk unnamed-chunk-7

In upcoming posts I'll go beyond linear models and describe ways to model (1) more general regression functions, like the periodic one above; and (2) non-Gaussian error distributions.

A Julia Installation Diary

As a statistician I often feel conflicted about software. On the one hand, R is great because there are so many packages available. On the other hand, R is not a good environment for methodological development, particulary for those interested in high-dimensional statistics and big data. This is mainly because R is slow, particularly in loops, which are something that Bayesian statisticians using MCMC for computation cannot avoid. Moreover, it tends to hog memory. As a result, most Bayesians I know do one of two things:

  1. Use R essentially as a wrapper for compiled C or Fortran code.
  2. Do computation for novel methods in matlab.

The former has the obvious issue of being more time consuming than R programming, often by a lot. Personally, I don't like debugging seg faults, and with the advent of parallel computing, sparse data structures, et cetera, efficient C programming is getting harder. Matlab is often as fast or faster than C because it natively parallelizes things like matrix multiplication, makes use of a JIT to deal with the looping problem in interpreted languages, and uses optimized linear algebra libraries. I know a number of statisticians who insist that matlab could not possibly be as fast as C, but this evidences a misunderstanding of how these langages actually work. Good matlab code in a parallel environment will often do as well or a bit better than someone's home-baked C code. Unfortunately, matlab is terribly expensive, and it sucks for anything besides matrix programming. The basic appraoch is to get you hooked on Matlab while you're in grad school. But commercial licenses cost about 3500 bucks apiece (2000 for Matlab, which is worthless on its own and another 1500 for statistics toolbox, which contains all the random number generators and some mostly crappy functions to fit standard models). I don't care to think about how much Duke spends on the academic site license. It must be several million dollars a year.

Julia

During one of my recent rants on the topic, a buddy of mine suggested I try Julia. If you've never heard of Julia you should go to the website and look at the performance benchmarks. It's impressively fast, and it isn't slow in loops the way that most interpreted languages are. It's free and open source, and the syntax is very similar to matlab. The startup costs in terms of writing code were pretty minimal for me, and I've found it performs as well or better than matlab for large scale Bayesian computation. What was relatively difficult was getting both my linux box and my mac laptop set up with fully functioning Julia and a modified version of ESS for writing and debugging code in Emacs. Since I'm becoming somewhat of an apostle for Julia, I'll describe the process of getting it to work. Not all statisticians are hackers, and as a breed we often hesitate to dedicate much time to mucking around with linux. Hopefully the information here will ease the process enough that more folks will give it a try.

Linux Installation

I have a desktop system with two quad-core intel i7's and 22 GB memory. I'm running Ubuntu 12.04 LTS. On Linux, the process of installing is relatively simple. Unless you run 13.04 Ubuntu you'll have to build from source. I highly recommend this anyway. Julia is under heavy development and if you want it to work well you basically need to pull the latest git repository every day. Thus you should clone the git repository and install it that way. The instructions here explain how to clone and build from source. Of course you might not have git, so you'll need to install that first. If you're running ubuntu you can do this with sudo apt-get install git. Then just follow the instructions from the Julia git repository. Don't forget to read the part near the bottom instructing you to make sure gcc is up to date.

You'll now have a command-line version of Julia installed, but it won't work with the usual version of ESS, including the one you get with apt-get. The easiest thing to do is to purge your current version of ESS (sudo apt-get --purge remove ess) and then clone the Julia ESS git repository (see instructions here).

You'll need to update Julia and Julia-ESS frequently. To update a cloned git repository, cd into the folder where you installed it and type git pull. After it finishes the pull, you'll need to make again using the same flags that you used when you built Julia/ESS from source.

Now you'll undoubtedly want to do more than julia base will allow you to do. The distributions package is now included in base. But for graphics you'll need to choose one of the packages. See here for some options. I'm a ggplot user in R, so I like Gadfly for Julia, which has a very ggplot-like feel, both in appearance and syntax. Gadfly installed without much of a problem for me on linux. A bizarre quirk of the package installation utilities is that if a package fails to install for whatever reason, when you try to reinstall it will throw an error. In this case, you pretty much have to wipe out ~/.julia (rm -rf ~/.julia) and then reinstall packages. However, once you have the packages installed, you should be able to keep them current using Pkg.update() in Julia.

Mac Installation

Installing Julia on the Mac was much harder. You might wonder why anyone would do this when a mac binary exists. The mac binary never worked for me -- it kept dying when I tried to multiply matrices. I think the mac binary is probably not updated frequently, and as I said before, daily git pulls are fairly essential to keep everything running smoothly. If you don't clone the repository and build from source, you can't do that. So I recommend struggling through the steps outlined here. FYI, I have a 2010-vintage MacBook Pro with 8 GB RAM and a dual-core Intel core i7. I'm running Mountain Lion. If your system is different, some of the steps may be different, but this should provide a rough guide.

  1. Install Xcode. You can do this from the apple app store. Just search for Xcode and install. Make sure that you choose to install the optional command line tools during the installation process - this is NOT the default option so read the screens carefully.
  2. Install homebrew package manager. This is optional but highly recommended. You can get a homebrew by following the instuctions here. After you install, it will recommend you run brew doctor. Follow the advice. Then do what the brew doctor tells you to do, unless you are enough of a linux hacker to know better.
  3. Go to the Julia source repo and read the specific instructions for Mac OS. Do not clone the git repository yet (you may not have git and you will install it next). The Julia instructions will tell you that you need a 64-bit fortran compiler. If you have homebrew, then you should be good to go, just install/update gfortran using home brew (brew install gfortran).
  4. Install git. Just follow the instructions here but if you choose to use the installer rather than the terminal, make sure you select to install the optional command line tools for git.
  5. Now clone the Julia git repo and make from source (see instructions).
  6. You should now be able to run Julia from the command line. Make sure it's working.
  7. Now get the Julia fork of ESS. Again clone the git repo here and install from source, following the instructions on the website. You may get an error when you run make, however ESS seems to be working fine for me despite this make error, so apparently the make error is not fatal.
  8. Modify your .emacs file according to the instructions provided at the Julia-ESS repo site. If you don't have a .emacs file, you'll need to create one in your home folder (/User/your-username-here).
  9. Open emacs and verify that you can start julia from ESS (M-x julia). If that's working, you'll now want to install some packages.
  10. Run Pkg.update() to get the latest.
  11. As noted above, I like Gadfly because I'm used to ggplot. Getting gadfly to work was not that easy. First, install XQuartz from sourceforge. Then at the julia prompt, type Pkg.add("Gadfly"). It will eventually prompt you about how you want to install. Choose installation with homebrew. It should work. Run some of the examples in the tutorial to check.

OK, you should now have a solid Julia installation. Happy computing!