### Share

I mentioned at the outset that R syntax is a bit quirky, especially if your frame of reference is, well, pretty much any other programming language. Here are some unusual traits of the language you may find useful to understand as you embark on your journey to learn R.

[This story is part of Computerworld's "Beginner's guide to R." To read from the beginning, check out the introduction; there are links on that page to the other pieces in the series.]

Assigning values to variables

In pretty much every other programming language I know, the equals sign assigns a certain value to a variable. You know, x = 3 means that x now holds the value of 3.

Not in R. At least, not necessarily.

In R, the primary assignment operator is <- as in:

x <- 3

But not:

x = 3

To add to the potential confusion, the equals sign actually can be used as an assignment operator in R -- but not all the time. When can you use it and when can you not?

The best way for a beginner to deal with this is to use the preferred assignment operator <- and forget that equals is ever allowed. Hey, if it's good enough for Google's R style guide -- they advise not using equals to assign values to variables -- it's good enough for me.

(If this isn't a good enough explanation for you, however, and you really really want to know the ins and outs of R's 5 -- yes, count 'em, 5 -- assignment options, check out the R manual's Assignment Operators page.)

One more note about variables: R is a case-sensitive language. So, variable x is not the same as X. That applies to pretty much everything in R; for example, the function subset() is not the same as Subset().

c is for combine (or concatenate, and sometimes convert/coerce.)

When you create an array in most programming languages, the syntax goes something like this:

myArray = array(1, 1, 2, 3, 5, 8);

Or:

int myArray = {1, 1, 2, 3, 5, 8};

Or maybe:

myArray = [1, 1, 2, 3, 5, 8]

In R, though, there's an extra piece: To put multiple values into a single variable, you need the c() function, such as:

my_vector <- c(1, 1, 2, 3, 5, 8)

If you forget that c, you'll get an error. When you're starting out in R, you'll probably see errors relating to leaving out that c() a lot. (At least I certainly did.)

And now that I've stressed the importance of that c() function, I (reluctantly) will tell you that there's a case when you can leave it out -- if you're referring to consecutive values in a range with a colon between minimum and maximum, like this:

my_vector <- (1:10)

I bring up this exception because I've run into that style quite a bit in R tutorials and texts, and it can be confusing to see the c required for some multiple values but not others. Note that it won't hurt anything to use the c with a colon-separated range, though, even if it's not required, such as:

my_vector <- c(1:10)

One more very important point about the c() function: It assumes that everything in your vector is of the same data type -- that is, all numbers or all characters. If you create a vector such as:

my_vector <- c(1, 4, "hello", TRUE)

You will not have a vector with two integer objects, one character object and one logical object. Instead, c() will do what it can to convert them all into all the same object type, in this case all character objects. So my_vector will contain "1", "4", "hello" and "TRUE". In other words, c() is also for "convert" or "coerce."

To create a collection with multiple object types, you need a list, not a vector. You create a list with the list() function, not c(), such as:

My_list <- list(1,4,"hello", TRUE)

Now you've got a variable that holds the number 1, the number 4, the character object "hello" and the logical object TRUE.

Loopless loops

Iterating through a collection of data with loops like "for" and "while" is a cornerstone of many programming languages. That's not the R way, though. While R does have for, while and repeat loops, you'll more likely see operations applied to a data collection using apply() functions or by using the plyr() add-on package functions.

But first, some basics.

If you've got a vector of numbers such as:

my_vector <- c(7,9,23,5)

and, say, you want to multiply each by 0.01 to turn them into percentages, how would you do that? You don't need a for, foreach or while loop. Instead, you can create a new vector called my_pct_vectors like this:

my_pct_vector <- my_vector * 0.01

Performing a mathematical operation on a vector variable will automatically loop through each item in the vector.

Typically in data analysis, though, you want to apply functions to subsets of data: Finding the mean salary by job title or the standard deviation of property values by community. The apply() function group and plyr add-on package are designed for that.

There are more than half a dozen functions in the apply family, depending on what type of data object is being acted upon and what sort of data object is returned. "These functions can sometimes be frustratingly difficult to get working exactly as you intended, especially for newcomers to R," says a blog post at Revolution Analytics, which focuses on enterprise-class R.

Plain old apply() runs a function on either every row or every column of a 2-dimensional matrix where all columns are the same data type. For a 2-D matrix, you also need to tell the function whether you're applying by rows or by columns: Add the argument 1 to apply by row or 2 to apply by column. For example:

apply(my_matrix, 1, median)

returns the median of every row in my_matrix and

apply(my_matrix, 2, median)

calculates the median of every column.

Other functions in the apply() family such as lapply() or tapply() deal with different input/output data types. Australian statistical bioinformatician Neal F.W. Saunders has a nice brief introduction to apply in R in a blog post if you'd like to find out more and see some examples. (In case you're wondering, bioinformatics involves issues around storing, retrieving and organizing biological data, not just analyzing it.)

Many R users who dislike the the apply functions don't turn to for-loops, but instead install the plyr package created by Hadley Wickham. He uses what he calls the "split-apply-combine" model of dealing with data: Split up a collection of data the way you want to operate on it, apply whatever function you want to each of your data group(s) and then combine them all back together again.

The plyr package is probably a step beyond this basic beginner's guide; but if you'd like to find out more about plyr, you can head to Wickham's plyr website. There's also a useful slide presentation on plyr in PDF format from Cosma Shalizi, an associate professor of statistics at Carnegie Mellon University, and Vincent Vu. Another PDF presentation on plyr is from an introduction to R workshop at Iowa State University.

R data types in brief (very brief)

Should you learn about all of R's data types and how they behave right off the bat, as a beginner? If your goal is to be an R ninja then, yes, you've got to know the ins and outs of data types. But my assumption is that you're here to try generating quick plots and stats before diving in to create complex code.

So, to start off with the basics, here's what I'd suggest you keep in mind for now: R has multiple data types. Some of them are especially important when doing basic data work. And some functions that are quite useful for doing your basic data work require your data to be in a particular type and structure.

More specifically, R has the "Is it an integer or character or true/false?" data type, the basic building blocks. R has several of these including integer, numeric, character and logical. Missing values are represented by NaN (if a mathematical function won't work properly) or NA (missing or unavailable).

As mentioned in the prior section, you can have a vector with multiple elements of the same type, such as:

1, 5, 7

or

"Bill", "Bob", "Sue"

A single number or character string is also a vector -- a vector of 1. When you access the value of a variable that's got just one value, such as 73 or "Learn more about R at Computerworld.com," you'll also see this in your console before the value:

[1]

That's telling you that your screen printout is starting at vector item number one. If you've got a vector with lots of values so the printout runs across multiple lines, each line will start with a number in brackets, telling you which vector item number that particular line is starting with. (See the screen shot, below.)

If you've got a vector with lots of values so the printout runs across multiple lines, each line will start with a number in brackets, telling you which vector item number that particular line is starting with.

If you want to mix numbers and strings or numbers and TRUE/FALSE types, you need a list. (If you don't create a list, you may be unpleasantly surprised that your variable containing (3, 8, "small") was turned into a vector of characters ("3", "8", "small") ).

And by the way, R assumes that 3 is the same class as 3.0 -- numeric (i.e., with a decimal point). If you want the integer 3, you need to signify it as 3L or with the as.integer() function. In a situation where this matters to you, you can check what type of number you've got by using the class() function:

class(3)

class(3.0)

class(3L)

class(as.integer(3))

There are several as() functions for converting one data type to another, including as.character(), as.list() and as.data.frame().

R also has special vector and list types that are of special interest when analyzing data, such as matrices and data frames. A matrix has rows and columns; you can find a matrix dimension with dim() such as

dim(my_matrix)

A matrix needs to have all the same data type in every column, such as numbers everywhere.

Data frames are like matrices except one column can have a different data type from another column, and each column must have a name. If you've got data in a format that might work well as a database table (or well-formed spreadsheet table), it will also probably work well as an R data frame.

In a data frame, you can think of each row as similar to a database record and each column like a database field. There are lots of useful functions you can apply to data frames, some of which I've gone over in earlier sections, such as summary() and the psych package's describe().

And speaking of quirks: There are several ways to find an object's underlying data type, but not all of them return the same value. For example, class() and str() will return data.frame on a data frame object, but mode() returns the more generic list.

If you'd like to learn more details about data types in R, you can watch this video lecture by Roger Peng, associate professor of biostatistics at the Johns Hopkins Bloomberg School of Public Health:

Roger Peng, associate professor of biostatistics at the Johns Hopkins Bloomberg School of Public Health, explains data types in R.

One more useful concept to wrap up this section -- hang in there, we're almost done: factors. These represent categories in your data. So, if you've got a data frame with employees, their department and their salaries, salaries would be numerical data and employees would be characters (strings in many other languages); but you'd likely want department to be a factor -- in other words, a category you may want to group or model your data by. Factors can be unordered, such as department, or ordered, such as "poor", "fair", "good" and "excellent."

R command line differs from the Unix shell

When you start working in the R environment, it looks quite similar to a Unix shell. In fact, some R command-line actions behave as you'd expect if you come from a Unix environment, but others don't.

Want to cycle through your last few commands? The up arrow works in R just as it does in Unix -- keep hitting it to see prior commands.

The list function, ls(), will give you a list, but not of files as in Unix. Rather, it will provide a list of objects in your current R session.

Want to see your current working directory? pwd just throws an error; what you want is getwd().

rm(my_variable) will delete a variable from your current session.

R does include a Unix-like grep() function. For more on using grep in R, see this brief writeup on Regular Expressions with The R Language at regular-expressions.info.

R doesn't need semicolons to end a line of code (although it's possible to put multiple commands on a single line separated by semicolons, you don't see that very often). Instead, R uses line breaks (i.e., new line characters) to determine when an expression has ended.

What if you want one expression to go across multiple lines? The R interpreter tries to guess if you mean for it to continue to the next line: If you obviously haven't finished a command on one line, it will assume you want to continue instead of throwing an error. Open some parentheses without closing them, use an open quote without a closing one or end a line with an operator like + or - and R will wait to execute your command until it comes across the expected closing character and the command otherwise looks finished.

Syntax cheating: Run SQL queries in R

If you've got SQL experience and R syntax starts giving you a headache -- especially when you're trying to figure out how to get a subset of data with proper R syntax -- you might start longing for the ability to run a quick SQL SELECT command query your data set.

You can.

The add-on package sqldf lets you run SQL queries on an R data frame (there are separate packages allowing you to connect R with a local database). Install and load sqldf, and then you can issue commands such as:

sqldf("select * from mtcars where mpg > 20 order by mpg desc")

This will find all rows in the mtcars sample data frame that have an mpg greater than 20, ordered from highest to lowest mpg.

Most R experts will discourage newbies from "cheating" this way: Falling back on SQL makes it less likely you'll power through learning R syntax. However, it's there for you in a pinch -- or as a useful way to double-check whether you're getting back the expected results from an R expression.

Examine and edit data with a GUI

And speaking of cheating, if you don't want to use the command line to examine and edit your data, R has a couple of options. The edit() function brings up an editor where you can look at and edit an R object, such as

edit(mtcars)

Invoking R's data editing window with the edit() function.

This can be useful if you've got a data set with a lot of columns that are wrapping in the small command-line window. However, since there's no way to save your work as you go along -- changes are saved only when you close the editing window -- and there's no command-history record of what you've done, the edit window probably isn't your best choice for editing data in a project where it's important to repeat/reproduce your work.

In RStudio you can also examine a data object (although not edit it) by clicking on it in the workspace tab in the upper right window.

In addition to saving your entire R workspace with the save.image() function and various ways to save plots to image files, you can save individual objects for use in other software. For example, if you've got a data frame just so and would like to share it with colleagues as a tab- or comma-delimited file, say for importing into a spreadsheet, you can use the command:

write.table(myData, "testfile.txt", sep="\t")

This will export all the data from an R object called myData to a tab-separated file called testfile.txt in the current working directory. Changing sep="\t" to sep="c" will generated a comma-separated file and so on.

This article, Beginner's guide to R: Syntax quirks you'll want to know, was originally published at Computerworld.com.

Sharon Machlis is online managing editor at Computerworld. Her e-mail address is [email protected] You can follow her on Twitter @sharon000, on Facebook, on Google+ or by subscribing to her RSS feeds: articles; and blogs.