And then in this lecture, I'm going to give a little overview and a very brief history of the R statistical programing environment. So the very first question, I think is most obvious, is which is, what is R? And the answer is actually quite simple. It's basically R is a dialect of S. Okay, so that leads to the next logical question, which is what is S? So S was a language, or is a language that was developed by John Chambers and at the now-defunct Bell Labs. And it was initiated in 1976 as an internal statistical analysis environment, so the, an environment that people at Bell Labs could use to analyze data. And initially it was implemented as a series of FORTRAN libraries to kind of implement routines that were tedious to have to do over and over again, so there were FORTRAN libraries to repeat these statistical routines. Early versions of the language did not contain functions for statistical modelling. That did not come until roughly version three of the language. So in 1988, the system was rewritten in the C language and to make it more portable across systems and it began to resemble the system that we have today. So this was version three. And there was a seminal book the, called the Statistical Models in S written by John Chambers and Trevor Hastie. Sometimes referred to as the white book. And that documents, all the statistical analysis functionality that came into the version, that version of the language. Version four of the S language was released in 1998. And its version, it's the version we more or less use today. The book Programming with Data, which is a reference for this course, is written by John Chambers sometimes called the green book and it documents version four of the S language. So, R is an implementation of the S language, that was originally del, developed in Bell Labs. So, just a little bit more history here, in 1993 Bell Labs gave a corporation called StatSci which became Insightful Corporation, an exclusive license to develop and sell the S language. In 2004, Insightful purchased the S language completely from Lucent. So Bell Labs became Lucent Technology for $2 million, and became the current owner. In 2006, Alcatel purchased Lucent Technologies and it's now called Alcatel-Lucent. So Insightful developed a product which was a implementation of the S language under the product name S-PLUS. And they built a number of fancy features into it for example graphical user interfaces and all kinds of a nice tools. that, so that's where the plus comes from in S-PLUS. In 2008 the Insightful Corporation was acquired a company called TIBCO for $25 million dollars and that's more or less where it stands. TIBCO still develops as PLUS, although in a variety of different types of business analytic type products. And it continues to this day. So you can see the history of the language is a little bit tortured because of the various corporate acquisitions but it still survives to this day. The basic fundamentals of the S language have not really changed since 1998 and the language that existed in 1998 looks more or less like we, like what we use today at least superficially. And it's worth nothing that in 1998 the S language won the association for repeating machinery software system award. A very pretigious honor. So in a document called the stages and the evolution of S, John Chambers who was the original writer of the S language the, the original creator kind of laid out his key principal with designing the S language. And it's very important I think to to see this which is that basically. They wanted to create an interactive environment where you didn't have to think of themselves as programming, right. Then he says then as the needs became clearer and their sophistication increased, they should be able to slide gradually into programming, when the language and system aspects would become more important. So the basic idea is behind the S language and then later the R language is that people would enter the language in an interactive environment. Where they could use the lang, the environment, without knowing about any sort of programming, or having to know very detailed aspects of the language. So, they could use the environment to look at data, and do basic analyses. And then when the environment, when they kind of outgrew their environment, then they can get into programming. They could get into learning the language aspects and learning to develop their own tools and, and the system would very kind of, would promote the kind of transition from user to programmer. And so that was the basic philosopy of the S language. So that's enough about S. we, let's go back to R. So what is R about? So basically, R is a relatively recent development. In 1991, it was created in New Zealand by two gentleman named Ross Ihaka and Robert Gentleman. So, and they talked about their experience developing R in a paper writ-, published in 1996 in the Journal of Computation and Graphical Statistics. In 1993 the first announcement of R was made to the public. 1995, Martin Michler convinced Ross and Robert to use, to license R under the GNU General Public License. And we'll talk a little bit about, more about that in a second. And that made R what we call free software. 1996 a mailing list was developed, so there's two main mailing lists. One called R-help, which is a general mailing list for questions. And R-devel, which is a more specific mailing list for people who are doing development work in R. 1997, what's called the R core group was formed. And these contained a lot of, this contained a lot of the same people. From the S-PLUS who developed S-PLUS. And the core group, basically controls the source code for R. So this, so the primary source code for R. Can only be modified by members of the R core group. However, a number of, people who are not in the core group have suggested changes to R, and they have been accepted by the core group. So, some of the features of R the first one, which was important back in the old days, when people were still using S+ but the syntax is very similar to S, which made it easy for S+ users to switch over. This feature isn't quite so relevant today, where most people generally go to R directly. The semantics are superficially similar to S, in that it looks like it's S, but in reality are quite different, but we'll talk more about this in the future lecture. One of the main benefits of R is that it runs on any standard computing platform or operating system. Mac, Windows, Linux whatever you want even on your PlayStation 3 and there are frequent releases, so there are annual major releases and often there are bug fixes releases in between. There is a very active development going on and so things are happening. The software the core software of R is actually quite lean. Its functionality is divided into modular packages, so you don't have to download and install a massive piece of software. Whereas you can download a very small piece of fundamental core, kind of functions, and then add things on as you need them. So it's graphics capabilities are very sophisticated and give the user a lot of control over how graphics are, are, are created, and in my opinion are better than most stat packages. It might even be the best for the mo- kind of a general purpose statistical package. It's very useful for interactive work as I said before, but it contains this powerful programming language. For developing new tools, so, it eases the transition from the user to the program. And fundamentally, actually, for a language like this, is that there is a very active and vibrant user community. So the mailing lists at R-help and R-devel are very active. There's many, posts per day, and there's also a series on stack overflow where questions can be answered. So, the user community is, is one of the most interesting aspects of R. It's where all the R packages come from and it creates a lot of kind of interesting features. Of course one of the, probably the most critical feature of R is that it's free. Both in the sense of free beer and the sense of speech. So what I mean by that, is that it doesn't cost any money so you can download the entire software from from the web. And also it's free software, so I'm going to divert for a second to talk a little bit about free software. So, with free software there are four basic principles, right? You have four basic freedoms that you have. The freedom zero is the freedom to run the program for any purpose, so you don't need. There's no restrictions on how you can run the program or when you can run the program or what you can or cannot do with it. Freedom one is the freedom to study how the program works and adapt it to your needs. So this happens almost every day which is that you can look at the source code for R itself. You can make changes to it if you want. You can, you may improve it or make a better version of it. You can sell changes to it if you want. You can do, you can modify the program any way you want and adapt it to your needs. Of course, so you can look at the source code for this to get freedom one. Freedom two is that you have the freedom to redistribute copies so you can help your neighbor and so the idea is that you can give copies to other people. You can sell copies. You can do whatever you want with it. Lastly you have the freedom to improve the program and release your improvements to the public so the whole community benefits, so this is freedom three. The idea is that when people make changes to the program they can release them to the public so that everyone gets those changes. And so these basic freedoms are outlined by the free software foundation and you can see more about it at their website there. So, there a couple drawbacks of R. I won't go through all of them and probably other people have many other complaints. But there's some basic drawbacks which are one that it's essentially based on 40 year old technology. So the original S language developed in the 70s was based on a couple of principles, and the basic ideas have not changed too much. Since then and so as, one of the results of that for example is that there is little built in support for dynamic or 3D graphics. But things have improved, greatly and not on that front since the old days and there's a lot of interesting tools now packages for doing dynamic or 3D graphics. Another drawback of R that I, I hear a lot about is that the functionality is based on consumer demand and basically user contributions. So if no one feels like implementing your favorite message then that's your job to do. And so you can't, there is no corporation, there's no company that you can complain to. There's no helpline that you can call to say that, to demand a specific implementation or a specific feature. If the feature's not there, then you have to build it. Or at least you can pay someone to build it. Another drawback which is a little bit more technical is that the objects that you manipulate in R have to be stored in the physical memory of the computer. And so if the object is bigger, than the physical memory of the computer, then you can't load it into memory. And then therefore you can't do something in R with that object. So there have been a lot of advancements to deal with this too. Both in the R language and also just in the hardware side there are computers now that you can buy with tremendous amounts of memory. And so some of those problems had been resolved just by, kind of, improvements in technology. But nevertheless, as we enter the, kind of, big data era where you have larger and larger data sets, the model of loading objects into physical memory can be a limitation. And finally, I'll just say that R is not ideal for all possible situations. And so many people, I think, in ways is a good thing they have high expectations for R. They expect it to be able to do everything. But it doesn't do everything and so you should go into this knowing that fact. So the basic R system is divided into two, what you can think as two conceptual parts. There is the base R system that you download from a CRAN which is the comprehensive R archive network. And that's kind of the go to place for all things R. Then there's kind of everything else. And so the base system contains what's called the base package which has all the kind of low level fundamental functions that you need to run the R system. And then there are other packages contained in the base system which includes for example util stats, data sets, graphics and a bunch of other packages that are kind of fundamental packages that more or less everyone might use. And then there are a series of recommended packages, so, boot for bootstrap, class for classification, cluster, codetools, foreign, and a variety of other packages. These are the commonly used packages, they may not be critical packages, but they're commonly used by many people. So all of these packages come with this, the base R system that you download from CRAN. Now, but there's much more than this obviously, and on the, on CRAN, there are, right now there are about 4,000 packages that have been developed by users and programmers all around the world. These packages are user contributed. They're not controlled by the R core. And they are uploaded to CRAN on a everyday on a periodic basis. And the i-, and CRAN has a few, has a number of restrictions and standards that have to be met in order to get a package on to CRAN. So, one of the nice things about CRAN is that there, that the packages that you download have to meet a certain level of quality. And so there have to be, for example there has to be documentation for all the functions that are in the package, and there has to be and they have to make sure that they pass a certain number of tests. So, so CRAN has, has a lot of different packages written by users and the number is really increasing everyday. So it's very exciting to see all these packages on CRAN and there, and to see new ones come up everyday. There are also packages associated with the Bioconductor project, which is a packaged, which is a project designed to implement R software for, kind of, genomic and, kind of, bio, biological data analysis. and, of course, there are also all their packages made that people make available on their personal websites. And there's really no reliable way to keep track of how many packages are available in this fashion. So, there's really thousands of packages out there written by people. That you can discover and use, to analyze data. So there are a couple of documents that you can find on the R website. As you're learning to use R, you then want to flip through some of these. One is an introduction to R, which is a relatively long PDF document now that kind of goes through the basics of how to use R, how to use the language. There's the Writing R Extensions manual which is really only useful to read if you're thinking of developing R packages. Which are these R extensions to the main system. The R data import and export manual, which is useful for getting R's data into R and the various different ways. The R installation administration manual is, is most useful if you want to build R from the source code, and I'll talk about that in another video. And then the R internals manual. Is is a really technical document for how R is designed. How R is implemented at a very low level. And it's not really for the faint of heart. But if you're that kind of person, who wants to know how R works at a very, very low level, this is the document for you. So, I'm just going to end over here with a couple of texts that are kind of standard or kind of classic texts in this area. Of course the books by John Chambers offers data analysis and programming the data are both published by Springer. And then there's two books by Bill Venables and Brian Ripley. One is called Modern Applied Statistics with S, and another one's called S Programing. Although they have the, the, they talk about S in the title, these books are all, are both very relevant for R programming too. There's a book by Pinheiroand Base, which is Mixed Effects, Models in S and S-PLUS. That's also quite useful, for R programmers too. And finally Paul Murrell who designed the R graphic system has written a book called R Graphics and actually it's currently in its second edition right now. So, a couple other resources, one is that Springer, the publisher Springer has a series of books called Use R, which is, which is a, a lot of very, kind of relatively short books. How to use R for different types of topics, different application areas. This is quite a nice series of books that you may be interested in. And there may be a book written for you particular area of application. And there's a longer list of books on the R website. So, that was a brief overview of R, and the history of how it kind of came to be. and, starting with the next video, I'll start talking about the details of the R programming language, and how we can use it to analyze data.