0:00
This lecture is about reading XML.
So if you don't know about XML, it's
Extensible Markup Language, is what it stands for.
So it's frequently used to score structured data.
It's particularly widely used in internet applications.
So you'll see it a lot when you're doing
things like web scraping or trying to get data from
an internet API or trying to download data from say,
open data of websites, say like an open government website.
So, extracting HTML, XML is actually the basis
for most of the web-scraping that you'll see.
There are two components to an XML file.
There's the markup.
That's the labels that give the text structure.
So you can imagine if you just started typing, you
would end up with sort of an unstructured text file.
The markup is the way that you add labels so that the file ends being structured.
And then the content is the actual text that you type in,
in between sort of the labels that give the structure to the text.
So to be a little bit more concrete, we'll talk about tags, elements and attributes.
0:58
So tags correspond to the labels, that there are labels that are going to
be applying to particular parts of the text so that it will be structured.
So, there are start tags.
And so, start tags look like this.
So, they start with an open caret on one side and then they
have a, a phrase and then a closed caret on the other side.
So, you'll have that at the beginning of a certain part of the text.
So, that will start them, that text as a, a labelled section called section.
And then it ends with an end tag which looks like this.
It has the same phrase as before, but it
starts with a forward slash inside of the two carets.
1:37
And so the empty tags are tags that
correspond to lines where you don't necessarily need
both an open and then a close tag to deliminate a certain part of a text.
You might just want to denote one specific line, say, as a line break.
And so what you do is, you have the same sort of structure.
Only you have the forward slash just over here at the right end.
So elements are specific examples of tags.
So, for example, one element of the XML
markup could be, the tag greeting, with the end
tag greeting over here on the right hand side
and in between it the content, the hello world.
2:13
So, attributes are components of the label.
So, you can actually add to the tags components.
And so, for example, you might have an image tag.
So, what you might want to have corresponding to
that image is an attribute that's the source.
That's where the image is actually coming from.
And then an alternate phrase for that tag, and that could be instructor, for example.
2:36
So another example here is suppose you have a tag that's
a step tag, and you might have multiple steps, so tag
number step number one might be tagged with number equal 1
and then two and then this is number three step three.
And then it needs to end with an end tag just like it did before.
If you want to know a lot more about XML and
it's probably Probably a good idea before doing a whole bunch
more with this lecture, if you could go and see, the
Wikepedia entry on XML, which is actually quite comprehensive and nice.
3:04
So this is an example XML file.
It's a little bit hard to see, but you can see it okay if
you go actually to the web link here at the bottom of the page.
And so the idea here is you have a text file,
and that text file has lots of tags and content in it.
So for example here you have a tag that says food and then
here is a close tag for that tag that says food down here.
So you can see they're indented at the same level within this file.
And then for this particular food element you actually have a name, so there's
the open tag for name and there's the close tag for name, and this particular.
Element is named sort of Belgian Waffle, delicious.
4:03
So you can read the file into R, you can do this
with the XML package so we're loading the XML package all caps XML.
And then you give it the url.
So in this case, we're going to give it the url from
the previous XML file we were looking at on that previous screen.
And then we can use the function XML three parts to parse out that XML file.
So what this does is.
It loads the document into a memory, into a R memory in a way
that you can then parse it and get access to different parts of it.
So, within R, it's still a structured object, so we have to
be able to use different functions to access different parts of that object.
So, the first thing that we want to look at is the root node.
So, the root node you can get with XML root.
And so what that is, is it's sort of the wrapper,
you can think of it as the wrapper for the entire document.
And so if you execute this xml root command, you'll
have access to that particular element to that xml file.
And then if you want to get the name out, you can actually use the xmlname
applied to that node to get the name out, in this case, it's the breakfast menu.
5:12
You can also look at the names of the root node.
And so, when you do names of the root node what it's actually
telling you is what all the nested elements within that root node are.
So, the root node wraps the whole document
and the whole document is a breakfast menu.
And then there are five different breakfast items on this
menu, and each one is wrapped within a food element.
So, you get five food elements, if you look at names of the root node.
So,
5:39
the next thing that you could use is
actually directly access parts of the XML document.
And you can do it in a little bit in the same that way you access a list in R.
And so, for example, if you want to look at.
You, you have this root node element that we've extracted.
And we can look at the first component of that or the
first element of that and we do it by using double brackets here.
That's how you access the first element, say in a list.
And so, what you get out, actually, is the first food element, like this.
So it contains the food tags, and it also
contains all the information about the first food element.
Then, if you want to keep drilling down into smaller and smaller parts of
the XML document, you can do sub-setting, just like you do with lists.
So first, this exit rootNode and the first bracket one, will again give you this.
6:39
You can actually programmatically extract different parts of the file.
So you can do that with the XML Xsupply command,
and so what you do is you pass that a
parsed XML object and then you tell it what function
you like to apply; so in this case XML value.
So what that's going to do is, it's going to
loop through all of the elements of the
root node and get the XML value And
by default this is going to do this recursively.
So, if you apply it in this way,
since rootNode contains the entire document, it's going to
go through and get every single value of
every sing tagged element in the entire document.
And so you just get a bunch of text all
strung together that's all the text that was in that document.
7:52
So the first thing that you're going to be looking at is the top
level node of each element, and that you get with a forward slash node.
The node at any level is double slash node.
And then you can actually extract specific nodes with specific attributes.
And we'll talk about how to do that in the next couple of slides.
[SOUND] So the first thing that you're going to want to be able to do is
get the items on the menu and their prices and so the way that
we're going to do that is we're again
going to use this xpathSApply but now we're
going to be a little bit more targeted
about which data we're going to be extracting.
So we're giving you, again, [INAUDIBLE] the rootNode.
That's the entire document, and so then what we're going to be doing
is we're going to be using This xpack element here is //name.
And so, what that's going to do is it's
going to go through and get all of the nodes
that correspond to an element with title name and
then it's going to get the xmlvalue of those nodes.
So, what it's going to do is it's going to take
out basically all of the elements of that XML file that
are tagged with name and so you can see, you end
up with the names of all of the items on the menu.
9:22
To give you an example of, it's a
little more complicated I am actually going to use
a website of the Baltimore Ravens which is an
American football team that is based here on Baltimore.
So, this the homepage of that team on ESPN.
Which is a sports channel in the U.S. and so
what we're going to do is actually look at the source code.
So, if you right click on the page and say view
source what you'll see is a source document that looks like this.
It's actually quite complicated.
That's the source code that actually
demonstrates, that's process to show you the
website that you actually see when you go navigate your browser to that website.
So we're going to actually drill into this website
source code and see if we can extract some information.
So again I'm going to pass the file URL.
This is the URL of that website, if you go back to the previous page.
So now, since I'm parsing an HTML file, instead of an XML
file, I'm going to use HTML tree parse instead of XML tree parse.
Um,the difference is our, it's different enough you want to
use HTML tree parse when you're parsing an HTML file.
And then I'm going to pass the command use internal equals true, so
that I can get all the different nodes inside of that, that file.
So now what I'm going to do is, I'm again going
to use the xpathSApply to programatically extract some components of this document.
So I'm going to start with the whole document again.
And now what I'm going to try to do is, I'm going to
again extract the XML value to the value inside of certain elements.
But I want to find very specific kinds of elements.
So here's what I'm going to do.
I'm going to look for elements.
That are list items li, and that have a particular class.
So they have class equal to score.
And so, what this is going to do is it's
going to go through the entire document, and any time it
sees a tag that is a, a list item,
it's going to check and see if it's class is score.
And if it's class is score it'll return the XML value.
So it turns out if you go back to this website and you look very very
carefully, That you can see for example there are these list items and
the class in this class is equal to the team name and so the
next element that I'm going to be extracting
on this page is actually the team name.
So it's the same thing.
I look for a list item with class equal
to team name and it will extract those team names.
So the way this works is you go to the website and
you find the tag and any attributes that you want to extract,
and then you need to write them into this xpath language to
extract only the data for those specific elements with xpath, as applied.
So, if I do this, I end up with the scores for each of the games and the teams.
So, I've scraped from that website information directly using
12:09
This is a very brief introduction to XML.
The xml packaging r and xml in general.
If you go and check out the xml tutorials for the
xml package there's a short one that's really good and a long
one that's extremely extensive and should only be read after reading the
short one so you get an idea of what is going on.
And then there's really outstanding guide to the XML
package that's linked to here as well that will
give you a lot of information about how you
can use XML to programatically extract information from websites.