agaskar.com

BeautifulStoneSoup: Getting XML tag properties

BeautifulSoup is pretty easy to use, but I find the documentation confusing at times -- some of the simpler applications aren't covered in sufficient detail, or at all. One of these is a clear example regarding the fetching of XML tag attributes, a pretty common task. Getting a tag attribute is covered in the documentation here:


The attributes of Tags

Tag and NavigableString objects have lots of useful members, most of which are covered in Navigating the Parse Tree and Searching the Parse Tree. However, there's one aspect of Tag objects we'll cover here: the attributes.

SGML tags have attributes:. for instance, each of the <P> tags in the example HTML above has an "id" attribute and an "align" attribute. You can access a tag's attributes by treating the Tag object as though it were a dictionary...
[goes on to show an HTML example with ID attributes]

Although it's not made explicit, the same sort of syntax can be used to grab attributes from XML tags. This means that getting an XML attribute from BeautifulStoneSoup is as easy as soup.tag['attributename']. The one gotcha is that BeautifulStoneSoup converts everything to lowercase -- if the actual XML looks like <tag AttributeName="foo">, then soup.tag['AttributeName'] raises an exception. This part isn't so well-documented.

Let's look at a brief real world example after the jump.

For example, let's say you get this XML file back from the Google HTTP geocoder*

<?xml version="1.0" encoding="UTF-8"?>
<kml xmlns="http://earth.google.com/kml/2.0">
<Response>
<name>1510 Polk Street @ Sacramento, San Francisco, CA 94109</name>
<Status><code>200</code><request>geocode</request></Status>
<Placemark id="p1">
<address>San Francisco Blvd, Sacramento, CA 95820, USA</address>
<AddressDetails Accuracy="6" xmlns="urn:oasis:names:tc:ciq:xsdschema:xAL:2.0">
<Country><CountryNameCode>US</CountryNameCode>
<AdministrativeArea><AdministrativeAreaName>CA</AdministrativeAreaName>
<Locality><LocalityName>Sacramento</LocalityName>
<Thoroughfare><ThoroughfareName>San Francisco Blvd</ThoroughfareName></Thoroughfare>
<PostalCode><PostalCodeNumber>95820</PostalCodeNumber></PostalCode></Locality></AdministrativeArea></Country></AddressDetails>
<Point><coordinates>-121.443214,38.537360,0</coordinates></Point>
</Placemark>
</Response>
</kml>

You want to find the contents of the Accuracy property; this will let you know if Google was able to return a street address. The accuracy is contained as an attribute in the <AddressDetails> tag. Here's how you would get it:

from BeautifulSoup import BeautifulStoneSoup  
import urllib2

geocodeResults={}
geocodeData=urllib2.urlopen('http://maps.google.com/maps/geo?'+YOUR_GOOGLE_GEOCODE_QUERY)
geocodeXML=BeautifulStoneSoup(geocodeData)
if geocodeXML.kml.response.status.code.string=='200':
	#note casing of below line! addressdetails['Accuracy'] will raise an exception!
	geocodeResults['accuracy']=geocodeXML.kml.response.placemark.addressdetails['accuracy']

* yeah; you're right -- I *should* use JSON -- but I'm already loading BeautifulSoup to do some web scraping, so I figured that it wouldn't hurt to use XML.