Beginning Python 2 Bs4

Nov 26, 2015


We know how to get a HTML document in previous post. Now, we need a tool to parse the HTML document.

Beautiful Soup is a third-party Python module used to parsing the Web page. Now download beautifulsoup4 first. Unzip and execute a CMD command python setup.py install.

Here are some usual basic usage.


Create an instance

Import BS4 first.

from bs4 import BeautifulSoup

e.g. there is a web page, then we create a BS4 instance.

html = '''
<html>
<head>
	<title>test</title>
</head>
<body>
	<img src="http://xxx.com/xxxx.jpg" alt="blablabla" />
	<p>Halo, 3Q</p>
	<div id="t">
    	<!--comment here-->
		<p>R U O K</p>
	</div>
</body>
</html>
'''
soup = BeautifulSoup(html)

Of course, we can get a page from local html file.

soup = BeautifulSoup(open('/var/www/index.html'))

Get the specified tag

formatted output

print soup.prettify()

get img tag

print soup.img

get the attributes of img

print soup.img.attrs

get the name of the tag

print soup.img.name
# print img

get the specified attribute

print soup.img.get('src')
# or
print soup.img['src']

get the content in the tag while it will ignore the comment.

print soup.p.string

Attention! print soup.div.stringIt will print NONE, as this tag has sub-tag. So we should judge the type first.

if type(soup.div.string)==bs4.element.Comment:
	pass

get the second sub-tag of body tag

print soup.body.contents[1]

list all the sub-tag of body

list(soup.body.children)
for child in soup.body.children:
	print child

get all contents

print soup.strings
print soup.stripped_strings
for string in soup.strings:
	print (repr(string))

Node operation

get parent node

p = soup.p.content
print p.parent.name
for parent in p.parents:
	print parent.name

get sibling node

p.next_sibling
p.next_siblings

p.previous_sibling
p.previous_siblings

get adjacent node(include children node)

p.next_element
p.next_elements

p.previous_element
p.previous_elements

find_all
soup.find_all(name, attrs, recursive, text, * *kwargs)
soup.find_all(re.compile('')) # regular expression
soup.find_all(True) # return all tag
def has_class_but_no_id(tag):
	return tag.has_attr('class') and not tag.has_attr('id')

soup.find_all(has_class_but_no_id)
soup.find_all(id='link')
soup.find_all(href=re.compile("elsie"))
soup.find_all("a", class_="sister")
soup.find_all(text=re.compile("include"))
soup.find_all(text=re.compile("include"), limit=2)
soup.find_all("title", recursive=False) # only find among direct children nodes

find

find the usage is similar to find_all, while it only return one node

In addition, there are other function. find_next & find_all_next find_parents & find_parent find_previous & find_all_previous find_next_siblings & find_next_sibling find_previous_siblings & find_previous_sibling


CSS selector
soup.select('title')
soup.select('#link')
soup.select('head > title')
soup.select('.sister')
soup.select('p a')
soup.select('p a[href="xxx"]')

All we need are ready. The next post, I’ll start my project.