The process of retrieving data or content from a web page is called web scraping. This is probably more appropriately called screen scraping. This lesson will identify major concepts within web page design, content retrieval, and web scraping best practices and tools.
Adapted from Rebecca Weiss's VAM tutorial by Mark Stacy
This lesson will scrape the political party platform from American Presidency Project. The main page contains links for each politcal party platform organized by party and year. The main objective is to retrieve the appropriate platform links and download the text of each platform and save the content to a file.
from IPython.display import HTML
presidency_platforms_url = 'http://www.presidency.ucsb.edu/platforms.php'
HTML("<iframe src=" + presidency_platforms_url + " width=100% height=400px></iframe>")
Investigating how the page is setup will determine which tool will be most appropriate. Determine location of content and the HTML structure that encapsulates the content.
Most modern browser comes with tools to inverstigate page.
HyperText Markup Language, commonly referred to as HTML, is the standard markup language used to create web pages.
from IPython.display import Image
Image('http://www.openbookproject.net/tutorials/getdown/css/images/lesson4/HTMLDOMTree.png')
Most modern browsers have a parser that reads in the HTML document, parses it into a DOM structure, and renders the DOM structure.
Much like HTTP, the DOM is an agreed-upon standard.
The DOM is much more than what I've described, but we don't have the time to go into it.
Image('http://www.cs.toronto.edu/~shiva/cscb07/img/dom/treeStructure.png')
BeautifulSoup
, lxml
, and Pandas¶BeautifulSoup
¶lxml
¶XPath
, CSS Selectors
Pandas
¶Depending on the page will determine which tool to implement. We're going to do an example with BeautifulSoup.
Request/response is a messaging protocol.
It is the underlying architectural model for the Hypertext Transfer Protocol, which is the agreed-upon standard for the way the Web works.
The very general, grossly oversimplified idea:
Servers sit around waiting to respond to requests. If a server doesn't respond, something is wrong .
#Python Library for retrieving page
# urllib is another library
import requests
r = requests.get(presidency_platforms_url)
r.text[:1000]
r.status_code
What's a status code?
The Web only works because everybody agreed to honor HTTP.
All HTTP clients (e.g. a web browser) must recognize status codes.
Generally:
If you write a script to automate scraping, check for status code = 200. Otherwise, you might get junk data!
Beautiful Soup is a Python library for pulling content out of HTML and XML files.
from bs4 import BeautifulSoup
presidency_platforms_url = 'http://www.presidency.ucsb.edu/platforms.php'
#load page in variable
r = requests.get(presidency_platforms_url)
r.text[:1000]
r.headers
# Load Beautiful Soup with requests text
soup= BeautifulSoup(r.text,'html.parser')
soup.prettify()[0:1000]
soup.title
soup.meta
soup.title.text
soup.a
soup.p
Beautiful Soup has written some functions that are helpful for working with HTML. They are essentially wrappers to retrieve very common HTML elements.
Next step to get all links on the main page!
all_links=[]
for link in soup.findAll('a'):
all_links.append(link.get('href'))
print("All links href in a list from a for loop: %s" % (len(all_links)))
all_links[40:60]
for link in all_links[40:60]:
print ('href #{0} = {1}'.format(str(all_links.index(link)),link.split('/')[-1]))
# Get all Valid Links
valid_links =[]
for link in all_links:
final_url_element = link.split('/')[-1]
if final_url_element.startswith('index.php?'):
valid_links.append(link)
print("There are {0} valid links.".format(len(valid_links)))
valid_links[:10]
r =requests.get('http://www.presidency.ucsb.edu/ws/index.php?pid=101962')
soup = BeautifulSoup(r.text,'html.parser')
soup.title
soup.title.text.replace(' ','_').replace(':','')
soup.p.get_text()[:2000]
soup.select('.displaytext')[0].get_text()[:1000]
def replace_with_newlines(element):
text = ''
for elem in element.recursiveChildGenerator():
#Python 2 users will need to replace 'str' with 'basestring'
#This is not fully tested with Python 2
if isinstance(elem, str):
text += elem.strip()
elif elem.name == 'br':
text += '\n'
return text
data = soup.select('.displaytext')
for item in data:
for itm in item.findChildren(['p','h2','h3','h4','b'])[:10]:
text = replace_with_newlines(itm)
print("{0} \n".format(text[:1000]))
#print("%s \n" % itm.get_text()[:1000])
import os
from datetime import datetime
#Create output directory
if not os.path.exists('output'):
os.makedirs('output')
#Create log file and set header row
log_file = open('output/president_scraping.log','w')
log_file.write("Timestamp\tURL\tStatus Code \n")
print("Start Scraping")
for link in valid_links:
#Load page
r = requests.get(link)
#Log Row setup
tmpl = "{time}\t{link}\t{status}\n"
log_string = tmpl.format(time=datetime.isoformat(datetime.now()),
link=link,
status=r.status_code)
log_file.write(log_string)
#Beautiful Soup
soup =BeautifulSoup(r.text,'html.parser')
#setup filename and path
filename = "{0}.txt".format(soup.title.text.replace(' ','_')
.replace(':','')
.replace('/','-'))
filename_path = os.path.join('output',filename)
#Write data to file
with open(filename_path,'w') as scraped_text:
data = soup.select('.displaytext')
for item in data:
for itm in item.findChildren(['p','h2','h3','h4','b']):
text = replace_with_newlines(itm)
scraped_text.write("{0} \n".format(text))
#ascii errors, comment out line above with error and use the one below
#scraped_text.write("%s \n" % text.encode('utf-8').strip())
log_file.close()
print("Finished scraping!")
Want to get Oklahoma City Thunder Team statistics!
Requires Python library html5lib
$ pip install html5lib
Requires restart of ipython notebook!
import pandas as pd
espn_okc_thunder = "http://espn.go.com/nba/team/stats/_/name/okc/oklahoma-city-thunder"
data = pd.read_html(espn_okc_thunder)
#returns a list of every html table with data.
data[0]
pd.read_html?
data = pd.read_html(espn_okc_thunder,skiprows=1,header=0)
data[0]
% matplotlib inline
import matplotlib
matplotlib.style.use('ggplot')
data[0][:10].plot(x='PLAYER',y='PPG',kind='bar')
data[1]