Web Scraping

The process of retrieving data or content from a web page is called web scraping. This is probably more appropriately called screen scraping. This lesson will identify major concepts within web page design, content retrieval, and web scraping best practices and tools.

Adapted from Rebecca Weiss's VAM tutorial by Mark Stacy

Topics

  1. Analyze web page
    1. HTML HyperText Markup Language
    2. Web Borwser Tools
  2. Python Web Scraping Tools
    1. Bueatiful Soup
    2. Lxml
    3. Pandas
  3. Beautiful Soup Example
  4. Pandas Example
  5. Web Scraping Best Practices

Lesson

This lesson will scrape the political party platform from American Presidency Project. The main page contains links for each politcal party platform organized by party and year. The main objective is to retrieve the appropriate platform links and download the text of each platform and save the content to a file.

In [1]:
from IPython.display import HTML

presidency_platforms_url = 'http://www.presidency.ucsb.edu/platforms.php'

HTML("<iframe src=" + presidency_platforms_url + " width=100% height=400px></iframe>")
Out[1]:

Analyze Web Pages

Investigating how the page is setup will determine which tool will be most appropriate. Determine location of content and the HTML structure that encapsulates the content.

Web Browser Tools

Most modern browser comes with tools to inverstigate page.

  • Chrome, Inspect Element
  • Firefox, Inspect Element
  • Internet Explore

HTML - HyperText Markup Language

HyperText Markup Language, commonly referred to as HTML, is the standard markup language used to create web pages.

In [2]:
from IPython.display import Image
Image('http://www.openbookproject.net/tutorials/getdown/css/images/lesson4/HTMLDOMTree.png')
Out[2]:

The HTML Document versus the DOM

Most modern browsers have a parser that reads in the HTML document, parses it into a DOM structure, and renders the DOM structure.

Much like HTTP, the DOM is an agreed-upon standard.

The DOM is much more than what I've described, but we don't have the time to go into it.

In [3]:
Image('http://www.cs.toronto.edu/~shiva/cscb07/img/dom/treeStructure.png')
Out[3]:

Python libraries: BeautifulSoup, lxml, and Pandas

BeautifulSoup

lxml

  • More powerful parsing capabilities: XPath, CSS Selectors
  • Has C dependencies (can be hard to install if you don't feel comfortable building software from source)
  • Can work with more than HTML (e.g. XML).

Pandas

  • Powerful tool to pull data from multiple formats
  • HTML Tables Tags
  • CSV Comma Separted values
  • Fixed Width Data

Depending on the page will determine which tool to implement. We're going to do an example with BeautifulSoup.

Request/Response model

Request/response is a messaging protocol.

It is the underlying architectural model for the Hypertext Transfer Protocol, which is the agreed-upon standard for the way the Web works.

The very general, grossly oversimplified idea:

  1. Clients (like you!) issue requests to servers
  2. Servers issue responses if they receive a request

Servers sit around waiting to respond to requests. If a server doesn't respond, something is wrong .

In [4]:
#Python Library for retrieving page
# urllib is another library
import requests

r = requests.get(presidency_platforms_url)

r.text[:1000]
Out[4]:
'<html>\r\n<head>\r\n<title>Political Party Platforms</title>\r\n<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=windows-1251">\r\n<meta name="keywords" content="President of the United States, presidency, American Presidency, American President, Public Papers of the Presidents, State of the Union Address, Inaugural Address, Presidents, American Presidents, George W. Bush, Bill Clinton, George Bush, Ronald Reagan, Jimmy Carter, Gerald Ford, Richard Nixon, Lyndon Johnson, John F. Kennedy. John Kennedy, Dwight Eisenhower, Harry Truman, FDR, Franklin Roosevelt, Presidential Elections, Presidential Rhetoric">\r\n<meta name="description" content="The American Presidency Project contains the most comprehensive collection of resources pertaining to the study of the President of the United States.  Compiled by John Woolley and Gerhard Peters">\r\n<link href="http://www.presidency.ucsb.edu/styles/main.css" rel="stylesheet" type="text/css">\r\n<!-- BEGIN Tynt Script -->\r\n<!-- <script type="text/jav'
In [5]:
r.status_code
Out[5]:
200

What's a status code?

The Web only works because everybody agreed to honor HTTP.

All HTTP clients (e.g. a web browser) must recognize status codes.

Generally:

  • 2XX is good
  • 4XX and 5XX are bad

If you write a script to automate scraping, check for status code = 200. Otherwise, you might get junk data!

Beautiful Soup

Beautiful Soup is a Python library for pulling content out of HTML and XML files.

In [6]:
from bs4 import  BeautifulSoup

presidency_platforms_url = 'http://www.presidency.ucsb.edu/platforms.php'

#load page in variable
r = requests.get(presidency_platforms_url)
In [7]:
r.text[:1000]
Out[7]:
'<html>\r\n<head>\r\n<title>Political Party Platforms</title>\r\n<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=windows-1251">\r\n<meta name="keywords" content="President of the United States, presidency, American Presidency, American President, Public Papers of the Presidents, State of the Union Address, Inaugural Address, Presidents, American Presidents, George W. Bush, Bill Clinton, George Bush, Ronald Reagan, Jimmy Carter, Gerald Ford, Richard Nixon, Lyndon Johnson, John F. Kennedy. John Kennedy, Dwight Eisenhower, Harry Truman, FDR, Franklin Roosevelt, Presidential Elections, Presidential Rhetoric">\r\n<meta name="description" content="The American Presidency Project contains the most comprehensive collection of resources pertaining to the study of the President of the United States.  Compiled by John Woolley and Gerhard Peters">\r\n<link href="http://www.presidency.ucsb.edu/styles/main.css" rel="stylesheet" type="text/css">\r\n<!-- BEGIN Tynt Script -->\r\n<!-- <script type="text/jav'
In [8]:
r.headers
Out[8]:
{'Content-Encoding': 'gzip', 'Vary': 'Accept-Encoding', 'Connection': 'Keep-Alive', 'Date': 'Mon, 01 Feb 2016 20:20:12 GMT', 'Server': 'Apache', 'Content-Type': 'text/html', 'Keep-Alive': 'timeout=15, max=100', 'Content-Length': '4126'}
In [9]:
# Load Beautiful Soup with requests text

soup= BeautifulSoup(r.text,'html.parser')
In [10]:
soup.prettify()[0:1000]
Out[10]:
'<html>\n <head>\n  <title>\n   Political Party Platforms\n  </title>\n  <meta content="text/html; charset=utf-8" http-equiv="Content-Type">\n   <meta content="President of the United States, presidency, American Presidency, American President, Public Papers of the Presidents, State of the Union Address, Inaugural Address, Presidents, American Presidents, George W. Bush, Bill Clinton, George Bush, Ronald Reagan, Jimmy Carter, Gerald Ford, Richard Nixon, Lyndon Johnson, John F. Kennedy. John Kennedy, Dwight Eisenhower, Harry Truman, FDR, Franklin Roosevelt, Presidential Elections, Presidential Rhetoric" name="keywords">\n    <meta content="The American Presidency Project contains the most comprehensive collection of resources pertaining to the study of the President of the United States.  Compiled by John Woolley and Gerhard Peters" name="description">\n     <link href="http://www.presidency.ucsb.edu/styles/main.css" rel="stylesheet" type="text/css">\n      <!-- BEGIN Tynt Script -->\n      <!-- <'
In [11]:
soup.title
Out[11]:
<title>Political Party Platforms</title>
In [12]:
soup.meta
Out[12]:
<meta content="text/html; charset=utf-8" http-equiv="Content-Type">
<meta content="President of the United States, presidency, American Presidency, American President, Public Papers of the Presidents, State of the Union Address, Inaugural Address, Presidents, American Presidents, George W. Bush, Bill Clinton, George Bush, Ronald Reagan, Jimmy Carter, Gerald Ford, Richard Nixon, Lyndon Johnson, John F. Kennedy. John Kennedy, Dwight Eisenhower, Harry Truman, FDR, Franklin Roosevelt, Presidential Elections, Presidential Rhetoric" name="keywords">
<meta content="The American Presidency Project contains the most comprehensive collection of resources pertaining to the study of the President of the United States.  Compiled by John Woolley and Gerhard Peters" name="description">
<link href="http://www.presidency.ucsb.edu/styles/main.css" rel="stylesheet" type="text/css">
<!-- BEGIN Tynt Script -->
<!-- <script type="text/javascript">
if(document.location.protocol=='http:'){
 var Tynt=Tynt||[];Tynt.push('cJ3hqqsgCr4i0xadbi-bpO');
 (function(){var s=document.createElement('script');s.async="async";s.type="text/javascript";s.src='http://tcr.tynt.com/ti.js';var h=document.getElementsByTagName('script')[0];h.parentNode.insertBefore(s,h);})();
}
</script>-->
<!-- END Tynt Script --><link href="styles/main.css" rel="stylesheet" type="text/css">
<script language="JavaScript">
<!--
function MM_jumpMenu(targ,selObj,restore){ //v3.0
 eval(targ+".location='"+selObj.options[selObj.selectedIndex].value+"'");
 if (restore) selObj.selectedIndex=0;
}
//-->
</script>
</link></link></meta></meta></meta>
In [13]:
soup.title.text
Out[13]:
'Political Party Platforms'
In [14]:
soup.a
Out[14]:
<a href="../index.php"><img alt="Home" border="0" height="29" src="http://www.presidency.ucsb.edu/images/l1.gif" width="26"/></a>
In [15]:
soup.p
Out[15]:
<p><span class="datatitle">Political Party Platforms of Parties Receiving Electoral Votes: </span><span class="datadates">1840 - 2012</span></p>

What are these functions?

Beautiful Soup has written some functions that are helpful for working with HTML. They are essentially wrappers to retrieve very common HTML elements.

Next step to get all links on the main page!

In [13]:
all_links=[]
for link in soup.findAll('a'):
    all_links.append(link.get('href'))
print("All links href in a list from a for loop: %s" % (len(all_links)))
All links href in a list from a for loop: 144
In [14]:
all_links[40:60]
Out[14]:
['http://www.presidency.ucsb.edu/ws/index.php?pid=101962',
 'papers_pdf/101962.pdf',
 'http://www.presidency.ucsb.edu/ws/index.php?pid=78283',
 'http://www.presidency.ucsb.edu/papers_pdf/78283.pdf',
 'http://www.presidency.ucsb.edu/ws/index.php?pid=29613',
 'http://www.presidency.ucsb.edu/papers_pdf/29613.pdf',
 'http://www.presidency.ucsb.edu/ws/index.php?pid=29612',
 'http://www.presidency.ucsb.edu/ws/index.php?pid=29611',
 'http://www.presidency.ucsb.edu/ws/index.php?pid=29610',
 'http://www.presidency.ucsb.edu/ws/index.php?pid=29609',
 'http://www.presidency.ucsb.edu/ws/index.php?pid=29608',
 'http://www.presidency.ucsb.edu/ws/index.php?pid=29607',
 'http://www.presidency.ucsb.edu/ws/index.php?pid=29606',
 'http://www.presidency.ucsb.edu/ws/index.php?pid=29605',
 'http://www.presidency.ucsb.edu/ws/index.php?pid=29604',
 'http://www.presidency.ucsb.edu/ws/index.php?pid=29603',
 'http://www.presidency.ucsb.edu/ws/index.php?pid=29602',
 'http://www.presidency.ucsb.edu/ws/index.php?pid=29601',
 'http://www.presidency.ucsb.edu/ws/index.php?pid=29600',
 'http://www.presidency.ucsb.edu/ws/index.php?pid=29599']
In [15]:
for link in all_links[40:60]:
    print ('href #{0} = {1}'.format(str(all_links.index(link)),link.split('/')[-1]))
href #40 = index.php?pid=101962
href #41 = 101962.pdf
href #42 = index.php?pid=78283
href #43 = 78283.pdf
href #44 = index.php?pid=29613
href #45 = 29613.pdf
href #46 = index.php?pid=29612
href #47 = index.php?pid=29611
href #48 = index.php?pid=29610
href #49 = index.php?pid=29609
href #50 = index.php?pid=29608
href #51 = index.php?pid=29607
href #52 = index.php?pid=29606
href #53 = index.php?pid=29605
href #54 = index.php?pid=29604
href #55 = index.php?pid=29603
href #56 = index.php?pid=29602
href #57 = index.php?pid=29601
href #58 = index.php?pid=29600
href #59 = index.php?pid=29599
In [43]:
# Get all Valid Links
valid_links =[]
for link in all_links:
    final_url_element = link.split('/')[-1]
    if final_url_element.startswith('index.php?'):
        valid_links.append(link)

print("There are {0} valid links.".format(len(valid_links)))
There are 96 valid links.
In [17]:
valid_links[:10]
Out[17]:
['http://www.presidency.ucsb.edu/ws/index.php?pid=101962',
 'http://www.presidency.ucsb.edu/ws/index.php?pid=78283',
 'http://www.presidency.ucsb.edu/ws/index.php?pid=29613',
 'http://www.presidency.ucsb.edu/ws/index.php?pid=29612',
 'http://www.presidency.ucsb.edu/ws/index.php?pid=29611',
 'http://www.presidency.ucsb.edu/ws/index.php?pid=29610',
 'http://www.presidency.ucsb.edu/ws/index.php?pid=29609',
 'http://www.presidency.ucsb.edu/ws/index.php?pid=29608',
 'http://www.presidency.ucsb.edu/ws/index.php?pid=29607',
 'http://www.presidency.ucsb.edu/ws/index.php?pid=29606']
In [45]:
r =requests.get('http://www.presidency.ucsb.edu/ws/index.php?pid=101962')
soup = BeautifulSoup(r.text,'html.parser')
In [46]:
soup.title
Out[46]:
<title>Democratic Party Platforms: 2012 Democratic Party Platform</title>
In [47]:
soup.title.text.replace(' ','_').replace(':','')
Out[47]:
'Democratic_Party_Platforms_2012_Democratic_Party_Platform'
In [48]:
soup.p.get_text()[:2000]
Out[48]:
"Four years ago, Democrats, independents, and many Republicans came together as Americans to move our country forward. We were in the midst of the greatest economic crisis since the Great Depression, the previous administration had put two wars on our nation's credit card, and the American Dream had slipped out of reach for too many. Today, our economy is growing again, al-Qaeda is weaker than at any point since 9/11, and our manufacturing sector is growing for the first time in more than a decade. But there is more we need to do, and so we come together again to continue what we started. We gather to reclaim the basic bargain that built the largest middle class and the most prosperous nation on Earth - the simple principle that in America, hard work should pay off, responsibility should be rewarded, and each one of us should be able to go as far as our talent and drive take us. This election is not simply a choice between two candidates or two political parties, but between two fundamentally different paths for our country and our families. We Democrats offer America the opportunity to move our country forward by creating an economy built to last and built from the middle out. Mitt Romney and the Republican Party have a drastically different vision. They still believe the best way to grow the economy is from the top down - the same approach that benefited the wealthy few but crashed the economy and crushed the middle class. Democrats see a young country continually made stronger by the greatest diversity of talent and ingenuity in the world, and a nation of people drawn to our shores from every corner of the globe. We believe America can succeed because the American people have never failed and there is nothing that together we cannot accomplish. Reclaiming the economic security of the middle class is the challenge we must overcome today. That begins by restoring the basic values that made our country great, and restoring for everyone who works hard and plays by the"
In [49]:
soup.select('.displaytext')[0].get_text()[:1000]
Out[49]:
"Moving America Forward2012 Democratic National PlatformMoving America ForwardFour years ago, Democrats, independents, and many Republicans came together as Americans to move our country forward. We were in the midst of the greatest economic crisis since the Great Depression, the previous administration had put two wars on our nation's credit card, and the American Dream had slipped out of reach for too many. Today, our economy is growing again, al-Qaeda is weaker than at any point since 9/11, and our manufacturing sector is growing for the first time in more than a decade. But there is more we need to do, and so we come together again to continue what we started. We gather to reclaim the basic bargain that built the largest middle class and the most prosperous nation on Earth - the simple principle that in America, hard work should pay off, responsibility should be rewarded, and each one of us should be able to go as far as our talent and drive take us. This election is not simply a ch"
In [53]:
def replace_with_newlines(element):
    text = ''
    for elem in element.recursiveChildGenerator():
        #Python 2 users will need to replace 'str' with 'basestring'
        #This is not fully tested with Python 2
        if isinstance(elem, str):
            text += elem.strip()
        elif elem.name == 'br':
            text += '\n'

    return text

data = soup.select('.displaytext')
for item in data:
    for itm in item.findChildren(['p','h2','h3','h4','b'])[:10]:
        text = replace_with_newlines(itm)
        print("{0} \n".format(text[:1000]))
        #print("%s \n" % itm.get_text()[:1000])
        
Moving America Forward
2012 Democratic National Platform 

Moving America Forward 

Four years ago, Democrats, independents, and many Republicans came together as Americans to move our country forward. We were in the midst of the greatest economic crisis since the Great Depression, the previous administration had put two wars on our nation's credit card, and the American Dream had slipped out of reach for too many.Today, our economy is growing again, al-Qaeda is weaker than at any point since 9/11, and our manufacturing sector is growing for the first time in more than a decade. But there is more we need to do, and so we come together again to continue what we started. We gather to reclaim the basic bargain that built the largest middle class and the most prosperous nation on Earth - the simple principle that in America, hard work should pay off, responsibility should be rewarded, and each one of us should be able to go as far as our talent and drive take us.This election is not simply a choice between two candidates or two political parties, but between two fundament 

Today, our economy is growing again, al-Qaeda is weaker than at any point since 9/11, and our manufacturing sector is growing for the first time in more than a decade. But there is more we need to do, and so we come together again to continue what we started. We gather to reclaim the basic bargain that built the largest middle class and the most prosperous nation on Earth - the simple principle that in America, hard work should pay off, responsibility should be rewarded, and each one of us should be able to go as far as our talent and drive take us.This election is not simply a choice between two candidates or two political parties, but between two fundamentally different paths for our country and our families.We Democrats offer America the opportunity to move our country forward by creating an economy built to last and built from the middle out. Mitt Romney and the Republican Party have a drastically different vision. They still believe the best way to grow the economy is from the top 

This election is not simply a choice between two candidates or two political parties, but between two fundamentally different paths for our country and our families.We Democrats offer America the opportunity to move our country forward by creating an economy built to last and built from the middle out. Mitt Romney and the Republican Party have a drastically different vision. They still believe the best way to grow the economy is from the top down - the same approach that benefited the wealthy few but crashed the economy and crushed the middle class.Democrats see a young country continually made stronger by the greatest diversity of talent and ingenuity in the world, and a nation of people drawn to our shores from every corner of the globe. We believe America can succeed because the American people have never failed and there is nothing that together we cannot accomplish.Reclaiming the economic security of the middle class is the challenge we must overcome today. That begins by restorin 

We Democrats offer America the opportunity to move our country forward by creating an economy built to last and built from the middle out. Mitt Romney and the Republican Party have a drastically different vision. They still believe the best way to grow the economy is from the top down - the same approach that benefited the wealthy few but crashed the economy and crushed the middle class.Democrats see a young country continually made stronger by the greatest diversity of talent and ingenuity in the world, and a nation of people drawn to our shores from every corner of the globe. We believe America can succeed because the American people have never failed and there is nothing that together we cannot accomplish.Reclaiming the economic security of the middle class is the challenge we must overcome today. That begins by restoring the basic values that made our country great, and restoring for everyone who works hard and plays by the rules the opportunity to find a job that pays the bills, t 

Democrats see a young country continually made stronger by the greatest diversity of talent and ingenuity in the world, and a nation of people drawn to our shores from every corner of the globe. We believe America can succeed because the American people have never failed and there is nothing that together we cannot accomplish.Reclaiming the economic security of the middle class is the challenge we must overcome today. That begins by restoring the basic values that made our country great, and restoring for everyone who works hard and plays by the rules the opportunity to find a job that pays the bills, turn an idea into a profitable business, care for your family, afford a home you call your own and health care you can count on, retire with dignity and respect, and, most of all, give your children the kind of education that allows them to dream even bigger and go even further than you ever imagined.This has to be our North Star - an economy that's built not from the top down, but from a 

Reclaiming the economic security of the middle class is the challenge we must overcome today. That begins by restoring the basic values that made our country great, and restoring for everyone who works hard and plays by the rules the opportunity to find a job that pays the bills, turn an idea into a profitable business, care for your family, afford a home you call your own and health care you can count on, retire with dignity and respect, and, most of all, give your children the kind of education that allows them to dream even bigger and go even further than you ever imagined.This has to be our North Star - an economy that's built not from the top down, but from a growing middle class, and that provides ladders of opportunity for those working hard to join the middle class.This is not another trivial political argument. It's the defining issue of our time and at the core of the American Dream. And now we stand at a make-or-break moment, and are faced with a choice between moving forwar 

This has to be our North Star - an economy that's built not from the top down, but from a growing middle class, and that provides ladders of opportunity for those working hard to join the middle class.This is not another trivial political argument. It's the defining issue of our time and at the core of the American Dream. And now we stand at a make-or-break moment, and are faced with a choice between moving forward and falling back.The Republican Party has turned its back on the middle class Americans who built this country. Our opponents believe we should go back to the top-down economic policies of the last decade. They think that if we simply eliminate protections for families and consumers, let Wall Street write its own rules again, and cut taxes for the wealthiest, the market will solve all our problems on its own. They argue that if we help corporations and wealthy investors maximize their profits by whatever means necessary, whether through layoffs or outsourcing, it will automa 

This is not another trivial political argument. It's the defining issue of our time and at the core of the American Dream. And now we stand at a make-or-break moment, and are faced with a choice between moving forward and falling back.The Republican Party has turned its back on the middle class Americans who built this country. Our opponents believe we should go back to the top-down economic policies of the last decade. They think that if we simply eliminate protections for families and consumers, let Wall Street write its own rules again, and cut taxes for the wealthiest, the market will solve all our problems on its own. They argue that if we help corporations and wealthy investors maximize their profits by whatever means necessary, whether through layoffs or outsourcing, it will automatically translate into jobs and prosperity that benefits us all. They would repeal health reform, turn Medicare into a voucher program, and follow the same path of fiscal irresponsibility of the past a 

Python Anaconda Versions

  1. Python 3.5 the code below works without errors
  2. Python 3.0 to 3.4 - Not tested!
  3. Python 2 Error with ascii character not encoded
    • Need to add .encode('utf-8').strip() to the end of text returned
    • See below have both lines with alternative
    • "{0}".format("string") is a Python 3 Syntax
In [42]:
import os
from datetime import datetime

#Create output directory
if not os.path.exists('output'):
    os.makedirs('output')

#Create log file and set header row
log_file = open('output/president_scraping.log','w')
log_file.write("Timestamp\tURL\tStatus Code \n")

print("Start Scraping")
for link in valid_links:
    #Load page
    r = requests.get(link)
    #Log Row setup
    tmpl = "{time}\t{link}\t{status}\n"
    log_string = tmpl.format(time=datetime.isoformat(datetime.now()),
                             link=link,
                             status=r.status_code)
    log_file.write(log_string)
    #Beautiful Soup 
    soup =BeautifulSoup(r.text,'html.parser')
    #setup filename and path
    filename = "{0}.txt".format(soup.title.text.replace(' ','_')
                                .replace(':','')
                                .replace('/','-'))
    filename_path = os.path.join('output',filename)
    #Write data to file
    with open(filename_path,'w') as scraped_text:
        data = soup.select('.displaytext')
        for item in data:
            for itm in item.findChildren(['p','h2','h3','h4','b']):
                text = replace_with_newlines(itm)
                scraped_text.write("{0} \n".format(text))
                #ascii errors, comment out line above with error and use the one below
                #scraped_text.write("%s \n" % text.encode('utf-8').strip())

log_file.close()
print("Finished scraping!")
Start Scraping
Finished scraping!

Pandas Quick Example

Want to get Oklahoma City Thunder Team statistics!

Requires Python library html5lib

$ pip install html5lib

Requires restart of ipython notebook!

In [60]:
import pandas as pd

espn_okc_thunder = "http://espn.go.com/nba/team/stats/_/name/okc/oklahoma-city-thunder"

data = pd.read_html(espn_okc_thunder)
#returns a list of every html table with data.

data[0]
Out[60]:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
0 GAME STATISTICS NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 PLAYER GP GS MIN PPG OFFR DEFR RPG APG SPG BPG TPG FPG A/TO PER
2 Kevin Durant, SF 42 42 36.0 27.2 0.5 7.5 8.0 4.5 0.98 1.17 3.0 1.7 1.5 28.2
3 Russell Westbrook, PG 49 49 34.2 24.1 1.7 5.6 7.3 9.9 2.39 0.27 4.3 2.4 2.3 28.6
4 Serge Ibaka, PF 49 49 32.6 12.8 1.9 4.9 6.8 0.9 0.39 2.14 1.6 2.7 0.5 14.4
5 Enes Kanter, C 49 0 20.8 11.8 2.9 4.7 7.6 0.5 0.22 0.47 1.3 1.9 0.4 23.1
6 Dion Waiters, SG 49 5 27.6 10.1 0.5 2.3 2.8 1.9 0.90 0.14 1.6 1.9 1.2 9.7
7 Steven Adams, C 47 47 24.4 7.1 2.5 3.8 6.4 0.7 0.43 1.17 0.9 2.8 0.7 14.8
8 Anthony Morrow, SG 43 5 14.9 5.8 0.2 1.0 1.2 0.3 0.42 0.07 0.2 1.0 1.4 11.6
9 Cameron Payne, PG 32 0 11.4 5.2 0.1 1.5 1.6 1.8 0.72 0.13 0.6 1.3 3.1 17.5
10 Andre Roberson, SG 45 45 21.7 4.9 1.1 2.2 3.3 0.8 0.89 0.67 0.6 1.8 1.3 10.9
11 D.J. Augustin, PG 34 0 15.3 4.2 0.1 1.1 1.3 1.9 0.38 0.06 0.9 1.2 2.2 8.9
12 Kyle Singler, SF 38 1 12.7 2.9 0.5 1.3 1.8 0.2 0.39 0.13 0.5 1.7 0.4 5.4
13 Nick Collison, PF 40 2 12.5 2.3 1.3 1.8 3.1 0.8 0.28 0.33 0.7 2.0 1.1 8.8
14 Steve Novak, SF 5 0 4.2 2.2 0.0 0.6 0.6 0.0 0.00 0.00 0.0 0.2 0.0 15.2
15 Mitch McGary, PF 13 0 4.2 1.2 0.3 0.7 1.0 0.2 0.00 0.15 0.5 0.6 0.3 4.9
16 Totals 49 -- -- 109.4 12.7 35.0 47.7 22.1 7.59 6.35 15.0 20.4 1.5 --
In [56]:
pd.read_html?
In [69]:
data = pd.read_html(espn_okc_thunder,skiprows=1,header=0)
data[0]
Out[69]:
PLAYER GP GS MIN PPG OFFR DEFR RPG APG SPG BPG TPG FPG A/TO PER
0 Kevin Durant, SF 42 42 36.0 27.2 0.5 7.5 8.0 4.5 0.98 1.17 3.0 1.7 1.5 28.2
1 Russell Westbrook, PG 49 49 34.2 24.1 1.7 5.6 7.3 9.9 2.39 0.27 4.3 2.4 2.3 28.6
2 Serge Ibaka, PF 49 49 32.6 12.8 1.9 4.9 6.8 0.9 0.39 2.14 1.6 2.7 0.5 14.4
3 Enes Kanter, C 49 0 20.8 11.8 2.9 4.7 7.6 0.5 0.22 0.47 1.3 1.9 0.4 23.1
4 Dion Waiters, SG 49 5 27.6 10.1 0.5 2.3 2.8 1.9 0.90 0.14 1.6 1.9 1.2 9.7
5 Steven Adams, C 47 47 24.4 7.1 2.5 3.8 6.4 0.7 0.43 1.17 0.9 2.8 0.7 14.8
6 Anthony Morrow, SG 43 5 14.9 5.8 0.2 1.0 1.2 0.3 0.42 0.07 0.2 1.0 1.4 11.6
7 Cameron Payne, PG 32 0 11.4 5.2 0.1 1.5 1.6 1.8 0.72 0.13 0.6 1.3 3.1 17.5
8 Andre Roberson, SG 45 45 21.7 4.9 1.1 2.2 3.3 0.8 0.89 0.67 0.6 1.8 1.3 10.9
9 D.J. Augustin, PG 34 0 15.3 4.2 0.1 1.1 1.3 1.9 0.38 0.06 0.9 1.2 2.2 8.9
10 Kyle Singler, SF 38 1 12.7 2.9 0.5 1.3 1.8 0.2 0.39 0.13 0.5 1.7 0.4 5.4
11 Nick Collison, PF 40 2 12.5 2.3 1.3 1.8 3.1 0.8 0.28 0.33 0.7 2.0 1.1 8.8
12 Steve Novak, SF 5 0 4.2 2.2 0.0 0.6 0.6 0.0 0.00 0.00 0.0 0.2 0.0 15.2
13 Mitch McGary, PF 13 0 4.2 1.2 0.3 0.7 1.0 0.2 0.00 0.15 0.5 0.6 0.3 4.9
14 Totals 49 -- -- 109.4 12.7 35.0 47.7 22.1 7.59 6.35 15.0 20.4 1.5 --
In [70]:
% matplotlib inline
In [71]:
import matplotlib
matplotlib.style.use('ggplot')
In [77]:
data[0][:10].plot(x='PLAYER',y='PPG',kind='bar')
Out[77]:
<matplotlib.axes._subplots.AxesSubplot at 0x10da5a940>
In [120]:
data[1]
Out[120]:
PLAYER FGM FGA FG% 3PM 3PA 3P% FTM FTA FT% 2PM 2PA 2P% PPS AFG%
0 Kevin Durant, SF 9.5 18.4 0.516 2.7 6.2 0.437 5.9 6.6 0.880 6.8 12.2 0.556 1.496 0.59
1 Russell Westbrook, PG 9.3 19.8 0.470 1.5 5.0 0.300 7.0 8.3 0.840 7.8 14.8 0.527 1.366 0.51
2 Serge Ibaka, PF 5.8 12.0 0.483 0.5 1.3 0.360 0.9 1.3 0.680 5.3 10.7 0.498 1.075 0.50
3 Enes Kanter, C 4.7 8.4 0.563 0.1 0.1 1.000 2.5 3.3 0.770 4.6 8.3 0.560 1.431 0.57
4 Dion Waiters, SG 3.9 9.9 0.394 1.1 2.9 0.368 1.9 2.2 0.840 2.8 7.0 0.404 1.081 0.45
5 Steven Adams, C 2.4 4.2 0.571 0.0 0.0 0.000 1.2 2.0 0.580 2.4 4.2 0.571 1.417 0.57
6 Anthony Morrow, SG 2.0 5.3 0.376 1.2 3.2 0.383 0.3 0.3 0.830 0.8 2.1 0.366 1.030 0.49
7 D.J. Augustin, PG 1.8 4.4 0.402 0.9 2.1 0.415 0.8 1.2 0.700 0.9 2.3 0.391 1.184 0.50
8 Andre Roberson, SG 1.8 4.4 0.422 0.5 1.9 0.278 0.4 0.7 0.570 1.3 2.5 0.532 1.060 0.48
9 Steve Novak, SF 1.0 1.0 1.000 1.0 1.0 1.000 0.0 0.0 0.000 0.0 0.0 0.000 3.000 1.50
10 Nick Collison, PF 1.2 2.4 0.487 0.0 0.1 0.000 0.4 0.4 1.000 1.2 2.3 0.514 1.128 0.49
11 Kyle Singler, SF 0.8 3.2 0.235 0.4 1.9 0.200 0.3 0.6 0.500 0.4 1.3 0.286 0.686 0.29
12 Mitch McGary, PF 0.5 1.0 0.500 0.0 0.0 0.000 0.5 1.0 0.500 0.5 1.0 0.500 1.500 0.50
13 Cameron Payne, PG 0.4 1.7 0.267 0.2 0.8 0.286 0.0 0.0 0.000 0.2 0.9 0.250 0.667 0.33
14 Totals 40.1 86.3 0.465 7.9 22.4 0.353 19.6 24.9 0.787 32.2 64.0 0.504 1.250 0.51

Web Scraping Best Practices

  1. Generate a log file to provide record of when the page was scraped.
  2. Be kind to your fellow web server administrators! Limit the number of times scraped.
  3. Log and check web request status.
  4. Test Web Scraping scripts to ensure website has not changed!
  5. Javascript loaded websites will need a different technique.