mardi 4 août 2015

Python : Scraping the Internet using Re and Pandas

Hi so I am trying to scrape data from the pga website to give me a CSV of information on golf courses. I tried something new and used module re and pandas instead of beautiful soup to access the data. I am having a problem writing a CSV file. I tried using pandas dataframe module but I have been getting an attribute error. with my current scheme, It is giving me an attribute error when encoding the Utf-8 and I was wondering if I should break my scrapers into try/except blocks just. Lastly how can i create a progress bar for my sanity while waiting for it to be scraped. Ideas thoughts will be greatly appreciated.

Code cited below:

import re
import requests
import pandas as pd
import csv


L = []
for i in range(1):      # Number of pages plus one 
    url = "http://ift.tt/1TSyPTR".format(i)
    r = requests.get(url)

    name     = re.findall('(?<=<div class="views-field-title"><span class="field-content">)([^<]+)', r.text)
    print (name)

    address1 = re.findall('(?<=<div class="views-field-address"><span class="field-content">)([^<]+)', r.text)

    address2 = re.findall('(?<=<div class="views-field-city-state-zip"><span class="field-content">)([^<]+)', r.text)

    ownership = re.findall('(?<=<div class="views-field-course-type"><span class="field-content">)([^<]+)',r.text)

    website   = re.findall('(?<=<div class="class":"views-field-website"><span class="field-content">)([^<]+)',r.text)

    phone     = re.findall('(?<=<div class="class":"views-field-work-phone"><span class="field-content">)([^<]+)',r.text)

    #L.extend(zip(name,address1,address2,ownership,website,phone))

    course=[name,address1,address2,ownership,website,phone]
    L.append(course)

    with open ('Testing.csv','a') as file:
        writer=csv.writer(file)
        for row in L:
            writer.writerow([s.encode("utf-8") for s in row])

Aucun commentaire:

Enregistrer un commentaire