Skip to content Skip to sidebar Skip to footer

Beautifulsoup Doesn't Reach A Child Element

I wrote the following code trying to scrape a google scholar page import requests as req from bs4 import BeautifulSoup as soup url = r'https://scholar.google.com/scholar?hl=en&

Solution 1:

Ok, so I figured it out. I used the selenium module for python which creates a virtual browser if you will that will allow you to perform actions like clicking links and getting the output of the resulting HTML. There was another issue I ran into while solving this which was the page had to be loaded otherwise it just returned the content "Loading..." in the pop-up div so I used the python time module to time.sleep(2) for 2 seconds which allowed the content to load in. Then I just parsed the resulting HTML output using BeautifulSoup to find the anchor tag with the class "gs_citi". Then pulled the href from the anchor and put this into a request with "requests" python module. Finally, I wrote the decoded response to a local file - scholar.bib.

I installed chromedriver and selenium on my Mac using these instructions here: https://gist.github.com/guylaor/3eb9e7ff2ac91b7559625262b8a6dd5f

Then signed by python file to allow to stop firewall issues using these instructions: Add Python to OS X Firewall Options?

The following is the code I used to produce the output file "scholar.bib":

import os
import time
from selenium import webdriver
from bs4 import BeautifulSoup as soup
import requests as req

# Setup Selenium Chrome Web Driver
chromedriver = "/usr/local/bin/chromedriver"
os.environ["webdriver.chrome.driver"] = chromedriver
driver = webdriver.Chrome(chromedriver)

# Navigate in Chrome to specified page.
driver.get("https://scholar.google.com/scholar?hl=en&q=Sustainability and the measurement of wealth: further reflections")

# Find "Cite" link by looking for anchors that contain "Cite" - second link selected "[1]"
link = driver.find_elements_by_xpath('//a[contains(text(), "' + "Cite" + '")]')[1]
# Click the link
link.click()

print("Waiting for page to load...")
time.sleep(2) # Sleep for 2 seconds

# Get Page source after waiting for 2 seconds of current page in Chrome
source = driver.page_source

# We are done with the driver so quit.
driver.quit()

# Use BeautifulSoup to parse the html source and use "html.parser" as the Parser
soupify = soup(source, 'html.parser')

# Find anchors with the class "gs_citi"
gs_citt = soupify.find('a',{"class":"gs_citi"})

# Get the href attribute of the first anchor found
href = gs_citt['href']

print("Fetching: ", href)

# Instantiate a new requests session
session = req.Session()

# Get the response object of href
content = session.get(href)

# Get the content and then decode() it.
bibtex_html = content.content.decode()

# Write the decoded data to a file named scholar.bib
with open("scholar.bib","w") as file:
    file.writelines(bibtex_html)

Hope this helps anyone looking for a solution to this out.

Scholar.bib file:

@article{arrow2013sustainability,
  title={Sustainability and the measurement of wealth: further reflections},
  author={Arrow, Kenneth J and Dasgupta, Partha and Goulder, Lawrence H and Mumford, Kevin J and Oleson, Kirsten},
  journal={Environment and Development Economics},
  volume={18},
  number={4},
  pages={504--516},
  year={2013},
  publisher={Cambridge University Press}
}

Post a Comment for "Beautifulsoup Doesn't Reach A Child Element"