Skip to content Skip to sidebar Skip to footer

Use Beautifulsoup To Loop Through And Retrieve Specific Urls

I want to use BeautifulSoup and retrieve specific URLs at specific position repeatedly. You may imagine that there are 4 different URL lists each containing 100 different URL links

Solution 1:

This is a good problem to use recursion. Try to call a recursive function to do this:

def retrieve_urls_recur(url, position, index, deepness):
    if index >= deepness:
        return True
    else:
        plain_text = requests.get(url)
        soup = BeautifulSoup(plain_text)
        links = soup.find_all('a'):
        desired_link = links[position].get('href')
        print desired_link
        return retrieve_urls_recur(desired_link, index+1, deepness) 

and then call it with the desired parameters, in your case:

retrieve_urls_recur(url, 2, 0, 4)

2 is the url index on the list of urls, 0 is the counter, and 4 is how deep you want to go recursively

ps: I am using requests instead of urllib, and I didnt test this, although I recentely used a very similar function with sucess

Solution 2:

Just get the link from find_all() by index:

while count < num:
    context = ssl._create_unverified_context()
    htm = urllib.request.urlopen(url, context=context).read()

    soup = BeautifulSoup(htm)
    url = soup.find_all('a')[position].get('href')

    count += 1

Post a Comment for "Use Beautifulsoup To Loop Through And Retrieve Specific Urls"