Can't Modify A Function To Work Independently Instead Of Depending On A Returned Result
Solution 1:
If you don't need multithreading support (your edits suggest you don't), you can make it work with the following minor changes. proxyVault
keeps both the entire proxy pool, and the active proxy (the last one) after shuffling the list (your code had both shuffle
and choice
, but just one of them is enough). pop()
-ing from the list changes the active proxy, until there are no more left.
import random
import requests
from random import choice
from urllib.parse import urljoin
from bs4 import BeautifulSoup
linklist = [
'https://www.amazon.com/dp/B00OI0RGGO',
'https://www.amazon.com/dp/B00TPKOPWA',
'https://www.amazon.com/dp/B00TH42HWE'
]
proxyVault = ['103.110.37.244:36022', '180.254.218.229:8080', '110.74.197.207:50632', '1.20.101.95:49001', '200.10.193.90:8080', '173.164.26.117:3128', '103.228.118.66:43002', '178.128.231.201:3128', '1.2.169.54:55312', '181.52.85.249:31487', '97.64.135.4:8080', '190.96.214.123:53251', '52.144.107.142:31923', '45.5.224.145:52035', '89.218.22.178:8080', '192.241.143.186:80', '113.53.29.218:38310', '36.78.131.182:39243']
random.shuffle(proxyVault)
class NoMoreProxies(Exception):
pass
def skip_proxy():
global proxyVault
if len(proxyVault) == 0:
raise NoMoreProxies()
proxyVault.pop()
def get_proxy():
global proxyVault
if len(proxyVault) == 0:
raise NoMoreProxies()
proxy_url = proxyVault[-1]
proxy = {'https': f'http://{proxy_url}'}
return proxy
def parse_product(link):
try:
proxy = get_proxy()
print("checking the proxy:", proxy)
res = requests.get(link, proxies=proxy, timeout=5)
soup = BeautifulSoup(res.text, "html5lib")
try:
product_name = soup.select_one("#productTitle").get_text(strip=True)
except Exception:
product_name = ""
return product_name
except Exception:
"""the following line when hit produces new proxy and remove the bad one that passes through process_proxy(proxy)"""
skip_proxy()
return parse_product(link)
if __name__ == '__main__':
for url in linklist:
result = parse_product(url)
print(result)
I would also suggest changing the last try/except clause to catch a RequestException
instead of Exception
.
Solution 2:
Perhaps you can put the proxy handling logic inside a class, and pass an instance to parse_product()
. Then, parse_product()
will invoke the necessary methods of the instance to get and/or reset the proxy. The class can look something like this:
class ProxyHandler:
proxyVault = [
"103.110.37.244:36022",
"180.254.218.229:8080" # and so on
]
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
# Initialize proxy
proxy_url = choice(self.proxyVault)
self.proxy = {"https": f"http://{proxy_url}"}
def get_proxy(self):
return self.proxy
def renew_proxy(self):
# Remove current proxy from the vault
proxy_pattern = self.proxy.get("https").split("//")[-1]
if proxy_pattern in proxyVault:
proxyVault.remove(proxy_pattern)
# Set new proxy
random.shuffle(proxyVault)
proxy_url = choice(self.proxyVault)
self.proxy = {"https": f"http://{proxy_url}"}
Then, parse_product()
might look something like this:
def parse_product(link, proxy_handler):
try:
if not proxy_handler:
raise
proxy = proxy_handler.get_proxy()
print("checking the proxy:", proxy)
res = requests.get(link, proxies=proxy, timeout=5)
soup = BeautifulSoup(res.text, "html5lib")
try:
product_name = soup.select_one("#productTitle").get_text(strip=True)
except Exception:
product_name = ""
return product_name
except Exception:
"""the following line when hit produces new proxy and remove the bad one that passes through process_proxy(proxy)"""
proxy_handler.renew_proxy()
return parse_product(link, proxy_handler)
I think you can pass the same ProxyHandler
instance to all threads and parallelize too.
Solution 3:
I might be missing something crucial here (as it's pretty late), but it seems a simple problem which was extremely overcomplicated. It almost tends to be an XY Problem. I'm going to post some thoughts, questions (wanderings of mine), observations, suggestions:
- The end goal is that for each link, access it (once or as many times possible ? if it's the latter, it seems like a DoS attempt, so I'll assume it's the former :) ) using each of the proxies (when a proxy fails, move to next). If that works, get some product (which seems to be some kind of an electric motor) name
- Why the recursion? It's limited by the stack (in Python by [Python 3.Docs]: sys.getrecursionlimit())
- No need to declare variables as global if not assigning values to them (there are exceptions, but I don't think it's the case here)
- process_proxy (question variant) isn't behaving well when proxyVault gets empty
global proxy
(from the answer) is just ugly- Why random instead of simply picking the next proxy from the list?
- parse_product_info (parse_product) behavior is not consistent, in some cases returns something, in others it doesn't
- Parallelization occurs only at target URL level. It can be improved a bit more (but more logic needs to be added to the code), if also working at proxy level
Below it's a simplified (and cleaner) version.
code00.py:
#!/usr/bin/env python3
import sys
import random
import requests
from bs4 import BeautifulSoup
urls = [
"https://www.amazon.com/dp/B00OI0RGGO",
"https://www.amazon.com/dp/B00TPKOPWA",
"https://www.amazon.com/dp/B00TH42HWE",
"https://www.amazon.com/dp/B00TPKNREM",
]
proxies = [
"103.110.37.244:36022",
"180.254.218.229:8080",
"110.74.197.207:50632",
"1.20.101.95:49001",
"200.10.193.90:8080",
"173.164.26.117:3128",
"103.228.118.66:43002",
"178.128.231.201:3128",
"1.2.169.54:55312",
"181.52.85.249:31487",
"97.64.135.4:8080",
"190.96.214.123:53251",
"52.144.107.142:31923",
"45.5.224.145:52035",
"89.218.22.178:8080",
"192.241.143.186:80",
"113.53.29.218:38310",
"36.78.131.182:39243"
]
def parse_product_info(link): # Can also pass proxies as argument
local_proxies = proxies[:] # Make own copy of the global proxies (in case you want to shuffle them and not affect other parallel processing workers)
#random.shuffle(local_proxies) # Makes no difference, but if you really want to shuffle it, decomment this line
for proxy in local_proxies:
try:
proxy_dict = {"https": f"http://{proxy}"} # http or https?
print(f" Proxy to be used: {proxy_dict['https']}")
response = requests.get(link, proxies=proxy_dict, timeout=5)
if not response:
print(f" HTTP request returned {response.status_code} code")
continue # Move to next proxy
soup = BeautifulSoup(response.text, "html5lib")
try:
product_name = soup.select_one("#productTitle").get_text(strip=True)
return product_name # Information retrieved, return it.
except Exception as e: # Might want to use specific exceptions
print(f"ERROR: {e}")
# URL was accessible, but the info couldn't be parsed.
# return, as probably it will be the same using any other proxies.
return None # Replace by `continue` if you want to try the other proxies
except Exception as e:
#print(f" {e}")
continue # Some exception occured, move to next proxy
def main():
for url in urls:
print(f"\nAttempting url: {url}...")
product_name = parse_product_info(url)
if product_name:
print(f"{url} yielded product name:\n[{product_name}\\n")
if __name__ == "__main__":
print("Python {0:s} {1:d}bit on {2:s}\n".format(" ".join(item.strip() for item in sys.version.split("\n")), 64 if sys.maxsize > 0x100000000 else 32, sys.platform))
main()
print("\nDone.")
Output (partial, as I didn't let it go through all proxies / URLs):
[cfati@CFATI-5510-0:e:\Work\Dev\StackOverflow\q058796837]> "e:\Work\Dev\VEnvs\py_064_03.07.03_test0\Scripts\python.exe" code00.py Python 3.7.3 (v3.7.3:ef4ec6ed12, Mar 25 2019, 22:22:05) [MSC v.1916 64 bit (AMD64)] 64bit on win32 Attempting url: https://www.amazon.com/dp/B00OI0RGGO... Proxy to be used: http://103.110.37.244:36022 Proxy to be used: http://180.254.218.229:8080 Proxy to be used: http://110.74.197.207:50632 Proxy to be used: http://1.20.101.95:49001 Proxy to be used: http://200.10.193.90:8080 Proxy to be used: http://173.164.26.117:3128 ...
Post a Comment for "Can't Modify A Function To Work Independently Instead Of Depending On A Returned Result"