Automatically Retrieving Large Files Via Public HTTP Into Google Cloud Storage
Solution 1:
3/ Workaround with a Compute Engine instance
Since it was not possible to retrieve large files from external HTTP with App Engine or directly with Cloud Storage, I have used a workaround with an always-running Compute Engine instance.
This instance regularly checks if new weather files are available, downloads them and uploads them to a Cloud Storage bucket.
For scalability, maintenance and cost reasons, I would have prefered to use only serverless services, but hopefully :
- It works well on a fresh f1-micro Compute Engine instance (no extra package required and only 4$/month if running 24/7)
- The network traffic from Compute Engine to Google Cloud Storage is free if the instance and the bucket are in the same region (0$/month)
Solution 2:
The md5 and size of the file can be retrieved easily and quickly using curl -I command as mentioned in this link https://developer.mozilla.org/en-US/docs/Web/HTTP/Range_requests.
The Storage Transfer Service can then be configured to use that information.
Another option would be to use a serverless Cloud Function. It could look like something below in Python.
import requests
def download_url_file(url):
    try:
        print('[ INFO ] Downloading {}'.format(url))
        req = requests.get(url)
        if req.status_code==200:
            # Download and save to /tmp
            output_filepath = '/tmp/{}'.format(url.split('/')[-1])
            output_filename = '{}'.format(url.split('/')[-1])
            open(output_filepath, 'wb').write(req.content)
            print('[ INFO ] Successfully downloaded to output_filepath: {} & output_filename: {}'.format(output_filepath, output_filename))
            return output_filename
        else:
            print('[ ERROR ] Status Code: {}'.format(req.status_code))
    except Exception as e:
        print('[ ERROR ] {}'.format(e))
    return output_filename
Solution 3:
Currently, the MD5 and size are required for Google's Transfer Service; we understand that in cases like yours, this can be difficult to work with, but unfortunately we don't have a great solution today.
Unless you're able to get the size and MD5 by downloading the files yourself (temporarily), I think that's the best you can do.
Post a Comment for "Automatically Retrieving Large Files Via Public HTTP Into Google Cloud Storage"