Automatically Retrieving Large Files Via Public HTTP Into Google Cloud Storage

July 29, 2022 Post a Comment

For weather processing purpose, I am looking to retrieve automatically daily weather forecast data in Google Cloud Storage. The files are available on public HTTP URL (http://dcpc-

Solution 1:

3/ Workaround with a Compute Engine instance

Since it was not possible to retrieve large files from external HTTP with App Engine or directly with Cloud Storage, I have used a workaround with an always-running Compute Engine instance.

This instance regularly checks if new weather files are available, downloads them and uploads them to a Cloud Storage bucket.

For scalability, maintenance and cost reasons, I would have prefered to use only serverless services, but hopefully :

It works well on a fresh f1-micro Compute Engine instance (no extra package required and only 4$/month if running 24/7)
The network traffic from Compute Engine to Google Cloud Storage is free if the instance and the bucket are in the same region (0$/month)

Solution 2:

The md5 and size of the file can be retrieved easily and quickly using curl -I command as mentioned in this link https://developer.mozilla.org/en-US/docs/Web/HTTP/Range_requests.
The Storage Transfer Service can then be configured to use that information.

Another option would be to use a serverless Cloud Function. It could look like something below in Python.

import requests

def download_url_file(url):
    try:
        print('[ INFO ] Downloading {}'.format(url))
        req = requests.get(url)
        if req.status_code==200:
            # Download and save to /tmp
            output_filepath = '/tmp/{}'.format(url.split('/')[-1])
            output_filename = '{}'.format(url.split('/')[-1])
            open(output_filepath, 'wb').write(req.content)
            print('[ INFO ] Successfully downloaded to output_filepath: {} & output_filename: {}'.format(output_filepath, output_filename))
            return output_filename
        else:
            print('[ ERROR ] Status Code: {}'.format(req.status_code))
    except Exception as e:
        print('[ ERROR ] {}'.format(e))
    return output_filename

Solution 3:

Currently, the MD5 and size are required for Google's Transfer Service; we understand that in cases like yours, this can be difficult to work with, but unfortunately we don't have a great solution today.

Unless you're able to get the size and MD5 by downloading the files yourself (temporarily), I think that's the best you can do.

Python Library

Automatically Retrieving Large Files Via Public HTTP Into Google Cloud Storage

Solution 1:

Solution 2:

Solution 3:

Post a Comment for "Automatically Retrieving Large Files Via Public HTTP Into Google Cloud Storage"