crawler

find_paths(url, regex, node_type='a', tag='href', filter=None, session=None)

Search for node an tag values on a webpage.

Parameters

url (str) – URL of page to search.
regex (str) – Regex to filter data.
node_type (str, optional) – Node type to search for, by default “a”.
tag (str, optional) – Tage within node to search for, by default “href”.
filter (function, optional) – Function applied to each tag found, by default None.
session (requests.Session, optional) – session to use to open url, by default None.

Returns

Array with all tags found.

Return type

np.ndarray

crawl(urls, regex, filter_regex, session, label_filter=None, list_out=False)

Search for tags on multiple pages.

Parameters

urls (dict) – Values are urls to search.
regex (str) – Regex to filter data.
session (requests.Session) – session to use to open url.

Returns

URLs found.

Return type

dict

download_urls(urls, folder, session=None, fps=None, parallel=0, headers=None, checker_function=None, max_tries=10, wait_sec=15, meta_max_tries=2)

Download URLs to a folder.

Parameters

urls (list) – URLs to download.
folder (str) – Path to folder in which to download.
session (request.Session, optional) – Session to use, by default None.
fps (list, optional) – Filepaths to store each URL, should be same length as urls, ignores folder, by default None.
parallel (int, optional) – Download URLs in parallel (currently not implemented), by default 0.
headers (dict, optional) – Headers to include with requests, by default None.
checker_function (function, optional) – Function to apply to downloaded files to see if they are valid, by default None.
max_tries (int, optional) – Max number of tries to download a single file (in case of server side errors), by default 10.
wait_sec (int, optional) – How many seconds to wait between retries, by default 15.
meta_max_tries (int, optional) – Max number of tries to download all URLs before giving up, by default 2.

Returns

Paths to downloaded files.

Return type

list

download_url(url, fp, session=None, waitbar=True, headers=None, max_tries=3, wait_sec=15)

Download a URL to a file.

Parameters

url (str) – URL to download.
fp (str) – Path to file to download into.
session (requests.Session, optional) – Session to use, by default None.
waitbar (bool, optional) – Display a download progress bar, by default True.
headers (dict, optional) – Headers to include with requests, by default None.
max_tries (int, optional) – Max number of tries to download a single file (in case of server side errors), by default 10.
wait_sec (int, optional) – How many seconds to wait between retries, by default 15.

Returns

Path to downloaded file.

Return type

str

Raises

NameError – Raised when max_tries is exceeeded.