crawler
- find_paths(url, regex, node_type='a', tag='href', filter=None, session=None)
Search for node an tag values on a webpage.
- Parameters
url (str) – URL of page to search.
regex (str) – Regex to filter data.
node_type (str, optional) – Node type to search for, by default “a”.
tag (str, optional) – Tage within node to search for, by default “href”.
filter (function, optional) – Function applied to each tag found, by default None.
session (requests.Session, optional) – session to use to open url, by default None.
- Returns
Array with all tags found.
- Return type
np.ndarray
- crawl(urls, regex, filter_regex, session, label_filter=None, list_out=False)
Search for tags on multiple pages.
- Parameters
urls (dict) – Values are urls to search.
regex (str) – Regex to filter data.
session (requests.Session) – session to use to open url.
- Returns
URLs found.
- Return type
dict
- download_urls(urls, folder, session=None, fps=None, parallel=0, headers=None, checker_function=None, max_tries=10, wait_sec=15, meta_max_tries=2)
Download URLs to a folder.
- Parameters
urls (list) – URLs to download.
folder (str) – Path to folder in which to download.
session (request.Session, optional) – Session to use, by default None.
fps (list, optional) – Filepaths to store each URL, should be same length as urls, ignores folder, by default None.
parallel (int, optional) – Download URLs in parallel (currently not implemented), by default 0.
headers (dict, optional) – Headers to include with requests, by default None.
checker_function (function, optional) – Function to apply to downloaded files to see if they are valid, by default None.
max_tries (int, optional) – Max number of tries to download a single file (in case of server side errors), by default 10.
wait_sec (int, optional) – How many seconds to wait between retries, by default 15.
meta_max_tries (int, optional) – Max number of tries to download all URLs before giving up, by default 2.
- Returns
Paths to downloaded files.
- Return type
list
- download_url(url, fp, session=None, waitbar=True, headers=None, max_tries=3, wait_sec=15)
Download a URL to a file.
- Parameters
url (str) – URL to download.
fp (str) – Path to file to download into.
session (requests.Session, optional) – Session to use, by default None.
waitbar (bool, optional) – Display a download progress bar, by default True.
headers (dict, optional) – Headers to include with requests, by default None.
max_tries (int, optional) – Max number of tries to download a single file (in case of server side errors), by default 10.
wait_sec (int, optional) – How many seconds to wait between retries, by default 15.
- Returns
Path to downloaded file.
- Return type
str
- Raises
NameError – Raised when max_tries is exceeeded.