Web Loader

Loaders

class webloader.loader.Loader(outdir='.', num_trials=1, http2=False, timeout=30, disable_local_cache=True, disable_network_cache=False, full_page=True, user_agent=None, headless=True, restart_on_fail=False, restart_each_time=False, proxy=None, save_har=False, save_screenshot=False, save_content='never', retries_per_trial=0, stdout_filename=None, check_protocol_availability=True, save_packet_capture=False, disable_quic=False, disable_spdy=False, log_ssl_keys=False, ignore_certificate_errors=False, delay_after_onload=0, delay_first_trial_only=False, primer_load_first=False, configs=[{'tag': 'default', 'settings': {}}])

Superclass for URL loader. Subclasses implement actual page load functionality (e.g., using Chrome, PhantomJS, etc.).

Parameters:
  • outdir – directory for HAR files, screenshots, etc.
  • num_trials – number of times to load each URL
  • http2 – use HTTP 2 (not all subclasses support this)
  • timeout – timeout in seconds
  • disable_local_cache – disable the local browser cache (RAM and disk)
  • disable_network_cache – send “Cache-Control: max-age=0” header
  • full_page – load page’s subresources and render; if False, only the object is fetched
  • user_agent – use custom user agent; if None, use browser’s default
  • headless – don’t use GUI (if there normally is one – e.g., browsers)
  • restart_on_fail – if a load fails, set up the loader again (e.g., reboot chrome)
  • restart_each_time – tear down and set up the loader before each page load (e.g., reboot chrome to close open connections)
  • save_har – save a HAR file to the output directory
  • save_screenshot – save a screenshot to the output directory
  • save_content – save HTTP message bodies (options: ‘always’, ‘first’, ‘never’)
  • retries_per_trial – if a trial fails, retry this many times (beyond first)
  • stdout_filename – if the loader launches other procs (e.g., browser), send their stdout and stderr to this file. If None, use parent proc’s stdout and stderr.
  • check_protocol_availability – before loading the page, check to see if the specified protocol (HTTP or HTTPS) is supported. (otherwise, the loader might silently fall back to a different protocol.)
  • save_packet_capture – save a pcap trace for each load (separate files)
  • disable_quic – disable use of the QUIC transport protocol
  • disable_spdy – disable use of SPDY/HTTP2
  • log_ssl_keys – instruct browser to save SSL session keys (by setting SSLKEYLOGFILE environment variable)
  • ignore_certificate_errors – continue loading page even if certificate check fails
  • delay_after_onload – continue recording objects after onLoad fires (ms)
  • delay_first_trial_only – if fetching a URL multiple times, only delay after onLoad on the first trial. (The delay is useful to count how many objects are loaded after onLoad, and this is less likely to change from trial to trial than load time.)
  • primer_load_first – load the page once before beginning normal trials (e.g., to prime DNS caches)
  • configs – TODO: document
load_pages(urls)

Load each URL in urls num_trials times and collect stats.

Parameters:urls – list of URLs to load
load_results

A dict mapping URLs to a list of LoadResult.

num_restarts

Number of times the loader was restarted (e.g., rebooted browser process) due to failures if restart_on_fail is True.

page_results

A dict mapping URLs to a PageResult.

urls

A cummulative list of the URLs this instance has loaded in the order they were loaded. Each trial is listed separately.

class webloader.phantomjs_loader.PhantomJSLoader(**kwargs)

Subclass of Loader that loads pages using PhantomJS.

Note

The PhantomJSLoader currently does not support HTTP2.

Note

The PhantomJSLoader currently does not support local caching.

Note

The PhantomJSLoader currently does not support disabling network caching.

Note

The PhantomJSLoader currently does not support single-object loading (i.e., it always loads the full page).

Note

The PhantomJSLoader currently does not support saving content.

class webloader.chrome_loader.ChromeLoader(**kwargs)

Subclass of Loader that loads pages using Chrome.

Note

The ChromeLoader currently does not time page load.

Note

The ChromeLoader currently does not save screenshots.

Note

The ChromeLoader currently does not support single-object loading (i.e., it always loads the full page).

Note

The ChromeLoader currently does not support saving screenshots.

class webloader.firefox_loader.FirefoxLoader(selenium=True, **kwargs)

Subclass of Loader that loads pages using Firefox.

Note

The FirefoxLoader currently does not extract HARs.

Note

The FirefoxLoader currently does not save screenshots.

Note

The FirefoxLoader currently does not support single-object loading (i.e., it always loads the full page).

Note

The FirefoxLoader currently does not support disabling network caches.

Note

The FirefoxLoader currently does not support saving screenshots.

Note

The FirefoxLoader currently does not support saving content.

class webloader.pythonrequests_loader.PythonRequestsLoader(**kwargs)

Subclass of Loader that loads pages using Python requests.

Note

The PythonRequestsLoader currently does not support HTTP2.

Note

The PythonRequestsLoader currently does not support local caching.

Note

The PythonRequestsLoader currently does not support disabling network caching.

Note

The PythonRequestsLoader currently does not support full page loading (i.e., fetching a page’s subresources).

Note

The PythonRequestsLoader currently does not support saving HARs.

Note

The PythonRequestsLoader currently does not support saving screenshots.

Note

The PythonRequestsLoader currently does not support saving content.

class webloader.curl_loader.CurlLoader(**kwargs)

Subclass of Loader that loads pages using curl.

Note

The CurlLoader currently does not support HTTP2.

Note

The CurlLoader currently does not support caching.

Note

The CurlLoader currently does not support full page loading (i.e., fetching a page’s subresources).

Note

The CurlLoader currently does not support saving HARs.

Note

The CurlLoader currently does not support saving screenshots.

Note

The CurlLoader currently does not support saving content.

class webloader.nodejs_loader.NodeJsLoader(**kwargs)

Subclass of Loader that loads pages using NODE.JS.

Note

The NodeJsLoader currently does not support caching.

Note

The NodeJsLoader currently does not support full page loading (i.e., fetching a page’s subresources).

Note

The NodeJsLoader currently does not support disabling network caches.

Note

The NodeJsLoader currently does not support saving HARs.

Note

The NodeJsLoader currently does not support saving screenshots.

Note

The NodeJsLoader currently does not support saving content.

class webloader.tcp_loader.TCPLoader(**kwargs)

Subclass of Loader that loads pages using custom executable so we can change TCP settings.

Note

The TCPLoader currently does not support HTTP2.

Note

The TCPLoader currently does not support local caching.

Note

The TCPLoader currently does not support disabling network caching.

Note

The TCPLoader currently does not support single-object loading (i.e., it always loads the full page).

Note

The TCPLoader currently does not support saving content.

Results

class webloader.loader.LoadResult(status, url, final_url=None, time=None, size=None, har=None, img=None, raw=None, server=None, tcp_fast_open_supported=False, tls_false_start_supported=False, tls_session_resumption_supported=False)

Status and stats for a single URL load (i.e., one trial).

Parameters:
  • status – The status of the page load.
  • url – The original URL.
  • final_url – The final URL (maybe be different if we were redirected).
  • time – The page load time (in seconds).
  • size – Size of object if loading a single object; total size if loading a full page.
  • har – Path to the HAR file.
  • img – Path to a screenshot of the loaded page.
  • tcp_fast_open_supported – True if TCP fast open was used successfully; False otherwise or unknown
FAILURE_NO_200 = 'FAILURE_NO_200'

HTTP status code was not 200

FAILURE_TIMEOUT = 'FAILURE_TIMEOUT'

Page load timed out

FAILURE_UNKNOWN = 'FAILURE_UNKNOWN'

Unkown failure occurred

FAILURE_UNSET = 'FAILURE_UNSET'

Status has not been set

SUCCESS = 'SUCCESS'

Page load was successful

final_url

The final URL (could be different if we were redirected).

har_path

Path to the HAR captured during this page load.

image_path

Path to a screenshot of the loaded page.

raw

Raw output from the underlying command.

server

Web server software name.

size

???

status

The status of this page load.

tcp_fast_open_supported

Bool indicating whether or not TCP fast open succeeded for this connection.

time

The page load time in seconds.

tls_false_start_supported

Bool indicating whether or not TLS false start succeeded for this connection.

tls_session_resumption_supported

Bool indicating whether or not TLS session resumption succeeded for this connection.

url

The original URL requested.

class webloader.loader.PageResult(url, status=None, load_results=None)

Status and stats for one URL (all trials).

Parameters:
  • url – The original URL.
  • status – The overall status of all trials.
  • load_results – List of individual LoadResult objects
FAILURE_NOT_ACCESSIBLE = 'FAILURE_NOT_ACCESSIBLE'

The page could not be loaded with the specified protocol

FAILURE_UNKNOWN = 'FAILURE_UNKNOWN'

An unknown failure occurred

FAILURE_UNSET = 'FAILURE_UNSET'

Status has not been set

PARTIAL_SUCCESS = 'PARTIAL_SUCCESS'

some trials were successful

SUCCESS = 'SUCCESS'

All trials were successful

load_statuses

A list of statuses from individual trials.

mean_time

Mean load time across all trials.

median_time

Median load time across all trials.

server

Web server software name.

sizes

A list of the page sizes from individual trials.

status

The overall status across all trials.

stddev_time

Standard deviation of load time across all trials.

tcp_fast_open_support_statuses

A list of bools indicating whether or not TCP fast open succeeded for each load.

times

A list of the load times from individual trials.

tls_false_start_support_statuses

A list of bools indicating whether or not TLS false start succeeded for each load.

tls_session_resumption_support_statuses

A list of bools indicating whether or not TLS session resumption succeeded for each load.

url

The URL.

Indices and tables