Web Loader¶

Loaders¶

class webloader.loader.Loader(outdir='.', num_trials=1, http2=False, timeout=30, disable_local_cache=True, disable_network_cache=False, full_page=True, user_agent=None, headless=True, restart_on_fail=False, restart_each_time=False, proxy=None, save_har=False, save_screenshot=False, save_content='never', retries_per_trial=0, stdout_filename=None, check_protocol_availability=True, save_packet_capture=False, disable_quic=False, disable_spdy=False, log_ssl_keys=False, ignore_certificate_errors=False, delay_after_onload=0, delay_first_trial_only=False, primer_load_first=False, configs=[{'tag': 'default', 'settings': {}}])¶

Superclass for URL loader. Subclasses implement actual page load functionality (e.g., using Chrome, PhantomJS, etc.).

Parameters:

outdir – directory for HAR files, screenshots, etc.
num_trials – number of times to load each URL
http2 – use HTTP 2 (not all subclasses support this)
timeout – timeout in seconds
disable_local_cache – disable the local browser cache (RAM and disk)
disable_network_cache – send “Cache-Control: max-age=0” header
full_page – load page’s subresources and render; if False, only the object is fetched
user_agent – use custom user agent; if None, use browser’s default
headless – don’t use GUI (if there normally is one – e.g., browsers)
restart_on_fail – if a load fails, set up the loader again (e.g., reboot chrome)
restart_each_time – tear down and set up the loader before each page load (e.g., reboot chrome to close open connections)
save_har – save a HAR file to the output directory
save_screenshot – save a screenshot to the output directory
save_content – save HTTP message bodies (options: ‘always’, ‘first’, ‘never’)
retries_per_trial – if a trial fails, retry this many times (beyond first)
stdout_filename – if the loader launches other procs (e.g., browser), send their stdout and stderr to this file. If None, use parent proc’s stdout and stderr.
check_protocol_availability – before loading the page, check to see if the specified protocol (HTTP or HTTPS) is supported. (otherwise, the loader might silently fall back to a different protocol.)
save_packet_capture – save a pcap trace for each load (separate files)
disable_quic – disable use of the QUIC transport protocol
disable_spdy – disable use of SPDY/HTTP2
log_ssl_keys – instruct browser to save SSL session keys (by setting SSLKEYLOGFILE environment variable)
ignore_certificate_errors – continue loading page even if certificate check fails
delay_after_onload – continue recording objects after onLoad fires (ms)
delay_first_trial_only – if fetching a URL multiple times, only delay after onLoad on the first trial. (The delay is useful to count how many objects are loaded after onLoad, and this is less likely to change from trial to trial than load time.)
primer_load_first – load the page once before beginning normal trials (e.g., to prime DNS caches)
configs – TODO: document

load_pages(urls)¶

Load each URL in urls num_trials times and collect stats.

Parameters:	urls – list of URLs to load

load_results¶: A dict mapping URLs to a list of LoadResult.

num_restarts¶: Number of times the loader was restarted (e.g., rebooted browser process) due to failures if restart_on_fail is True.

page_results¶: A dict mapping URLs to a PageResult.

urls¶: A cummulative list of the URLs this instance has loaded in the order they were loaded. Each trial is listed separately.

class webloader.phantomjs_loader.PhantomJSLoader(**kwargs)¶: Subclass of Loader that loads pages using PhantomJS.

Note

The PhantomJSLoader currently does not support HTTP2.

Note

The PhantomJSLoader currently does not support local caching.

Note

The PhantomJSLoader currently does not support disabling network caching.

Note

The PhantomJSLoader currently does not support single-object loading (i.e., it always loads the full page).

Note

The PhantomJSLoader currently does not support saving content.

class webloader.chrome_loader.ChromeLoader(**kwargs)¶: Subclass of Loader that loads pages using Chrome.

Note

The ChromeLoader currently does not time page load.

Note

The ChromeLoader currently does not save screenshots.

Note

The ChromeLoader currently does not support single-object loading (i.e., it always loads the full page).

Note

The ChromeLoader currently does not support saving screenshots.

class webloader.firefox_loader.FirefoxLoader(selenium=True, **kwargs)¶: Subclass of Loader that loads pages using Firefox.

Note

The FirefoxLoader currently does not extract HARs.

Note

The FirefoxLoader currently does not save screenshots.

Note

The FirefoxLoader currently does not support single-object loading (i.e., it always loads the full page).

Note

The FirefoxLoader currently does not support disabling network caches.

Note

The FirefoxLoader currently does not support saving screenshots.

Note

The FirefoxLoader currently does not support saving content.

class webloader.pythonrequests_loader.PythonRequestsLoader(**kwargs)¶: Subclass of Loader that loads pages using Python requests.

Note

The PythonRequestsLoader currently does not support HTTP2.

Note

The PythonRequestsLoader currently does not support local caching.

Note

The PythonRequestsLoader currently does not support disabling network caching.

Note

The PythonRequestsLoader currently does not support full page loading (i.e., fetching a page’s subresources).

Note

The PythonRequestsLoader currently does not support saving HARs.

Note

The PythonRequestsLoader currently does not support saving screenshots.

Note

The PythonRequestsLoader currently does not support saving content.

class webloader.curl_loader.CurlLoader(**kwargs)¶: Subclass of Loader that loads pages using curl.

Note

The CurlLoader currently does not support HTTP2.

Note

The CurlLoader currently does not support caching.

Note

The CurlLoader currently does not support full page loading (i.e., fetching a page’s subresources).

Note

The CurlLoader currently does not support saving HARs.

Note

The CurlLoader currently does not support saving screenshots.

Note

The CurlLoader currently does not support saving content.

class webloader.nodejs_loader.NodeJsLoader(**kwargs)¶: Subclass of Loader that loads pages using NODE.JS.

Note

The NodeJsLoader currently does not support caching.

Note

The NodeJsLoader currently does not support full page loading (i.e., fetching a page’s subresources).

Note

The NodeJsLoader currently does not support disabling network caches.

Note

The NodeJsLoader currently does not support saving HARs.

Note

The NodeJsLoader currently does not support saving screenshots.

Note

The NodeJsLoader currently does not support saving content.

class webloader.tcp_loader.TCPLoader(**kwargs)¶: Subclass of Loader that loads pages using custom executable so we can change TCP settings.

Note

The TCPLoader currently does not support HTTP2.

Note

The TCPLoader currently does not support local caching.

Note

The TCPLoader currently does not support disabling network caching.

Note

The TCPLoader currently does not support single-object loading (i.e., it always loads the full page).

Note

The TCPLoader currently does not support saving content.

Results¶

class webloader.loader.LoadResult(status, url, final_url=None, time=None, size=None, har=None, img=None, raw=None, server=None, tcp_fast_open_supported=False, tls_false_start_supported=False, tls_session_resumption_supported=False)¶

Status and stats for a single URL load (i.e., one trial).

Parameters:

status – The status of the page load.
url – The original URL.
final_url – The final URL (maybe be different if we were redirected).
time – The page load time (in seconds).
size – Size of object if loading a single object; total size if loading a full page.
har – Path to the HAR file.
img – Path to a screenshot of the loaded page.
tcp_fast_open_supported – True if TCP fast open was used successfully; False otherwise or unknown

FAILURE_NO_200 = 'FAILURE_NO_200'¶: HTTP status code was not 200

FAILURE_TIMEOUT = 'FAILURE_TIMEOUT'¶: Page load timed out

FAILURE_UNKNOWN = 'FAILURE_UNKNOWN'¶: Unkown failure occurred

FAILURE_UNSET = 'FAILURE_UNSET'¶: Status has not been set

SUCCESS = 'SUCCESS'¶: Page load was successful

final_url¶: The final URL (could be different if we were redirected).

har_path¶: Path to the HAR captured during this page load.

image_path¶: Path to a screenshot of the loaded page.

raw¶: Raw output from the underlying command.

server¶: Web server software name.

size¶: ???

status¶: The status of this page load.

tcp_fast_open_supported¶: Bool indicating whether or not TCP fast open succeeded for this connection.

time¶: The page load time in seconds.

tls_false_start_supported¶: Bool indicating whether or not TLS false start succeeded for this connection.

tls_session_resumption_supported¶: Bool indicating whether or not TLS session resumption succeeded for this connection.

url¶: The original URL requested.

class webloader.loader.PageResult(url, status=None, load_results=None)¶

Status and stats for one URL (all trials).

Parameters:	url – The original URL. status – The overall status of all trials. load_results – List of individual `LoadResult` objects

FAILURE_NOT_ACCESSIBLE = 'FAILURE_NOT_ACCESSIBLE'¶: The page could not be loaded with the specified protocol

FAILURE_UNKNOWN = 'FAILURE_UNKNOWN'¶: An unknown failure occurred

FAILURE_UNSET = 'FAILURE_UNSET'¶: Status has not been set

PARTIAL_SUCCESS = 'PARTIAL_SUCCESS'¶: some trials were successful

SUCCESS = 'SUCCESS'¶: All trials were successful

load_statuses¶: A list of statuses from individual trials.

mean_time¶: Mean load time across all trials.

median_time¶: Median load time across all trials.

server¶: Web server software name.

sizes¶: A list of the page sizes from individual trials.

status¶: The overall status across all trials.

stddev_time¶: Standard deviation of load time across all trials.

tcp_fast_open_support_statuses¶: A list of bools indicating whether or not TCP fast open succeeded for each load.

times¶: A list of the load times from individual trials.

tls_false_start_support_statuses¶: A list of bools indicating whether or not TLS false start succeeded for each load.

tls_session_resumption_support_statuses¶: A list of bools indicating whether or not TLS session resumption succeeded for each load.

url¶: The URL.

Web Loader¶

Loaders¶

Results¶

Indices and tables¶