Skip to content

Commit bee7026

Browse files
committed
Merge branch 'develop'
2 parents 238a45b + 09861ad commit bee7026

16 files changed

+160
-58
lines changed

CHANGES.rst

Lines changed: 21 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,20 @@
1+
pywb 0.6.6 changelist
2+
~~~~~~~~~~~~~~~~~~~~~
3+
4+
* JS client side improvements: check for double-inits, preserve anchor in wb.js top location redirect
5+
6+
* JS Rewriters: add mixins for link + location (default), link only, location only rewriting by setting ``js_rewrite_location`` to ``all``, ``urls``, ``location``, respectively.
7+
8+
(New: location only rewriting does not change JS urls)
9+
10+
* Beginning of new rewrite options, settable per collections and stored in UrlRewriter. Available options:
11+
12+
- ``rewrite_base`` - set to False to disable rewriting ``<base href="...">`` tag
13+
- ``rewrite_rel_canon`` - set to false to disable rewriting ``<link rel=canon href="...">``
14+
15+
* JS rewrite: Don't rewrite location if starting with '$'
16+
17+
118
pywb 0.6.5 changelist
219
~~~~~~~~~~~~~~~~~~~~~
320

@@ -40,17 +57,17 @@ pywb 0.6.3 changelist
4057
pywb 0.6.2 changelist
4158
~~~~~~~~~~~~~~~~~~~~~
4259

43-
* Invert framed replay paradigm: Canonical page is always without a modifier (instead of with `mp_`), if using frames, the page redirects to `tf_`, and uses replaceState() to change url back to canonical form.
60+
* Invert framed replay paradigm: Canonical page is always without a modifier (instead of with ``mp_``), if using frames, the page redirects to ``tf_``, and uses replaceState() to change url back to canonical form.
4461

4562
* Enable Memento support for framed replay, include Memento headers in top frame
4663

47-
* Easier to customize just the banner html, via `banner_html` setting in the config. Default banner uses ui/banner.html and inserts the script default_banner.js, which creates the banner.
64+
* Easier to customize just the banner html, via ``banner_html`` setting in the config. Default banner uses ui/banner.html and inserts the script default_banner.js, which creates the banner.
4865

49-
Other implementations may create banner via custom JS or directly insert HTML, as needed. Setting `banner_html: False` will disable the banner.
66+
Other implementations may create banner via custom JS or directly insert HTML, as needed. Setting ``banner_html: False`` will disable the banner.
5067

5168
* Small improvements to streaming response, read in fixed chunks to allow better streaming from live.
5269

53-
* Improved cookie and csrf-token rewriting, including: ability to set `cookie_scope: root` per collection to have all replayed cookies have their Path set to application root.
70+
* Improved cookie and csrf-token rewriting, including: ability to set ``cookie_scope: root`` per collection to have all replayed cookies have their Path set to application root.
5471

5572
This is useful for replaying sites which share cookies amongst different pages and across archived time ranges.
5673

README.rst

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
PyWb 0.6.5
1+
PyWb 0.6.6
22
==========
33

44
.. image:: https://travis-ci.org/ikreymer/pywb.png?branch=master
@@ -44,7 +44,7 @@ This README contains a basic overview of using pywb. After reading this intro, c
4444
pywb Tools Overview
4545
-----------------------------
4646

47-
In addition to the standard wayback machine (explained further below), pywb tool suite includes a
47+
In addition to the standard wayback machine (explained further below), pywb tool suite includes a
4848
number of useful command-line and web server tools. The tools should be available to run after
4949
running ``python setup.py install``:
5050

@@ -58,10 +58,10 @@ running ``python setup.py install``:
5858
for all options.
5959

6060

61-
* ``cdx-server`` -- a CDX API only server which returns a responses about CDX captures in bulk.
61+
* ``cdx-server`` -- a CDX API only server which returns a responses about CDX captures in bulk.
6262
Includes most of the features of the `original cdx server implementation <https://github.com/internetarchive/wayback/tree/master/wayback-cdx-server>`_,
6363
updated documentation coming soon.
64-
64+
6565
* ``proxy-cert-auth`` -- a utility to support proxy mode. It can be used in CA root certificate, or per-host certificate with an existing root cert.
6666

6767

@@ -151,7 +151,7 @@ If you would like to use non-SURT ordered .cdx files, simply add this field to t
151151
::
152152

153153
surt_ordered: false
154-
154+
155155
UI Customization
156156
"""""""""""""""""""""
157157

pywb/framework/archivalrouter.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -62,7 +62,8 @@ def parse_request(self, route, env, matcher, coll, request_uri,
6262
use_abs_prefix=use_abs_prefix,
6363
wburl_class=route.handler.get_wburl_type(),
6464
urlrewriter_class=UrlRewriter,
65-
cookie_scope=route.cookie_scope)
65+
cookie_scope=route.cookie_scope,
66+
rewrite_opts=route.rewrite_opts)
6667

6768
# Allow for applying of additional filters
6869
route.apply_filters(wbrequest, matcher)
@@ -101,6 +102,7 @@ def __init__(self, regex, handler, coll_group=0, config={},
101102
# collection id from regex group (default 0)
102103
self.coll_group = coll_group
103104
self.cookie_scope = config.get('cookie_scope')
105+
self.rewrite_opts = config.get('rewrite_opts', {})
104106
self._custom_init(config)
105107

106108
def is_handling(self, request_uri):

pywb/framework/wbrequestresponse.py

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,8 @@ def __init__(self, env,
3838
wburl_class=None,
3939
urlrewriter_class=None,
4040
is_proxy=False,
41-
cookie_scope=None):
41+
cookie_scope=None,
42+
rewrite_opts={}):
4243

4344
self.env = env
4445

@@ -77,7 +78,8 @@ def __init__(self, env,
7778
host_prefix + rel_prefix,
7879
rel_prefix,
7980
env.get('SCRIPT_NAME', '/'),
80-
cookie_scope)
81+
cookie_scope,
82+
rewrite_opts)
8183

8284
self.urlrewriter.deprefix_url()
8385
else:

pywb/rewrite/html_rewriter.py

Lines changed: 16 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -92,6 +92,9 @@ def __init__(self, url_rewriter,
9292

9393
self.rewrite_tags = self._init_rewrite_tags(defmod)
9494

95+
# get opts from urlrewriter
96+
self.opts = url_rewriter.rewrite_opts
97+
9598
# ===========================
9699
META_REFRESH_REGEX = re.compile('^[\\d.]+\\s*;\\s*url\\s*=\\s*(.+?)\\s*$',
97100
re.IGNORECASE | re.MULTILINE)
@@ -174,9 +177,11 @@ def _rewrite_tag_attrs(self, tag, tag_attrs):
174177
elif attr_name == 'crossorigin':
175178
attr_name = '_crossorigin'
176179

177-
# special case: link don't rewrite canonical
180+
# special case: if rewrite_canon not set,
181+
# don't rewrite rel=canonical
178182
elif tag == 'link' and attr_name == 'href':
179-
if not self.has_attr(tag_attrs, ('rel', 'canonical')):
183+
if (self.opts.get('rewrite_rel_canon', True) or
184+
not self.has_attr(tag_attrs, ('rel', 'canonical'))):
180185
rw_mod = handler.get(attr_name)
181186
attr_value = self._rewrite_url(attr_value, rw_mod)
182187

@@ -191,17 +196,21 @@ def _rewrite_tag_attrs(self, tag, tag_attrs):
191196
rw_mod = 'oe_'
192197
attr_value = self._rewrite_url(attr_value, rw_mod)
193198

199+
# special case: base tag
200+
elif (tag == 'base') and (attr_name == 'href') and attr_value:
201+
rw_mod = handler.get(attr_name)
202+
base_value = self._rewrite_url(attr_value, rw_mod)
203+
if self.opts.get('rewrite_base', True):
204+
attr_value = base_value
205+
self.url_rewriter = (self.url_rewriter.
206+
rebase_rewriter(base_value))
207+
194208
else:
195209
# rewrite url using tag handler
196210
rw_mod = handler.get(attr_name)
197211
if rw_mod is not None:
198212
attr_value = self._rewrite_url(attr_value, rw_mod)
199213

200-
# special case: base tag
201-
if (tag == 'base') and (attr_name == 'href') and attr_value:
202-
self.url_rewriter = (self.url_rewriter.
203-
rebase_rewriter(attr_value))
204-
205214
# write the attr!
206215
self._write_attr(attr_name, attr_value)
207216

pywb/rewrite/regex_rewriters.py

Lines changed: 26 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,7 @@ def archival_rewrite(rewriter):
3535

3636
#DEFAULT_OP = add_prefix
3737

38-
def __init__(self, rules):
38+
def __init__(self, rewriter, rules):
3939
#rules = self.create_rules(http_prefix)
4040

4141
# Build regexstr, concatenating regex list
@@ -106,7 +106,7 @@ def parse_rule(obj):
106106

107107

108108
#=================================================================
109-
class JSLinkOnlyRewriter(RegexRewriter):
109+
class JSLinkRewriterMixin(object):
110110
"""
111111
JS Rewriter which rewrites absolute http://, https:// and // urls
112112
at the beginning of a string
@@ -118,19 +118,20 @@ def __init__(self, rewriter, rules=[]):
118118
rules = rules + [
119119
(self.JS_HTTPX, RegexRewriter.archival_rewrite(rewriter), 0)
120120
]
121-
super(JSLinkOnlyRewriter, self).__init__(rules)
121+
super(JSLinkRewriterMixin, self).__init__(rewriter, rules)
122122

123123

124124
#=================================================================
125-
class JSLinkAndLocationRewriter(JSLinkOnlyRewriter):
125+
class JSLocationRewriterMixin(object):
126+
#class JSLinkAndLocationRewriter(JSLinkOnlyRewriter):
126127
"""
127-
JS Rewriter which also rewrites location and domain to the
128+
JS Rewriter mixin which rewrites location and domain to the
128129
specified prefix (default: 'WB_wombat_')
129130
"""
130131

131132
def __init__(self, rewriter, rules=[], prefix='WB_wombat_'):
132133
rules = rules + [
133-
(r'(?<!/)\blocation\b(?!\":)', RegexRewriter.add_prefix(prefix), 0),
134+
(r'(?<![/$])\blocation\b(?!\":)', RegexRewriter.add_prefix(prefix), 0),
134135
(r'(?<=document\.)domain', RegexRewriter.add_prefix(prefix), 0),
135136
(r'(?<=document\.)referrer', RegexRewriter.add_prefix(prefix), 0),
136137
(r'(?<=document\.)cookie', RegexRewriter.add_prefix(prefix), 0),
@@ -148,7 +149,23 @@ def __init__(self, rewriter, rules=[], prefix='WB_wombat_'):
148149
#(r'\b(?:self|window)\b[!=\W]+\b(top)\b',
149150
#RegexRewriter.add_prefix(prefix), 1),
150151
]
151-
super(JSLinkAndLocationRewriter, self).__init__(rewriter, rules)
152+
super(JSLocationRewriterMixin, self).__init__(rewriter, rules)
153+
154+
155+
#=================================================================
156+
class JSLocationOnlyRewriter(JSLocationRewriterMixin, RegexRewriter):
157+
pass
158+
159+
160+
#=================================================================
161+
class JSLinkOnlyRewriter(JSLinkRewriterMixin, RegexRewriter):
162+
pass
163+
164+
#=================================================================
165+
class JSLinkAndLocationRewriter(JSLocationRewriterMixin,
166+
JSLinkRewriterMixin,
167+
RegexRewriter):
168+
pass
152169

153170

154171
#=================================================================
@@ -161,7 +178,7 @@ class XMLRewriter(RegexRewriter):
161178
def __init__(self, rewriter, extra=[]):
162179
rules = self._create_rules(rewriter)
163180

164-
super(XMLRewriter, self).__init__(rules)
181+
super(XMLRewriter, self).__init__(rewriter, rules)
165182

166183
# custom filter to reject 'xmlns' attr
167184
def filter(self, m):
@@ -189,7 +206,7 @@ class CSSRewriter(RegexRewriter):
189206

190207
def __init__(self, rewriter):
191208
rules = self._create_rules(rewriter)
192-
super(CSSRewriter, self).__init__(rules)
209+
super(CSSRewriter, self).__init__(rewriter, rules)
193210

194211
def _create_rules(self, rewriter):
195212
return [

pywb/rewrite/rewriterules.py

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
from pywb.utils.dsrules import BaseRule
22

33
from regex_rewriters import RegexRewriter, CSSRewriter, XMLRewriter
4-
from regex_rewriters import JSLinkAndLocationRewriter, JSLinkOnlyRewriter
4+
from regex_rewriters import JSLinkAndLocationRewriter, JSLinkOnlyRewriter, JSLocationOnlyRewriter
55

66
from header_rewriter import HeaderRewriter
77
from html_rewriter import HTMLRewriter
@@ -27,12 +27,13 @@ def __init__(self, url_prefix, config={}):
2727
self.parse_comments = config.get('parse_comments', False)
2828

2929
# Custom handling for js rewriting, often the most complex
30-
self.js_rewrite_location = config.get('js_rewrite_location', True)
31-
self.js_rewrite_location = bool(self.js_rewrite_location)
30+
self.js_rewrite_location = config.get('js_rewrite_location', 'all')
3231

3332
# ability to toggle rewriting
34-
if self.js_rewrite_location:
33+
if self.js_rewrite_location == 'all':
3534
js_default_class = JSLinkAndLocationRewriter
35+
elif self.js_rewrite_location == 'location':
36+
js_default_class = JSLocationOnlyRewriter
3637
else:
3738
js_default_class = JSLinkOnlyRewriter
3839

pywb/rewrite/test/test_html_rewriter.py

Lines changed: 21 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -20,13 +20,22 @@
2020
#>>> parse('<input "selected"><img src></div>')
2121
#<input "selected"=""><img src=""></div>
2222
23-
# Base Tests
23+
# Base Tests -- w/ rewrite (default)
2424
>>> parse('<html><head><base href="http://example.com/diff/path/file.html"/>')
2525
<html><head><base href="/web/20131226101010/http://example.com/diff/path/file.html"/>
2626
2727
>>> parse('<base href="static/"/><img src="image.gif"/>')
2828
<base href="/web/20131226101010/http://example.com/some/path/static/"/><img src="/web/20131226101010im_/http://example.com/some/path/static/image.gif"/>
2929
30+
# Base Tests -- no rewrite
31+
>>> parse('<html><head><base href="http://example.com/diff/path/file.html"/>', urlrewriter=no_base_canon_rewriter)
32+
<html><head><base href="http://example.com/diff/path/file.html"/>
33+
34+
>>> parse('<base href="static/"/><img src="image.gif"/>', urlrewriter=no_base_canon_rewriter)
35+
<base href="static/"/><img src="/web/20131226101010im_/http://example.com/some/path/static/image.gif"/>
36+
37+
38+
3039
# HTML Entities
3140
>>> parse('<a href="">&rsaquo; &nbsp; &#62; &#63</div>')
3241
<a href="">&rsaquo; &nbsp; &#62; &#63</div>
@@ -102,8 +111,12 @@
102111
>>> parse('<link href="abc.txt"><div>SomeTest</div>', head_insert = '<script>load_stuff();</script>')
103112
<link href="/web/20131226101010oe_/http://example.com/some/path/abc.txt"><script>load_stuff();</script><div>SomeTest</div>
104113
105-
# don't rewrite rel=canonical
114+
# rel=canonical: rewrite (default)
106115
>>> parse('<link rel=canonical href="http://example.com/">')
116+
<link rel="canonical" href="/web/20131226101010oe_/http://example.com/">
117+
118+
# rel=canonical: no_rewrite
119+
>>> parse('<link rel=canonical href="http://example.com/">', urlrewriter=no_base_canon_rewriter)
107120
<link rel="canonical" href="http://example.com/">
108121
109122
# doctype
@@ -143,7 +156,12 @@
143156

144157
urlrewriter = UrlRewriter('20131226101010/http://example.com/some/path/index.html', '/web/')
145158

146-
def parse(data, head_insert = None):
159+
no_base_canon_rewriter = UrlRewriter('20131226101010/http://example.com/some/path/index.html',
160+
'/web/',
161+
rewrite_opts=dict(rewrite_rel_canon=False,
162+
rewrite_base=False))
163+
164+
def parse(data, head_insert=None, urlrewriter=urlrewriter):
147165
parser = HTMLRewriter(urlrewriter, head_insert = head_insert)
148166
#data = data.decode('utf-8')
149167
result = parser.rewrite(data) + parser.close()

pywb/rewrite/test/test_regex_rewriters.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33
# Custom Regex
44
#=================================================================
55
# Test https->http converter (other tests below in subclasses)
6-
>>> RegexRewriter([(RegexRewriter.HTTPX_MATCH_STR, RegexRewriter.remove_https, 0)]).rewrite('a = https://example.com; b = http://example.com; c = https://some-url/path/https://embedded.example.com')
6+
>>> RegexRewriter(urlrewriter, [(RegexRewriter.HTTPX_MATCH_STR, RegexRewriter.remove_https, 0)]).rewrite('a = https://example.com; b = http://example.com; c = https://some-url/path/https://embedded.example.com')
77
'a = http://example.com; b = http://example.com; c = http://some-url/path/http://embedded.example.com'
88
99

0 commit comments

Comments
 (0)