A couple of years ago I wrote an HTTP proxy that would save all viewed web pages. It was written in Java and used a relational database as storage. It had a streaming architecture, which meant that complete requests and responses were not kept in memory, but rather flowed through the proxy. This meant that it could handle responses of any size, including large file downloads, without using much memory. You could write filters that would modify requests and responses, as well as block them completely. Storing web pages was implemented as a special filter.
I never got around to implementing all the features I wanted. Hoping to change that, I am taking a shot at implementing an HTTP proxy in Python. Obviously, socket programming is needed on some level. While my original Java proxy used asynchronous sockets directly, I am thinking about using the Twisted framework in Python. Twisted is “an event-driven networking engine written in Python.” It provides lots of different protocol implementations, including HTTP. In fact, there is a simple HTTP proxy, too. Starting a proxy instance is easy:
1 2 3 4 5 6 7 8 9 | from twisted.internet import reactor from twisted.web import http from twisted.web.proxy import Proxy factory = http.HTTPFactory() factory.protocol = Proxy reactor.listenTCP(8000, factory) reactor.run() |
However, a this proxy isn’t very interesting. If we would like to print every site that is accessed through the proxy, we could subclass it like so:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | from twisted.internet import reactor from twisted.web import http from twisted.web.proxy import Proxy, ProxyRequest class VerboseProxyRequest(ProxyRequest): def process(self): print self.uri ProxyRequest.process(self) class VerboseProxy(Proxy): requestFactory = VerboseProxyRequest factory = http.HTTPFactory() factory.protocol = VerboseProxy reactor.listenTCP(8000, factory) reactor.run() |
The ProxyRequest implementation of process makes a new HTTP client, an instance of the class ProxyRequestClient. The HTTP client will simply forward anything it receives to the response transport channel of the parent ProxyRequest instance. Creating a simple ad blocker is easy:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | from twisted.internet import reactor from twisted.web import http from twisted.web.proxy import Proxy, ProxyRequest class BlockingProxyRequest(ProxyRequest): def process(self): if "ads" in self.uri: print "Blocked:", self.uri self.transport.write("HTTP/1.0 404 Not found\r\n") self.transport.write("Content-Type: text/html\r\n") self.transport.write("\r\n") self.transport.write('''<H1>Resource not found</H1>''') self.transport.loseConnection() ProxyRequest.process(self) class BlockingProxy(Proxy): requestFactory = BlockingProxyRequest factory = http.HTTPFactory() factory.protocol = BlockingProxy reactor.listenTCP(8000, factory) reactor.run() |
This code will block any request that has the string “ads” in the URI. This may be overly aggressive for a real ad blocker, though.
Twisted handles all the details of the HTTP requests and responses, as well as the underlying transport protocol. One drawback is that the provided HTTP client implementation does not support HTTP 1.1 yet. This may affect performance, but will not be an issue for my uses.
To get at the response data, we need to subclass the ProxyClient class:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 | from twisted.internet import reactor from twisted.web import http from twisted.web.proxy import Proxy, ProxyRequest, ProxyClientFactory, ProxyClient from ImageFile import Parser from StringIO import StringIO class InterceptingProxyClient(ProxyClient): def __init__(self, *args, **kwargs): ProxyClient.__init__(self, *args, **kwargs) self.image_parser = None def handleHeader(self, key, value): if key == "Content-Type" and value in ["image/jpeg", "image/gif", "image/png"]: self.image_parser = Parser() if key == "Content-Length" and self.image_parser: pass else: ProxyClient.handleHeader(self, key, value) def handleEndHeaders(self): if self.image_parser: pass #Need to calculate and send Content-Length first else: ProxyClient.handleEndHeaders(self) def handleResponsePart(self, buffer): if self.image_parser: self.image_parser.feed(buffer) else: ProxyClient.handleResponsePart(self, buffer) def handleResponseEnd(self): if self.image_parser: image = self.image_parser.close() try: format = image.format image = image.rotate(180) s = StringIO() image.save(s, format) buffer = s.getvalue() except: buffer = "" ProxyClient.handleHeader(self, "Content-Length", len(buffer)) ProxyClient.handleEndHeaders(self) ProxyClient.handleResponsePart(self, buffer) ProxyClient.handleResponseEnd(self) class InterceptingProxyClientFactory(ProxyClientFactory): protocol = InterceptingProxyClient class InterceptingProxyRequest(ProxyRequest): protocols = {'http': InterceptingProxyClientFactory} class InterceptingProxy(Proxy): requestFactory = InterceptingProxyRequest factory = http.HTTPFactory() factory.protocol = InterceptingProxy reactor.listenTCP(8000, factory) reactor.run() |
This proxy will rotate JPEG, GIF and PNG files 180 degrees, turning a Google image search into this:

Fun, but not very useful. It would have been just as easy to store the image, as well as other content, to disk or a database, but that is a topic for another blog post.




