There’s Something About Code

June 17, 2009

Twisted HTTP proxy

Filed under: Code — Tags: , , , — Knut Eldhuset @ 20:15

A couple of years ago I wrote an HTTP proxy that would save all viewed web pages. It was written in Java and used a relational database as storage. It had a streaming architecture, which meant that complete requests and responses were not kept in memory, but rather flowed through the proxy. This meant that it could handle responses of any size, including large file downloads, without using much memory. You could write filters that would modify requests and responses, as well as block them completely. Storing web pages was implemented as a special filter.

I never got around to implementing all the features I wanted. Hoping to change that, I am taking a shot at implementing an HTTP proxy in Python. Obviously, socket programming is needed on some level. While my original Java proxy used asynchronous sockets directly, I am thinking about using the Twisted framework in Python. Twisted is “an event-driven networking engine written in Python.” It provides lots of different protocol implementations, including HTTP. In fact, there is a simple HTTP proxy, too. Starting a proxy instance is easy:

1
2
3
4
5
6
7
8
9
from twisted.internet import reactor
from twisted.web import http
from twisted.web.proxy import Proxy
 
factory = http.HTTPFactory()
factory.protocol = Proxy
 
reactor.listenTCP(8000, factory)
reactor.run()

However, a this proxy isn’t very interesting. If we would like to print every site that is accessed through the proxy, we could subclass it like so:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
from twisted.internet import reactor
from twisted.web import http
from twisted.web.proxy import Proxy, ProxyRequest
 
class VerboseProxyRequest(ProxyRequest):
    def process(self):
        print self.uri
        ProxyRequest.process(self)
 
class VerboseProxy(Proxy):
    requestFactory = VerboseProxyRequest
 
factory = http.HTTPFactory()
factory.protocol = VerboseProxy
 
reactor.listenTCP(8000, factory)
reactor.run()

The ProxyRequest implementation of process makes a new HTTP client, an instance of the class ProxyRequestClient. The HTTP client will simply forward anything it receives to the response transport channel of the parent ProxyRequest instance. Creating a simple ad blocker is easy:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
from twisted.internet import reactor
from twisted.web import http
from twisted.web.proxy import Proxy, ProxyRequest
 
class BlockingProxyRequest(ProxyRequest):
    def process(self):
        if "ads" in self.uri:
            print "Blocked:", self.uri
            self.transport.write("HTTP/1.0 404 Not found\r\n")
            self.transport.write("Content-Type: text/html\r\n")
            self.transport.write("\r\n")
            self.transport.write('''<H1>Resource not found</H1>''')
            self.transport.loseConnection()
 
        ProxyRequest.process(self)
 
class BlockingProxy(Proxy):
    requestFactory = BlockingProxyRequest
 
factory = http.HTTPFactory()
factory.protocol = BlockingProxy
 
reactor.listenTCP(8000, factory)
reactor.run()

This code will block any request that has the string “ads” in the URI. This may be overly aggressive for a real ad blocker, though.

Twisted handles all the details of the HTTP requests and responses, as well as the underlying transport protocol. One drawback is that the provided HTTP client implementation does not support HTTP 1.1 yet. This may affect performance, but will not be an issue for my uses.

To get at the response data, we need to subclass the ProxyClient class:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
from twisted.internet import reactor
from twisted.web import http
from twisted.web.proxy import Proxy, ProxyRequest, ProxyClientFactory, ProxyClient
from ImageFile import Parser
from StringIO import StringIO
 
class InterceptingProxyClient(ProxyClient):
    def __init__(self, *args, **kwargs):
        ProxyClient.__init__(self, *args, **kwargs)
        self.image_parser = None
 
    def handleHeader(self, key, value):
        if key == "Content-Type" and value in ["image/jpeg", "image/gif", "image/png"]:
            self.image_parser = Parser()
        if key == "Content-Length" and self.image_parser:
            pass
        else:
            ProxyClient.handleHeader(self, key, value)
 
    def handleEndHeaders(self):
        if self.image_parser:
            pass #Need to calculate and send Content-Length first
        else:
            ProxyClient.handleEndHeaders(self)
 
    def handleResponsePart(self, buffer):
        if self.image_parser:
            self.image_parser.feed(buffer)
        else:
            ProxyClient.handleResponsePart(self, buffer)
 
    def handleResponseEnd(self):
        if self.image_parser:
            image = self.image_parser.close()
            try:
                format = image.format
                image = image.rotate(180)
                s = StringIO()
                image.save(s, format)
                buffer = s.getvalue()
            except:
                buffer = ""
            ProxyClient.handleHeader(self, "Content-Length", len(buffer))
            ProxyClient.handleEndHeaders(self)
            ProxyClient.handleResponsePart(self, buffer)
        ProxyClient.handleResponseEnd(self)
 
class InterceptingProxyClientFactory(ProxyClientFactory):
    protocol = InterceptingProxyClient
 
class InterceptingProxyRequest(ProxyRequest):
    protocols = {'http': InterceptingProxyClientFactory}
 
class InterceptingProxy(Proxy):
    requestFactory = InterceptingProxyRequest
 
factory = http.HTTPFactory()
factory.protocol = InterceptingProxy
 
reactor.listenTCP(8000, factory)
reactor.run()

This proxy will rotate JPEG, GIF and PNG files 180 degrees, turning a Google image search into this:
Upside-down Google image search
Fun, but not very useful. It would have been just as easy to store the image, as well as other content, to disk or a database, but that is a topic for another blog post.

No Comments »

No comments yet.

RSS feed for comments on this post. TrackBack URL

Leave a comment

Powered by WordPress