There’s Something About Code

June 17, 2009

Twisted HTTP proxy

Filed under: Code — Tags: , , , — Knut Eldhuset @ 20:15

A couple of years ago I wrote an HTTP proxy that would save all viewed web pages. It was written in Java and used a relational database as storage. It had a streaming architecture, which meant that complete requests and responses were not kept in memory, but rather flowed through the proxy. This meant that it could handle responses of any size, including large file downloads, without using much memory. You could write filters that would modify requests and responses, as well as block them completely. Storing web pages was implemented as a special filter.

I never got around to implementing all the features I wanted. Hoping to change that, I am taking a shot at implementing an HTTP proxy in Python. Obviously, socket programming is needed on some level. While my original Java proxy used asynchronous sockets directly, I am thinking about using the Twisted framework in Python. Twisted is “an event-driven networking engine written in Python.” It provides lots of different protocol implementations, including HTTP. In fact, there is a simple HTTP proxy, too. Starting a proxy instance is easy:

1
2
3
4
5
6
7
8
9
from twisted.internet import reactor
from twisted.web import http
from twisted.web.proxy import Proxy
 
factory = http.HTTPFactory()
factory.protocol = Proxy
 
reactor.listenTCP(8000, factory)
reactor.run()

However, a this proxy isn’t very interesting. If we would like to print every site that is accessed through the proxy, we could subclass it like so:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
from twisted.internet import reactor
from twisted.web import http
from twisted.web.proxy import Proxy, ProxyRequest
 
class VerboseProxyRequest(ProxyRequest):
    def process(self):
        print self.uri
        ProxyRequest.process(self)
 
class VerboseProxy(Proxy):
    requestFactory = VerboseProxyRequest
 
factory = http.HTTPFactory()
factory.protocol = VerboseProxy
 
reactor.listenTCP(8000, factory)
reactor.run()

The ProxyRequest implementation of process makes a new HTTP client, an instance of the class ProxyRequestClient. The HTTP client will simply forward anything it receives to the response transport channel of the parent ProxyRequest instance. Creating a simple ad blocker is easy:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
from twisted.internet import reactor
from twisted.web import http
from twisted.web.proxy import Proxy, ProxyRequest
 
class BlockingProxyRequest(ProxyRequest):
    def process(self):
        if "ads" in self.uri:
            print "Blocked:", self.uri
            self.transport.write("HTTP/1.0 404 Not found\r\n")
            self.transport.write("Content-Type: text/html\r\n")
            self.transport.write("\r\n")
            self.transport.write('''<H1>Resource not found</H1>''')
            self.transport.loseConnection()
 
        ProxyRequest.process(self)
 
class BlockingProxy(Proxy):
    requestFactory = BlockingProxyRequest
 
factory = http.HTTPFactory()
factory.protocol = BlockingProxy
 
reactor.listenTCP(8000, factory)
reactor.run()

This code will block any request that has the string “ads” in the URI. This may be overly aggressive for a real ad blocker, though.

Twisted handles all the details of the HTTP requests and responses, as well as the underlying transport protocol. One drawback is that the provided HTTP client implementation does not support HTTP 1.1 yet. This may affect performance, but will not be an issue for my uses.

To get at the response data, we need to subclass the ProxyClient class:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
from twisted.internet import reactor
from twisted.web import http
from twisted.web.proxy import Proxy, ProxyRequest, ProxyClientFactory, ProxyClient
from ImageFile import Parser
from StringIO import StringIO
 
class InterceptingProxyClient(ProxyClient):
    def __init__(self, *args, **kwargs):
        ProxyClient.__init__(self, *args, **kwargs)
        self.image_parser = None
 
    def handleHeader(self, key, value):
        if key == "Content-Type" and value in ["image/jpeg", "image/gif", "image/png"]:
            self.image_parser = Parser()
        if key == "Content-Length" and self.image_parser:
            pass
        else:
            ProxyClient.handleHeader(self, key, value)
 
    def handleEndHeaders(self):
        if self.image_parser:
            pass #Need to calculate and send Content-Length first
        else:
            ProxyClient.handleEndHeaders(self)
 
    def handleResponsePart(self, buffer):
        if self.image_parser:
            self.image_parser.feed(buffer)
        else:
            ProxyClient.handleResponsePart(self, buffer)
 
    def handleResponseEnd(self):
        if self.image_parser:
            image = self.image_parser.close()
            try:
                format = image.format
                image = image.rotate(180)
                s = StringIO()
                image.save(s, format)
                buffer = s.getvalue()
            except:
                buffer = ""
            ProxyClient.handleHeader(self, "Content-Length", len(buffer))
            ProxyClient.handleEndHeaders(self)
            ProxyClient.handleResponsePart(self, buffer)
        ProxyClient.handleResponseEnd(self)
 
class InterceptingProxyClientFactory(ProxyClientFactory):
    protocol = InterceptingProxyClient
 
class InterceptingProxyRequest(ProxyRequest):
    protocols = {'http': InterceptingProxyClientFactory}
 
class InterceptingProxy(Proxy):
    requestFactory = InterceptingProxyRequest
 
factory = http.HTTPFactory()
factory.protocol = InterceptingProxy
 
reactor.listenTCP(8000, factory)
reactor.run()

This proxy will rotate JPEG, GIF and PNG files 180 degrees, turning a Google image search into this:
Upside-down Google image search
Fun, but not very useful. It would have been just as easy to store the image, as well as other content, to disk or a database, but that is a topic for another blog post.

June 1, 2009

Python audio output

Filed under: Code — Tags: , , , — Knut Eldhuset @ 10:01

Playing MOD files requires outputting digital sound at sample rates proportional to the desired pitch. In the Amiga, this was accomplished by setting the sample rate of the hardware channels. Using high level audio interfaces may require setting a fixed sample rate for the lifetime of the audio channel. Converting the MOD file to a WAV file would also require a fixed sample rate. Thus, sample rate conversion is needed.

The most basic MOD files have four channel sequencing. When played on stereo hardware, two sequencer channels are output to each hardware channel. The sound samples need to be converted from mono to stereo, then mixed together with the other channels.

In the Amiga, the sequencer played a new sample each vertical blanking interval. The screen refresh rate in the PAL version of the computer was 50Hz. I have used the pyglet library to play audio. By subclassing the StreamingSource class and providing an implementation of the _get_audio_data method, the timing of the sequencer takes care of itself automatically. The _get_audio_data method returns an audio chunk equivalent to what the Amiga played per vertical blanking interval. Pyglet will simply request more data when needed.

The code for the MOD player can be found here. The code excerpts below are taken from the file player.py.

The Python Multimedia Services library contains functions for doing the necessary raw audio operations. The audio operations are located in the audioop module. The ratecv function takes care of sample rate conversion:

ratecv(fragment, width, nchannels, inrate, outrate, state[, weightA[, weightB]])

It takes an audio fragment as input, and returns the fragment converted to the desired sample rate, as well as the new state. The new state is passed as input the next time the function is invoked. Here is how it looks in the MOD player:

44
45
46
47
48
49
50
51
52
53
54
55
56
57
    def _ratecv(self, sounds):
        output = []
        for n, (sound, state) in enumerate(zip(sounds, self.ratecv_state)):
            while True:
                o, state = ratecv(sound, self.bytes, 1, 
                              int(round(len(sound) / self.tick_time) / self.bytes), 
                              self.rate, 
                              state)
                #Length may be off by one, so process until OK
                if len(o) == int(round(self.rate * self.tick_time * self.bytes)):
                    break
            output.append(o)
            self.ratecv_state[n] = state
        return output

Self.bytes is the number of bytes per sample. The number of channels is 1, since sound is a mono sample. The inrate will vary according to the pitch at which the sample fragment is played back. This is controlled by the sequencer by varying the length of the fragment. The inrate is then calculated based on the fact that this particular fragment fills self.tick_time seconds. The state is stored in an array for later use.

When mixing several channels into one, one needs to make sure that there will be no clipping of the resulting sound sample. This is done by dividing by the number of channels that are to be mixed:

63
64
   def _scale(self, output):
        return [mul(o, self.bytes, 1.0 / (len(output) / self.channels)) for o in output]

The mul function takes care of the scaling. As all the other audioop functions, it needs to know how many bytes are user per sample.

mul(fragment, width, factor)

The tostereo function takes a mono sample and returns a stereo sample. One can supply scaling factors for each of the left and right channels.

tostereo(fragment, width, lfactor, rfactor)

This is used below to put every even numbered sequencer channel in the left stereo channel, and every odd numbered sequencer channel in the right stereo channel.

66
67
    def _tostereo(self, output):
        return [tostereo(o, self.bytes, n % 2, (n + 1) % 2) for n, o in enumerate(output)]

Putting all this together, the variable sample rate sequencer channels can be transformed into a constant sample rate stereo output:

69
70
71
72
73
74
75
76
77
78
79
80
81
82
    def _get_audio_data(self, num_bytes):
        sound = self.sequencer.tick()
        if sound is None:
            pyglet.app.exit()
            return
        self._mute(sound)
        sound = [lin2lin(s, 1, self.bytes) for s in sound]
        output = self._ratecv(sound)
        output = self._scale(output)
        output = self._tostereo(output)
        stereo = mix(output, self.bytes)
        audio = AudioData(stereo, self.audio_length, self.timestamp, self.tick_time)
        self.timestamp += self.tick_time
        return audio

For a standard MOD file, the sequencer returns 4 samples per tick. These are converted from 8 bit to 16 bit using the lin2lin function:

lin2lin(fragment, width, newwidth)

The samples are then rate converted, scaled and converted to stereo. The mix function is a helper function to mix a list of samples. The audioop module only provides an add function to mix two channels, so I made the helper function to mix an arbitrary number of channels. The method ends with creating an AudioData object with the parameters needed for pyglet to play the sound.

Powered by WordPress