How many ways can I say …

My friend Kitty asked me a simple question the other evening:

What is the total number of unique messages you can fit into 140 characters (i.e. a tweet)

This seems like potentially quite a simple question. So, lets say you have a set of valid characters called C, and a maximum message length of l (in this case, equals 140). For all messages consisting of one character, you have size of C possibilities. For two characters, you have C^2 possibilities – all the way up to C^l. In math-speak:

\displaystyle\sum_{n=1}^l C^n

Which, given that we know that the general formula for a geometric progression to be :
\displaystyle\sum_{k=0}^{n} ar^k = \frac{a(1-r^{n+1})}{1-r}
Geometric progression (Wikipedia)

We can substitute a = 1, r = C and k = i giving us
\displaystyle\sum_{n=0}^{l} C^n = \frac{1-C^{l+1}}{1-C}

For the sake of argument, lets say tweets can only be 8-bit ASCII (I know, terrible assumption to make, but lets just run with this), so the size of C = 255. We shall also let l = 140, since that is the maximum length of a tweet. The answer is …. a very big number. According to Wolfram Alpha, it’s

That’s about 8 followed by over 300 zeros, and that’s being conservative (not including unicode, for example). However, what we have just calculated is in fact all possible messages. Including “xSADFt5hagarnw”, or ” s sssss akasf”. I can’t think of many people who would tweet that – or in fact be able to logically distinguish that from any other sequence of random letters. What we actually want to know is the proportion of legible tweets

This in turn requires us to define what we mean by legible. We could, naively, create a grammar for messages that they might follow. You might say that they contain words separated by spaces. Sometimes the words can start with a # or a @, they can end with punctuation (like . , ? ! etc). But then, we are eliminating legible (but low quality) messages like “!!!LOL!!!” or “i <3 my v1@gra”. Our grammar starts becoming more complex to accommodate these exceptions.

One possible approach would be to estimate a channel grammar by taking all of the current traffic on the channel and create an unweighted graph generated using minimal a priori information (i.e. that words are separated by words, and contain non-whitespace characters). The graph would then contain a path for every single “word” on the channel – with each node representing a character and it’s position in the word. Once this specific graph is created, one could group common nodes together and generalise the graph, reducing the number of redundant nodes.

Once the graph has been generalised sufficiently, one could then use graph theory to calculate all valid routes through the graph (or, as it would be, chain) – imposing the limits on the number of possible characters.

Chances are, it’ll still be a pretty darn huge number. However, we know that will certainly be a subset of the number we calculated earlier. By continuing to feed in data to the graph, you would then be able to adapt to new words being adopted in various languages. The total number of unique tweets is dependent on the vocabulary of the users.

So, I would say that the answer to the original question is in fact: it depends. And, given the general lack of any real central repository of vocabulary that all users must adhere to (which is a clarphing relief), the only way to determine it is by looking at it!

Unfortunately, not a very mathematically beautiful answer – but then anything to do with language rarely is ….

XML-RPC Benchmarking

Hokay, so, for my report, I needed to do some benchmarking of how well my XML-RPC server could perform. So I threw together a quick little script with Twisted which allows me to make manylots of requests at once. It makes two files: times.csv (which contains the length of time each request takes) and errors.csv (which dumps any errors that are encountered along the way).

This is litterally the first revision so it’s incredibly hacky, but it does work fairly nicely.

To create the benchmarker object, you issue the following commands:

benchmark = Benchmark(200, 5000, url_of_server,method_name, ['method','params','here'])

And then to make it go:

benchmark.setupReqs()
reactor.run() # start the reactor!

Here’s the code that currently makes it work. It’s not at all documented but you should kinda be able to see what it’s doing.

#!/usr/bin/env python

from twisted.web.xmlrpc import Proxy
from twisted.internet import reactor
import sys
import datetime

class Request():
       def __init__(self, url, method, params):
               self.url = url
               self.method = method
               self.params = params
               self.start_time = 0
               self.end_time = 0
               self.value =  ""
               self.error = ""
               self.finished = False
               self.error_file = open('errors.csv','w+')

       def addCallback(self, callback):
               self.callback = callback

       def addErrback(self, errback):
               self.errback = errback

       def makeRequest(self):
               proxy = Proxy(self.url)
               proxy.callRemote(self.method,*self.params).addCallbacks(self.retSuccess, self.retFail)
               self.start_time = datetime.datetime.now()

       def __returned(self):
               self.end_time = datetime.datetime.now()

       def retSuccess(self, value):
               self.__returned()
               self.finished = True
               self.value = value
               self.callback(self,value)

       def retFail(self, error):
               self.__returned()
               self.finished = True
               self.error = error
               self.error_file.write("Error: %s" % error)
               self.callback(self,error)

       def isFinished(self):
               return self.finished

       def getTime(self):
               return (self.end_time - self.start_time) # this should be a timedelta

class Benchmark():
       def __init__(self, concurrent = 10, total = 100, url = 'http://localhost:7080/XMLRPC', method = 'date', params=[]):
               self.url = url
               self.method = method
               self.params = params
               self.concurrent_reqs = concurrent
               self.total_reqs = total
               self.open_reqs = 0
               self.current_reqs = 0
               self.error_file = open('errors.csv','w+')
               self.req_times_file = open('times.csv','w+')

       def makeLog(self, filename):
               self.log_file = open(filename,'w+')

       def makeRequest(self):
               req = Request(self.url, self.method, self.params)
               req.addCallback(self.reqSuccess)
               req.addErrback(self.reqError)
               req.makeRequest()
               self.open_reqs = self.open_reqs + 1

       def printReqDetail(self, req):
               #print "Request time: %d ms" % req.getTime().microseconds
               delta = req.getTime()
               print delta

       def reqFinished(self, req):
               self.printReqDetail(req)
               self.open_reqs = self.open_reqs - 1
               self.current_reqs = self.current_reqs + 1 # completed requests
               if ((self.current_reqs + self.open_reqs) < self.total_reqs):
                       self.makeRequest()
               else:
                       if self.open_reqs == 0:
                               reactor.stop() # made as many requests as we wanted to

       def reqSuccess(self,req,value):
               self.reqFinished(req)
               #print repr(value)

       def reqError(self,req, error):
               self.reqFinished(req)
               #print 'error', error

       def setupReqs(self):
               for i in range(0,self.concurrent_reqs): # make the initial pool of requests
                       self.makeRequest()

NEBFAP

So, due to it being a UCAS day here at my Higher Education Institution, I have been forced to seek my sandwiches elsewhere.  So I sauntered off to the theater/cinema cafe which normally has fairly decent sandwiches and cappuccinos.  Having not been here in a while, I had forgotten the perplexing problem of NEBFAP. That is to say, Napkin Exists Between Food and Plate. This is what the aftermath of my BLT looked like:

IMAG0027.jpg

What is the purpose of putting a paper napkin underneath your food?  As you can see – the food has been sitting on it and as such has made it dirty – and thus unsuitable for wiping ones hands or face on.  So that surely can’t be the reason for placing it there.  Is it instead to insulate the food from the plate?  Perhaps to make it simpler to clean?  If that were the case, why would you insulate the sandwich but not the salad – which has loads of (excessively mustardy) dressing drizzled over it.  Equally, if it was to prevent the food from touching the plate: why would the plate not be clean in the first place? And a similar argument regarding the salad again applies – if the plate isn’t clean enough to touch the sandwich, why is it in direct contact with the salad?

Now, thinking about it critically and somewhat laterally, I can see a potential reason for this.  A napkin would soak up any excess oil or fat which may be coming from the food.  This may be desirable, for example, with a cheese toastie – it would absorb the fat and prevent the plate from becoming greasy – and as such prevent the bread from becoming soggy and laden with grease.

Equally, as you may have noticed when making toast in the morning – if you place a hot slice of toast on a cold plate, water tends to condense under the toast making it soggy if left for any reasonable period of time.  The solutions to this being toast racks, or my personal favorite, a House of Toast (whereby you balance two pieces of toast against each other a la a house of cards – try it!).  Or, if you are in the business of turning out toast-based food items at a high rate, you may well place a clean, absorbent cloth underneath the toast such that any condensation is absorbed without making the toast soggy.

Condensation and excessive moisture is a problem for hot foods, especially during microwave cooking.  For example, MicroChips(tm) (the microwavable potato chips – not the electronic devices) have a “crisping pad” at the bottom of the pack.  This is essentially a thin metallic sheet adhered to a piece of card.  It’s purpose in life is to get very hot very fast (as metals tend to do in a microwave) and prevent moisture from collecting at the bottom of the pack, making the chips soggy.

So how does this relate to the problem of NEBFAP?  I can only assume that putting a napkin underneath sandwiches has become standard practice – given that a number of the items served may be served toasted or cold, it’s easier to train staff to put napkins underneath all sandwiches, rather than differentiate.

Mystery solved … ? I think that science may be required to verify this theory …

Multicasting with Twisted Python

Part of my MEng project work is to write some server code which allows alerts to be raised and broadcast to computers in a call center (something along the line of an alarm monitoring response centre – bunch of people in a room which respond to personal alarms in peoples homes). This is but a small part of the complete system – which I have been integrating it into.

The main bulk of the server code which I have written is one giant XML-RPC server, using the Twisted Python framework.

The XML-RPC server itself was immensely easy to implement – using the Twisted XML-RPC howto. In fact, not a single bit of coding that I have done has had anything to do with XML-RPC. I could have written the application end-to-end without having any idea how XML-RPC works – which is fantastic because I don’t care that much for this application!

The next thing that I had to do was write in a component that would allow me to send out real time alerts to listening stations. I figured, since this will most likely be running in a call centre of some sort, I could take advantage of multicasting to achieve this. Essentially, all of the “listening” clients will join a multicast group (in this case, 224.0.0.1). Then, when the XML-RPC server receives a call to raise an alert, it will then send out a UDP packet saying that an alert has been raised to that multicast group.

Integrating a custom UDP protocol with an XML-RPC server? Whoa, that’s gotta be hard right? Nope. Not in the slightest. I wrote my (not particularly exciting) alert broadcasting code:


from twisted.internet.protocol import DatagramProtocol
from twisted.internet import reactor
from twisted.application.internet import MulticastServer

class AlertBroadcast(DatagramProtocol):
	def sendAlert(self,status,alert_id, alert_msg):
		self.transport.write("%d:%s:%s" % (alert_id, status, alert_msg), ('224.0.0.1', 1337))

	def startProtocol(self):
		self.transport.joinGroup('224.0.0.1') # join a multicast group

Then, in my main server building code, added this:

        r = APIRoot() # my root API class, descended from xmlrpc.XMLRPC
        alert_broadcast = AlertBroadcast()
        alertAPI = AlertsAPI(conn,auth,alert_broadcast) # the class to handle raising, acking and clearing alerts
        r.putSubHandler('alerts',alertAPI)
        # and I create all the required objects for the main XML-RPC server, and add it to 'r' (the root)
	reactor.listenMulticast(0, alert_broadcast)
	reactor.listenTCP(7080, server.Site(r))
        reactor.run()

And that’s it. That’s literally all there was to it. Integrating a HTTP server, and a custom mutlicast UDP protocol – done!

I like Twisted - it’s awesome.

Helpdesk Sleuthing

I had an interesting query today on Helpdesk at an institution which will remain anonymous (but will be blatantly obvious to the people who know me).  A student came to the helpdesk and was faced with a perplexing problem.  Their lecturer had given them a website which they were required to use as part of the course (and written in their course guide).  The problem was, they could get to this website from a public PC (which are available for student’s use), and when they were off campus – but not via the wireless service or the network connection in their student accommodation on campus.  Perplexing.

They gave me the website address. We shall say it is called diversebiocultures.org.  I went to the address on my workstation (which is on the staff network) and sure enough, it went through just fine.  Perhaps they’ve got some weird proxy settings on their laptop, I thought.  Checked them and there was no such thing – everything set up correctly for Internet access, and they could access every other site just fine.  So I took one of our laptops which are pre-configured to use the wireless service, and bam – connection didn’t go through.  That’s odd.

So I decide to investigate further.  I looked up the domain name and it resolved to an IP which I knew was in the University’s network.  Except it wasn’t that of our usual web servers – it was on another network which I didn’t recognise.  Knowing that it must be on the local network, I did a traceroute.  The route mapped out went out of the building which I was in, to another building on the campus and down to a distribution switch – from the name I could see what building it was in.  That’s strange, I thought, there’s no machine rooms in that building that would be capable of hosting any servers.

I did a reverse DNS on the IP.  It has a hostname on the University’s domain: cromwell.wont.ac.uk (we shall say).  So clearly someone had requested that at some point with Computing – and most likely it was a static IP.  But again, the hostname wasn’t one which I recognised as being a server.

To the robtex machine!  I discovered by looking it up on robtex.com that it was in fact running a number of domains and websites.  About 12 domains in total.

I then decided to poke it a bit further.  I managed to get an error page up which told me that it was running Apache/1.3.41 on port 80.  Interesting. I decided to give it a portscan.  It was running ssh (OpenSSH), http and https. So most likely a Unix box of some description – makes sense for a server. I did some more poking and looking at the server headers, and discovered it was running PHP 5.2.13, which (as of this writing) is a pretty new version of PHP.  Bit of a strange combination, an old version of Apache but a brand spanking new version of PHP.  And, more interestingly, with the extended headers, it informed me that it was a Darwin build of PHP – so it’s more than likely an OS X machine (i.e. an Apple Mac – probably a desktop machine).

After investigating the other websites which the machine was serving, I managed to find a few references to the “creator” of the websites, with a link to his home page.  Surprisingly (or not surprisingly) his home page wasn’t hosted on the same machine.  However, this did give me the full details of who owned this machine, and, presumably, maintained it.

So there’s an OS X machine on the University’s network serving several websites – including one which is required to participate in a module on a course.  And the reason it doesn’t work on the wireless service?  Because the firewall rules were never requested from Computing to allow that kind of access to the network which it was sitting 0n.  Because it’s a staff network – for staff desktops.

Ended up passing the query back to be escalated to the right people – here’s hoping this person decides to host his website on, shock horror, the University’s web servers.  Which funnily enough have no problems being accessed on the wireless service or in student accommodation.