My friend Kitty asked me a simple question the other evening:
What is the total number of unique messages you can fit into 140 characters (i.e. a tweet)
This seems like potentially quite a simple question. So, lets say you have a set of valid characters called
, and a maximum message length of
(in this case, equals 140). For all messages consisting of one character, you have size of
possibilities. For two characters, you have
possibilities – all the way up to
. In math-speak:
Which, given that we know that the general formula for a geometric progression to be :

Geometric progression (Wikipedia)
We can substitute
,
and
giving us
For the sake of argument, lets say tweets can only be 8-bit ASCII (I know, terrible assumption to make, but lets just run with this), so the size of
. We shall also let
, since that is the maximum length of a tweet. The answer is …. a very big number. According to Wolfram Alpha, it’s

That’s about 8 followed by over 300 zeros, and that’s being conservative (not including unicode, for example). However, what we have just calculated is in fact all possible messages. Including “xSADFt5hagarnw”, or ” s sssss akasf”. I can’t think of many people who would tweet that – or in fact be able to logically distinguish that from any other sequence of random letters. What we actually want to know is the proportion of legible tweets

This in turn requires us to define what we mean by legible. We could, naively, create a grammar for messages that they might follow. You might say that they contain words separated by spaces. Sometimes the words can start with a # or a @, they can end with punctuation (like . , ? ! etc). But then, we are eliminating legible (but low quality) messages like “!!!LOL!!!” or “i <3 my v1@gra”. Our grammar starts becoming more complex to accommodate these exceptions.
One possible approach would be to estimate a channel grammar by taking all of the current traffic on the channel and create an unweighted graph generated using minimal a priori information (i.e. that words are separated by words, and contain non-whitespace characters). The graph would then contain a path for every single “word” on the channel – with each node representing a character and it’s position in the word. Once this specific graph is created, one could group common nodes together and generalise the graph, reducing the number of redundant nodes.
Once the graph has been generalised sufficiently, one could then use graph theory to calculate all valid routes through the graph (or, as it would be, chain) – imposing the limits on the number of possible characters.
Chances are, it’ll still be a pretty darn huge number. However, we know that will certainly be a subset of the number we calculated earlier. By continuing to feed in data to the graph, you would then be able to adapt to new words being adopted in various languages. The total number of unique tweets is dependent on the vocabulary of the users.
So, I would say that the answer to the original question is in fact: it depends. And, given the general lack of any real central repository of vocabulary that all users must adhere to (which is a clarphing relief), the only way to determine it is by looking at it!
Unfortunately, not a very mathematically beautiful answer – but then anything to do with language rarely is ….