Max Hodak Writings

What does Twitter asymptote to?

December 2020

Consider the following interpretation of Twitter: Twitter is a process that is applied to a human thought vector $x$, a time $t$, and a state vector $c_{t-1}$ which emits a value $tweet$ and a new state vector $c_{t}$. Twitter is effectively a method of sampling text from humans at various times and storing the outputs. The sampled text also has some extra structure: hash tags, @references, a retweet flag and possible retweet comments, a possible reply-relationship to another tweet, and they inherit the timestamp.

Over time as this process runs, we discover things about the cognitive structure of humans. We learn about the syntax and semantics of their languages. We learn about objects and places and events from their lives. But it is very sparse: the informational bottleneck is so severe — it is far beyond limits of practical compression — that the output is heavily degraded for most complex topics.

If you start in 2006 and let $Twitter(x_{i}, c_{t-1}, t)$ run for 14 years, you might get something \(\{c_{t}, \mathbf{tweets}\}\) that is similar to Twitter.com today, but you might also get something totally different. Being able to run a Twitter-Earth-2006-2020 simulation repeatedly and characterizing the variation would tell you tons about humans and Earth.

Let’s say you can only run it once — you don’t have a simulator — but you can run it for a really long time. $Twitter(x_{i}, c_{t-1}, t)$ is clearly not ergodic and so even with infinite time in two different universes the outputs will be different.

Imagine we had some measure $M(\mathbf{tweets}, WorldState)$ that provided some score (for example, a KL divergence) for how much information of the current world state was found embedded in the set of Twitter outputs. What is the shape of that curve over time? Could we take subsets $WorldStateSubset \subseteq WorldState$ and probabilistically sample $M(\mathbf{tweets}, WorldStateSubset)$ to build richer, higher dimensional maps of the Twitter-associated flows of information? Are there are some areas of human knowledge that simply never make it into the outputs even if we run it forever? If so, what are their unifying characteristics?