Saturday, September 25, 2004

Writing a Social Content Engine with RDF

September 19th 2004 anselm@hook.org

Last rev Sep 23 2004 - continuing to add comments to Javascript section.
No source code is published yet - this is still a work in progress.

Revving our engines



Today we're going to build a social content engine for organizing and sharing
content with our friends.
The service we build will let you:

  1. Publish observations or 'stuff' onto a website.
  2. Categorize it variety of ways.
  3. Pivot on yours or others observations to discover other related topics or
    persons.

Our work will be modelled on newly emerging services including
del.icio.us
, Flickr and
Webjay
. The code itself will be a rewrite based on actually a fairly
small subset of Thingster and
BooksWeLike

which I've been developing (and learning to understand the implications of)
over the last 6 months or so. If you haven't used delicious in particular you
need to stop reading this, go there, make an account and play with it for a
while.

Users use these services to organize their own content for later recollection.
But since the services are public, other users can peek into the collective
space, and discover similar items, topics or persons.

In this project we're going to look for opportunities to stress the 'synthesis'
aspect of social discovery; to escape from the pattern of curated collections
managed and presented by one person. If there is time it would be fun to play
with generating statistics and views on participants and their recommendations
as well.
The components that we need to write to deliver our service will include:

  1. Our own lightweight RDF triple-store based on top of PERST.
  2. Our own lightweight content management system.
  3. A tag engine used to categorize our observations.
  4. An XML server gateway that we'll build on top of Jetty.
  5. A Javascript client-side User Interface.


One of the specific things we're going to build into our service is a 'tags'
mechanism as popularized by delicious. Users will be able to publish tags to
categorize items of interest and other users will be able to pivot on those
tags to discover items of like interest.

We are going to push RDF quite hard. We will write a lightweight persistent and
embeddable RDF triple store in Java - possibly being the first people to do so.
This will be the cornerstore of our application and represents significant
value even beyond this particular project. We'll also seek to use official RDF
vocabularies as much as possible. We want to have something that is not only
functional for our own use but that can interact with the rich ecology of the
web - publishing data via RSS or RDF/XML to a wide variety of other services.

We are also going to push Javascript quite a bit to express the client side
interface. Again we will seek to build fairly powerful components that will
have significant reuse for other projects.

Overall the pattern of the finished project is to build an XML driven
web-service built on top of industrial strength concepts that can be re-used
for almost any conceivable knowledge management application.
To accomplish all this you will need these third party pieces:

  1. Sun's Java SDK ( 1.3 is ok )
  2. PERST
  3. Jena's ARP
  4. Jetty
  5. Ant (optionally)


The results should be quite fun to drive and fairly industrial. Let's take a
look at some of the ideas next.

Signalling



Here we're just going to muck about about with casual observations about what
it means to have a 'social content' system. Ideally we'd like to end up with a
laundry list of constraints that can guide our choices.

There's an old saw that goes "actions speak louder than words". A car can have
its left signal flashing but be travelling blindly down the road not turning at
all... Or oncoming traffic may suddenly and mysteriously slow down suggesting
the presence of a fine officer of the law doing his part to help keep a
community orderly - or even just a kid crossing the street without illuminating
the crosswalk signal.

In vehicular traffic drivers wheel and race making moment to moment decisions
on the basis of each others inputs; signalling to each other in a variety of
both intentional and unintentional ways. As a participant you end up creating a
mental model of the things around you, the situational landscape, and the best
navigation choices.

On the net there is a potential for similar behavior.

If we could just watch what people "do" instead of what they "say" we might
actually find that the quality of knowledge we're getting from them is actually
higher.

People on the net do of course signal to each other with a variety of
intentional and explicit mechanisms. There are countless blogging services,
craigslist, vanilla websites, listservs, email, wikis, p2p networks, irc, sms
and on and on.
But that space has started falling over. There is incessant spam, and almost
everything has become saturated with 'adwords by google'. The language and
phrasing of traditional content has steered sharply towards maximizing ad
revenue. The intentional signals are polluted and noise-ridden.

Watching flocks of humans pinwheel about has up until now been the domain of
web portals. Now we're seeing this become more democratic as new p2p
psychographic behaviour tracking services such as A9 and Ask Jeeves are rolled
out.
The newer services that are emerging seem to have few parallels to existing
services. Wikipedia of course does offer social benefit but it has content
organized and massaged by hand. Orkut, Friendster, Multiply, LinkedIn are social but don't have any particular organizational utility; there is no personal activity that others observe - most behavior is explicit. CraigsList and Meetup and Upcoming do provide community but the signalling is all explicit again.

In automating the synthesis of many peoples observations there is perhaps an
immediacy, a lower latency between oneself and ones peers. Perhaps this
satisfies an instinctive need for a sense of connectedness. The best I can say
is that delicious seems more 'human' than say 'google news' or many of the
other sites I look at on a daily basis.
Can we get anything specific from all this? Here's a grabbag of constraints:

  1. Let you publish citations to books, urls, photographs and other digital ephemera.

  2. Categorize your observations using tags

  3. Pivot on tags and observations to find more like tags, observations and people.

  4. The application should be embeddable; running on personal systems not just LAMP
    / UNIX .

  5. Small

  6. Fast

  7. Clean understandable source

  8. Standards based

  9. A foundation for future projects

  10. Strong social discovery aspects

  11. Recommendation

  12. Tag clustering using LSI techniques (or any technique that comes to mind)

  13. Perhaps even a general purpose content management system.

  14. Run on Java 1.2 or any 'older language' - don't require bleeding edge ( to
    improve embedability and portability ).

As we find ourselves employing capricious aesthetics to arbitrate between
technology choices we can bolster this list.

Getting our hands dirty


One thing that we do know is our service is simply a web-site.
We don't have to think much about "what kind" of web-site yet. And in fact we'd
prefer not to. We'd like to pluck away all the orthogonal pieces and erase them
from consideration as early as possible.
Since this "serving web content" is a well defined goal we can at least take it
off our list. This will reduce the total number of things that we have to think
about.

  1. In broad strokes our candidate applications for serving web-pages are going to
    be either Apache, Mason, Tomcat, Jetty or even possibly just mod_perl or cgi
    support. My personal experience is that perl, Mason and mod_perl have too many
    dependencies to ever be run in embeddable environments. Admittedly these are
    extremely pleasing and rapid development tools but one of the constraints of
    this project is portability across devices - where those devices are not
    necessarily running full blown LAMP or UNIX capable operating environments.
    Java is the only language that currently has widespread portability (well C# as
    well). Java is slightly more available - C# Mono for example has just started
    running on OpenBSD 3.6 and is not stable.

  2. In Java we have a choice between Tomcat or Jetty. NetKernel and other 1.4 NIO
    driven systems are also mighty appealing but I'd like to stick with a 1.2 or
    1.3 compatible environment. Jetty has become my personal favorite web-server of
    choice because it is quite visible and transparent all the way to the bottom.
    You can run it in an embedded mode and step-trace the logic all the way through
    to see what is going on. Being able to debug the application without having to
    'attach' to a running warfile does tremendously expedite development.

  3. Of course we are going to want to present a dynamic interface to the user.
    Traditionally people use Velocity templates or JSP templates to wrangle the
    user interface. In our case we are going to use Javascript and have the server
    serve static web-pages and dynamically generated XML content. What that means
    is that for now we do not have to think about how we are going to develop a
    server that serves dynamic content. We just have to think about the basic
    server core.

  4. Obviously we need a main() entry point of some kind. Presumably it starts up
    Jetty and does Jetty like things. Such as starting up a Jetty Resource Handler
    that will handle the incoming user web requests.

Overall then we're looking at some kickoff code that goes something like this:

static public void main(String[] args) {

server = new jetty server
context = new jetty context
session = new subclassed instace of a jetty handler

}


Here we're not bothering to package up the system as a servlet. We want this to
be easily accessible to the debugger and we're basically in a hurry overall. We
want to build the whole project in less time than our boredom threshhold.
Considering how to package something as a servlet will multiply the total
number of considerations in this project and create spurious complexity.
The Jetty Documentation tells us that we need to subclass a Jetty Resource
Handler to do actual work. In this case we invent a 'Session' concept that will
be responsible for replying to user requests as per our application. In broad
strokes this will look like this:

public class Session extends org.mortbay.jetty.ResourceHandler {

handle_event() {

if the request is for a vanilla web page then just return it
if the request is a database query then pass it off to some kind of query
handler we are about to detail out.
return query results as an xml graph

}

}

Our Session Handler above will be shallow. We're going to push most of the work
off to an XML query handler layer.
One complexity that we have to keep in mind is that multiple response handlers
can be active at the same time so we'll have to remember to put semaphores or
synchronized blocks around any code that isn't thread-safe. This will require a
careful audit of the project when it is done.
Now that we can "start up our app" we need to pick another piece to do. Our
choices are the query layer or the user interface. But it really does seem like
we are going to have to do a bit of real work now and deal with our actual
persistent datastore. Since we have a main() entry point we should be able to
do quick tests anything we now write.
The actual code for the above should be in the tarball at the end of this
project.

Writing the Triple Store


The first piece of real work is to write a lightweight RDF Triple Store. This
section will get the most discussion in fact; there are many details here.

Again here we don't have to think much about "what kind" of application is
going to use the triple store. In a sense we're making a decision that will
enforce design apriori - because of previous experiences I've had with RDF and
influences I've gotten from other people who have used RDF quite successfully.
RDF is a perpetually emerging grammer for expressing the relationships between
objects. It will be the cornerstone of this project and just about every other
project that we walk through. We're actually going to use RDF/XML - one way of
expressing RDF.

Parsing


One of the things we need to do is to load up RDF content off disk. Although
we're interested in writing a datastore we're actually not that terribly
interested in writing an RDF parser. And excellent ones already exist. To load
content into our RDF database we'll use Jena's RDF parser called ARP:

http://www.hpl.hp.com/semweb/jena2.htm

Storing


Another thing we need to do is to store stuff. We are not really keen on writing
a BTree on Disk or some other storage system. Java 1.4 does support NIO -
memory mapped IO and it is somewhat appealing to write our own system based on
that - but this takes us out of Java 1.2 land and breaks a design constraint.
Also there are some rather bizarre systems such as Prevalyer which offer
transparent persistence but I'm just not sure about the idea of inhaling
hundreds of thousands of RDF triples every time we start up - regardless of
performance. In this case we're going to go with PERST - which is a very nice
datastore written by some crazy russian guy:

http://www.garret.ru/~knizhnik/perst.html

Side Note:
We could in fact avoid writing our own triple store if we used
Jena or Kowari:


http://www.hpl.hp.com/semweb/jena2.htm

http://www.kowari.org

And in fact we could just grab an open source blogging tool
off the shelf:


http://www.opensourcecms.org

http://wordpress.org

We're not going to go with the completely off-the-shelf
solutions in this project because:


  1. Part of what we want to do is to build a
    system that we can understand all the way to the bottom. Getting a comfort
    level with RDF and
    a few of its peculiarities will let us make
    better decisions when we want to pick that industrial strength RDF store for
    subsequent projects.

  2. As well we may want to run this project as a mini-server on a
    local home PC or even on a cellphone type device. Our approach should be light
    enough to at least run on a circa 2004 HP IPaq and possibly even on newer
    smartphones.

  3. The fact is that building a triple-store is *not* hard given
    the power of tools such as PERST and ARP which we are going to leverage
    heavily.

  4. The main thing we're going to miss is a real query language.
    In fact even in avoiding a full blown query language one effectively ends up
    writing ones own. Query languages do introduce complexity and unpredictability.
    But mostly what we're trying to do is to understand the landscape of RDF; how
    and why exactly one works with RDF 'all the way down to the bottom' in a sense.

As far as I know nobody else has written an embeddable
persistent Java based RDF Triple store yet. As soon as one comes out we can
chuck all of this code out the window - but to achieve our learning and
portability constraints we are (for now) forced to use a solution that we write
ourselves.

Mapping Object Representation to RDF


Another big question - possibly the biggest question of this entire project - is
what is the best mapping between RDF (say in an XML file) and RDF in memory.
There are a number of excellent W3C sponsored articles on RDF mappings to RDBMS.
(In this case we're looking for a mapping from RDF to an OODB - but the ideas
are the same). This article:

http://www.w3.org/2001/sw/Europe/reports/scalable_rdbms_mapping_report/


and

http://www.w3.org/2001/sw/Europe/reports/rdf_scalable_storage_report/


talks about some of the data-type requirements and implementation issues than an
RDF Store might have for example. Some of the completely reasonable
considerations they cite are:

  1. Text Searching
  2. Storing URI's efficiently
  3. Supporting Datatypes ( int, float, string )
  4. Supporting RDF Containers
  5. Supporting RDF Schemas
  6. Inferencing / rules and reasoning hooks.
  7. Triple Provenance ( tracking what website a triple came from )

We're actually going to respectfully ignore quite a bit of this good advice -
but it is worth reading.
Our RDF database is going to have only a single kind of persisted object - an
RDF triple. Where an RDF Triple consists of a:

{ Subject, Predicate, Value }

Each of these parts can be represented in Java:

  1. 'Subject' represents a canonical URI string describing the topic at hand. It
    can be represented by a Java String.
  2. 'Predicate' describes a relationship such as "knows" or "owns" and can also be
    represented by a Java String. There is some argument that for conservation of
    memory one could store the XMLNS encoding of the predicate such as 'geo:long'.
    We'll store the whole unrolled predicate for now and revisit the idea later on
    possibly.
  3. 'Value' is either a literal such as "12" or "Mary" or alternatively it is a
    reference to another Subject. This can be either a literal of type integer,
    float, String or another Subject reference. Another consideration might be
    different language encodings for values. And yet another consideration might be
    providing full text search on the Values as well (probably best done using
    Lucene). We are going to just treat this as a string and not actually
    differentiate except by context of usage.

In Java our simple triple container would look like this:

public class Triple extends Persistent {

public String sub;
public String pred;
public String val;

}

Side Note:
Even if we're not going to be formal we should be at least
aware of the weaknesses of both the data model and the representation of that
data model being used here:


  1. It duplicates the same 'Subject' and 'Predicate' and 'Value'
    Strings over and over in the cache and even on disk. This is quite wasteful.
    Often in fact (say when implementing a PostgreSQL based store) one would index
    all the strings once only in a shared global index. The triple-store can then
    just be integer keys that refer to the String Index. The problem with one
    common bucket of strings is that one doesn't know if wildcard matches are
    returning Subjects, Predicates or Values without also checking the triples -
    and this can be slow. An alternative would be to have three string pools.
    Another alternative could be using a key that is say an md5 or sha1 hash of the
    string in question; thus allowing exact searches without having to goto disk to
    discover the strings key value first. In any case these approaches are easy
    enough to retrofit under a working RDF store later on.

  2. This approach doesn't specify the 'type' of a Value. One could argue that it is the role of an OWL based description to formalize those facts. The system will be able to store OWL terms just as it stores ordinary
    RDF content but at the same time we're not going to be writing any code to validate a collection of RDF triplets against an OWL definition.

  3. Some people would also add 'provenance' here - turning the triple into a quad and tracking the originating site of each triple. For our purposes we will simply treat it as a 'String' for now and revisit the issue
    later on possibly.

  4. Yet more bulky RDF triple stores might specify 'owner' concepts on each triple for fine-grained privacy. I prefer to have concepts of ownership be 'in' the grammer itself rather going from triples to quads.
  5. Another consideration might be to date-stamp triples as well - again something we're not doing.

Another way to store RDF triples would be to bind all triples associated with a given subject as a single Subject node. Doing this in Java
would look like so:


public class Reference extends Persistent {

public String subject;
public Hashtable values = new Hashtable();

}


Although we're not doing it this way - this second way does
have a subtle advantage. It would allow a query engine to operate across
disjoint database back ends. For example you might have a spatial database and
a vanilla subject-sorted keyword index and you might want to return some
features from each. Since each reference is fully self contained you could
easily emit a stream of blended features - without having to duplicate those
features into each database. This is a significant benefit - but again
something we're not doing.

Yet another way to do this would be to use an IDL to generate your java objects from an OWL definition. This is completely insane but I can see cases where people might do it:

public class MyRDFPerson {

public String uri;
public int age;
public float height;

}

We are going to use the first approach however we will wrap the triples inside of a Reference Class as exampled above so that from the
outside you won't really care about the implementation that much - and in fact
it will be very easy to swap implementations even as far as switching to Jena
or directly backing your persistence requirements with PostgreSQL.

Here is what that Reference class is going to look like:

public class Reference {

String uri;
public String get(String predicate);
public String set(String predicate, String value, boolean allowDuplicates );

}

The rules we'd think of normally associating with set() would say that duplicate
predicates are not allowed per subject. In a Java class for example you can't
say "int myvalue; int myvalue;". But in RDF this method can explicitly allow a
given predicate to be declared more than once if allowDuplicates is true. You'd
typically however want an rdf:Bag. Let's say that for example you wanted to
associate several tags with a given subject - you'd want to declare a child bag
that belongs to that subject and have that child cite all of the tags in
question.
At this stage we have a concept of a 'Reference'. This acts as a bag for
predicates and values associated with a given Subject.

Sticking things together


What we need now is actual persistence and a way to manufacture and store
handles on our Reference objects. Basically now we're going to just glue all of
the pieces into one huge blob called 'Database'.
So this is where we call upon PERST to do the heavy lifting for us:

import org.mortbay.perst.*;
class DatabaseRoot extends Persistent {

FieldIndex subs;
FieldIndex preds;
FieldIndex vals;

}

This incantation declares 3 persistent field indexes using PERST. Now when we
commit triples into the database we commit them to all 3 indexes. And to query
for any triple we can query any of the indexes.
PERST supports range queries, exact queries, and "subject starts with" string
queries. Queries can be done in forward or reverse index order.
For our needs this will suffice. For example:

  1. If we want to discover all triples whose subject begins with http://playground then we ask PERST to efficiently search the subs FieldIndex for subjects with that term.
  2. If we want to discover say all predicates that are http://www.w3.org/2003/01/geo/long
    then we do something similar.
  3. If we want to discover all values that are say > "118.35" and < "120.35"
    we can do that as well using PERST.

However to do more complex queries such as say find all things that are within a
certain value of predicate "geo:long" and predicate "geo:lat" we have to issue
multiple queries and do explicit joins by hand. Technically speaking however
one can actually avoid fully explicit joins (where one has a full copy of each
set) by using java code to iterate through the second set with the first set in
hand. (In the particular case where we are doing something that looks like a
spatial query - we could use the spatial indexing that PERST provides).
There's one more piece on top of all this that we need to add. We need some
concept of an overall "database" that can yield instances of References that
the application logic can then manipulate. That database layer will wrap PERST
completely; making it invisible to the outside world and will look something
like this:

interface Database {

public Reference get(String key);

}


With a little bit of glue this layer is basically done. Please refer to the
associated tar-ball for the exact details.
Now we're done most of the hard stuff. We just have to think about the user
experience and build out some UI. Actually that will also be quite a bit of
work - but hard in a different way - as we wander a thicket of possible UI
choices next.
Side Note:
A lot of people wonder if RDF is really any kind of
improvement over other ways of expressing objects. People often complain that
RDF/XML is overly verbose and not human editable for example. And people do
wonder if the same content couldn't be packaged under some other schema
altogether. Here are some of my thoughts as a first-time-user from playing with
RDF over the last few months:


  1. These days I find it easiest to think of RDF documents as simply big buckets full of triplets consisting of { subject, predicate, object }. This corresponds fairly well to the definition of a simple english sentence
    being { subject, verb, noun }.

  2. In RDF one can talk about 'decorating' any arbitrary subject with any arbitrary fact. It can be a reasonable way to think about and verbally discuss RDF system architecture in general - having something of a 'tools not rules' or 'just do it' flavor that can expedite thought.
  3. Subjects can be extended later on in the development process - meaning less time spent anticipating and pre-planning the system. Pre-planning and discussion time can be exponential with the number of elements
    that need to be considered and RDF can help de-stress that part of the work.

  4. RDF implies an underlying database model (for better or worse). If you're using RDF triplets consisting of { subject, predicate, object } then you're likely to find yourself somewhat coerced into having a database
    implementation that consists of a huge bucket of RDF triplets (rather than say
    one with a lot of specialized schemas that are being translated to RDF
    dynamically).

  5. RDF forces debate up a level of abstraction. Using RDF/XML specifies agreement not only on the transport notation (XML), but now also on
    the database structure. Since RDF specifies a grammer - not simply a syntax -
    it seems to coercively imply how that grammer is stored at least to some
    degree. Agents written to crawl one RDF database can potentially crawl another
    one that they were not originally meant to consider. Where people used to argue
    about how to transport data and how to structure meaning in the data now they
    are arguing about what the words mean.

  6. RDF is something of a 'universal solvent'; things tend to be dissolveable in RDF whereas they are not dissolveable in other grammers. Some of the tension with other grammers such as say VRML, GML and the like come out
    of this fact: people want to represent extremely diverse collections of facts.
    Even if you can't succinctly represent a concept as a single RDF triplet you
    can pretty much always find some transformation of your original idea into two
    or more RDF triplets.

  7. Often (in other grammers) facts that were not core
    considerations are attached as kind of barnacles. Late arriving concepts are
    not considered to be first-class citizens. Because of this classical grammers
    often attempt to re-invent the wheel 'better' - rolling in all the new
    thinking. Grammers such as say SVG, Flash, VRML, GML, Avalon 'steal' ideas from
    other grammers - re-implementing and repackaging them - whereas with RDF you
    just 'use' the snippets of the other RDF vocabularies that you like. One
    example of this is the 'Locative Packet' that in and of itself specifies no new
    vocabulary but simply denotes a convenient intersection of already existing
    vocabularies:

    http://locative.net/workshop/index.cgi?Locative_Packets

  8. RDF has an appealing simplicity and formality. It is quite pleasing for example that OWL (a grammer for specifying the legal attributes of any RDF subject) is itself in RDF. In other language such as say Java or any IDL - there are separate notations for specifying 'abstract' versus 'instance'. Even XML has the infamous DTD notation which is not itself XML. This lucky happenstance of RDF seems to be more than just accidental thinking - it looks like there were many predecessor ideas that ended up emerging here such as this
    paper on Associative Databases:

    http://www.lazysoft.com/docs/other_docs/AMD.pdf
    and these general comments on database normalization:
    http://en.wikipedia.org/wiki/Database_normalization


  9. A weaknesses of RDF is that work is pushed over to logic. Instead of having a declarative schema that fully constrained an object one
    tends to ignore constraints and simply use application logic to traverse the
    complex relationships that describe an objects state. The fact that a person
    may belong to an organization for example can be expressed in OWL but is more
    likely - practically speaking - expressed implicitly in the logic that walks
    persons and finds organizations they belong to. This might be as simple as an
    RDQL query or could be as complicated as explicit hard-coded logic in the
    application. Ultimately what is needed is a programmatical model of RDF where
    the OWL schema is itself directly exposed to the procedural logic.
    FABL for example moves
    in that direction.

Navigating via Tags, Streams and Crumpled URLs


At this point we have a way to serve content, and we have a way to store
content. Now we have to consider exactly how the user is going to interact with
content.
Here is where we move into the thinking that specializes the design away from
being any generic web driven database application.
We do know that effectively we're building a CMS - it understands what a user
is, what posts are, how to perform various useful queries and enforces a
permissions policy such that users cannot overwrite each others space. The
kinds of concepts we're needing to manage include:

  1. Users
  2. User posts
  3. User tags
  4. Perhaps some statistics as well

We also have a list of constraints from our earlier design talk.

Users and Posts



One thing we do know is that there will be users and user accounts. Presumably
users have preferences as well.
As well users will make 'posts'.
These roles seem fairly clear. We can use FOAF to define people. And for posts
we just define some RDF predicates in a vocabulary to capture basic post data.
In fact we don't even have to do any work - we can just use RSS as is with
<title> <link> and <description> being perfectly adequate.

Tags


Tags are a new concept here and get a little bit more discussion.
Let's cite a few things that tags do:

  1. Tags are introduced as a mnemonic to help users recall their own posts later
    on. A user can categorize a new post under any arbitrary string that they wish
    such as say 'politics, satire, humor' or say 'politics, art, prague'. The user
    can then see posts under a specific topic or intersection of topics and this
    helps with overall recall.

  2. If enough users have similar ideas about similar tagging systems then
    presumably even in groups you'll begin to see certain tags evolve and become
    representative of certain ideas. In the dating advertisments in the back of
    magazines for example you often see 'm4w' or 'w4m' as examples of tags that
    have evolved to represent certain ideas.

  3. Tag naming can get out of control if we are not careful. We may have to enforce
    some tag naming constraints. In this system all user tags are lower case, may
    start with a number, must not have any symbol in them except '/' and may not
    have spaces in them (even with quotes). Heirarchical tags are allowed although
    their value is low and they are treated as single atomic tags in most cases. As
    well 'sys:' and 'system:' are reserved.

  4. Tags can also be used internally to categorize system concepts. There is a
    somewhat seductive power to a tag engine. Once you have one typing system then
    it becomes increasingly convenient to be able to do all of your filtering
    against that type system. A 'system:subscription' or a 'system:friend' or a
    'system:ignore' tag could be attached to a user post to indicate that that post
    is about another user that that user may be subscribed to or a friend of or
    ignoring. If we weren't using tags then we might have defined our own RDF
    Vocabulary to explicitly capture concepts such as 'subscription' and or
    'friend' and would have a system that was actually less flexible (as the query
    layer will show). At the same time, by migrating system concepts up to the
    level of tags we are in a sense stepping outside of RDF a bit - it means other
    third party consumers of our RDF feeds have to have special logic to understand
    exactly what class of object an object is.

Note that there isn't any particularily deep reasoning as to why we're using
tags - it's just an easy, convenient, brief and memorable concept for users.
At the same time there is quite a bit of formal discussion on voluntary
categorization, prototype theory and the like. You can read some of the
literature in cognitive psychology for more discussion of these topics - in
particular Eleanor Rosch and George Lakoff. But at the same time it's probably
best to think of tags as a simple colloquial concept and not to read too much
into them.
Here is a bit of a ramble about some of the thinking however. One essay that I
like to drag out even now is:

http://citeseer.ist.psu.edu/taivalsaari96classes.html

I like to use the made-up phrase 'platypus effect' to capture a bit of
the ideas expressed by Antero Taivalsaari:


http://www.advogato.org/article/83.html


At the time I was puzzled by finding ways to categorize knowledge -
wanting to build all kinds of complicated virtual file systems and the like. (
I sometimes wonder if Ma Bell didn't invent C++ and OOP abstraction because of
their problem domain - dealing with millions of identical phone records. If Ma
Bell had been say a games developer instead they might have encouraged
something that dealt better with lots of heterogenous types. )

But Del.icio.us tags pretty much demonstrated that this was actually
trivial - and that thinking about this too much is basically just a waste of
time.

Crumpled URLS


The URL presents a very small text space within which a number of not completely
orthogonal concepts are being 'crumpled'. We are effectively trying to
represent a set of slightly irrational 'human shaped' ideas within a few dozen
bytes. The URL space should be:

  1. Memorable. To have an URL scheme for the site overall that is simple enough and
    clear enough that it can act as a mnemonic for the user. The user should
    ideally be able to type in an URL with various path and parameter qualifiers
    and have their browser retrieve specific content at that path. The user should
    not be required to visit the site and navigate solely by mouse-clicks.
  2. Unique. Each unique given page of a given type of content should uniquely map
    to a an URL and visa versa. Some sites that don't conform to these simple rules
    cannot be bookmarked; the user must manually navigate back to the site and page
    in order to retrieve the content.
  3. Key concepts dominate. In general the most important concepts that the service
    provides should be URL addressable in the URL path itself. Secondary concepts
    can be reached by '?' style parameters.
  4. Ego dominates. Users simply enjoy having their name be visible in the URL
    space.
  5. Tags dominate. Tags are an important concept and should be visible in the URL.

Streams


Del.icio.us uses an especially nice pattern where the url path represents a kind
of 'sum of children streams'.
We're going to do something similar where the URL is broken up like this:
Effectively the url is broken into:

[ domain ] / [ username ] / [ tag ] [ ?styles ]

Each parent folder sums up all of the content of all children folders. It's an
intuitive and useful metaphor. It even works with hierarchical tags.
An alternative pattern could be to do [username].[domainname]/[user tag path].
This is problematic simply for DNS management issues and because it ruins the
opportunity to use the domain name space for other kinds of more appropriate
overloading and precedence order. It is (arguably) more clear to humans to say
"portland.craigslist.org/anselm" than to say "anselm.craigslist.org/portland"
for example. So we won't do this.
Using a streams concept helps us work in RDF. There are some nice things we can
do in the database layer for indexing and discovering collections of facts
under a given stream or stream with a wildcard path.
Streams do create some worries and considerations however:

  1. There does need to be some concept of getting a stream of all tags independent
    of users. To accomplish this we can create a fake user called 'tag' and copy
    all posts to that user. Visiting http://domain/tag/elephant
    would yield all posts with the category elephant of all users.

  2. There is also quite a need to get at information in different 'styles' such as
    /person?rss=true. We are going to avoid as much as possible having reserved
    root path nodes and instead use parameter arguments where appropriate (
    avoiding /rss/person and favoring /person?rss=true ). This isn't quite "REST"
    http://www.xfront.com/REST-Web-Services.html
    ] in that REST encourages
    using 'nouns' not 'verbs' - but the REST argument in this case isn't quite
    clear to me and we are using tags so we desperately want to minimize use of the
    urlspace for anything else.

  3. Another final consideration regarding streams: there are system folders and
    other internal things that effectively end up becoming reserved users. If for
    example we want to have a folder for all books such as '/isbn/' we would have
    to make sure that user is reserved. There is some argument to put all users
    under '/home/' but that is a terrific waste of root namespace and that
    root-namespace is highly valuable and highly sought by users for their own
    names. So we live with the slight irrationality and just crumple the concepts
    we need into the url space as best maps to human needs.

Now we're done thinking about the way the user "sees" the system.
We're not actually being terribly innovative here - just emulating patterns that
work. Hopefully though we get to play a bit more later on once these
foundations are in place.

The Query Engine


Since we have a model of user interaction - with streams and tags and all that
stuff - we need to figure out how we're going to drive that interaction. We
have to make a bridge between the user and the database engine.
We're going to want a query layer that can be directly queried by the client
application. This is not RDQL (although it could use RDQL or another query
language) but is tailored towards our specific application. It also imposes a
security wall so that users cannot pollute other users content.
Basically we just want a laundry list of the kinds of capabilities we need and
then we can pluck out commonalities and implement something simple that
translates these high level requests into actual indexed query lookups of our
RDF database.
Typical queries are probably:

  1. Return a list of all posts by a particular user
  2. Return a list of all posts by a particular user under a particular topic or
    'tag'
  3. Return a description of a particular user
  4. Return a count of all posts or posts in an area
  5. Return posts within a certain date range
  6. Return posts over a certain size
  7. Return a list of all topics
  8. Return a list of all users
  9. Return a list of all posts
  10. Return all posts with certain content
  11. Return all posts on a particular 'kind' of topic - such as a book, music,
    mime-type or other disambiguatable thing.
  12. Return a thumbnail of an url or a file
  13. Return administrative gateway views of all users and all posts
  14. Login a user
  15. Logout a user
  16. Accept a new users description
  17. Accept a new post by a user
  18. Accept a subscription by a user to another user (ie accept other kinds of
    things not just posts)
  19. Accept a file
  20. Allow a sysadmin to delete or modify users and or posts
  21. Show statistics
  22. Throttle returned results; return todays or this weeks or 10 results only.
  23. Return not individual posts but only 'unique' posts about a given URL. An url
    posted twice should show up once only.

The discussion of the actual implementation of the query engine is probably too
much detail for here. I'll let you look at the code to see the specifics of how
these queries were implemented based on this set of use cases.

Javascript User Interface


This javascript stuff all sounds terribly mundane but actually it's quite
liberating - it means you as an engineer can get more stuff off your shoulders
and get other people to deal with it. That means much more leverage, more
people stirring the pot and more help overall.
What's happening is that a few web services now are starting to use Javascript -
and thats a pattern we're going to use.
Googles gmail and Amazon's A9 service are good examples of this.
Historically most web services manufactured the user interface on the server
side using Mason, ASP, JSP or other such grammers. These solutions are actually
quite difficult for designers to work with and they create a security liability
in that the pages can express commands that permeate the security wall between
the client and server state.
A cool thing about Javascript is that we're able to ask the server to ship us
pure XML and then we (or our lackeys) can do the layout of that ourselves. We
can even have long complicated dialogues with the server - asking small
questions about users or state and making decisions based on that. We could let
a user try to create a 'shared discussion group' and then advise the user if
their group was made or if the name was already taken for example.
We're able to use the same patterns we would in an ordinary
not-split-over-a-network application.
Here's a general laundry list of the reasons Javascript is appealing:

  1. Javascript runs on the client side, is shipped as static content from the
    server so less computation for server.
  2. It can be vastly more responsive than any server driven application.
  3. Authoring tools can deal with Javascript much better than with ASP, JSP and the
    like.
  4. It is simply a nice separation between server responsibilities and client
    responsibilities.
  5. Using Javascript creates a practice of building XML gateways between server and
    client; this formalizes the server API.
  6. A well separated server with clearly defined roles can talk to any client - a
    native application or other visualization tools.
  7. Since there is a total separation between the server and the client it becomes
    possible to allow clients to create their own html pages and store them on our
    server. That means users could entirely customize the appearance of their own
    pages and we wouldn't have to worry about security issues. Many web services
    fight over look and feel - this makes that debate totally obsolete and a little
    bit silly.

There are some drawbacks to using Javascript:

  1. Browser portability problems.
  2. HTML and layout inconsistencies across browsers (that can be more easily
    treated on the server side).
  3. Slow client machines can be slow to rebuild display.

In the way we're going to use Javascript there are also a few seemingly bizarre
design choices. We're going to simply have a single html document on the server
that we're going to send to the client over and over. This single document will
change its appearance based on the current URL that the client is on. And what
this means is that we have to 'round-trip' form parameters back to the client
document for it to do work.
In a sense we are shipping an 'application' to the client - and even though HTML
is too stupid to know it - that application persists between pages and doesn't
have to introduce any new pages.
The Javascript application delivers the UI. That UI consists of pieces like
this:

  1. Present a users home page full of their stuff. A huge time sorted list of
    posts, citations, books, music - whatever was logged.

  2. Present alterna views such as calendars, time-maps, geographic maps or
    whatever.

  3. Present other pages full of stuff there respectively as well.

  4. Show user tags so user can navigate by tag.

  5. Show all users tags together so users can pivot on tags to discover other
    similar posts and other similar users.

  6. Let users post new stuff.

  7. Let users fiddle with personal settings and profile

  8. Possibly show statistics

  9. Show recommendations; doing some server side computation.

There are going to be many UI pieces - but we can build them as we come across
them. It doesn't require a lot of pre-planning.
The amusing thing about a Javascript based application is that the HTML is
treated as just a launching point. There is almost no HTML at all:

<html>

<body>

<javascript>

deal_with_entire_pages_content();

</javascript>

</body>

</html>

All of the work is done from javascript. It doesn't even make sense to draw
header or footer banners in HTML unless they are absolutely universally
constant.
The client application sits inside of our javascript code and more or less just
fulfills the list of UI pages that we want to have. It's largely a sequence of
functions that we pick between. We look at the users current URL and the
current parameters and then execute the appropriate subroutine to draw that
page.
In the case of this application a fly-over of the code at 10,000 feet might look
something like this:

  1. Detect if IE, Moz, Safari or another browser.

  2. Get and set cookies for tracking current user.

  3. Parse apart the current URL into [ domain ] [ user ] [ tags ] [ ? ] [
    parameters ]

  4. Get current user

  5. Get user being visited

  6. If the user is looking at the home page then manufacture a splash screen. This
    could be a root feed with all content.

  7. If the user is looking at their own page then present their own content with
    edit controls.

  8. If the user is looking at somebody elses page then present that page.

  9. If the user wants to see all posts under a common tag then direct the user to
    some unique url that can represent that concept such as /tag.

  10. If the user wants to see their personal profile then start using parameter
    space to express that such as /user?profile=true

  11. If supplied parameters indicate an 'older' date range then present that date
    range.

  12. Draw the current page full of items

  13. Draw a navigation calendar widget or whatever is used to navigate timewise
    through a pages collection.

  14. Draw any silly mumbo-jumbo header and bottom.

All of these aspirations are going to be pinned on a small library of Javascript
functions. We're going to write some XML utilities, some layout utilities and a
few other bits and pieces. Overall the library will be something like this:

  1. Reading XML in Javascript from the server
  2. Writing XML in Javascript back to the server
  3. Drawing XML to the screen
  4. Input forms
  5. Some layout utilities
  6. Determining current user page and reacting appropriately

I don't have time to actually walk through exact code in this discussion. You'll
have to refer to the tarball for now. Later I may add more comments to this.

Conclusion


Here is the tarball. [ Well it's not
up yet but it will be in a week or so when I have a chance to finish it ].
These services are fun to build from a kind of mad scientist perspective. The
tools we have today to architect these large scale social systems are so
powerful and so easy to use that it can be as little as a few days work to
unleash an entirely new social application on an unsuspecting public.
If you're going to use this starting point professionally then there are other
considerations not covered here; such as finding ways to aggregate and or
federate content so that you can take advantage of laws of utility and avoid
walled garden effects. As well if you are deploying a commercial service based
on this code you may want to support some wiki like concepts so that users can
entirely customize their own experience.
What could you do with this?
You could make your own Craigs List such as discussed by Jo:http://frot.org/geo/craigslist.html
Your own personal knowledge tracking system - for tracking your habits or even
your finances.
Effectively this becomes a big bucket that you can pour stuff into. If used
personally it could become a hugely powerful tool for long term stuff
organization and management; from tracking habits, health, phone numbers and
other such often lost things to post-organizing existing collections of
duplicate archives and the like. You could attach an aggregator to this and do
say brute force geo-location of news-articles and project them onto a globe;
and then do peer based review of those articles or additional decoration of
facts from people who are on the ground in that area...
Really the sky is the limit.
In fact I originally started down this path with the hopes of writing a video
game. The idea of managing users, managing content and doing it all in a high
performance way came out of the kinds of demands that a large scale
locative-media multi-player experience would have. I ended up recognizing that
even building this foundation was a chore in itself and made just doing an RDF
based CMS the first goal.
The thing to do is to think about where all of these services are going over the
next 10 years. Clearly many of them are going to go away - and clearly others
will have to find ways to federate and share their knowledge.
Hope you had fun.
Please send me comments if you liked this essay to anselm@hook.org
. I'm also looking for ways to improve my understanding of this space so I'd
like to hear advice about better or more rigorous ways to build an RDF database
and to do embeddable persistence overall.
I'd like to thank Tangra, Maciej, Joshua, Brad Degraf, Dan Brickley and
especially Jo Walsh for getting me interested in RDF in the first place. All
mistakes are my own and many insights belong to these people.
- a


2 Comments:

Anonymous Term Papers said...

I have been visiting various blogs for my term papers writing research. I have found your blog to be quite useful. Keep updating your blog with valuable information... Regards

5:05 AM  
Anonymous Anonymous said...

But no make a difference what their designers have put on these jackets and coats, they will by no means overlook the most important objective of this style of clothes, which is producing the wearers heat and at ease [url=http://www.tiffanyjewelryonsale.com/]tiffany outlet online[/url]
Essentially, these products do not just help to save time, but they do all the jobs in seconds, that most people really hate doing [url=http://www.2013thomassaboonline.co.uk/]thomas sabo sale[/url]
Ask your friends or neighbors whether they know any good specialists [url=http://www.louboutinshoes2013sale.co.uk/]christian louboutin shoes[/url]
Obviously a day to day set will not need to be as flash as a set that is used on special occasions [url=http://www.2013tomsshoes-outlet.com/]cheap toms shoes[/url]
They can't afford to spend lots of money on a track piece of sunglasses [url=http://www.cheapestlouboutinshoe.co.uk/]cheap christian louboutin[/url]

4:05 PM  

Post a Comment

Subscribe to Post Comments [Atom]

<< Home