onsdag, oktober 02, 2013

URL structures and hyper media for Web APIs and RESTful services

A recurring theme on various mailing lists is that of choosing the "right" URL structure for a specific kind of web API. In this post I will present my view on this issue, based on various input from for instance "API-craft" (https://groups.google.com/forum/#!forum/api-craft).

First of all, let me hammer it in: URL structuring has absolutely nothing to do with REST. Period. REST is not concerned about URL structures - in REST a URL is an opaque string of characters with no meaning beyond the fact that it is both an identifier and a resource locator. An on-line web service doesn't become a RESTful service just because it has a nice pretty looking URL structure. There is simply no such thing as "A RESTful URL".

What this means is that a URL like http://geo.com/countries/usa/states/nevada is just as "RESTful" (or non-RESTful) as http://geo.com/states/nevada, http://geo.com/states/321, http://geo.com/states?id=321 and http://geo.com/foo-bar-U7q. The URL structure simply doesn't matter in REST.

But from a human point of view it helps understanding if the API has some kind of meaningful URL structure. Computers may easily ignore URL structures but as humans we tend to look at URLs and try to infer meaning from that. Thus, having pretty and well structured URLs helps us understand what is going on - not only as client developers but certainly also as server developers who often have to navigate from URLs to source code - and having a well defined URL structure helps us with that process.

In order to discuss URL structures we need a domain to model. I think geographical information with countries, states and cities should be easily understood by most, so lets try that. I will ignore the fact that many countries doesn't have states ;-)

URLs as identifiers for entities

Our geographical domain easily lends itself to some kind of hierarchical structure of countries / states / cities. So intuitively we reach out for a hierarchical URL structure as shown below (for the sake of clarity I will ignore the host name and only show the path element of the URL). Let us try to build the URL for the city of Las Vegas in Nevada, USA:

  /countries/USA/states/Nevada/cities/Las+Vegas

That would work, but think a bit about it; what if the state of Nevada had more than one city called Las Vegas? How would we be able to distinguish between the two cities? The problem here is that we confuse searching for a city named Las vegas in Nevada, USA with the concept of identifying a specific city.

I believe that it is fair to assume that most geographical systems will have some kind of backend that assigns unique identifies to all of its entities. This may be integers, GUIDs or strings with composite keys - but in the end it boils down to a sequence of characters that uniquely identifies the entity in the system.

So let us assume that the well known city of Las Vegas is identified by the integer 82137 which is a unique city number. It may happen to be the same number which is used for a country or a state, but in the context of cities it is unique.

The same goes for countries and states: USA has the ID 54 and Nevada is identified by 7334. Now we get the URL:

  /countries/54/states/7334/cities/82137

But what happens if some client decided to lookup this URL with mismatching IDs:

  /countries/54/states/8112/cities/82137

Well, that should be considered a non existing resource and the server should return HTTP code 404 Not Found.

But why bother at all with the overhead of checking both state, country and city IDs when the city ID uniquely identifies the city? It would be easier for all parties if only the city ID was needed in the URL:

  /cities/82137

Now the server can do one single lookup by the ID to see if the referenced city exists. No need for any additional checking for matching state and country.

The same logic can be applied to states (and countries is trivial), so we end up with the following canonical URL structures for countries, states and cities:

  /countries/{country-id}
  /states/{state-id}
  /cities/{city-id}

Should it happen that the server doesn't assign unique IDs to cities (or states), and really needs the state reference for a city, because two cities in different states may have the same (non-unique) ID, then we must include both in the URL:

  /states/123/cities/77 => Rome in Italy (assuming some state in Italy is identified by 123)
  /states/432/cities/77 => Rome in the state of New York

In the rest of this post I will assume that all cities and states has "globally" unique IDs.

Finding the right ID with UI dropdowns

But how does the client know what ID to use, you may ask? This depends on the application, but lets take the scenario where an end user needs to get information about the city of Las Vegas (while still assuming that Nevada may have two Las Vegas).

The UI could be structured by three dropdowns: one for countries, one for states in the selected country and one for cities in the selected state. To present such a UI for the end user we first need to be able to get the list of all countries. The obvious choice for this resource is /countries. Then, for the selected country we need the list of states. The obvious choice here is /countries/{country-id}/states.

But what about the list of all cities for a specific state in a specific country? Let us avoid the trap of a hierarchical URL with multiple IDs and use the short /states/{state-id}/cities.

So now we have the following resources representing lists of geographical items:

  /countries
  /countries/{country-id}/states
  /states/{state-id}/cities

Each of these resources returns a JSON list as shown below and from this list the UI can easily build a dropdown element for selecting a city:

[
  { Name: "Item name A", ID: xxx },
  { Name: "Item name B", ID: yyy }
]


In this way the client gets the unique city ID by letting the end user select a city and its corresponding ID.

Query by text search

Another approach could be to use textual searching where the end user enters a query text like "Las Vegas, USA" (which is how Google maps work). This would require a new query resource:

  /cities?query=Las+Vegas,+USA

The result would be a list of matching cities:

[
  { Title: "Las Vegas, County A, Nevada, USA", ID: 16352 },
  { Title: "Las Vegas, County B, Nevada, USA", ID: 82137 }

]

Now the end user can select one of the results and thus get the ID of the city.

Adding hyper media, getting closer to REST

The previously mentioned approaches requires the client to create URLs by combining URL templates with IDs. This means the client has to be hard coded with the URL templates - and the consequence is a tight coupling to the URL structure of the web API.

But it is very easy to avoid this kind of URL coupling by using hyper media elements in the returned representations. Take for instance the list of cities matching the text "Las Vegas, USA"; here we can include the actual city URLs in the response instead of requiring the clients to construct the URLs itself:

[
  {
    Title: "Las Vegas, County A, Nevada, USA",
    ID: 16352,
    CityLink: "http://.../cities/16352"
  },
  {
    Title: "Las Vegas, County B, Nevada, USA",
    ID: 82137,
    CityLink: "http://.../cities/82137"
  }
]


Now we can start talking about a RESTful service instead of a static web API: by including hyper media elements we allow the server to include links to other hosts that might be better to represent cities:

[
  {
    Title: "Las Vegas, County A, Nevada, USA",
    CityLink: "http://other-geo-service/jump.aspx?type=city&ID=16352"
  },
  {
    Title: "Las Vegas, County B, Nevada, USA",
    CityLink: "http://geo.com/cities/82137"
  }
]


By including links we have stopped worrying about URL structures and has come one step closer to a RESTful service.

The upside is looser coupling to server URL structures, simpler client logic and enabling the use of different services on different servers. The downside is a larger payload with bigger URLs than simple IDs.

Filtering

So far we have looked at hierarchical data with some obvious URL structures. But what if we need to get the list of cities with a population of more than 200000 citizens? And what if we only want cities from the state of Massachusetts?

There are many different ways to do this depending on the complexity of the filtering. But it may be fine to start out with simple queries like "All cities in (Massachusetts or New York)"; first we need to use state IDs and thus we get "All cities in states (2321, 2981)". Such simple integer IDs can be separated with commas, so one possible URL structure could be:

  /cities?states=2321,2981

It is also possible to encode an SQL like query language in one single parameter:

  /cities?where=state+in+(2321,2981)+and+population+greater-than+200000

The possibilities are endless, but it usually consists of a path like /cities that identifies the type of query together with some set of URL parameters encoding the query specification.

A common solution is to interpret "&" as AND and "," as OR when possible. So for instance /cities?states=2321,2981&size=large,huge would mean "All cities where state is either (Massachusetts OR New Your) AND size is either (large OR huge)".

And no discussion about filtering without mentioning OData's URL conventions: http://www.odata.org/documentation/odata-v3-documentation/url-conventions/

Handling large input filters

URLs for filtering may become rather large, so another recurring question is "How do I handle filter strings too large for a URL"? The recommended solution is to POST the filter to a query resource, for instance like this:

  POST /city-filters
  Content-Type: application/x-www-form-urlencoded

  where=state+in+(2321,2981)+and+population+greater-than+200000


The server then creates a temporary resource for this query and returns a redirect to it:

  201 Created
  Location: /city-filters/9638


The client can then GET /city-filters/9638 to get the result of the query.

A nice side effect of this is that the created filter resource can be cached to avoid re-calculating the potentially very slow query on the server.

Natural keys, surrogate keys, URL aliases and resource duplication

A common question relates to the use of natural keys versus surrogate keys in URL construction. It is more or less the same discussion as we see with databases (see for instance http://www.agiledata.org/essays/keys.html). Examples of natural keys could be order numbers, e-mails, postal codes, social security numbers and phone numbers.

When choosing between natural keys versus surrogate keys you should consider the lifespan of the key; URLs are supposed to be stable over a very long period of time, so do not choose keys that vary over time. For instance, do not use phone numbers and e-mails to identify people since people tend to change these during their life.

You should also beware of natural keys which can be used by more than one entity. It is for instance (still) common for some members of a family to share a common e-mail, so e-mails are not good candidates for identifying persons. Even social security numbers may sometimes change. In Denmark for instance a person may get a new social security number if they change gender.

A valid natural key could be a sales order number since these are supposed to be both unique and stable.

But if we introduce natural keys, should we then only use natural keys? What if an entity has both a natural key and an internal surrogate key? You can use both but you should decide on one being the canonical ID and avoid duplicating resources by using HTTP redirects for the secondary keys.

Take for instance a sales order with the order number SK324-1 and internal surrogate key 887766 - if we consider the order number as the canonical ID then we can use these URL structures:

  /orders/SK324-1  =>  returns order representation
  /orders/id/887766 => redirects to /orders/SK324-1

Redirects should be done using the HTTP status code 303 See Other with a Location header containing the canonical URL.

As stated earlier on: do not confuse searching with identity. You may want to search for a person with a specific e-mail, but the result should include the canonical URL of the found person.

See also http://www.w3.org/TR/webarch/#uri-aliases for a discussion of URL aliases and duplication.

Relations and back-references

What if we want back-references and other relations to other resources, does that influence the URL structure? For instance, now that we have links to states in a country, we might also want links to the country in which a state belongs. That might lead to something like this:

  /states/{state-id}/country

But, wait a minute, we already have links to countries, right? The canonical version is although /countries/{country-id} so how do we get from /states/{state-id}/country to /countries/{country-id}? The obvious answer is to consider the /states/{state-id}/country URL as an alias for some country and use HTTP redirects to get to the canonical country URL.

But lets step back and take a broader look at relations in general; a back-reference is just one kind of relation from one resource to another - but we could have many other kinds of relations, like "neighbor states", "the country of a city", "statistical information about a state" and so on. The general solution to this concept is to include links in the payloads instead of creating a myriade of small alias resources that only redirects to canonical URLs.

So, instead of using /states/{state-id}/country for the country of a certain state, we include the canonical country link in the representation of the state:

  GET /states/4321

  returns =>

  {
    Name: "State X",
    CountryLink: "http://.../countries/1234",
    NeighborStatesLink: "http://.../states/4321/neighbors"
  }


Static data, volatile data and caching

Some times we end up with some sort of "hotspot" resource with tons of requests and a very volatile content making it impossible to cache the result and improve performance in that way. A solution to this may be to split the resource into two (or more) different sub-resources; a cacheable resource and a volatile non-cacheable resource.

Take for instance our state resource at /states/{state-id} - it may contain some very static data like the name of the state, its area and such like plus some volatile data like for instance the number of Tweets tweeted from that state the last ten minutes. The static information could easily be cached, but we have no way to do it since the complete resource also contains the number of Tweets.

The solution is straight forward: split the resource into two different resources:

  /states/{state-id} => static state information
  /states/{state-id}/tweet-stats => volatile tweet information

I'll admit that the above example is rather contrived, so lets try a more realistic example: a streaming music distribution network publishes information about its songs through an online web API. Each song has its own resource representation with details about the song. The title, lyrics, artist and such like won't change much (if ever), but the company also publishes the number of current listeners which changes all the time. To improve caching characteristics the song data is split into (at least) two different resources:

  /songs/{song-id} => static song details (cacheable)
  /songs/{song-id}/usage => volatile usage information (non cacheable)

But who says the song usage is published by the same API? Some time after the initial release of the web API the company off-load some of the streaming to another content delivery network which will also deliver the usage statistics. Now suddenly not only the URL structure changes but even the host name changes:

  /songs/{song-id} => static song details (cacheable)
  http://cdn.com/acme/file-usage/{song-id} => volatile usage information (non cacheable)

This is a breaking change and all clients must now be upgraded. Had the API instead contained hyper links then the change would have been transparent to all clients.

Classic song representation:

{
  Id: 1234,
  Name: "My song"
}


Hyper media improved representation:

{
  Id: 1234,
  Name: "My song",
  UsageLink: "http://cdn.com/acme/file-usage/1234"
}


Once again we see how unimportant the actual URL structure is when we start using hyper media elements in the responses.

Formats and content types

If the same resource can be found in different formats (encoded with different media types) then we can ask ourself, should URLs end on .json .xml or similar extensions? On one side it makes it easy to explore the different representations using a standard web browser - on the other side it introduces different URL aliases for the same resource.

My recommendation is to implement the extensions as a convenience for the client developers, but avoid using them when interacting with the API "for real". If for instance our geographical API can return both JSON as well as XML and HTML then I would use these URLs for states:

  /states/{state-id} => canonical URL used in all returned hyper media elements
  /states/{state-id}.json => JSON representation of state
  /states/{state-id}.xml => XML representation of state
  /states/{state-id}.html => HTML representation of state

The canonical URL would also support standard HTTP content negotiation for JSON, XML and HTML representations of the exact same resource. The framework I use, OpenRasta, supports this dual type of "content negotiation" right out of the box with no implementation overhead.

If our resources have different variations then we can add them as "sub resources" of the primary resource (not that such a thing really exists since URLs are opaque strings). Where I work we have resources for documents in a case management system. These resources contains meta data about the document (title, owner and so on) - and then we have various other (sub) resources for the documents themselves - the raw binary document (image, power point, pdf etc.), a PDF replica of the document and a PDF replica with an added front page containing the document meta data. Thus we get these URLs:

  /documents/{doc-id} => canonical document meta data URL
  /documents/{doc-id}/pdf => PDF replica
  /documents/{doc-id}/meta-pdf => PDF replica with meta data frontpage

Use NOUNS not VERBS

I think most people get this right nowadays: URLs should be NOUNS not VERBS. Avoid URLs like /getOrders and /updateCountry - use all of the HTTP verbs instead when interacting with the resources and use something like /orders/{order-id} and /countries/{country-id} for the URLs. If you run out of HTTP verbs then invent new resources.

In this way you will avoid the trap of doing something horrible like this which I would expect to delete order number 1234 when you GET the resource:

  GET /orders/1234/delete

You also get the ability to identify all your resources and add caching, which is not possible with this sort of old school SOAP'ish look-up mechanism:

  POST /orders
  Body => { OrderId: 1234, Operation: "read" }


And you get a nice explorable and consistent API that you developers will love to use :-)

Versioning

Where should API version numbers go in the URL? Should it be /api/v1/countries, /countries-1 or maybe in the host name http://v1.api.geo.com/countries?

Well, API and URL versioning is a whole story in itself so I suggest you take a look at Mark Nottingham's excellent "API versioning smackdown" (http://www.mnot.net/blog/2011/10/25/web_api_versioning_smackdown) for a good discussion on this subject.


Have fun, hack some code and create beautiful APIs out there :-)

/Jørn