Pages

Monday, November 14, 2011

JSON validation

I spend most of my time working with back-end services where JSON over HTTP is king. JSON in, JSON out. While processing requests, one of the steps is to validate them. The tool I use for validating JSON input is validictory. I am surprised how little known it is, yet so useful.


Validictory

It's a simple Python library to validate JSON data. One function, two arguments:

import validictory
validictory.validate(data, schema)

The first argument is the JSON input loaded to Python.

data = json.loads(request.body)

The second argument is a JSON schema, represented in Python.

schema = {"type": "string"}

"A JSON schema? What is this?", may you ask.


JSON schema: sneak-peek

JSON schema is to JSON what DTD is to XML. It is a JSON-based document that describes how JSON data should be represented. Yes, there's an on-going Internet Draf maintained by json-schema.org. And no, it's not as complicated as it sounds (and way easier than DTD).

I am not going to go through all the JSON schema definition, but I am going to show you how I use it the most.

Let's assume we want to update a user's information. In plain JSON, it could look like this:

{"name": "Alex Conrad",
 "address": "University Avenue, Palo Alto, CA",
 "zip": "94301",
 "country": "US",
 "date-of-birth": "1980-10-18",
 "gender": "male",
 "email": "alex@example.com",
 "phones": {"home": "+1 650-555-1234",
            "cell": "+1 650-555-4321"}}

In JSON terms, this is an object (dictionary in Python) and can be written in the following JSON schema:

{"type": "object"}

This JSON schema is enough to make sure the JSON input is an object -- not a string, nor an integer, Boolean, whatnot... Thus, an empty object {} would validate just fine.

Let's add strictness. We want to make sure the object has the desired properties.

{"type": "object",
 "properties": {"name": {"type": "string"},
                "address": {"type": "string"},
                "zip": {"type": "string"},
                "country": {"type": "string"},
                "date-of-birth": {"type": "string"},
                "gender": {"type": "string"},
                "email": {"type": "string"},
                "phones": {"type": "object"}}

We added the "properties" key under which we describe the expected keys and their values. Now {} will not validate. This is pretty much how your JSON schema will look like most of the time.

But let's be even more picky about the values... Let's make sure that:
  • "zip" is a 5-to-10-character string (to support ZIP+4 format)
  • "country" is a 2-character string
  • "date-of-birth" follows the YYYY-MM-DD format
  • "gender" is optional and may not be present
  • "email" is a valid format
The "properties" value would look like this:
{"name": {"type": "string"},
 "address": {"type": "string"},
 "zip": {"type": "string", "minLength": 5, "maxLength": 10},
 "country": {"type": "string", "minLength": 2, "maxLength": 2},
 "date-of-birth": {"type": "string", "format": "date"},
 "gender": {"type": "string", "required": false},
 "email": {"type": "string", "format": "email"},
 "phones": {"type": "object"}}

That's it! Using minLength and maxLength defines the minimum and maximum length when the value is a string. The format key is set to "date" and "email" accordingly. Format will also supports many common patterns, such as "date-time", "utc-millisec", "ip-address", "phone", etc. It even supports "regex" to match your custom regular expression! The required key controls whether or not a property is optional (required by default).

For "phones", we can just set properties for that nested object:

"phones": {"type": "object",
           "properties": {"home": {"type": "string", "format": "phone"},
                          "cell": {"type": "string", "format": "phone"}}}

Let's more thoroughly validate "gender". It must be either "male" or "female". For this, we can define "enum", which is an array of possible values.

"gender": {"type": "string", "required": false, "enum": ["male", "female"]}

You can also validate JSON arrays with minItems, maxItems, uniqueItems, etc.


Breaking down JSON schemas

The JSON schema can get a tad large depending on the number of input parameters and how strict you want to be about it.

As you may have to validate the same piece of JSON for different inputs, it can be useful to break down the schema into sub-schemas and Lego them back together.

user_id = {"type": "integer", minimum=1}
gender = {"type": "string", "enum": ["male", "female"]}

create_user_schema = {"type": "object",
                      "properties": {"name": {"type": "string"},
                                     "gender": gender}}

from copy import deepcopy
update_user_schema = deepcopy(create_user_schema)
update_user_schema["properties"]["user_id"] = user_id

add_pet_schema = {"type": "object",
                  "properties": {"owner_id": user_id,
                                 "gender": gender,
                                 "species": {"type": "string":
                                             "enum": ["dog", "cat", "bird", "worm"]}}}

If you add an extra gender, say "alien", you just need to update the "gender" variable accordingly and it will be propagated to all schemas that use it.


Dynamic typing

What if your application could accept different structures for a single property? Let's assume that these 3 structures are valid JSON inputs to your application.

{"phone": "+1 650-555-1234"}

{"phone": ["+1 650-555-1234", "+1 650-555-4321"]}

{"phone": {"home": "+1 650-555-1234",
           "cell": "+1 650-555-1234"}}

How would we define our JSON schema? Easy. First, you need to define a schema for each possible "phone" value:

phone_str = {"type": "string", "format": "phone"}
phone_array = {"type": "array", "items": phone_str}
phone_dict = {"type": "object", "properties": {"home": phone_str,
                                               "cell": phone_str}}

The key "item" for the array takes a schema, or an array of possible schemas.

Then you define the type of "phone":

{"type": "object",
 "properties": {"phone": {"type": [phone_str, phone_array, phone_dict]}}}

As you can see "type" is not necessarily a data type name but can also be an array of schemas. The input value is valid if it matches any of these schemas. Very flexible!


Leverage Python data structures

JSON doesn't offer a large choice of data structures in contrast with Python. But because validictory deals with plain Python objects, we can leverage Python data structures for better validation. The one I often use is frozenset(), which I combine with the "enum" parameter, or anything that does a lookup in an array.

countries = [country.strip() for country in open("counrty_codes.txt")]
countries = frozenset(countries)
country = {"type": "string", "enum": countries}
validictory.validate("FR", country)

The run time of validating "country" is constant O(1) rather than linear O(N). Neat! Of course, you don't want to build that frozenset for every request -- the set has to be constructed once, on application initialization.


Testing

I also use validictory in my tests. As I mentioned, many of the apps I write return JSON, so validating the generated JSON structure is nice.

response = app.post("/users", params=json.dumps(data))
output = json.loads(response.body)
validictory.validate(output, expected_schema)
# ... test the expected values

It definitely adds an extra level of confidence to maintain backward compatibility during code refactoring.

8 comments:

  1. Hi Alex - you should take a look at DictShield too (https://github.com/j2labs/dictshield) more from trhe ground up modelling with JSON over HTTP in mind - it makes a nice separation between model and persistence.

    ReplyDelete
  2. Ben, thanks for the link! I didn't know about DictShield.

    ReplyDelete
  3. Thanks a lot, this is tremendously useful!

    ReplyDelete
  4. Thanks :) But email validation doesn't work:

    >>> import validictory
    >>> validictory.validate("foo", {"type":"string", "format":"email"})
    >>> validictory.__version__
    '0.9.1'

    ReplyDelete
    Replies
    1. Maxim, thanks for pointing this out! I just submitted a pull request:

      https://github.com/sunlightlabs/validictory/pull/51

      Delete
  5. Thanks for this--clear and useful. I'm up and running with my test automation sooner rather than later because of this.

    ReplyDelete
  6. Hi I am trying to use validictory.. I have an empty string 'lan_ip': '' .
    Error:
    Value '' for field 'lan_ip' cannot be blank'

    Schema:
    {"lan_ip": { "type":["string","null","any"], "id": "http://jsonschema.net/device/lan_ip", "required":False }

    ReplyDelete
    Replies
    1. Hi, yeah that's a validictory quirk. If the "lan_ip" key is optional, then you should set {"required": False} as you did it. But if you want "lan_ip" to exist but be actually an empty string, then validictory has something called {"blank": True}. Take a look at their tests for further insight: https://github.com/sunlightlabs/validictory/blob/master/validictory/tests/test_values.py#L463

      Delete