Pains of JSON

And why I don't think there's a better alternative

Published on 2021-08-31

I was handed a file recently and was told to go through it. It was a JSON file. "Okay, no problem", I thought to myself. It turns out it was a Trello board export. A former employee left and all I had to go off was a Trello board left over. That's it.

My first thought was as follows: why not import it back into Trello as a new board? Wouldn't that be easy? Well uh... I don't think it's possible by default without some kind of third party help, and I wasn't about to start messing with Trello's API. I needed to fetch some data kept on this board, not spend an afternoon figuring out how to put it back into Trello.

Okay, so let's examine what a Trello board is and how it kind of works.

Trello = {
  "actions" : [Listof all Cards],
  "lists": [Listof all Lists],
  ...
}

How Trello works is that you create cards, which could be considered "goals" or "tasks", and you put them into Lists like "To-Do", "Done", and so on. This means that it's better off if action cards are separate from lists, because you can move them around quite a bit.

An action card stores some info about it's position.

Action = {
	data : {
	  card : {
	    name: <String>,
	    desc: <String>,
		...
	  }
   	  list : {
	    id: <List ID>,
	    name: <List Name>,
	    ...
    }
  }
}

I've omitted a lot of details here because the ones I list are currently the only things I care about. There are many other properties like attachments, URLs and other pointers, but really I'm trying to grab all this Trello data that's been left behind.

The data.list object tells us where the action currently lives. Does it reside in the "To-Do" table, or the "Done" table? That distinction will help us search for data. The person who left my company surely organized certain data into different tables.

Left with this daunting task, I did my best to figure out how to navigate a nearly 1MB blob of JSON.

`jq`

jq is a JSON destructuring and exploration tool that flawlessly traverses JSON structures and gets you what you need. My blob came to me in a one-line file, so the cool thing I can do right away is pretty-print format it with jq.

$ cat json_blob.txt | jq '.'
{
  "id" : "12345"
  "name": "Stevenator"
  ...
  "actions": [
    {
	  "id": "456789"
	  ...

That's already one thing done to make my brain not hurt. grep now works better since grep only works on a line-by-line basis. Now I can search this structure to find the former employee's email and see what important login data he left behind.

$ cat json_blob.txt | jq '.' | grep [email protected]
"desc": "hey [email protected], i left you a card"
"desc": "login: [email protected] password: 12345"
"desc": "so [email protected] did bla bla bla ..."
...

Oooohkay, well, I guess I wasn't expecting much on a first pass. The desc field in Trello actions is pretty much a one-stop shop for writing down anything, no rhyme or reason is required. So if I'm left to find this guy's logins, maybe he put them all in one place. The best I could do would be to maybe search by a List ID.

Using jq I can pretty much print all the List IDs available.

$ cat json_blob.txt| jq '.lists[]|{id:.id,name:.name}'
{
  "id": "12345"
  "name": "My To-Do's"
}
{
  "id": "34567"
  "name": "My Secret Log-ins"
}
...

Ah-ha! That's it! Secret Log-ins! Surely that's useful company info that we're going to need.

Sadly this is where my jq knowledge ends and my Python hacking begins. Cramming it all into a jq command string that I might lose somewhere in my shell history is not great, so I'm going to write it into an expansive Python script for better reporting.

Python

I wasn't looking forward to going back into Python, but I didn't have much of a choice.

Let's start with the obvious using the json package.

from json import load as load_json_file

def open_blob(f):
    with open(f, 'r') as fo:
        return load_json_file(fo)
    return None

>>> DATA = open_blob("json_blob.txt")

Now we have a loaded dictionary (why do I return None? Because that's my style, I like default return paths). All the properties of the JSON object are now string keys in a dictionary. Easy stuff, right?

>>> all_lists = [
	(x["id"], x["name"]) for x in DATA["lists"]
]

[("12345", "My To-Do's"),
 ("34567", "My Secret Log-ins"), ...]

That looks right. Now what if we filter by a certain List ID to get our cards now?

all_secret_cards = [
  (x["data"]["card"]["name"], x["data"]["card"]["desc"])
   for x in DATA["actions"] if x["data"]["list"]["id"] == "34567"
]

Error: NoneType indexed

Oh what?? Okay, so as it turns out, sometimes cards are entirely blank. Now we would have to add checks in each time we wanted to grab something. We can technically fix this by doing dict.get/2, which accesses a dict's mappings using a key, and provides us a default value in case it fails.

But uh... It won't look pretty.

>>> all_secret_cards = [
 (x.get("data",{}).get("card",{}).get("name",""),
  x.get("data",{}).get("card",{}).get("desc",""))
  for x in DATA.get("actions", {})
  if x.get("data",{}).get("list",{}).get("id","") == "34567"
]

[("Accounting", "Login: [email protected] pass: steveiscool"), ...]

Okay, that actually works, but holy cow that is horrendous. Chaining dict.get/2 together is a nasty and very dirty way of doing this.

Surely there are better ways to crawl through a JSON file like this. I know Racket has a json module, so maybe I'll give that a shot?

Racket

When Racket loads a JSON object, it converts objects into a hasheq type, meaning an object that looks like {"id": 1, "name": "me"} is converted into #hasheq(("id" . 1) ("name" . "me")). Our function to refer to keys inside a hash is hash-ref/2.

Except... Racket, like Python, by default, does not enforce types or check for types natively. It takes a naive approach and will burp errors if there is a type mismatch. So you can tell by me writing this that we will bump into problems parsing JSON.

#lang racket

(require json)

; our main opening function
(define file->jsexpr (compose string->jsexpr file->string))

; our data loaded and parsed
(define *data* (file->jsexpr "json_blob.txt"))

hash-ref/2 by default will give us problems, so we want to mimic behavior like dict.get/2 from Python, by being able to provide a fail-safe value so we can continue an expression. This exists with the function hash-ref-key/3, where you are able to provide said value and allow it to fail and yield a value if a key is not present.

Except, this function, along with hash-ref/2, has an issue: it accepts only a hash type, and it fails when the value is not a hash. It is impossible to chain these functions together in a way that makes sense, because we're not always parsing hashes, we're also parsing lists.

Introducing href/3, a friendly way of failing on non-hash types and still returning default values.

; href :: Hash -> Key -> Value -> Value
(define (href h k default)
  (if (and (hash? h) (hash-has-key? h k))
      (hash-ref h k)
	  default))

Now we can use this to seamlessly collect data.

; Crawl through the data using href expressions
; then put it all into a friendly data list
(define (action->data action-obj)
  (define data  (href action-obj 'data '()))
  (define card  (href data 'card '()))
  (define name  (href card 'name ""))
  (define desc  (href card 'desc ""))
  (list name desc))

Using href/3 I'm able to provide default values for when something can fail. This way feels a little bit more organized than repeated dict.get/2 calls at least. Plus, by escaping the Python list comprehension, I'm free to write some cleaner code.

Now comes the issue of how we filter our objects. action->data converts our Trello action into some data, but how do I add the ability to test if a property is of a value? For that I'm going to write a separate function to keep the logic contained.

First let's recall that we have List IDs to indicate different Trello Lists. Using those IDs I'll convert them into filter functions.

; Generate a function to check
(define (id-checker filter-key)
  (λ (action-obj)
     (define data  (href action-obj 'data '()))
     (define lst   (href data 'list '()))
     (define lstid (href lst 'id ""))
     (and (string? lstid)
          (string=? filter-key lstid))))

(define is-secret-login? (id-checker "34567"))

Now it's all as simple as:

; display and print all secret login action objects
(for-each (compose displayln action->data)
          (filter is-secret-login? *data*))


("Accounting" "Login: [email protected] pass: steveiscool")
...

I copied all of these logins over to my boss and was on my way. I was curious to see if there was a way of currying hash keys into a larger function that would unpack for me gracefully, but I didn't even really want to try risking that.

Side-note: is this really how people store log-ins at companies? What's the sanest way of preventing this from ever happening in an organization? Mono-logins? Shared Word documents?

Why JSON is Great

I think JSON is pretty neat, and when it comes to describing data, JSON should in practice do it in a way that reduces the amount of data. As we saw in the Trello example, data was separate from it's storage logic, and we were able to infer the link between actions/cards between their respective lists.

JSON should probably be thought of as a multi-functional tool of sorts where it can be as beautiful as a binary tree, or it can be as verbose as a CSV data file. The styling of data is up to the programmer to find a middle-ground between beautiful, duplicate-less, and easy to use.

As someone who uses Lisp/Scheme a whole bunch, I'm familiar with the idea behind S-expressions, and JSON is able to follow that idea pretty closely.

But also:

Why JSON Sucks

Optional data definitions suck, and makes parsing a lot harder. The whole point behind JSON is that it's naturally read by JavaScript and easy to convert to a data format that works between browsers. However, that doesn't mean it's a fun thing to use outside of JavaScript.

When it comes to data in a format of SQL or CSV, I don't have to think very hard about the data. It's laid out in a linear fashion, and querying it is very easy. You write some rules and logic, and it's over. Plain and simple. When I want to parse JSON, it's like scraping a fork against an empty dinner plate (literally, the plate can be null).

Null-checking JSON data happens far too often. You're parsing a JSON object and then boom, suddenly a nested container is empty. Why? That seems a bit silly. Then you have to go back and add an empty? check in your code somewhere to gracefully handle it.

For every time you have something empty, it results in you going back and adding some kind of optional data handling. If you're using Rust, you have to run back to your struct definition and add an Option<T> handler of some sort when doing full JSON parsing.

How are you supposed to avoid this? When you receive a blob like this, with no documentation, it's frustrating to have to go through to figure out and add application logic. If you received a CSV file, it would take you all three minutes to look at the top row and create a struct record for all fields.

I'm probably ranting too hard but I don't think JSON is a great data format for consistency. The format is too all over the place and it's never as good as plain-old data files. I like CSVs since it's easy to import and export. JSON feels too inconsistent, and most languages require special tooling to even have JSON support.

Lessons learned: Trello, please add a board import option from JSON, so I don't have to deal with this again.