Pulling out IDs from objects and back also


jq

While transforming JSON data with jq quite often I encounter two situations:

Pull-Out

This is an example of the “pull-out” situation:

[
  {
    "id": "id-1",
    "data": "123"
  },
  {
    "id": "id-2",
    "data": "456"
  }
]

The id within each nested object is unique an I want to access the data with that key. The desired output is this:

{
  "id-1": {
    "id": "id-1",
    "data": "123"
  },
  "id-2": {
    "id": "id-2",
    "data": "456"
  }
}

Sometimes I even want to delete the redundant id key in the nested objects.

Using only jq 1.5 there is no suitable builtin function. But a simple solution is this filter:

map({ (.id): . }) | add

It works like this: The first part produces a lot of small objects like these:

[
  {
    "id-1": {
      "id": "id-1",
      "data": "123"
    }
  },
  {
    "id-2": {
      "id": "id-2",
      "data": "456"
    }
  }
]

The add in the second part merges them into one object.

While the expression is short enough it might be hard to understand by the casual reader. Another little nitpick: Both parts walk over the same entries.

So the next solution achieves the same output with only one stage:

reduce .[] as $e ( {}; . + { ($e.id): $e } )

The principle is the same: Little objects are created and added to the state.

The nitpick here: The intermediate object is not necessary. The next solution simply sets a the new attribute in the state:

reduce .[] as $e ( {}; .[$e.id] = $e )

The text of this filter is a little bit longer than map/add but it works with only one stage which directly translates into a traditional imperative for-each loop. Therefore this should be more approachable by the casual reader.

Pull-Out vs. INDEX

jq 1.6 introduced the functions INDEX(stream; idx_expr) and INDEX(idx_expr) which can be used for the same purpose:

INDEX(.[]; .id)
INDEX(.id)

It covers more cases like complex keys and non-string keys but I have some points about it:

> jq 'INDEX({id, data})'
{
  "{\"id\":\"id-1\",\"data\":\"123\"}": {
    "id": "id-1",
    "data": "123"
  },
  "{\"id\":\"id-2\",\"data\":\"456\"}": {
    "id": "id-2",
    "data": "456"
  }
}

The last reason why I will probably not use INDEX/1 by default even if my environments are upgraded to jq 1.6 is this: In most cases I don’t want the exact output shown so far but some variant of it. But most of the required post-processing can be done easily in the reduce step.

Pull-Out with del

One gripe of the output so far: It contains redundant information now. In most cases that might be OK for intermediate steps but not for something final. The clean output looks like this:

{
  "id-1": {
    "data": "123"
  },
  "id-2": {
    "data": "456"
  }
}

The simple solution is – of course – to do some post-processing like | map_values(del(.id)) or even folding it into the previous map step. This is OKish if the first part is a predefined function like INDEX. But if the first part contains a reduce anyways then this is better and simpler:

reduce .[] as $e ( {}; .[$e.id] = ($e | del(.id)) )

Or as a convenient function:

def pullout_by(k):
    reduce .[] as $e ( {}; .[$e|k] = ($e | del(k)) );
pullout_by(.id)

Non-unique keys

Another variant of pull-out is required if the keys are not unique like in this example:

[
  {
    "id": "id-1",
    "data": "123"
  },
  {
    "id": "id-2",
    "data": "456"
  },
  {
    "id": "id-2",
    "data": "789"
  }
]

Of course the output must be adapted by wrapping an array around the nested objects:

{
  "id-1": [
    {
      "data": "123"
    }
  ],
  "id-2": [
    {
      "data": "456"
    },
    {
      "data": "789"
    }
  ]
}

This is similar to group_by and a naive filter using predefined functions is indeed this:

group_by(.id) | map({ (.[0].id): map(del(.id)) }) | add

But as I’ve explained above this multistep transformation is harder to understand as the direct implementation with reduce. In fact the required changes to pullout_by are quite small:

def pullout_groups_by(k):
    reduce .[] as $e ( {}; .[$e|k] += [$e | del(k)] );
pullout_groups_by(.id)

An note about performance: group_by itself is not implemented in C but defined as a builtin jq function. group_by calls the function _group_by_impl/1 which is implemented in C. So I assume that group_by by itself is faster than reduce. But I guess that any speed advantage is lost by the additional processing steps required for this usecase.

multiple levels

The last and most fun variant of pull-out covers multiple levels of pulling. Given this input:

[
  {
	"id1": "foo1",
	"id2": "bar1",
	"id3": "bla1",
	"data": 42
  },
  {
	"id1": "foo1",
	"id2": "bar1",
	"id3": "bla2",
	"data": 43
  }
]

the expected output is:

{
  "foo1": {
	"bar1": {
	  "bla1": {
		"data": 42
	  },
	  "bla2": {
		"data": 43
	  }
	}
  }
}

One solution uses the previously defined functions:

pullout_groups_by(.id1)
| map_values(
	pullout_groups_by(.id2)
	| map_values(pullout_by(.id3))
)

Not to bad to understand. The trick is to extract the “largest” key first.

But again this can be optimized into one stage using the same principle used so far:

reduce .[] as $e (
	{};
	.[$e.id1][$e.id2][$e.id3] = ($e | del(.id1, .id2, .id3))
)

If data is not only a placeholder for arbitrary additional data but indeed the last key then it gets even simpler:

reduce .[] as {$id1, $id2, $id3, $data} (
	{};
	.[$id1][$id2][$id3] = $data
)

summing up Pull-Out

To sum up pull-out: Use INDEX/1 if you can and it matches your usecase exactly. Use reduce directly or the above functions (pullout_by/1, pullout_groups_by/1) when they match your usecase better. But the multilevel case should be handled by using reduce directly.

Now to the reverse operation.

Push-Into

The reverse operation of pull-up is “push-into”: Pushing a key into the child objects. This is also a very useful and common thing.

The simplest variant is the reverse operation of INDEX(.id) because the child objects still contain the key/value pair. So only the top-level object must be removed while gathering all values. Easy:

[ .[] ]

But what if the key/value is not inside the child objects?

unique data

If the input does not contain the key/value pair the input looks like this:

{
  "id-1": {
    "data": "123"
  },
  "id-2": {
    "data": "456"
  }
}

then the required output should be:

[
  {
    "id": "id-1",
    "data": "123"
  },
  {
    "id": "id-2",
    "data": "456"
  }
]

And as with the opposite pull-up the push-into operation can be implemented in several ways. One naive way using predefined function is this:

to_entries | map( .value.id = .key | .value )

Here the same nitpick as above applies also: to_entries creates intermediate objects and both to_entries and map iterates essentially over the same list. The following filter doesn’t do that:

[ keys_unsorted[] as $k | .[$k] | .id = $k ]

At first glance it seems that there is no loop at all. The loop is hidden in the variable assignment: keys_unsorted[] returns multiple results (so this expression is a stream). Each of these results is assigned to the variable and the following filter executed with that variable. The filter produces one result for each call – so this is also a stream. All results are gathered by the surrounding array constructor [...].

So the reverse operation of pullout_by(k) can be conveniently defined as:

def pushinto_by(k):
    [ keys_unsorted[] as $k | .[$k] | k = $k ];
pushinto_by(.id)

non-unique data

If the keys are not unique (like the output of pullout_groups_by) then pushinto_by will not work because it is confused by the intermediate arrays in the input:

{
  "id-1": [
    {
      "data": "123"
    }
  ],
  "id-2": [
    {
      "data": "456"
    },
    {
      "data": "789"
    }
  ]
}

The expected output is this:

[
  {
    "id": "id-1",
    "data": "123"
  },
  {
    "id": "id-2",
    "data": "456"
  },
  {
    "id": "id-2",
    "data": "789"
  }
]

A simple change (an additional []) does the job:

def pushinto_groups_by(k):
	[ keys_unsorted[] as $k | .[$k][] | k = $k ];
pushinto_groups_by(.id)

only add keys

Another variant of push-into pushes the key into the objects but also keeps the outer structure. The expected output for the “unique id” case is:

{
  "id-1": {
    "id": "id-1",
    "data": "123"
  },
  "id-2": {
    "id": "id-2",
    "data": "456"
  }
}

The outer object can still be used as a fast hash map to the data while at the same time the data objects can be retrieved and handed to other filters which also require the keys.

A naive implementation is:

with_entries(.value.id = .key)

And – as in most cases so far – a more efficient and still comprehensible expression can be found. In this case:

reduce keys_unsorted[] as $k ( .; .[$k].id = $k )

multiple levels

Similar to pull-out the last variant of push-into handles multiple levels. Given the input:

{
  "foo1": {
	"bar1": {
	  "bla1": {
		"data": 42
	  },
	  "bla2": {
		"data": 43
	  }
	}
  }
}

the output should be:

[
  {
	"id1": "foo1",
	"id2": "bar1",
	"id3": "bla1",
	"data": 42
  },
  {
	"id1": "foo1",
	"id2": "bar1",
	"id3": "bla2",
	"data": 43
  }
]

Again the previously defined functions can be used like this:

map_values(
	map_values(
		pushinto_by(.id3)
	)
	| pushinto_groups_by(.id2)
)
| pushinto_groups_by(.id3)

Here the trick is to start with the inner level working outwards.

And again: There is a solution using the lesson learned so far:

[
	keys_unsorted[] as $id1
	| .[$id1]
	| keys_unsorted[] as $id2
	| .[$id2]
	| keys_unsorted[] as $id3
	| .[$id3]
	| . + { $id1, $id2, $id3 }
]

Basically this is pushinto_by applied three times. The only difference is that instead of doing three separate assignments an intermediate object with the three keys/values is added. I think this intermediate object better in this case.

summing up push-into

If some piece of data uses the {"key": { "data": ... }} format then it usually does so in more than one place. Flattening these object hierarchies using predefined functions or pushinto_by is possible but tricky. Therefore the direct implementation is very useful here – although it’s not so terse and concise as I’d like.1

Summary

The pull-out and push-into patterns are very useful to know – both for custom and for predefined data formats. But as it is with patterns: Some usecases can be implemented as functions but not all due to the countless variants. So the direct implementations and the principles behind them are very useful to understand because these can be adapted more easily without loosing the casual reader of the code.


  1. Perhaps I will experiment with a recursive function for that. ↩︎