Pulling out IDs from objects and back also
While transforming JSON data with jq quite often I encounter two situations:
- Pull a key out of nested objects (aka. hashing/indexing).
- Push a key into nested objects (do the reverse operation)
Pull-Out
This is an example of the “pull-out” situation:
[
{
"id": "id-1",
"data": "123"
},
{
"id": "id-2",
"data": "456"
}
]
The id
within each nested object is unique an I want to access the data with
that key. The desired output is this:
{
"id-1": {
"id": "id-1",
"data": "123"
},
"id-2": {
"id": "id-2",
"data": "456"
}
}
Sometimes I even want to delete the redundant id
key in the nested objects.
Using only jq 1.5 there is no suitable builtin function. But a simple solution is this filter:
map({ (.id): . }) | add
It works like this: The first part produces a lot of small objects like these:
[
{
"id-1": {
"id": "id-1",
"data": "123"
}
},
{
"id-2": {
"id": "id-2",
"data": "456"
}
}
]
The add
in the second part merges them into one object.
While the expression is short enough it might be hard to understand by the casual reader. Another little nitpick: Both parts walk over the same entries.
So the next solution achieves the same output with only one stage:
reduce .[] as $e ( {}; . + { ($e.id): $e } )
The principle is the same: Little objects are created and added to the state.
The nitpick here: The intermediate object is not necessary. The next solution simply sets a the new attribute in the state:
reduce .[] as $e ( {}; .[$e.id] = $e )
The text of this filter is a little bit longer than map/add
but it works with
only one stage which directly translates into a traditional imperative for-each
loop. Therefore this should be more approachable by the casual reader.
Pull-Out vs. INDEX
jq 1.6 introduced the functions INDEX(stream; idx_expr)
and INDEX(idx_expr)
which can be used for the same purpose:
INDEX(.[]; .id)
INDEX(.id)
It covers more cases like complex keys and non-string keys but I have some points about it:
-
“For reasons” I’m currently stuck with jq 1.5 on quite a few machines right now.
-
INDEX/2
is documented but it requires an understanding of “streams”. -
INDEX/1
is defined but not documented in jq 1.6 which can confuse the casual reader.
- When handling complex keys and non-string keys the generated key is the output
of
tojson
which looks stupid in most cases (see example below). And it works correctly only if every lookup uses the same order of attributes when constructing the lookup key. Therefore I think in these cases the key requires post-processing anyway which can be done easily in the explicitreduce
filter.
> jq 'INDEX({id, data})'
{
"{\"id\":\"id-1\",\"data\":\"123\"}": {
"id": "id-1",
"data": "123"
},
"{\"id\":\"id-2\",\"data\":\"456\"}": {
"id": "id-2",
"data": "456"
}
}
The last reason why I will probably not use INDEX/1
by default even if my
environments are upgraded to jq 1.6 is this: In most cases I don’t want the
exact output shown so far but some variant of it. But most of the required
post-processing can be done easily in the reduce
step.
Pull-Out with del
One gripe of the output so far: It contains redundant information now. In most cases that might be OK for intermediate steps but not for something final. The clean output looks like this:
{
"id-1": {
"data": "123"
},
"id-2": {
"data": "456"
}
}
The simple solution is – of course – to do some post-processing like
| map_values(del(.id))
or even folding it into the previous map
step. This
is OKish if the first part is a predefined function like INDEX
. But if the
first part contains a reduce
anyways then this is better and simpler:
reduce .[] as $e ( {}; .[$e.id] = ($e | del(.id)) )
Or as a convenient function:
def pullout_by(k):
reduce .[] as $e ( {}; .[$e|k] = ($e | del(k)) );
pullout_by(.id)
Non-unique keys
Another variant of pull-out is required if the keys are not unique like in this example:
[
{
"id": "id-1",
"data": "123"
},
{
"id": "id-2",
"data": "456"
},
{
"id": "id-2",
"data": "789"
}
]
Of course the output must be adapted by wrapping an array around the nested objects:
{
"id-1": [
{
"data": "123"
}
],
"id-2": [
{
"data": "456"
},
{
"data": "789"
}
]
}
This is similar to group_by
and a naive filter using predefined functions is
indeed this:
group_by(.id) | map({ (.[0].id): map(del(.id)) }) | add
But as I’ve explained above this multistep transformation is harder to
understand as the direct implementation with reduce
. In fact the required
changes to pullout_by
are quite small:
def pullout_groups_by(k):
reduce .[] as $e ( {}; .[$e|k] += [$e | del(k)] );
pullout_groups_by(.id)
An note about performance: group_by
itself is not implemented in C but defined
as a builtin jq function. group_by
calls the function _group_by_impl/1
which is implemented in C. So I assume that group_by
by itself is faster
than reduce
. But I guess that any speed advantage is lost by the additional
processing steps required for this usecase.
multiple levels
The last and most fun variant of pull-out covers multiple levels of pulling. Given this input:
[
{
"id1": "foo1",
"id2": "bar1",
"id3": "bla1",
"data": 42
},
{
"id1": "foo1",
"id2": "bar1",
"id3": "bla2",
"data": 43
}
]
the expected output is:
{
"foo1": {
"bar1": {
"bla1": {
"data": 42
},
"bla2": {
"data": 43
}
}
}
}
One solution uses the previously defined functions:
pullout_groups_by(.id1)
| map_values(
pullout_groups_by(.id2)
| map_values(pullout_by(.id3))
)
Not to bad to understand. The trick is to extract the “largest” key first.
But again this can be optimized into one stage using the same principle used so far:
reduce .[] as $e (
{};
.[$e.id1][$e.id2][$e.id3] = ($e | del(.id1, .id2, .id3))
)
If data
is not only a placeholder for arbitrary additional data but indeed
the last key then it gets even simpler:
reduce .[] as {$id1, $id2, $id3, $data} (
{};
.[$id1][$id2][$id3] = $data
)
summing up Pull-Out
To sum up pull-out: Use INDEX/1
if you can and it matches your usecase
exactly. Use reduce
directly or the above functions (pullout_by/1
,
pullout_groups_by/1
) when they match your usecase better. But the multilevel
case should be handled by using reduce
directly.
Now to the reverse operation.
Push-Into
The reverse operation of pull-up is “push-into”: Pushing a key into the child objects. This is also a very useful and common thing.
The simplest variant is the reverse operation of INDEX(.id)
because the child
objects still contain the key/value pair. So only the top-level object must be
removed while gathering all values. Easy:
[ .[] ]
But what if the key/value is not inside the child objects?
unique data
If the input does not contain the key/value pair the input looks like this:
{
"id-1": {
"data": "123"
},
"id-2": {
"data": "456"
}
}
then the required output should be:
[
{
"id": "id-1",
"data": "123"
},
{
"id": "id-2",
"data": "456"
}
]
And as with the opposite pull-up the push-into operation can be implemented in several ways. One naive way using predefined function is this:
to_entries | map( .value.id = .key | .value )
Here the same nitpick as above applies also: to_entries
creates intermediate
objects and both to_entries
and map
iterates essentially over the same list.
The following filter doesn’t do that:
[ keys_unsorted[] as $k | .[$k] | .id = $k ]
At first glance it seems that there is no loop at all. The loop is hidden in
the variable assignment: keys_unsorted[]
returns multiple results (so this
expression is a stream). Each of these results is assigned to the variable and
the following filter executed with that variable. The filter produces one result
for each call – so this is also a stream. All results are gathered by the
surrounding array constructor [...]
.
So the reverse operation of pullout_by(k)
can be conveniently defined as:
def pushinto_by(k):
[ keys_unsorted[] as $k | .[$k] | k = $k ];
pushinto_by(.id)
non-unique data
If the keys are not unique (like the output of pullout_groups_by
) then
pushinto_by
will not work because it is confused by the intermediate arrays
in the input:
{
"id-1": [
{
"data": "123"
}
],
"id-2": [
{
"data": "456"
},
{
"data": "789"
}
]
}
The expected output is this:
[
{
"id": "id-1",
"data": "123"
},
{
"id": "id-2",
"data": "456"
},
{
"id": "id-2",
"data": "789"
}
]
A simple change (an additional []
) does the job:
def pushinto_groups_by(k):
[ keys_unsorted[] as $k | .[$k][] | k = $k ];
pushinto_groups_by(.id)
only add keys
Another variant of push-into pushes the key into the objects but also keeps the outer structure. The expected output for the “unique id” case is:
{
"id-1": {
"id": "id-1",
"data": "123"
},
"id-2": {
"id": "id-2",
"data": "456"
}
}
The outer object can still be used as a fast hash map to the data while at the same time the data objects can be retrieved and handed to other filters which also require the keys.
A naive implementation is:
with_entries(.value.id = .key)
And – as in most cases so far – a more efficient and still comprehensible expression can be found. In this case:
reduce keys_unsorted[] as $k ( .; .[$k].id = $k )
multiple levels
Similar to pull-out the last variant of push-into handles multiple levels. Given the input:
{
"foo1": {
"bar1": {
"bla1": {
"data": 42
},
"bla2": {
"data": 43
}
}
}
}
the output should be:
[
{
"id1": "foo1",
"id2": "bar1",
"id3": "bla1",
"data": 42
},
{
"id1": "foo1",
"id2": "bar1",
"id3": "bla2",
"data": 43
}
]
Again the previously defined functions can be used like this:
map_values(
map_values(
pushinto_by(.id3)
)
| pushinto_groups_by(.id2)
)
| pushinto_groups_by(.id3)
Here the trick is to start with the inner level working outwards.
And again: There is a solution using the lesson learned so far:
[
keys_unsorted[] as $id1
| .[$id1]
| keys_unsorted[] as $id2
| .[$id2]
| keys_unsorted[] as $id3
| .[$id3]
| . + { $id1, $id2, $id3 }
]
Basically this is pushinto_by
applied three times. The only difference is that
instead of doing three separate assignments an intermediate object with the
three keys/values is added. I think this intermediate object better in this case.
summing up push-into
If some piece of data uses the {"key": { "data": ... }}
format then it usually
does so in more than one place. Flattening these object hierarchies using
predefined functions or pushinto_by
is possible but tricky. Therefore the
direct implementation is very useful here – although it’s not so terse and
concise as I’d like.1
Summary
The pull-out and push-into patterns are very useful to know – both for custom and for predefined data formats. But as it is with patterns: Some usecases can be implemented as functions but not all due to the countless variants. So the direct implementations and the principles behind them are very useful to understand because these can be adapted more easily without loosing the casual reader of the code.
-
Perhaps I will experiment with a recursive function for that. ↩︎