Skip to content

alexander-schranz/how-to-flatten-and-unflatten-nested-objects

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

How to flatten and unflatten complex JSON Objects or PHP arrays?

While working on SEAL the Search Engine Abstraction Layer for PHP I have struggled with one issue.

Not all search engines support nested objects in the same way. So for some search engines it is required to flatten the nested objects into a single level array.

So if we have, for example, something like this:

[
    'id' => 1,
    'title' => 'Title',
    'blocks' => [
        [
           'title' => 'Title 1',
           'tags' => ['UI', 'UX'],
        ],
        [
           'title' => 'Title 2',
           'tags' => ['Tech'],
        ],
    ],
]

we require it to flatten it to something like that:

[
    'id' => 1,
    'title' => 'Title',
    'blocks.title' => ['Title 1', 'Title 2'],
    'blocks.tags' => ['UI', 'UX', 'Tech'],
]

That way it is possible to query on the nested field blocks.tags in search engines which do not support such nested objects.

Time to try to flatten the object

The easiest way to flatten the object is to use a recursive function which will iterate over the array and flatten the keys:

public function doFlatten(array $data): array
{
    $newData = [];
    foreach ($data as $key => $value) {
        if (!is_array($value)) {
            $newData[$prefix . $key] = $value;

            continue;
        }

        $flattened = $this->doFlatten($value, $key . '.');
        foreach ($flattened as $subKey => $subValue) {
            $newData[$prefix . $subKey] = $subValue;
        }
    }

    return $newData;
}

But this ends in our example in the following output:

[
    'id' => 1,
    'title' => 'Title',
    'blocks.0.title' => 'Title 1',
    'blocks.0.tags.0' => 'UI',
    'blocks.0.tags.1' => 'UX',
    'blocks.1.title' => 'Title 2',
    'blocks.1.tags.0' => 'Tech'
 ]

The above is a very common format if you search for "Flatten JSON algorithm" examples. But which isn't what we want, even we add a check on array_is_list and ending up in:

[
    'id' => 1,
    'title' => 'Title',
    'blocks.0.title' => 'Title 1',
    'blocks.0.tags' => [
        'UI',
        'UX',
    ],
    'blocks.1.title' => 'Title 2',
    'blocks.1.tags' => [
        'Tech',
    ],
]

It doesn't allow us to query on our required blocks.tags field as it is split into multiple fields. But it's already a good step forward, but I reverted the array_is_list to have the common flatten by blocks.0.tags.0 as it is easier to move forward from there for very complex objects.

In the next step we will just iterate over the array check if it contains numbers split by using:

$newData = [];
foreach ($flattenData as $key => $value) {
    unset($flattenData[$key]);
    $newKey = preg_replace('/\.(\d+)\./', '.', $key, -1);
    $newKey = preg_replace('/\.(\d+)$/', '', $newKey, -1);
    
    //

If the $newKey still matches $key we can directly go to the next step:

    if ($newKey === $key) {
        $newData[$key] = $value;
    
        continue;
    }

If not, we convert the result into an array and eventually merge the values:

    $newData[$newKey] = [
        ...($newData[$newKey] ?? []),
        ...(is_array($value) ? $value : [$value]),
    ];
}

So the full example for flatten the data look like this:

class FlattenMarshaller
{
    /**
     * @param array<string, mixed> $data
     *
     * @return array<string, mixed>
     */
    public function flatten(array $data, string $prefix = ''): array
    {
        $flattenData = $this->doFlatten($data, $prefix);

        $newData = [];
        foreach ($flattenData as $key => $value) {
            unset($flattenData[$key]);

            $newKey = preg_replace('/\.(\d+)\./', '.', $key, -1);
            $newKey = preg_replace('/\.(\d+)$/', '', $newKey, -1);

            if ($newKey === $key) {
                $newData[$key] = $value;

                continue;
            }

            $newValue = is_array($value) ? $value : [$value];

            $newData[$newKey] = [
                ...($newData[$newKey] ?? []),
                ...$newValue,
            ];
        }

        return $newData;
    }

    /**
     * @param array<string, mixed> $data
     *
     * @return array<string, mixed>
     */
    private function doFlatten(array $data, string $prefix = ''): array
    {
        $newData = [];
        foreach ($data as $key => $value) {
            if (!is_array($value)) {
                $newData[$prefix . $key] = $value;

                continue;
            }

            $flattened = $this->doFlatten($value, $key . '.');
            foreach ($flattened as $subKey => $subValue) {
                $newData[$prefix . $subKey] = $subValue;
            }
        }

        return $newData;
    }
}

and the result is now as expected:

[
    'id' => 1,
    'title' => 'Title',
    'blocks.title' => ['Title 1', 'Title 2'],
    'blocks.tags' => ['UI', 'UX', 'Tech'],
]

All fine tests pass and we are happy or? We just use the reverse logic to unflatten the data?

Prepare unflatten the object

As we now have flatten the data like this:

[
    'id' => 1,
    'title' => 'Title',
    'blocks.title' => ['Title 1', 'Title 2'],
    'blocks.tags' => ['UI', 'UX', 'Tech'],
]

And we want again this:

[
    'id' => 1,
    'title' => 'Title',
    'blocks' => [
        [
           'title' => 'Title 1',
           'tags' => ['UI', 'UX'],
        ],
        [
           'title' => 'Title 2',
           'tags' => ['Tech'],
        ],
    ],
]

We will stumble over a small problem.

You already maybe see it. While blocks.title may straight forward the blocks.tags is not. We don't know longer that ['UI', 'UX'] was part of blocks.0 and ['Tech'] was part of blocks.1.

My current workaround in SEAL was to have an additional field lets call it _raw which includes the whole original document JSON encoded. So we can use that to unflatten the data back to the original array.:

[
    'id' => 1,
    'title' => 'Title',
    'blocks.title' => ['Title 1', 'Title 2'],
    'blocks.tags' => ['UI', 'UX', 'Tech'],
    '_raw' => '{"id":1,"title":"Title","blocks":[{"title":"Title 1","tags":["UI","UX"]},{"title":"Title 2","tags":["Tech"]}]}',
]

This comes with the big disadvantage that the document is now twice as big as it was before. The second issue is that when using the highlighting feature, it is applied to the flatten data not the raw data and so highlighting not works correctly in all cases.

So we need a better solution for this problem.

We need some kind of additional metadata to know how we can map the data back to the original array, which I did come up with something like this:

[
    'id' => 1,
    'title' => 'Title',
    'blocks.title' => ['Title 1', 'Title 2'],
    'blocks.tags' => ['UI', 'UX', 'Tech'],
    '_metadata' => [
        'blocks.title' => ['blocks.0.title', 'blocks.1.title'],
        'blocks.tags' => ['blocks.0.tags.0', 'blocks.0.tags.1', 'blocks.1.tags.0'],
    ],
]

So we can use the metadata to unflatten the data back to the original array, but first we need to create the metadata with a small adoption:

+      $metadata = [];
       foreach ($flattenData as $key => $value) {
            // ..

            $newData[$newKey] = [
                ...($newData[$newKey] ?? []),
                ...$newValue,
            ];

+           foreach ($newValue as $v) {
+               $metadata[$newKey][] = $key;
+           }
        }

+       $newData['_metadata'] = $metadata;

        return $newData;

So yes, it is a basic case, but yes our metadata still looks longer as our actual document. In real use cases the documents will be a lot bigger with more texts so we could keep it this way. But I would like to compress the metadata by using placeholders of the data we already know by the key and not require again:

    '_metadata' => [
-        'blocks.title' => ['blocks.0.title', 'blocks.1.title'],
+        'blocks.title' => ['*.0.*', '*.1.*'],
-        'blocks.tags' => ['*.0.*.0', '*.0.*.1', '*.1.*.0'],
+        'blocks.tags' => ['*.0.*.0', '*.0.*.1', '*.1.*.0'],
    ],

With a callback regex replacement we can achieve that:

-               $metadata[$newKey][] = $key;
+               $metadata[$newKey][] = \preg_replace_callback('/[^.]+/', function ($matches) {
                    return \is_numeric($matches[0]) ? $matches[0] : '*';
                }, $key)

As we already split by . the is_numeric check is enough to know if it is the list index which place is skipped.

So we flatten the object to:

[
    'id' => 1,
    'title' => 'Title',
    'blocks.title' => ['Title 1', 'Title 2'],
    'blocks.tags' => ['UI', 'UX', 'Tech'],
    '_metadata' => [
        'blocks.title' => ['*.0.*', '*.1.*'],
        'blocks.tags' => ['*.0.*.0', '*.0.*.1', '*.1.*.0'],
    ],
]

But the metadata is again some unsupported nested object to fix that we call the json_encode method on it:

[
    'id' => 1,
    'title' => 'Title',
    'blocks.title' => ['Title 1', 'Title 2'],
    'blocks.tags' => ['UI', 'UX', 'Tech'],
    '_metadata' => '{"blocks.title":["*.0.*","*.1.*"],"blocks.tags":["*.0.*.0","*.0.*.1","*.1.*.0"]}',
]

So with a single additional field we have enough metadata to unflatten the data back to the original array.

Unflatten the object

Unflatten the data is kind of less todo as flatten the data. First we extract the metadata from the original data and all fields which do not have metadata can be directly used:

    public function unflatten(array $data): array
    {
        $metadata = json_decode($data['_metadata'], true, flags: JSON_THROW_ON_ERROR);
        unset($data['_metadata']);

        foreach ($data as $key => $value) {
            if (!array_key_exists($key, $metadata)) {
                $newData[$key] = $value;

                continue;
            }
            
            // ...
        }
        
        return $newData;
    }

First we need to split the key e.g. blocks.tags into ['blocks', 'tags']. That value is the array of the data which we loop over.

Via the preg_replace_callback we can replace the * so *.0.* becomes blocks.0.tags and so on. Which the keyPath we can use to create the unflatten data structure.

            // ...
            
            $keyParts = explode('.', $key);

            foreach ($value as $subKey => $subValue) {
                $keyPartsReplacements = $keyParts;

                $keyPath = preg_replace_callback('/\*/', function () use (&$keyPartsReplacements) {
                    return array_shift($keyPartsReplacements);
                }, $metadata[$key][$subKey]);
                
                // ...
            }

Via a reference we can get into the deep structure we require to set our value we got from the flatten data:

                // ...

                $newSubData = &$newData;
                foreach (explode('.', $keyPath) as $newKeyPart) {
                    $newSubData = &$newSubData[$newKeyPart];
                }

                $newSubData = $subValue;

So the whole unflatten method look like this:

    public function unflatten(array $data): array
    {
        $newData = [];
        $metadata = $data[$this->metadataKey] ?? [];
        unset($data[$this->metadataKey]);

        foreach ($data as $key => $value) {
            if (!\array_key_exists($key, $metadata)) {
                $newData[$key] = $value;

                continue;
            }

            $keyParts = \explode('.', $key);

            foreach ($value as $subKey => $subValue) {
                $keyPartsReplacements = $keyParts;

                $newKeyPath = \preg_replace_callback('/\*/', function () use (&$keyPartsReplacements) {
                    return \array_shift($keyPartsReplacements);
                }, $metadata[$key][$subKey]);

                $newSubData = &$newData;
                foreach (explode('.', $newKeyPath) as $newKeyPart) {
                    $newSubData = &$newSubData[$newKeyPart];
                }

                $newSubData = $subValue;
            }
        }

        return $newData;
    }

With these two methods we are able to flatten and unflatten JSON objects:

// unflatten data:
[
    'id' => 1,
    'title' => 'Title',
    'blocks' => [
        [
           'title' => 'Title 1',
           'tags' => ['UI', 'UX'],
        ],
        [
           'title' => 'Title 2',
           'tags' => ['Tech'],
        ],
    ],
];

// flatten data:
[
    'id' => 1,
    'title' => 'Title',
    'blocks.title' => ['Title 1', 'Title 2'],
    'blocks.tags' => ['UI', 'UX', 'Tech'],
    '_metadata' => '{"blocks.title":["*.0.*","*.1.*"],"blocks.tags":["*.0.*.0","*.0.*.1","*.1.*.0"]}',
];

Addition after usage in SEAL

For SEAL use case this is enough. I was wrong in this case after try to use that I found out that I used _ in Loupe as field separator which can conflict with SEAL field name.

Example if we rename our blocks field to content_blocks and use _ instead of . as a separator:

// unflatten data:
[
    'id' => 1,
    'title' => 'Title',
    'content_blocks' => [
        [
           'title' => 'Title 1',
           'tags' => ['UI', 'UX'],
        ],
        [
           'title' => 'Title 2',
           'tags' => ['Tech'],
        ],
    ],
];

// flatten data:
[
    'id' => 1,
    'title' => 'Title',
    'content_blocks_title' => ['Title 1', 'Title 2'],
    'content_blocks_tags' => ['UI', 'UX', 'Tech'],
    '_metadata' => '{"content_blocks_title":["*_0_*","*_1_*"],"content_blocks_tags":["*_0_*_0","*_0_*_1","*_1_*_0"]}',
];

I will not go over the whole refactoring, but I am now using two configurable separators, one for the field (_) and one for the metadata (/).

// unflatten data:
[
    'id' => 1,
    'title' => 'Title',
    'content_blocks' => [
        [
           'title' => 'Title 1',
           'tags' => ['UI', 'UX'],
        ],
        [
           'title' => 'Title 2',
           'tags' => ['Tech'],
        ],
    ],
];

// flatten data:
[
    'id' => 1,
    'title' => 'Title',
    'content_blocks_title' => ['Title 1', 'Title 2'],
    'content_blocks_tags' => ['UI', 'UX', 'Tech'],
    '_metadata' => '{"content_blocks/title":["*/0/*","*/1/*"],"content_blocks/tags":["*/0/*/0","*/0/*/1","*/1/*/0"]}',
];

This way even if the search engine is limited to specific field names characters, we can still flatten and unflatten the data.

The whole implementation can be found in src/FlattenMarshaller.php.

Conclusion

For SEAL use case this is enough; the solution comes with a limitation that field names always need to start with a letter as numeric fields are reserved for the list indexes. This is not a problem for SEAL as it already has that strict field name definition.

If you have any other recommendation to tackle such task to flatten and unflatten an object with a better solution, please let me know.

About

POC to flatten nested array und unflatten them

Topics

Resources

Stars

Watchers

Forks

Languages