This tool enables reproducible data shuffling for JSONLines streams.
Here is an example YAML configuration file
version: 1          # Version of the configuration, only 1 is allowed for now
seed: 42            # Starting seed for the pseudo-random process, ensures consistency between executions
frameSize: 1000     # Frame size is the size of the processing window; should be as large as possible
selectors:          # Each selector in this list triggers a permutation between JSONLines
  - $.name          # A selector is defined by a JSONPath expression
  - $.surname
  - group:          # A group of selectors swaps attributes together
    - $.age
    - $.nationalityNotes :
- The seedparameter is optional. Use it only if you need a reproducible execution (every execution gives the same result). Change the value to obtain different results.
- The frameSizeparameter is a crucial element affecting the quality of the permutation, as it defines the size of the processing window. To ensure good permutation quality, set its value as large as possible. This allows for a greater number of values to be permuted and reduces the likelihood of permutations with identical data at the origin.
Suppose our input stream of type JSONLines is stored in a "stream.jsonl" file:
{"company":"acme","employees":[{"name":"one","children":[{"name":"child 1"},{"name":"child 2"}]},{"name":"two","children":[{"name":"child 3"},{"name":"child 4"},{"name":"child 5"}]}]}
{"company":"megacorp","employees":[{"name":"alpha","children":[{"name":"kid 1"}]},{"name":"beta","children":[{"name":"kid 2"},{"name":"kid 3"}]}]}
{"company":"dynatech","employees":[{"name":"first","children":[{"name":"offspring 1"},{"name":"offspring 2"}]},{"name":"second","children":[]}]}The following configuration file, named swap.yml, is used:
version: 1
seed: 42
frameSize: 1000
selectors:
  - $.employees.*.childrenIn this example, we want to swap the children of the employees. Siblings will not be separated, as the JSONPath $.employees.*.children selects entire arrays of children. However, children will be redistributed to new parents.
< stream.jsonl | tipoThe result will be the following:
{"company":"acme","employees":[{"name":"one","children":[{"name":"kid 2"},{"name":"kid 3"}]},{"name":"two","children":[{"name":"child 1"},{"name":"child 2"}]}]}
{"company":"megacorp","employees":[{"name":"alpha","children":[]},{"name":"beta","children":[{"name":"kid 1"}]}]}
{"company":"dynatech","employees":[{"name":"first","children":[{"name":"offspring 1"},{"name":"offspring 2"}]},{"name":"second","children":[{"name":"child 3"},{"name":"child 4"},{"name":"child 5"}]}]}Note: The tipo command can use the path to the configuration file with the -c flag. If no path is provided, it will look for the swap.yml file by default, which must be located in the project's root directory.
Using the same data file stream.jsonl, consider this TIPO configuration:
version: 1
seed: 42
frameSize: 1000
selectors:
  - $.employees.*.children.*In this example, we want to mix the children together. Siblings will be separated because the JSONPath $.employees.*.children.* selects each child individually. However, each parent will retain their original number of children.
cat stream.jsonl | tipoThe result will be the following:
{"company":"acme","employees":[{"name":"one","children":[{"name":"offspring 2"},{"name":"child 2"}]},{"name":"two","children":[{"name":"child 5"},{"name":"kid 2"},{"name":"child 3"}]}]}
{"company":"megacorp","employees":[{"name":"alpha","children":[{"name":"kid 3"}]},{"name":"beta","children":[{"name":"kid 1"},{"name":"child 4"}]}]}
{"company":"dynatech","employees":[{"name":"first","children":[{"name":"child 1"},{"name":"offspring 1"}]},{"name":"second","children":[]}]}Let's modify our dataset example by adding a new attribute, childnumber.
{"company":"acme","employees":[{"name":"one","childnumber":2,"children":[{"name":"child 1"},{"name":"child 2"}]},{"name":"two","childnumber":3,"children":[{"name":"child 3"},{"name":"child 4"},{"name":"child 5"}]}]}
{"company":"megacorp","employees":[{"name":"alpha","childnumber":1,"children":[{"name":"kid 1"}]},{"name":"beta","childnumber":2,"children":[{"name":"kid 2"},{"name":"kid 3"}]}]}
{"company":"dynatech","employees":[{"name":"first","childnumber":2,"children":[{"name":"offspring 1"},{"name":"offspring 2"}]},{"name":"second","childnumber":0,"children":[]}]}When there is a need to permute a group of attributes coherently, such as ensuring that childnumber and the children list are swapped together, the configuration file named swap.yml should have the following content:
version: 1
seed: 42
frameSize: 1000
selectors:
  - group:
    - $.employees.*.childnumber
    - $.employees.*.childrentipo < stream.jsonlThe result will be the following:
{"company":"acme","employees":[{"name":"one","childnumber":2,"children":[{"name":"kid 2"},{"name":"kid 3"}]},{"name":"two","childnumber":2,"children":[{"name":"child 1"},{"name":"child 2"}]}]}
{"company":"megacorp","employees":[{"name":"alpha","childnumber":0,"children":[]},{"name":"beta","childnumber":1,"children":[{"name":"kid 1"}]}]}
{"company":"dynatech","employees":[{"name":"first","childnumber":2,"children":[{"name":"offspring 1"},{"name":"offspring 2"}]},{"name":"second","childnumber":3,"children":[{"name":"child 3"},{"name":"child 4"},{"name":"child 5"}]}]}Suppose the following input stream is stored in a file named stream.jsonl:
{"company":"acme","employees":[{"name":"one","surname":"ONE","age":20,"nationality":"Kenyan"},{"name":"two","surname":"TWO","age":30,"nationality":"Icelandic"}]}
{"company":"megacorp","employees":[{"name":"alpha","surname":"ALPHA","age":40,"nationality":"Colombian"},{"name":"beta","surname":"BETA","age":50,"nationality":"Malaysian"}]}
{"company":"dynatech","employees":[{"name":"first","surname":"FIRST","age":60,"nationality":"Belgian"},{"name":"second","surname":"SECOND","age":70,"nationality":"Egyptian"}]}The corresponding configuration file is named configuration.yml:
version: 1
seed: 42
frameSize: 1000
selectors:
  - group1: # name and surname will be swapped together
    - employees.*.name
    - employees.*.surname
  - group2: # age and nationality will be swapped together
    - employees.*.age
    - employees.*.nationality
  - companyThe permutation of the two groups will be performed independently, as for the single attribute company. The execution is done as follows:
< stream.jsonl | tipo -c configuration.ymlAnd the result will be the following:
{"company":"dynatech","employees":[{"name":"beta","surname":"BETA","age":50,"nationality":"Malaysian"},{"name":"one","surname":"ONE","age":30,"nationality":"Icelandic"}]}
{"company":"acme","employees":[{"name":"second","surname":"SECOND","age":70,"nationality":"Egyptian"},{"name":"alpha","surname":"ALPHA","age":20,"nationality":"Kenyan"}]}
{"company":"megacorp","employees":[{"name":"first","surname":"FIRST","age":60,"nationality":"Belgian"},{"name":"two","surname":"TWO","age":40,"nationality":"Colombian"}]}Note that the age and nationality fields have been swapped consistently and independently of the surname and name fields, which have also been swapped consistently.
In previous examples, the following group configuration has been presented:
selectors:
  - groupname:
    - employees.*.name
    - employees.*.surnameHowever, there are other ways to configure a group:
1- Inline array
selectors:
  - ["employees.*.name", "employees.*.surname"]2- Inline array with map
selectors:
  - groupname: ["employees.*.name", "employees.*.surname"]- CGI France ✉Contact support
Copyright (C) 2023 CGI France
TIPO is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
TIPO is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with TIPO. If not, see http://www.gnu.org/licenses/.