|
| 1 | +This project demonstrates how Spark can be used to solve a common problem encountered when importing data from |
| 2 | +a relational database into MarkLogic. In order to fit a logical entity - such as a person with many addresses - into a |
| 3 | +set of tables, it is necessary to "break apart" the entity - i.e. person data is stored in a "persons" table, and address |
| 4 | +data is stored in an "address" table. Treating the entity as a whole always requires joining the rows together, which |
| 5 | +can require large numbers of joins based on the nature of the entity. Additionally, securing access to the logical |
| 6 | +entity is far more difficult when the entity is spread across many tables. |
| 7 | + |
| 8 | +In a document database such as MarkLogic, a logical entity doesn't need to be broken apart. The hierarchical nature of |
| 9 | +a JSON or XML document allows for [composition relationships](https://www.uml-diagrams.org/composition.html) to be |
| 10 | +represented via nested data structures. This approach simplifies searching, viewing, updating, and securing the entity. |
| 11 | +Related data can still be stored in other documents when desired, but being able to "reassemble" the entity when importing |
| 12 | +the data into MarkLogic is a key feature to leverage these advantages. |
| 13 | + |
| 14 | +To try out this example, you will need [Docker](https://www.docker.com/get-started/) installed. Docker is used to |
| 15 | +create an instance of MarkLogic, a PostgreSQL (referred to in this document as "Postgres") |
| 16 | +relational database, and [pgadmin](https://www.pgadmin.org/). You will also need Java 8 or higher, and if you are already running an |
| 17 | +instance of MarkLogic locally, it is recommended to stop that instance first. Note that the pgadmin instance is included |
| 18 | +solely for users wishing to use it to explore the Postgres database; using pgadmin is outside the scope of this |
| 19 | +document. |
| 20 | + |
| 21 | +## Setting up the Docker containers |
| 22 | + |
| 23 | +To begin, clone this repository if you have not already, open a terminal, and run the following commands: |
| 24 | + |
| 25 | + cd examples/entity-aggregation |
| 26 | + docker-compose up -d --build |
| 27 | + |
| 28 | +Depending on whether you've downloaded the MarkLogic and Postgres images, this may take a few minutes to complete. |
| 29 | + |
| 30 | +## Setting up the Postgres database |
| 31 | + |
| 32 | +Go to [this Postgres tutorial](https://www.postgresqltutorial.com/postgresql-getting-started/postgresql-sample-database/) |
| 33 | +in your web browser. Scroll down to the section titled "Download the PostgreSQL sample database". Follow the |
| 34 | +instructions in that section for downloading the `dvdrental.zip` and extracting it to produce a file named |
| 35 | +`dvdrental.tar` (any zip extraction tool should work, including Java via `java xvf dvdrental.zip`). Copy the |
| 36 | +`dvdrental.tar` file to the `./docker/postgres` directory in this example project. |
| 37 | + |
| 38 | +The DVD rental dataset is used as a source of relational data that can be imported into MarkLogic. The dataset contains |
| 39 | +sample data about movies, customers, rentals, and other related entities. In this example, customers and their rentals |
| 40 | +will be joined together and loaded into MarkLogic as customer documents with their related rentals nested inside. |
| 41 | + |
| 42 | +The DVD rental dataset needs a database to be loaded into in Postgres. Run the following Docker command to create a |
| 43 | +new database named `dvdrental`: |
| 44 | + |
| 45 | + docker exec -it entity_aggregation-postgres-1 psql -U postgres -c "CREATE DATABASE dvdrental" |
| 46 | + |
| 47 | +This will output "CREATE DATABASE" as an indication that it ran successfully. |
| 48 | + |
| 49 | +Then run the following Docker command to use the Postgres `pg_restore` tool to load the `dvdrental.tar` file into the |
| 50 | +`dvdrental` database: |
| 51 | + |
| 52 | + docker exec -it entity_aggregation-postgres-1 pg_restore -U postgres -d dvdrental /opt/dvdrental.tar |
| 53 | + |
| 54 | +After this completes, the `dvdrental` database will contain 15 tables. You can use the pgadmin application included |
| 55 | +in this project's `docker-compose.yml` file to browse the database if you wish. |
| 56 | + |
| 57 | +## Importing customers from Postgres to MarkLogic |
| 58 | + |
| 59 | +This project contains an example Java program that uses Spark and its aggregation ability to accomplish the following: |
| 60 | + |
| 61 | +1. Retrieve customers from Postgres, joined with rentals. |
| 62 | +2. Aggregate the rentals for each customer into a "rentals" column containing structs. |
| 63 | +3. Write each customer row as a new document to the "Documents" database in MarkLogic. |
| 64 | + |
| 65 | +Each customer document in MarkLogic captures the desired data structure with rentals aggregated in their associated |
| 66 | +customer, as opposed to having separate customer and rental documents. |
| 67 | + |
| 68 | +Run the Spark program via the following command (the use of Gradle is solely for demonstration purposes as it |
| 69 | +simplifies constructing the necessary classpath for the program; in production, a user would use |
| 70 | +[spark-submit](https://spark.apache.org/docs/latest/submitting-applications.html) or some similar mechanism provided by their Spark environment): |
| 71 | + |
| 72 | + ./gradlew importCustomers |
| 73 | + |
| 74 | +You can then use [MarkLogic's qconsole application](https://docs.marklogic.com/guide/qconsole/intro) to view the |
| 75 | +customer documents written to the Documents database. As there are 599 customer rows in the Postgres "customers" table, |
| 76 | +you will also see 599 customer documents in the Documents database. |
| 77 | + |
| 78 | +As an example, each customer document will have a "rentals" array (the one shown here is abbreviated for demonstration |
| 79 | +purposes): |
| 80 | + |
| 81 | +``` |
| 82 | +{ |
| 83 | + "customer_id": 12, |
| 84 | + "first_name": "Nancy", |
| 85 | + "last_name": "Thomas", |
| 86 | + "email": "nancy.thomas@sakilacustomer.org", |
| 87 | + "rentals": [ |
| 88 | + { |
| 89 | + "rental_id": 988, |
| 90 | + "rental_date": "2005-05-31T03:08:03.000Z", |
| 91 | + "return_date": "2005-06-07T04:22:03.000Z" |
| 92 | + }, |
| 93 | + { |
| 94 | + "rental_id": 1084, |
| 95 | + "rental_date": "2005-05-31T15:10:17.000Z", |
| 96 | + "return_date": "2005-06-01T15:15:17.000Z" |
| 97 | + }, |
| 98 | + { |
| 99 | + "rental_id": 1752, |
| 100 | + "rental_date": "2005-06-16T21:02:55.000Z", |
| 101 | + "return_date": "2005-06-23T23:09:55.000Z" |
| 102 | + } |
| 103 | + } |
| 104 | +} |
| 105 | +``` |
| 106 | + |
| 107 | +## How the data is aggregated together |
| 108 | + |
| 109 | +The `./src/main/java/org/example/ImportCustomers.java` file in this project contains the source code for the Spark |
| 110 | +program. Please see this file for inline comments that explain how Spark is used to aggregate rentals into their |
| 111 | +related customers. |
| 112 | + |
| 113 | +Note as well that this example is not intended to be authoritative. Please see |
| 114 | +[the Spark programming guide](https://spark.apache.org/docs/latest/sql-programming-guide.html) for complete information |
| 115 | +on writing Spark programs. You may also find |
| 116 | +[this reference on Spark aggregate functions](https://sparkbyexamples.com/spark/spark-sql-aggregate-functions/) helpful. |
| 117 | + |
0 commit comments