Skip to content

Commit 5db9322

Browse files
committed
DEVEXP-616 Added entity aggregation example project
1 parent 2618dd2 commit 5db9322

File tree

12 files changed

+623
-1
lines changed

12 files changed

+623
-1
lines changed
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
docker/MarkLogic
2+
.gradle
3+
build
4+
out
5+
gradle-local.properties

examples/entity-aggregation/README.md

Lines changed: 117 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,117 @@
1+
This project demonstrates how Spark can be used to solve a common problem encountered when importing data from
2+
a relational database into MarkLogic. In order to fit a logical entity - such as a person with many addresses - into a
3+
set of tables, it is necessary to "break apart" the entity - i.e. person data is stored in a "persons" table, and address
4+
data is stored in an "address" table. Treating the entity as a whole always requires joining the rows together, which
5+
can require large numbers of joins based on the nature of the entity. Additionally, securing access to the logical
6+
entity is far more difficult when the entity is spread across many tables.
7+
8+
In a document database such as MarkLogic, a logical entity doesn't need to be broken apart. The hierarchical nature of
9+
a JSON or XML document allows for [composition relationships](https://www.uml-diagrams.org/composition.html) to be
10+
represented via nested data structures. This approach simplifies searching, viewing, updating, and securing the entity.
11+
Related data can still be stored in other documents when desired, but being able to "reassemble" the entity when importing
12+
the data into MarkLogic is a key feature to leverage these advantages.
13+
14+
To try out this example, you will need [Docker](https://www.docker.com/get-started/) installed. Docker is used to
15+
create an instance of MarkLogic, a PostgreSQL (referred to in this document as "Postgres")
16+
relational database, and [pgadmin](https://www.pgadmin.org/). You will also need Java 8 or higher, and if you are already running an
17+
instance of MarkLogic locally, it is recommended to stop that instance first. Note that the pgadmin instance is included
18+
solely for users wishing to use it to explore the Postgres database; using pgadmin is outside the scope of this
19+
document.
20+
21+
## Setting up the Docker containers
22+
23+
To begin, clone this repository if you have not already, open a terminal, and run the following commands:
24+
25+
cd examples/entity-aggregation
26+
docker-compose up -d --build
27+
28+
Depending on whether you've downloaded the MarkLogic and Postgres images, this may take a few minutes to complete.
29+
30+
## Setting up the Postgres database
31+
32+
Go to [this Postgres tutorial](https://www.postgresqltutorial.com/postgresql-getting-started/postgresql-sample-database/)
33+
in your web browser. Scroll down to the section titled "Download the PostgreSQL sample database". Follow the
34+
instructions in that section for downloading the `dvdrental.zip` and extracting it to produce a file named
35+
`dvdrental.tar` (any zip extraction tool should work, including Java via `java xvf dvdrental.zip`). Copy the
36+
`dvdrental.tar` file to the `./docker/postgres` directory in this example project.
37+
38+
The DVD rental dataset is used as a source of relational data that can be imported into MarkLogic. The dataset contains
39+
sample data about movies, customers, rentals, and other related entities. In this example, customers and their rentals
40+
will be joined together and loaded into MarkLogic as customer documents with their related rentals nested inside.
41+
42+
The DVD rental dataset needs a database to be loaded into in Postgres. Run the following Docker command to create a
43+
new database named `dvdrental`:
44+
45+
docker exec -it entity_aggregation-postgres-1 psql -U postgres -c "CREATE DATABASE dvdrental"
46+
47+
This will output "CREATE DATABASE" as an indication that it ran successfully.
48+
49+
Then run the following Docker command to use the Postgres `pg_restore` tool to load the `dvdrental.tar` file into the
50+
`dvdrental` database:
51+
52+
docker exec -it entity_aggregation-postgres-1 pg_restore -U postgres -d dvdrental /opt/dvdrental.tar
53+
54+
After this completes, the `dvdrental` database will contain 15 tables. You can use the pgadmin application included
55+
in this project's `docker-compose.yml` file to browse the database if you wish.
56+
57+
## Importing customers from Postgres to MarkLogic
58+
59+
This project contains an example Java program that uses Spark and its aggregation ability to accomplish the following:
60+
61+
1. Retrieve customers from Postgres, joined with rentals.
62+
2. Aggregate the rentals for each customer into a "rentals" column containing structs.
63+
3. Write each customer row as a new document to the "Documents" database in MarkLogic.
64+
65+
Each customer document in MarkLogic captures the desired data structure with rentals aggregated in their associated
66+
customer, as opposed to having separate customer and rental documents.
67+
68+
Run the Spark program via the following command (the use of Gradle is solely for demonstration purposes as it
69+
simplifies constructing the necessary classpath for the program; in production, a user would use
70+
[spark-submit](https://spark.apache.org/docs/latest/submitting-applications.html) or some similar mechanism provided by their Spark environment):
71+
72+
./gradlew importCustomers
73+
74+
You can then use [MarkLogic's qconsole application](https://docs.marklogic.com/guide/qconsole/intro) to view the
75+
customer documents written to the Documents database. As there are 599 customer rows in the Postgres "customers" table,
76+
you will also see 599 customer documents in the Documents database.
77+
78+
As an example, each customer document will have a "rentals" array (the one shown here is abbreviated for demonstration
79+
purposes):
80+
81+
```
82+
{
83+
"customer_id": 12,
84+
"first_name": "Nancy",
85+
"last_name": "Thomas",
86+
"email": "nancy.thomas@sakilacustomer.org",
87+
"rentals": [
88+
{
89+
"rental_id": 988,
90+
"rental_date": "2005-05-31T03:08:03.000Z",
91+
"return_date": "2005-06-07T04:22:03.000Z"
92+
},
93+
{
94+
"rental_id": 1084,
95+
"rental_date": "2005-05-31T15:10:17.000Z",
96+
"return_date": "2005-06-01T15:15:17.000Z"
97+
},
98+
{
99+
"rental_id": 1752,
100+
"rental_date": "2005-06-16T21:02:55.000Z",
101+
"return_date": "2005-06-23T23:09:55.000Z"
102+
}
103+
}
104+
}
105+
```
106+
107+
## How the data is aggregated together
108+
109+
The `./src/main/java/org/example/ImportCustomers.java` file in this project contains the source code for the Spark
110+
program. Please see this file for inline comments that explain how Spark is used to aggregate rentals into their
111+
related customers.
112+
113+
Note as well that this example is not intended to be authoritative. Please see
114+
[the Spark programming guide](https://spark.apache.org/docs/latest/sql-programming-guide.html) for complete information
115+
on writing Spark programs. You may also find
116+
[this reference on Spark aggregate functions](https://sparkbyexamples.com/spark/spark-sql-aggregate-functions/) helpful.
117+
Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
plugins {
2+
id 'java'
3+
}
4+
5+
repositories {
6+
mavenCentral()
7+
}
8+
9+
dependencies {
10+
implementation 'org.apache.spark:spark-sql_2.12:3.3.2'
11+
implementation "com.marklogic:marklogic-spark-connector:2.0.0"
12+
implementation "org.postgresql:postgresql:42.6.0"
13+
}
14+
15+
task importCustomers(type: JavaExec) {
16+
classpath = sourceSets.main.runtimeClasspath
17+
mainClass = 'org.example.ImportCustomers'
18+
}
Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
version: '3.8'
2+
3+
name: entity_aggregation
4+
5+
services:
6+
7+
marklogic:
8+
image: "marklogicdb/marklogic-db:11.1.0-centos-1.1.0"
9+
platform: linux/amd64
10+
restart: always
11+
environment:
12+
- MARKLOGIC_INIT=true
13+
- MARKLOGIC_ADMIN_USERNAME=admin
14+
- MARKLOGIC_ADMIN_PASSWORD=admin
15+
volumes:
16+
- ./docker/marklogic/logs:/var/opt/MarkLogic/Logs
17+
ports:
18+
- 8000-8002:8000-8002
19+
20+
postgres:
21+
image: postgres:latest
22+
environment:
23+
POSTGRES_USER: postgres
24+
POSTGRES_PASSWORD: postgres
25+
volumes:
26+
# Simplifies restoring the dvdrental.tar into Postgres.
27+
- ./docker/postgres:/opt
28+
ports:
29+
- "5432:5432"
30+
31+
# Included solely for users that wish to explore the Postgres database.
32+
pgadmin:
33+
image: dpage/pgadmin4:latest
34+
environment:
35+
PGADMIN_DEFAULT_EMAIL: postgres@pgadmin.com
36+
PGADMIN_DEFAULT_PASSWORD: postgres
37+
PGADMIN_LISTEN_PORT: 80
38+
ports:
39+
- 15432:80
40+
depends_on:
41+
- postgres
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
dvdrental.tar
Binary file not shown.
Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
distributionBase=GRADLE_USER_HOME
2+
distributionPath=wrapper/dists
3+
distributionUrl=https\://services.gradle.org/distributions/gradle-8.4-bin.zip
4+
networkTimeout=10000
5+
validateDistributionUrl=true
6+
zipStoreBase=GRADLE_USER_HOME
7+
zipStorePath=wrapper/dists

0 commit comments

Comments
 (0)