Skip to content

Commit 1de1b0e

Browse files
authored
Merge pull request #70 from marklogic/feature/orderBy-multiple
Added test and docs example of multiple orderBy columns
2 parents 67f2aeb + bd9e12f commit 1de1b0e

File tree

2 files changed

+56
-2
lines changed

2 files changed

+56
-2
lines changed

docs/reading.md

Lines changed: 37 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -118,7 +118,7 @@ down the following operations to MarkLogic:
118118
- `filter` and `where`
119119
- `groupBy` when followed by `count`
120120
- `limit`
121-
- `orderBy`
121+
- `orderBy` and `sort`
122122

123123
For each of the above operations, the user's Optic query is enhanced to include the associated Optic function.
124124
Note that if multiple partitions are used to perform the `read` operation, each
@@ -127,7 +127,42 @@ from each partition and re-apply the function calls as necessary to ensure that
127127

128128
If either `count` or `groupBy` and `count` are pushed down, the connector will make a single request to MarkLogic to
129129
resolve the query (thus ignoring the number of partitions and batch size that may have been configured; see below
130-
for more information on these options), ensuring that a single count or set of counts is returned to Spark.
130+
for more information on these options), ensuring that a single count or set of counts is returned to Spark.
131+
132+
In the following example, every operation after `load()` is pushed down to MarkLogic, thereby resulting in far fewer
133+
rows being returned to Spark and far less work having to be done by Spark:
134+
135+
```
136+
spark.read.format("com.marklogic.spark") \
137+
.option("spark.marklogic.client.uri", "spark-example-user:password@localhost:8020") \
138+
.option("spark.marklogic.read.opticQuery", "op.fromView('example', 'employee', '')") \
139+
.load() \
140+
.filter("HiredDate < '2020-01-01'") \
141+
.groupBy("State", "Department") \
142+
.count() \
143+
.orderBy("State", "count") \
144+
.limit(10) \
145+
.show()
146+
```
147+
148+
The following results are returned:
149+
150+
```
151+
+-----+-----------+-----+
152+
|State| Department|count|
153+
+-----+-----------+-----+
154+
| AL| Marketing| 1|
155+
| AL| Training| 1|
156+
| AL| R&D| 4|
157+
| AL| Sales| 4|
158+
| AR| Sales| 1|
159+
| AR| Marketing| 3|
160+
| AR| R&D| 9|
161+
| AZ| Training| 1|
162+
| AZ|Engineering| 2|
163+
| AZ| R&D| 2|
164+
+-----+-----------+-----+
165+
```
131166

132167
## Tuning performance
133168

src/test/java/com/marklogic/spark/reader/PushDownOrderByAndLimitTest.java

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -188,6 +188,25 @@ void sort() {
188188
verifyRowsAreOrderedByCitationID(rows);
189189
}
190190

191+
@Test
192+
void sortByMultiple() {
193+
List<Row> rows = newDefaultReader()
194+
.option(Options.READ_OPTIC_QUERY, QUERY_WITH_NO_QUALIFIER)
195+
.load()
196+
.sort("CitationID", "LastName")
197+
.limit(8)
198+
.collectAsList();
199+
200+
assertEquals(8, rows.size());
201+
verifyRowsAreOrderedByCitationID(rows);
202+
203+
// Verify the first few rows to make sure they're sorted by LastName as well based on known values.
204+
final String column = "LastName";
205+
assertEquals("Awton", rows.get(0).getAs(column));
206+
assertEquals("Bernadzki", rows.get(1).getAs(column));
207+
assertEquals("Canham", rows.get(2).getAs(column));
208+
}
209+
191210
private void verifyRowsAreOrderedByCitationID(List<Row> rows) {
192211
// Lowest known CitationID is 1, so start comparisons against that.
193212
long previousValue = 1;

0 commit comments

Comments
 (0)