-
Notifications
You must be signed in to change notification settings - Fork 51
Home
# Welcome to the node-client-api wiki!
# Data Movement in the Node.js API
Node.js provides concurrency through multiple waits for IO responses, instead of multiple threads. This strategy avoids the challenges and risks of multi-threaded programming for middle-tier clients that are IO rather than compute intensive.
In the Node.js API, that general architectural principle of Node.js means that better throughput for large data sets requires multiple concurrent requests to each e-node (instead of serial requests to one e-node).
In other words, while the Node.js client is submitting a request or processing a response, many other pending requests are waiting for the server to respond. Because the round trip over the network is typically much more expensive (especially in the cloud), the client can typically submit many requests, and / or process many responses, during the time required for a single round trip to the server.
The optimal level of concurrency would provide full utilization of both the client and server:
From the client perspective, a new response becomes available to process at the moment that submitting a new request finishes. In essence, neither the single Node.js thread nor the request submitting or response processing routines ever wait. From the server perspective, thread and memory consumption is at the sweet spot, with allowance for other requests to the server. Clients should avoid exceeding the optimum concurrency level for either client or server.
To determine the appropriate level of concurrency, the client must become aware of server capacity, as reflected by the number of hosts for the database, the number of threads available on those hosts, and (for query management) the number of forests. DMSDK gets that information during initialization by calling internal endpoints of the REST API. The Node.js API objects for data movement call the same internal endpoints to inspect server state during initialization.
Node.js provides streams as the standard representation for large data sets. Conforming to this standard, the Node.js API data movement functions are factories that return:
an object of type stream.Writable to the application for sending request input to the server an object of type stream.Readable to the application for receiving response output from the server Internally, the Node.js API constructs input streams from the builtin Node.js stream.PassThrough class. That way, the application has a writable stream for sending the request data and the client implementation in the Node.js API has a readable stream for receiving the request data.
Typically, the Node.js API reads the input stream repeatedly, accumulating a batch of data in a buffer, and then making a batch request when the buffer is full or the input stream ends.
Internally, the Node.js API also constructs output streams from the stream.PassThrough class. That way, the client implementation in the Node.js API has a writable stream for sending the response data, and the application has a readable stream for receiving the response data.
Typically, the Node.js API writes each item in a response batch separately to the output stream, and ends the output stream when the last response has been processed.
Where the data movement requires both input and output, the factory function returns:
a duplex stream to the application, for sending request input to the server, and receiving response output from the server Internally, the Node.js API implements duplex streams by adding a new dependency that composes a duplex stream from a writable stream object and a readable stream object. (Two obvious candidates with considerable adoption in the Node.js community are duplexify and duplexer2.) The writable and readable stream objects composed by the duplex stream are each instances of the stream.PassThrough class. By doing this, the client implementation in the Node.js API has a readable stream for receiving the request input from the application and a writable stream for sending the response output to the application.
This section highlights the Node.js API equivalents to the DMSDK batcher classes in the Java API.
The order of the subsections reflects a probable order of importance, in case triage is necessary to troubleshoot the initial effort.
Customers perform data movement with idiomatic Node.js streams, with pipelines resembling the usage examples given in each subsection.
When using multiple data movement functions in a pipeline to handle special cases (instead of the provided conveniences), the application has the responsibility for configuring each function to share the available client and server concurrency.
Some examples:
documents.queryAll(...).pipe(...copy uris to audit sink...).pipe(documents.transformAll(...)) documents.queryAll(...).pipe(documents.readAll(...)).pipe(...enrich content on client...).pipe(documents.writeAll(...)) Data movement functions take options such as:
the batch size the number of concurrent requests per forest or host success and / or error callbacks for the batch Each data movement function maintains state for its operations (similar to the use of the Operation object for single-request calls).
When processing data in memory, clients need to work with request or response data as JavaScript in-memory objects. Alternatively when, dispatching request or response data from other sources or sinks (such as other databases), clients can achieve better throughput by working with request or response data as JavaScript strings or buffers.
Internally, each data movement function repeatedly executes a request function. In most cases, the request function already exists in the Node.js API – though it may be necessary to call into the implementation, rather than the public interface.
In particular, most of the existing request functions return a ResultProvider, to let the application choose whether to get the response data as a Node.js Promise or as a Node.js Stream. By contrast, a data movement function must:
write response data to the output stream execute an application callback to determine the disposition of any error on a batch request Internally, the lib/requester.js and lib/responder.js modules may need to add a capability to ResultProvider for internal use that:
takes the pre-existing output stream that receives request result data, similar to ResultProvider.stream() in the existing implementation. includes a callback for request errors, that receives request errors similar to the ResultProvider.result() Promise in the existing implementation.
Node-client-api - 2.8.0
The Node.js API documents object adds a writeAll() function equivalent to the DMSDK WriteBatcher with the following signature:
writeAll(options)
The parameters:
options JavaScript literal object Configures the write operation. The return value:
a stream.Writable in object mode that receives document descriptor input from the application for writing to the database.
The properties of the options object:
onBatchSuccess function(progress, documents) Notifying other systems about batches written successfully.
Takes parameters of:
A progress object with properties for the running totals of documents written and failed and the elapsed time An array with the document descriptors for the written batch Any return value is ignored.
onBatchError function(progress, documents, error) Responding to any error while writing a batch of documents.
Takes parameters of:
A progress object with properties for the running totals of documents written and failed, the elapsed time, and the number of failed attempts for the current batch An array with the document descriptors for the batch that failed to write The error that occurred Controls the resolution of the error by:
Returning a non-empty array of document descriptors to try to write as this batch Returning null or an empty array to skip this batch Throwing an error to stop reading the input stream and writing documents; when the last request finishes, finalize the job onCompletion function(summary) Providing a summary of the results.
Takes parameters of:
A summary object with properties for the total documents written and failed, the elapsed time, and whether processing was cancelled by an error. batchSize int The number of documents in each batch.
Defaults to the number recommended by Performance Engineering (possibly 200).
concurrentRequests JavaScript object literal Controls the maximum number of concurrent requests that can be pending at the same time.
Specified as a multiple of either forests or hosts with the following properties:
multipleOf = either "forests" or "hosts" multiplier = an int Defaults to the setting recommended by Performance Engineering (possibly 4 per forest).
defaultMetadata JavaScript object literal An object with any or all of the metadata properties from the document descriptor including collections, permissions, properties, quality, and metadataValues.
If supplied, the default metadata is passed as the first item in every batch.
transform transformSpecification The same as the transform parameter of documents.write()
Example - An example for using the writeAll api has been added to the examples folder on node-client-api - https://github.com/marklogic/node-client-api/blob/develop/examples/writeAll-documents.js JS docs - https://github.com/marklogic/node-client-api/pull/610/files