Memory Management & Manual Chunking

Currently in the pipeline for `salesforce_chunker` is Manual Chunking, for fields that do not have Primary Key Chunking enabled. The premise behind this is that it will query all of the Ids for a given object, and then split the query into batches manually (in `salesforce_chunker` as opposed to Salesforce).

I have a proof of concept here: https://github.com/Shopify/salesforce_chunker/compare/manual-chunking.

After experimentation, this approach has several advantages and disadvantages to Salesforce PK Chunking.

The disadvantage is that is more complicated and a bit slower, especially when extracting a large about of data (i.e. 10 million rows). Generating batches alone for 10 million rows takes 7-8 minutes.

The advantages happen when using a `where` clause and extracting a small percentage of the data from an object with many rows. For example, if you are extracting the 100k rows modified in the last day from an object with 10M rows, the batch generation will only take 10-15 seconds or so. If your batch size is 50k, then your data will be returned in two batches of ~50k records with Manual Chunking instead of 200 batches of ~500 records with PK Chunking. This will end up using fewer API calls, and taking roughly the same amount of time. 

Another disadvantage of Manual Chunking is memory. First, we need to discuss what is happening in memory in this gem.

The JSON records are being returned in "pages" of maximum size 1GB. This 1GB refers to the uncompressed text size. Something like 50-500MB of compressed data is being transferred over the internet. Then, it gets deflated to a <= 1GB string. Then, when this string gets deserialized and converted to a Ruby object, it becomes even larger - 2 to 10 GB. 

This is somewhat inefficient and causing memory issues. With PK Chunking, we have a lever to reduce this - just decrease the batch size. This takes more API calls, but solves the issue fairly easily. However, with manual chunking, we may need to retrieve all of the Ids from 10M+ records, which can introduce the memory issues.

(Salesforce Design Flaw 1) One of the main reasons why this is happening is that Salesforce attaches an "attributes" field in the JSON response like this:
```json
{
  "attributes" : {
    "type" : "Account",
    "url" : "/services/data/v40.0/sobjects/Account/<accountId>"
  },
  "Id" : "<accountId>"
}
```
This is not useful and causes the uncompressed Id list to become massively large.

There are several ways to get around this issue: throw more memory and resources at our jobs, parse the JSON in more memory-efficient streaming manner, or use CSV or XML instead of JSON.

- the `oj` library can be used for streaming JSON parsing. It uses C code and is actually ~10% faster than the current approach and takes way less memory (~5x less). However, it is not available on `jruby` which we need. Supposedly, a version of oj for jruby is being worked on.

- There are several issues with CSV parsing. The compressed and uncompressed files are smaller, meaning less data sent over the internet. However, two main disadvantages, both Salesforce design related:

  - (Salesforce Design Flaw 2) When you choose CSV (or XML) as your contentType when creating a job, all of the batch status requests (create_batch, get_batch_statuses, retrieve_batch_results) are returned in XML as opposed to JSON regardless of your "Content-Type" header. Since XML does not have a 1:1 relationship with JSON due to handling of single/multiple responses, this is challenging to handle, and makes the code significantly more complicated.

  - (Salesforce Design Flaw 3) The CSV data is not returned in the same format as the JSON data. For example, empty rows are returned as empty strings `""` in CSV but `null` in JSON. Also some timestamps are returned as an integer in JSON but string in CSV. This makes things hard to reconcile.

I think the plan going forward is as follows:
- [x] Implement Manual Chunking
- Implement `oj` streaming parsing for `ruby` only and default to regular parsing for `jruby`
- Throw more resources at the problem in `jruby`
- Wait and see if `oj` is implemented in `jruby`
- Possibly add CSV/XML support in the future















Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Memory Management & Manual Chunking #50

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Memory Management & Manual Chunking #50

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions