Skip to content

Memory Management & Manual Chunking #50

@csholmes

Description

@csholmes

Currently in the pipeline for salesforce_chunker is Manual Chunking, for fields that do not have Primary Key Chunking enabled. The premise behind this is that it will query all of the Ids for a given object, and then split the query into batches manually (in salesforce_chunker as opposed to Salesforce).

I have a proof of concept here: https://github.com/Shopify/salesforce_chunker/compare/manual-chunking.

After experimentation, this approach has several advantages and disadvantages to Salesforce PK Chunking.

The disadvantage is that is more complicated and a bit slower, especially when extracting a large about of data (i.e. 10 million rows). Generating batches alone for 10 million rows takes 7-8 minutes.

The advantages happen when using a where clause and extracting a small percentage of the data from an object with many rows. For example, if you are extracting the 100k rows modified in the last day from an object with 10M rows, the batch generation will only take 10-15 seconds or so. If your batch size is 50k, then your data will be returned in two batches of ~50k records with Manual Chunking instead of 200 batches of ~500 records with PK Chunking. This will end up using fewer API calls, and taking roughly the same amount of time.

Another disadvantage of Manual Chunking is memory. First, we need to discuss what is happening in memory in this gem.

The JSON records are being returned in "pages" of maximum size 1GB. This 1GB refers to the uncompressed text size. Something like 50-500MB of compressed data is being transferred over the internet. Then, it gets deflated to a <= 1GB string. Then, when this string gets deserialized and converted to a Ruby object, it becomes even larger - 2 to 10 GB.

This is somewhat inefficient and causing memory issues. With PK Chunking, we have a lever to reduce this - just decrease the batch size. This takes more API calls, but solves the issue fairly easily. However, with manual chunking, we may need to retrieve all of the Ids from 10M+ records, which can introduce the memory issues.

(Salesforce Design Flaw 1) One of the main reasons why this is happening is that Salesforce attaches an "attributes" field in the JSON response like this:

{
  "attributes" : {
    "type" : "Account",
    "url" : "/services/data/v40.0/sobjects/Account/<accountId>"
  },
  "Id" : "<accountId>"
}

This is not useful and causes the uncompressed Id list to become massively large.

There are several ways to get around this issue: throw more memory and resources at our jobs, parse the JSON in more memory-efficient streaming manner, or use CSV or XML instead of JSON.

  • the oj library can be used for streaming JSON parsing. It uses C code and is actually ~10% faster than the current approach and takes way less memory (~5x less). However, it is not available on jruby which we need. Supposedly, a version of oj for jruby is being worked on.

  • There are several issues with CSV parsing. The compressed and uncompressed files are smaller, meaning less data sent over the internet. However, two main disadvantages, both Salesforce design related:

    • (Salesforce Design Flaw 2) When you choose CSV (or XML) as your contentType when creating a job, all of the batch status requests (create_batch, get_batch_statuses, retrieve_batch_results) are returned in XML as opposed to JSON regardless of your "Content-Type" header. Since XML does not have a 1:1 relationship with JSON due to handling of single/multiple responses, this is challenging to handle, and makes the code significantly more complicated.

    • (Salesforce Design Flaw 3) The CSV data is not returned in the same format as the JSON data. For example, empty rows are returned as empty strings "" in CSV but null in JSON. Also some timestamps are returned as an integer in JSON but string in CSV. This makes things hard to reconcile.

I think the plan going forward is as follows:

  • Implement Manual Chunking
  • Implement oj streaming parsing for ruby only and default to regular parsing for jruby
  • Throw more resources at the problem in jruby
  • Wait and see if oj is implemented in jruby
  • Possibly add CSV/XML support in the future

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions