JDBC Incremental Sync with Limit #36180
Replies: 3 comments
-
Hi @jhecking - thanks for submitting the idea and the thorough research. We are generally in the habit of full data syncs, so it's not likely that we would work on something like this anytime soon. Here are a few options available to you:
cc @misteryeo on some interesting requirements here. |
Beta Was this translation helpful? Give feedback.
-
Thanks for the update @bleonard. Since the source databases are not our own, but our customers’, we don’t have the option to create views. We are currently working on the 3rd option, i.e. modifying the existing Airbyte connectors to suit our needs. We are trying to keep our changes as minimal and non-invasive as possible, and in some time we are planning to submit our changes as pull requests back to the Airbyte upstream, to see if some/all of them could be merged back. Let me know if that seems reasonable? |
Beta Was this translation helpful? Give feedback.
-
@jhecking any updates on that? I'm a new airbyte user but a veteran of cask and streamsets where LIMIT is supported in incremental SQL. The design choice of not supporting LIMIT on the client-side is unexpected. Its a common need for the same reasons an API client has the capability to control page size in a GET call. Often DBAs prefer query grains and open query time to be limited. I wish you could simply define a LIMIT inside of a named view and then connect airbyte to the view. If you did that the query airbyte issues would apply after the LIMIT, not the other way around. The cursor would advance to the limit defined inside the view, no further. Constructing a temporary table and lining this up with the time airbyte sync happens to execute it awkward at best, produces many more parts and failure modes. Seems like the only viable option on the table is to run large queries until they succeed or fail, hope the checkpoints work sufficiently, or produce a fork of the code supporting limits. I'll try and look into the logistics of adding LIMIT as an option in the JDBC/database-source class hierarchy. The code is reasonably easy to follow at first impression however the requested change seems to affect multiple layers and possibly multiple sources. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Tell us about the problem you're trying to solve
We have a source database that can receive a very high level of updates. We would like to incrementally load data from that source into a different destination to run compute-intensive analytics on the data. For the purpose of the analytics, we don’t really need all of the source data, i.e. a representative sample of the data would be sufficient and would reduce the compute required to run the analysis.
Describe the solution you’d like
We would like to use Airbyte to run an incremental load from the source to the destination, but would like to apply a limit to each incremental sync operation. E.g. run an incremental sync every hour, but limit the amount of data loaded to at most 1000 rows.
Describe the alternative you’ve considered or used
One alternative we have considered is to load all of the data into the destination first, but then discard the excessive parts of the data before running the analytics. But we would rather avoid transferring a lot of data just to discard most of it immediately, as that would put extra load on both the source and the destination.
Are you willing to submit a PR?
Yes. In fact, we have already done some preliminary investigation, and found that incremental queries for JDBC data sources are generated here:
airbyte/airbyte-integrations/connectors/source-jdbc/src/main/java/io/airbyte/integrations/source/jdbc/AbstractJdbcSource.java
Line 295 in 626f51f
If we could find some way to pass a limit into this function, achieving what we want to do might be as easy as amending the query from
SELECT %s FROM %s WHERE %s > ?
toSELECT %s FROM %s WHERE %s > ? ORDER BY %s DESC LIMIT %s
.(The reverse order by the cursor field is necessary to ensure the query returns the most recent cursor value, so that the next incremental sync can pick up where the current one stopped.)
But before we proceed further, we would like to hear from the Airbyte team whether a change like this would be accepted and also whether there are any other potential challenges with this approach that we might not be aware of.
Beta Was this translation helpful? Give feedback.
All reactions