Skip to content

Commit 89adaf1

Browse files
committed
Drop prevent_duplicate_insert which has no use-case now
1 parent 22dc720 commit 89adaf1

File tree

6 files changed

+2
-60
lines changed

6 files changed

+2
-60
lines changed

README.md

Lines changed: 0 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -50,7 +50,6 @@ v0.3.x has incompatibility changes with v0.2.x. Please see [CHANGELOG.md](CHANGE
5050
| auto_create_table | boolean | optional | false | See [Dynamic Table Creating](#dynamic-table-creating) and [Time Partitioning](#time-partitioning) |
5151
| schema_file | string | optional | | /path/to/schema.json |
5252
| template_table | string | optional | | template table name. See [Dynamic Table Creating](#dynamic-table-creating) |
53-
| prevent_duplicate_insert | boolean | optional | false | See [Prevent Duplication](#prevent-duplication) |
5453
| job_status_max_polling_time | int | optional | 3600 sec | Max job status polling time |
5554
| job_status_polling_interval | int | optional | 10 sec | Job status polling interval |
5655
| is_skip_job_result_check | boolean | optional | false | Skip waiting Load job finishes. Available for append, or delete_in_advance mode |
@@ -355,22 +354,6 @@ out:
355354
payload_column_index: 0 # or, payload_column: payload
356355
```
357356

358-
### Prevent Duplication
359-
360-
`prevent_duplicate_insert` option is used to prevent inserting same data for modes `append` or `append_direct`.
361-
362-
When `prevent_duplicate_insert` is set to true, embulk-output-bigquery generate job ID from md5 hash of file and other options.
363-
364-
`job ID = md5(md5(file) + dataset + table + schema + source_format + file_delimiter + max_bad_records + encoding + ignore_unknown_values + allow_quoted_newlines)`
365-
366-
[job ID must be unique(including failures)](https://cloud.google.com/bigquery/loading-data-into-bigquery#consistency) so that same data can't be inserted with same settings repeatedly.
367-
368-
```yaml
369-
out:
370-
type: bigquery
371-
prevent_duplicate_insert: true
372-
```
373-
374357
### GCS Bucket
375358

376359
This is useful to reduce number of consumed jobs, which is limited by [100,000 jobs per project per day](https://cloud.google.com/bigquery/quotas#load_jobs).

example/config_prevent_duplicate_insert.yml

Lines changed: 0 additions & 30 deletions
This file was deleted.

lib/embulk/output/bigquery.rb

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -53,7 +53,6 @@ def self.configure(config, schema, task_count)
5353
'job_status_max_polling_time' => config.param('job_status_max_polling_time', :integer, :default => 3600),
5454
'job_status_polling_interval' => config.param('job_status_polling_interval', :integer, :default => 10),
5555
'is_skip_job_result_check' => config.param('is_skip_job_result_check', :bool, :default => false),
56-
'prevent_duplicate_insert' => config.param('prevent_duplicate_insert', :bool, :default => false),
5756
'with_rehearsal' => config.param('with_rehearsal', :bool, :default => false),
5857
'rehearsal_counts' => config.param('rehearsal_counts', :integer, :default => 1000),
5958
'abort_on_error' => config.param('abort_on_error', :bool, :default => nil),

lib/embulk/output/bigquery/bigquery_client.rb

Lines changed: 2 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -79,11 +79,7 @@ def load_from_gcs(object_uris, table)
7979
begin
8080
# As https://cloud.google.com/bigquery/docs/managing_jobs_datasets_projects#managingjobs says,
8181
# we should generate job_id in client code, otherwise, retrying would cause duplication
82-
if @task['prevent_duplicate_insert'] and (@task['mode'] == 'append' or @task['mode'] == 'append_direct')
83-
job_id = Helper.create_load_job_id(@task, path, fields)
84-
else
85-
job_id = "embulk_load_job_#{SecureRandom.uuid}"
86-
end
82+
job_id = "embulk_load_job_#{SecureRandom.uuid}"
8783
Embulk.logger.info { "embulk-output-bigquery: Load job starting... job_id:[#{job_id}] #{object_uris} => #{@project}:#{@dataset}.#{table} in #{@location_for_log}" }
8884

8985
body = {
@@ -174,11 +170,7 @@ def load(path, table, write_disposition: 'WRITE_APPEND')
174170
if File.exist?(path)
175171
# As https://cloud.google.com/bigquery/docs/managing_jobs_datasets_projects#managingjobs says,
176172
# we should generate job_id in client code, otherwise, retrying would cause duplication
177-
if @task['prevent_duplicate_insert'] and (@task['mode'] == 'append' or @task['mode'] == 'append_direct')
178-
job_id = Helper.create_load_job_id(@task, path, fields)
179-
else
180-
job_id = "embulk_load_job_#{SecureRandom.uuid}"
181-
end
173+
job_id = "embulk_load_job_#{SecureRandom.uuid}"
182174
Embulk.logger.info { "embulk-output-bigquery: Load job starting... job_id:[#{job_id}] #{path} => #{@project}:#{@dataset}.#{table} in #{@location_for_log}" }
183175
else
184176
Embulk.logger.info { "embulk-output-bigquery: Load job starting... #{path} does not exist, skipped" }

test/test_configure.rb

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -62,7 +62,6 @@ def test_configure_default
6262
assert_equal 3600, task['job_status_max_polling_time']
6363
assert_equal 10, task['job_status_polling_interval']
6464
assert_equal false, task['is_skip_job_result_check']
65-
assert_equal false, task['prevent_duplicate_insert']
6665
assert_equal false, task['with_rehearsal']
6766
assert_equal 1000, task['rehearsal_counts']
6867
assert_equal [], task['column_options']

test/test_example.rb

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,6 @@ def embulk_run(config_path)
3333
files.each do |config_path|
3434
if %w[
3535
config_expose_errors.yml
36-
config_prevent_duplicate_insert.yml
3736
].include?(File.basename(config_path))
3837
define_method(:"test_#{File.basename(config_path, ".yml")}") do
3938
assert_false embulk_run(config_path)

0 commit comments

Comments
 (0)