Skip to content

Commit 455de15

Browse files
authored
Merge pull request #104 from embulk/remove-prevent_duplicate_insert
Drop prevent_duplicate_insert which has no use-case now
2 parents eb3b401 + 89adaf1 commit 455de15

File tree

6 files changed

+2
-60
lines changed

6 files changed

+2
-60
lines changed

README.md

Lines changed: 0 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -50,7 +50,6 @@ v0.3.x has incompatibility changes with v0.2.x. Please see [CHANGELOG.md](CHANGE
5050
| auto_create_table | boolean | optional | false | See [Dynamic Table Creating](#dynamic-table-creating) and [Time Partitioning](#time-partitioning) |
5151
| schema_file | string | optional | | /path/to/schema.json |
5252
| template_table | string | optional | | template table name. See [Dynamic Table Creating](#dynamic-table-creating) |
53-
| prevent_duplicate_insert | boolean | optional | false | See [Prevent Duplication](#prevent-duplication) |
5453
| job_status_max_polling_time | int | optional | 3600 sec | Max job status polling time |
5554
| job_status_polling_interval | int | optional | 10 sec | Job status polling interval |
5655
| is_skip_job_result_check | boolean | optional | false | Skip waiting Load job finishes. Available for append, or delete_in_advance mode |
@@ -354,22 +353,6 @@ out:
354353
payload_column_index: 0 # or, payload_column: payload
355354
```
356355

357-
### Prevent Duplication
358-
359-
`prevent_duplicate_insert` option is used to prevent inserting same data for modes `append` or `append_direct`.
360-
361-
When `prevent_duplicate_insert` is set to true, embulk-output-bigquery generate job ID from md5 hash of file and other options.
362-
363-
`job ID = md5(md5(file) + dataset + table + schema + source_format + file_delimiter + max_bad_records + encoding + ignore_unknown_values + allow_quoted_newlines)`
364-
365-
[job ID must be unique(including failures)](https://cloud.google.com/bigquery/loading-data-into-bigquery#consistency) so that same data can't be inserted with same settings repeatedly.
366-
367-
```yaml
368-
out:
369-
type: bigquery
370-
prevent_duplicate_insert: true
371-
```
372-
373356
### GCS Bucket
374357

375358
This is useful to reduce number of consumed jobs, which is limited by [100,000 jobs per project per day](https://cloud.google.com/bigquery/quotas#load_jobs).

example/config_prevent_duplicate_insert.yml

Lines changed: 0 additions & 30 deletions
This file was deleted.

lib/embulk/output/bigquery.rb

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -53,7 +53,6 @@ def self.configure(config, schema, task_count)
5353
'job_status_max_polling_time' => config.param('job_status_max_polling_time', :integer, :default => 3600),
5454
'job_status_polling_interval' => config.param('job_status_polling_interval', :integer, :default => 10),
5555
'is_skip_job_result_check' => config.param('is_skip_job_result_check', :bool, :default => false),
56-
'prevent_duplicate_insert' => config.param('prevent_duplicate_insert', :bool, :default => false),
5756
'with_rehearsal' => config.param('with_rehearsal', :bool, :default => false),
5857
'rehearsal_counts' => config.param('rehearsal_counts', :integer, :default => 1000),
5958
'abort_on_error' => config.param('abort_on_error', :bool, :default => nil),

lib/embulk/output/bigquery/bigquery_client.rb

Lines changed: 2 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -79,11 +79,7 @@ def load_from_gcs(object_uris, table)
7979
begin
8080
# As https://cloud.google.com/bigquery/docs/managing_jobs_datasets_projects#managingjobs says,
8181
# we should generate job_id in client code, otherwise, retrying would cause duplication
82-
if @task['prevent_duplicate_insert'] and (@task['mode'] == 'append' or @task['mode'] == 'append_direct')
83-
job_id = Helper.create_load_job_id(@task, path, fields)
84-
else
85-
job_id = "embulk_load_job_#{SecureRandom.uuid}"
86-
end
82+
job_id = "embulk_load_job_#{SecureRandom.uuid}"
8783
Embulk.logger.info { "embulk-output-bigquery: Load job starting... job_id:[#{job_id}] #{object_uris} => #{@project}:#{@dataset}.#{table} in #{@location_for_log}" }
8884

8985
body = {
@@ -174,11 +170,7 @@ def load(path, table, write_disposition: 'WRITE_APPEND')
174170
if File.exist?(path)
175171
# As https://cloud.google.com/bigquery/docs/managing_jobs_datasets_projects#managingjobs says,
176172
# we should generate job_id in client code, otherwise, retrying would cause duplication
177-
if @task['prevent_duplicate_insert'] and (@task['mode'] == 'append' or @task['mode'] == 'append_direct')
178-
job_id = Helper.create_load_job_id(@task, path, fields)
179-
else
180-
job_id = "embulk_load_job_#{SecureRandom.uuid}"
181-
end
173+
job_id = "embulk_load_job_#{SecureRandom.uuid}"
182174
Embulk.logger.info { "embulk-output-bigquery: Load job starting... job_id:[#{job_id}] #{path} => #{@project}:#{@dataset}.#{table} in #{@location_for_log}" }
183175
else
184176
Embulk.logger.info { "embulk-output-bigquery: Load job starting... #{path} does not exist, skipped" }

test/test_configure.rb

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -62,7 +62,6 @@ def test_configure_default
6262
assert_equal 3600, task['job_status_max_polling_time']
6363
assert_equal 10, task['job_status_polling_interval']
6464
assert_equal false, task['is_skip_job_result_check']
65-
assert_equal false, task['prevent_duplicate_insert']
6665
assert_equal false, task['with_rehearsal']
6766
assert_equal 1000, task['rehearsal_counts']
6867
assert_equal [], task['column_options']

test/test_example.rb

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,6 @@ def embulk_run(config_path)
3333
files.each do |config_path|
3434
if %w[
3535
config_expose_errors.yml
36-
config_prevent_duplicate_insert.yml
3736
].include?(File.basename(config_path))
3837
define_method(:"test_#{File.basename(config_path, ".yml")}") do
3938
assert_false embulk_run(config_path)

0 commit comments

Comments
 (0)