Prevent creating unnecessary tables. #148

kashira202111 · 2022-12-22T08:06:04Z

Why

The plugin creates two unnecessary tables in project besides a table in destination_project. One is empty table and the other is a empty temp_table.

This behavior requests extra permissions and datasets. Therefore I would like to fix this.

How to check operation

Settings

load.yml

in:
  type: file
  path_prefix: "data.csv"
  parser:
    charset: UTF-8
    type: csv
    delimiter: ','
    quote: '"'
    columns:
    - {name: id, type: long}
    - {name: name, type: string}
out:
  type: bigquery
  mode: replace
  project: kashira-test-embulk-20240712-1
  destination_project: kashira-test-embulk-20240712-2
  dataset: test_embulk
  auth_method: service_account
  json_keyfile: xxx
  auto_create_table: true
  auto_create_dataset: false
  table: test_data

data.csv

1,"Alice"
2,"Bob"

initial state of GCP

Run command

embulk run load.yml

Results

Before

After

p-eye · 2024-06-23T02:49:07Z

@hiroyuki-sato Hi, could you review it?
I want to separate the destination project {yyy} from the billing project {xxx},
but without this commit, it always tries to insert into the billing project and fails.

It needs to insert into the destination project instead.

hiroyuki-sato · 2024-06-23T14:29:41Z

@p-eye It seems this PR needs additional modification.
The trocco-io version already uses this feature
trocco-io@8343629

I'll take a look at the difference.

hiroyuki-sato · 2024-06-24T09:40:17Z

@kashira202111, @p-eye I misunderstood about this PR.
Could you provide a reproduced example? (For my test)
@kashira202111 Could you resolve the conflicts?

kashira202111 · 2024-06-24T09:59:34Z

@hiroyuki-sato Thank you for your reply, I'm going to resolve the conflicts. Please wait a moment.

hiroyuki-sato · 2024-06-24T10:44:35Z

@kashira202111 Thank you for your work and sorry for waiting for the review.

I want to confirm the following.

Current behavior: raise error dataset not found error and permission denied (I want to know how to reproduce this error)
Whether this PR changes the current behavior or not.

If my understanding is correct, this PR type is the fix for the error, We don't need to consider compatibility.
But If this PR changes the current behavior, We need to care about the compatibility.

Please tell me about detail about above.

kashira202111 · 2024-06-25T10:39:56Z

Current behavior: raise error dataset not found error and permission denied (I want to know how to reproduce this error)

The important purpose of this PR is create table to destination_project project, not to project.
Current behavior: create table into project regardless of setting destination_project.

If you want to reproduce this error, you must prepare below.

Use auto_create_table=true
Use both project parameter and destination_project parameter
- project and destination_project are not equal.
Use auto_create_dataset=false
In xxx project, there aren't 'aaa' dataset. In yyy project, there are 'aaa' dataset.
- Then raise dataset not found error

out:
  type: bigquery
  auth_method: json_key
  mode: 'replace'
  project: 'xxx'
  destination_project: 'yyy'
  dataset: 'aaa'
  json_keyfile: '{service_account.json}'
  auto_create_dataset: false
  auto_create_table: true

kashira202111 · 2024-06-25T10:42:12Z

@hiroyuki-sato I am sorry that my English is so bad. Could you check again?

hiroyuki-sato · 2024-06-25T12:01:59Z

@kashira202111 Thank you for your work. I'll review this PR in a week. Please wait.

hiroyuki-sato · 2024-07-03T13:15:31Z

@kashira202111 I can't take time for review. Please wait for a while.

hiroyuki-sato · 2024-07-07T13:15:18Z

Hello, @kashira202111 and @p-eye

Does this PR fix the problem if a user uses the gcs_bucket parameter in the configuration?
If you don't use gcs_bucket you can load data into the table which you specified in the detination_project

The following configuration created the bqtest2.mydataset.hello_embulk table.

Is my understanding correct?
And I have not set up gcp_bucket for testing yet. (I don't use gcp_bucket parameter in my project)
Please wait to set them up.

As a workaround, can you remove gcp_bucket?

out:
  type: bigquery
  mode: replace
  project: bqtest
  destination_project: bqtest2
  dataset: mydataset
  auth_method: service_account
  json_keyfile:  'key_file`
  auto_create_dataset: false
  auto_create_table: true
  table: hello_embulk

kashira202111 · 2024-07-09T09:07:33Z

@hiroyuki-sato
Thanks for review.

Does this PR fix the problem if a user uses the gcs_bucket parameter in the configuration?

I'm going to check this. Please wait for a while.

kashira202111 · 2024-07-10T09:57:12Z

@hiroyuki-sato

Both using gcs_bucket parameter and not using it, we can load data into destination_project with this commit.

But I found another problem about gcs_bucket in latest release. After the problem is fixed I confirmed that I can load data.

kashira202111 · 2024-07-10T10:12:43Z

I issue the problem later. Please wait for a while.

kashira202111 · 2024-07-10T14:31:01Z

@hiroyuki-sato

This is the problem that I mentioned above.

#164

hiroyuki-sato · 2024-07-11T13:03:56Z

@kashira202111

I would like to double-check about this issue,
Without applying this PR, I could load data to the detination_project table.
#148 (comment)
So, I don't completely understand the problem yet.

Could you tell me again the problem?

kashira202111 · 2024-07-16T06:17:38Z

@hiroyuki-sato
Hi. I'm sorry that my PR title and description led you misunderstands. I rewrote PR title and description.

Could you review this PR again?

hiroyuki-sato · 2024-07-17T12:38:14Z

@kashira202111 Thank you for proposing this PR. I confirmed this PR fixes the problem.

Expect behavior: Load data into the table in the detination_project project and remove all intermediate tables.
However, without this PR, It behaves the following.

Dataset in `project`	Actual behavior
Not exists	Raise Exception (below)
Already exits	create Unnecessary tables (temp and destination).

org.embulk.exec.PartialExecutionException: org.jruby.exceptions.StandardError: (Error) failed to create table bqtest2-428605:mydataset.LOAD_TEMP_93248bb6_8d52_42b8_8226_67856107508b_hello_embulk in us/eu, response:{:status_code=>404, :message=>"notFound: Not found: Dataset bqtest-253710:mydataset", :error_class=>Google::Apis::ClientError}
	at org.embulk.exec.BulkLoader$LoaderState.buildPartialExecuteException(BulkLoader.java:340)
	at org.embulk.exec.BulkLoader.doRun(BulkLoader.java:580)
	at org.embulk.exec.BulkLoader.access$000(BulkLoader.java:36)
	at org.embulk.exec.BulkLoader$1.run(BulkLoader.java:353)
	at org.embulk.exec.BulkLoader$1.run(BulkLoader.java:350)
	at org.embulk.spi.ExecInternal.doWith(ExecInternal.java:26)
	at org.embulk.exec.BulkLoader.run(BulkLoader.java:350)
	at org.embulk.EmbulkEmbed.run(EmbulkEmbed.java:278)
	at org.embulk.EmbulkRunner.runInternal(EmbulkRunner.java:288)
	at org.embulk.EmbulkRunner.run(EmbulkRunner.java:153)
	at org.embulk.cli.EmbulkRun.runInternal(EmbulkRun.java:115)
	at org.embulk.cli.EmbulkRun.run(EmbulkRun.java:24)
	at org.embulk.cli.Main.main(Main.java:53)
	Suppressed: java.lang.NullPointerException
		at org.embulk.exec.BulkLoader.doCleanup(BulkLoader.java:477)
		at org.embulk.exec.BulkLoader$3.run(BulkLoader.java:411)
		at org.embulk.exec.BulkLoader$3.run(BulkLoader.java:408)
		at org.embulk.spi.ExecInternal.doWith(ExecInternal.java:26)
		at org.embulk.exec.BulkLoader.cleanup(BulkLoader.java:408)
		at org.embulk.EmbulkEmbed.run(EmbulkEmbed.java:283)
		... 5 more
Caused by: org.jruby.exceptions.StandardError: (Error) failed to create table bqtest2-428605:mydataset.LOAD_TEMP_93248bb6_8d52_42b8_8226_67856107508b_hello_embulk in us/eu, response:{:status_code=>404, :message=>"notFound: Not found: Dataset bqtest-253710:mydataset", :error_class=>Google::Apis::ClientError}
	at RUBY.create_table_if_not_exists/path/to/embulk-output-bigquery-0.7.1/lib/embulk/output/bigquery/bigquery_client.rb:462)
	at RUBY.auto_create/path/to/embulk-output-bigquery-0.7.1/lib/embulk/output/bigquery.rb:315)
	at RUBY.transaction/path/to/embulk-output-bigquery-0.7.1/lib/embulk/output/bigquery.rb:342)
	at RUBY.transaction/path/to/embulk-0.11.4-java/lib/embulk/output_plugin.rb:64)
Caused by: org.jruby.exceptions.StandardError: (ClientError) notFound: Not found: Dataset bqtest-253710:mydataset
	at RUBY.check_status/path/to/google-apis-core-0.14.0/lib/google/apis/core/http_command.rb:244)
	at RUBY.check_status/path/to/google-apis-core-0.14.0/lib/google/apis/core/api_command.rb:135)
	at RUBY.process_response/path/to/google-apis-core-0.14.0/lib/google/apis/core/http_command.rb:207)
	at RUBY.execute_once/path/to/google-apis-core-0.14.0/lib/google/apis/core/http_command.rb:326)
	at RUBY.do_retry/path/to/google-apis-core-0.14.0/lib/google/apis/core/http_command.rb:131)
	at RUBY.retriable/path/to/retriable-3.1.2/lib/retriable.rb:61)
	at org.jruby.RubyFixnum.times(org/jruby/RubyFixnum.java:312)
	at RUBY.retriable/path/to/retriable-3.1.2/lib/retriable.rb:56)
	at RUBY.do_retry/path/to/google-apis-core-0.14.0/lib/google/apis/core/http_command.rb:128)
	at RUBY.retriable/path/to/retriable-3.1.2/lib/retriable.rb:61)
	at org.jruby.RubyFixnum.times(org/jruby/RubyFixnum.java:312)
	at RUBY.retriable/path/to/retriable-3.1.2/lib/retriable.rb:56)
	at RUBY.do_retry/path/to/google-apis-core-0.14.0/lib/google/apis/core/http_command.rb:118)
	at RUBY.execute/path/to/google-apis-core-0.14.0/lib/google/apis/core/http_command.rb:109)
	at RUBY.execute_or_queue_command/path/to/google-apis-core-0.14.0/lib/google/apis/core/base_service.rb:477)
	at RUBY.insert_table/path/to/google-apis-bigquery_v2-0.65.0/lib/google/apis/bigquery_v2/service.rb:1442)
	at RUBY.create_table_if_not_exists/path/to/embulk-output-bigquery-0.7.1/lib/embulk/output/bigquery/bigquery_client.rb:451)
	at RUBY.with_network_retry/path/to/embulk-output-bigquery-0.7.1/lib/embulk/output/bigquery/google_client.rb:51)
	at RUBY.create_table_if_not_exists/path/to/embulk-output-bigquery-0.7.1/lib/embulk/output/bigquery/bigquery_client.rb:451)
	... 3 more

Error: org.jruby.exceptions.StandardError: (Error) failed to create table bqtest2-428605:mydataset.LOAD_TEMP_93248bb6_8d52_42b8_8226_67856107508b_hello_embulk in us/eu, response:{:status_code=>404, :message=>"notFound: Not found: Dataset bqtest-253710:mydataset", :error_class=>Google::Apis::ClientError}

hiroyuki-sato

LGTM👍

@joker1007 Could you take a look when you have time?
Due to another reason, CI doesn't run correctly, #165 fixes the problem.

After merging #165, I'll try to rerun CI.

hiroyuki-sato · 2024-07-19T04:58:56Z

@kashira202111 Fix CI in #166. Could you rebase this PR to check CI?

kashira202111 · 2024-07-19T07:14:25Z

@hiroyuki-sato Thank you. I rebased this PR !

hiroyuki-sato · 2024-07-19T08:09:46Z

@joker1007 CI passed. Please take a look when you have time.

hiroyuki-sato · 2024-07-21T02:32:37Z

Thanks!

hiroyuki-sato · 2024-07-21T11:46:16Z

@p-eye and @kashira202111 embulk-output-bigquery 0.7.2 has been released. 0.7.2 includes this PR. Please try it.

p-eye · 2024-07-22T06:32:41Z

It works well, thanks to you guys.

kashira202111 force-pushed the kashira/fix_create_table_to_destination_project branch from f3796f9 to 18c2816 Compare June 25, 2024 09:40

kashira202111 changed the title ~~creates table into destination project when setting destination_project,~~ Prevent creating unnecessary tables. Jul 16, 2024

kashira202111 force-pushed the kashira/fix_create_table_to_destination_project branch 2 times, most recently from 08ea588 to 876553b Compare July 16, 2024 13:01

hiroyuki-sato approved these changes Jul 17, 2024

View reviewed changes

hiroyuki-sato requested a review from joker1007 July 17, 2024 12:41

if table does not exist, embulk creates table on destination project

5ed9f0d

kashira202111 force-pushed the kashira/fix_create_table_to_destination_project branch from 876553b to 5ed9f0d Compare July 19, 2024 07:08

joker1007 approved these changes Jul 20, 2024

View reviewed changes

hiroyuki-sato merged commit b20fc4e into embulk:master Jul 21, 2024
2 checks passed

Prevent creating unnecessary tables. #148

Prevent creating unnecessary tables. #148

Conversation

kashira202111 commented Dec 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

How to check operation

Settings

Run command

Results

Before

After

Uh oh!

p-eye commented Jun 23, 2024

Uh oh!

hiroyuki-sato commented Jun 23, 2024

Uh oh!

hiroyuki-sato commented Jun 24, 2024

Uh oh!

kashira202111 commented Jun 24, 2024

Uh oh!

hiroyuki-sato commented Jun 24, 2024

Uh oh!

kashira202111 commented Jun 25, 2024

Uh oh!

kashira202111 commented Jun 25, 2024

Uh oh!

hiroyuki-sato commented Jun 25, 2024

Uh oh!

hiroyuki-sato commented Jul 3, 2024

Uh oh!

hiroyuki-sato commented Jul 7, 2024

Uh oh!

kashira202111 commented Jul 9, 2024

Uh oh!

kashira202111 commented Jul 10, 2024

Uh oh!

kashira202111 commented Jul 10, 2024

Uh oh!

kashira202111 commented Jul 10, 2024

Uh oh!

hiroyuki-sato commented Jul 11, 2024

Uh oh!

kashira202111 commented Jul 16, 2024

Uh oh!

hiroyuki-sato commented Jul 17, 2024

Uh oh!

hiroyuki-sato left a comment

Choose a reason for hiding this comment

Uh oh!

hiroyuki-sato commented Jul 19, 2024

Uh oh!

kashira202111 commented Jul 19, 2024

Uh oh!

hiroyuki-sato commented Jul 19, 2024

Uh oh!

Uh oh!

hiroyuki-sato commented Jul 21, 2024

Uh oh!

hiroyuki-sato commented Jul 21, 2024

Uh oh!

p-eye commented Jul 22, 2024

Uh oh!

Uh oh!

kashira202111 commented Dec 22, 2022 •

edited

Loading