Skip to content

Prevent creating unnecessary tables. #148

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

kashira202111
Copy link

@kashira202111 kashira202111 commented Dec 22, 2022

Why

The plugin creates two unnecessary tables in project besides a table in destination_project. One is empty table and the other is a empty temp_table.

This behavior requests extra permissions and datasets. Therefore I would like to fix this.

How to check operation

Settings

load.yml

in:
  type: file
  path_prefix: "data.csv"
  parser:
    charset: UTF-8
    type: csv
    delimiter: ','
    quote: '"'
    columns:
    - {name: id, type: long}
    - {name: name, type: string}
out:
  type: bigquery
  mode: replace
  project: kashira-test-embulk-20240712-1
  destination_project: kashira-test-embulk-20240712-2
  dataset: test_embulk
  auth_method: service_account
  json_keyfile: xxx
  auto_create_table: true
  auto_create_dataset: false
  table: test_data

data.csv

1,"Alice"
2,"Bob"

initial state of GCP
initial state of GCP

Run command

embulk run load.yml

Results

Before

before Untitled (15) Untitled (16)

After

after

@p-eye
Copy link
Contributor

p-eye commented Jun 23, 2024

@hiroyuki-sato Hi, could you review it?
I want to separate the destination project {yyy} from the billing project {xxx},
but without this commit, it always tries to insert into the billing project and fails.

It needs to insert into the destination project instead.

@hiroyuki-sato
Copy link
Member

@p-eye It seems this PR needs additional modification.
The trocco-io version already uses this feature
trocco-io@8343629

I'll take a look at the difference.

@hiroyuki-sato
Copy link
Member

@kashira202111, @p-eye I misunderstood about this PR.
Could you provide a reproduced example? (For my test)
@kashira202111 Could you resolve the conflicts?

@kashira202111
Copy link
Author

@hiroyuki-sato Thank you for your reply, I'm going to resolve the conflicts. Please wait a moment.

@hiroyuki-sato
Copy link
Member

@kashira202111 Thank you for your work and sorry for waiting for the review.

I want to confirm the following.

  • Current behavior: raise error dataset not found error and permission denied (I want to know how to reproduce this error)
  • Whether this PR changes the current behavior or not.

If my understanding is correct, this PR type is the fix for the error, We don't need to consider compatibility.
But If this PR changes the current behavior, We need to care about the compatibility.

Please tell me about detail about above.

@kashira202111 kashira202111 force-pushed the kashira/fix_create_table_to_destination_project branch from f3796f9 to 18c2816 Compare June 25, 2024 09:40
@kashira202111
Copy link
Author

  • Current behavior: raise error dataset not found error and permission denied (I want to know how to reproduce this error)

The important purpose of this PR is create table to destination_project project, not to project.
Current behavior: create table into project regardless of setting destination_project.

If you want to reproduce this error, you must prepare below.

  • Use auto_create_table=true
  • Use both project parameter and destination_project parameter
    • project and destination_project are not equal.
  • Use auto_create_dataset=false
  • In xxx project, there aren't 'aaa' dataset. In yyy project, there are 'aaa' dataset.
    • Then raise dataset not found error
out:
  type: bigquery
  auth_method: json_key
  mode: 'replace'
  project: 'xxx'
  destination_project: 'yyy'
  dataset: 'aaa'
  json_keyfile: '{service_account.json}'
  auto_create_dataset: false
  auto_create_table: true

@kashira202111
Copy link
Author

@hiroyuki-sato I am sorry that my English is so bad. Could you check again?

@hiroyuki-sato
Copy link
Member

@kashira202111 Thank you for your work. I'll review this PR in a week. Please wait.

@hiroyuki-sato
Copy link
Member

@kashira202111 I can't take time for review. Please wait for a while.

@hiroyuki-sato
Copy link
Member

Hello, @kashira202111 and @p-eye

Does this PR fix the problem if a user uses the gcs_bucket parameter in the configuration?
If you don't use gcs_bucket you can load data into the table which you specified in the detination_project

The following configuration created the bqtest2.mydataset.hello_embulk table.

Is my understanding correct?
And I have not set up gcp_bucket for testing yet. (I don't use gcp_bucket parameter in my project)
Please wait to set them up.

As a workaround, can you remove gcp_bucket?

out:
  type: bigquery
  mode: replace
  project: bqtest
  destination_project: bqtest2
  dataset: mydataset
  auth_method: service_account
  json_keyfile:  'key_file`
  auto_create_dataset: false
  auto_create_table: true
  table: hello_embulk

@kashira202111
Copy link
Author

@hiroyuki-sato
Thanks for review.

Does this PR fix the problem if a user uses the gcs_bucket parameter in the configuration?

I'm going to check this. Please wait for a while.

@kashira202111
Copy link
Author

@hiroyuki-sato

Both using gcs_bucket parameter and not using it, we can load data into destination_project with this commit.

But I found another problem about gcs_bucket in latest release. After the problem is fixed I confirmed that I can load data.

@kashira202111
Copy link
Author

I issue the problem later. Please wait for a while.

@kashira202111
Copy link
Author

@hiroyuki-sato

This is the problem that I mentioned above.

#164

@hiroyuki-sato
Copy link
Member

@kashira202111

I would like to double-check about this issue,
Without applying this PR, I could load data to the detination_project table.
#148 (comment)
So, I don't completely understand the problem yet.

Could you tell me again the problem?

@kashira202111 kashira202111 changed the title creates table into destination project when setting destination_project, Prevent creating unnecessary tables. Jul 16, 2024
@kashira202111
Copy link
Author

@hiroyuki-sato
Hi. I'm sorry that my PR title and description led you misunderstands. I rewrote PR title and description.

Could you review this PR again?

@kashira202111 kashira202111 force-pushed the kashira/fix_create_table_to_destination_project branch 2 times, most recently from 08ea588 to 876553b Compare July 16, 2024 13:01
@hiroyuki-sato
Copy link
Member

@kashira202111 Thank you for proposing this PR. I confirmed this PR fixes the problem.

Expect behavior: Load data into the table in the detination_project project and remove all intermediate tables.
However, without this PR, It behaves the following.

Dataset in project Actual behavior
Not exists Raise Exception (below)
Already exits create Unnecessary tables (temp and destination).
org.embulk.exec.PartialExecutionException: org.jruby.exceptions.StandardError: (Error) failed to create table bqtest2-428605:mydataset.LOAD_TEMP_93248bb6_8d52_42b8_8226_67856107508b_hello_embulk in us/eu, response:{:status_code=>404, :message=>"notFound: Not found: Dataset bqtest-253710:mydataset", :error_class=>Google::Apis::ClientError}
	at org.embulk.exec.BulkLoader$LoaderState.buildPartialExecuteException(BulkLoader.java:340)
	at org.embulk.exec.BulkLoader.doRun(BulkLoader.java:580)
	at org.embulk.exec.BulkLoader.access$000(BulkLoader.java:36)
	at org.embulk.exec.BulkLoader$1.run(BulkLoader.java:353)
	at org.embulk.exec.BulkLoader$1.run(BulkLoader.java:350)
	at org.embulk.spi.ExecInternal.doWith(ExecInternal.java:26)
	at org.embulk.exec.BulkLoader.run(BulkLoader.java:350)
	at org.embulk.EmbulkEmbed.run(EmbulkEmbed.java:278)
	at org.embulk.EmbulkRunner.runInternal(EmbulkRunner.java:288)
	at org.embulk.EmbulkRunner.run(EmbulkRunner.java:153)
	at org.embulk.cli.EmbulkRun.runInternal(EmbulkRun.java:115)
	at org.embulk.cli.EmbulkRun.run(EmbulkRun.java:24)
	at org.embulk.cli.Main.main(Main.java:53)
	Suppressed: java.lang.NullPointerException
		at org.embulk.exec.BulkLoader.doCleanup(BulkLoader.java:477)
		at org.embulk.exec.BulkLoader$3.run(BulkLoader.java:411)
		at org.embulk.exec.BulkLoader$3.run(BulkLoader.java:408)
		at org.embulk.spi.ExecInternal.doWith(ExecInternal.java:26)
		at org.embulk.exec.BulkLoader.cleanup(BulkLoader.java:408)
		at org.embulk.EmbulkEmbed.run(EmbulkEmbed.java:283)
		... 5 more
Caused by: org.jruby.exceptions.StandardError: (Error) failed to create table bqtest2-428605:mydataset.LOAD_TEMP_93248bb6_8d52_42b8_8226_67856107508b_hello_embulk in us/eu, response:{:status_code=>404, :message=>"notFound: Not found: Dataset bqtest-253710:mydataset", :error_class=>Google::Apis::ClientError}
	at RUBY.create_table_if_not_exists/path/to/embulk-output-bigquery-0.7.1/lib/embulk/output/bigquery/bigquery_client.rb:462)
	at RUBY.auto_create/path/to/embulk-output-bigquery-0.7.1/lib/embulk/output/bigquery.rb:315)
	at RUBY.transaction/path/to/embulk-output-bigquery-0.7.1/lib/embulk/output/bigquery.rb:342)
	at RUBY.transaction/path/to/embulk-0.11.4-java/lib/embulk/output_plugin.rb:64)
Caused by: org.jruby.exceptions.StandardError: (ClientError) notFound: Not found: Dataset bqtest-253710:mydataset
	at RUBY.check_status/path/to/google-apis-core-0.14.0/lib/google/apis/core/http_command.rb:244)
	at RUBY.check_status/path/to/google-apis-core-0.14.0/lib/google/apis/core/api_command.rb:135)
	at RUBY.process_response/path/to/google-apis-core-0.14.0/lib/google/apis/core/http_command.rb:207)
	at RUBY.execute_once/path/to/google-apis-core-0.14.0/lib/google/apis/core/http_command.rb:326)
	at RUBY.do_retry/path/to/google-apis-core-0.14.0/lib/google/apis/core/http_command.rb:131)
	at RUBY.retriable/path/to/retriable-3.1.2/lib/retriable.rb:61)
	at org.jruby.RubyFixnum.times(org/jruby/RubyFixnum.java:312)
	at RUBY.retriable/path/to/retriable-3.1.2/lib/retriable.rb:56)
	at RUBY.do_retry/path/to/google-apis-core-0.14.0/lib/google/apis/core/http_command.rb:128)
	at RUBY.retriable/path/to/retriable-3.1.2/lib/retriable.rb:61)
	at org.jruby.RubyFixnum.times(org/jruby/RubyFixnum.java:312)
	at RUBY.retriable/path/to/retriable-3.1.2/lib/retriable.rb:56)
	at RUBY.do_retry/path/to/google-apis-core-0.14.0/lib/google/apis/core/http_command.rb:118)
	at RUBY.execute/path/to/google-apis-core-0.14.0/lib/google/apis/core/http_command.rb:109)
	at RUBY.execute_or_queue_command/path/to/google-apis-core-0.14.0/lib/google/apis/core/base_service.rb:477)
	at RUBY.insert_table/path/to/google-apis-bigquery_v2-0.65.0/lib/google/apis/bigquery_v2/service.rb:1442)
	at RUBY.create_table_if_not_exists/path/to/embulk-output-bigquery-0.7.1/lib/embulk/output/bigquery/bigquery_client.rb:451)
	at RUBY.with_network_retry/path/to/embulk-output-bigquery-0.7.1/lib/embulk/output/bigquery/google_client.rb:51)
	at RUBY.create_table_if_not_exists/path/to/embulk-output-bigquery-0.7.1/lib/embulk/output/bigquery/bigquery_client.rb:451)
	... 3 more

Error: org.jruby.exceptions.StandardError: (Error) failed to create table bqtest2-428605:mydataset.LOAD_TEMP_93248bb6_8d52_42b8_8226_67856107508b_hello_embulk in us/eu, response:{:status_code=>404, :message=>"notFound: Not found: Dataset bqtest-253710:mydataset", :error_class=>Google::Apis::ClientError}

Copy link
Member

@hiroyuki-sato hiroyuki-sato left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM👍

@joker1007 Could you take a look when you have time?
Due to another reason, CI doesn't run correctly, #165 fixes the problem.

After merging #165, I'll try to rerun CI.

@hiroyuki-sato hiroyuki-sato requested a review from joker1007 July 17, 2024 12:41
@hiroyuki-sato
Copy link
Member

@kashira202111 Fix CI in #166. Could you rebase this PR to check CI?

@kashira202111 kashira202111 force-pushed the kashira/fix_create_table_to_destination_project branch from 876553b to 5ed9f0d Compare July 19, 2024 07:08
@kashira202111
Copy link
Author

@hiroyuki-sato Thank you. I rebased this PR !

@hiroyuki-sato
Copy link
Member

@joker1007 CI passed. Please take a look when you have time.

@hiroyuki-sato hiroyuki-sato merged commit b20fc4e into embulk:master Jul 21, 2024
2 checks passed
@hiroyuki-sato
Copy link
Member

Thanks!

@hiroyuki-sato
Copy link
Member

@p-eye and @kashira202111 embulk-output-bigquery 0.7.2 has been released. 0.7.2 includes this PR. Please try it.

@p-eye
Copy link
Contributor

p-eye commented Jul 22, 2024

It works well, thanks to you guys.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

4 participants