Merge pull request #35 from trocco-io/merge_origin

NamedPython · web-flow · commit dd10cb7d115c · 2025-06-20T12:46:10.000+09:00
Merge origin
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,3 +1,12 @@
+## 0.7.5 - 2025-05-13
+* [enhancement] Add range partitioning support (Thanks to kitagry) #174
+
+## 0.7.4 - 2024-12-19
+* [maintenance] Primary location unless location is set explicitly (Thanks to joker1007) #172
+
+## 0.7.3 - 2024-08-28
+* [enhancement] Add TIME type conversion to string converter (Thanks to p-eye)
+
 ## 0.7.2 - 2024-07-21
 * [maintenance] Fix GitHub Actions #166
 * [maintenance] Fix gcs_client in order to load data using gcs_bucket parameter (Thanks to kashira202111) #164
diff --git a/README.md b/README.md
@@ -112,6 +112,12 @@ Following options are same as [bq command-line tools](https://cloud.google.com/b
 |  time_partitioning.type           | string   | required  | nil     | The only type supported is DAY, which will generate one partition per day based on data loading time. |
 |  time_partitioning.expiration_ms  | int      | optional  | nil     | Number of milliseconds for which to keep the storage for a partition. |
 |  time_partitioning.field          | string   | optional  | nil     | `DATE` or `TIMESTAMP` column used for partitioning |
+|  range_partitioning               | hash     | optional  | nil     | See [Range Partitioning](#range-partitioning) |
+|  range_partitioning.field         | string   | required  | nil     | `INT64` column used for partitioning |
+|  range-partitioning.range         | hash     | required  | nil     | Defines the ranges for range paritioning |
+|  range-partitioning.range.start   | int      | required  | nil     | The start of range partitioning, inclusive. |
+|  range-partitioning.range.end     | int      | required  | nil     | The end of range partitioning, exclusive. |
+|  range-partitioning.range.interval| int      | required  | nil     | The width of each interval. |
 |  clustering                       | hash     | optional  | nil     | Currently, clustering is supported for partitioned tables, so must be used with `time_partitioning` option. See [clustered tables](https://cloud.google.com/bigquery/docs/clustered-tables) |
 |  clustering.fields                | array    | required  | nil     | One or more fields on which data should be clustered. The order of the specified columns determines the sort order of the data. |
 |  schema_update_options            | array    | optional  | nil     | (Experimental) List of `ALLOW_FIELD_ADDITION` or `ALLOW_FIELD_RELAXATION` or both. See [jobs#configuration.load.schemaUpdateOptions](https://cloud.google.com/bigquery/docs/reference/v2/jobs#configuration.load.schemaUpdateOptions). NOTE for the current status: `schema_update_options` does not work for `copy` job, that is, is not effective for most of modes such as `append`, `replace` and `replace_backup`. `delete_in_advance` deletes origin table so does not need to update schema. Only `append_direct` can utilize schema update. |
@@ -332,8 +338,8 @@ Column options are used to aid guessing BigQuery schema, or to define conversion
     - boolean:   `BOOLEAN`, `STRING` (default: `BOOLEAN`)
     - long:      `BOOLEAN`, `INTEGER`, `FLOAT`, `STRING`, `TIMESTAMP` (default: `INTEGER`)
     - double:    `INTEGER`, `FLOAT`, `STRING`, `TIMESTAMP` (default: `FLOAT`)
-    - string:    `BOOLEAN`, `INTEGER`, `FLOAT`, `STRING`, `TIMESTAMP`, `DATETIME`, `DATE`, `RECORD` (default: `STRING`)
-    - timestamp: `INTEGER`, `FLOAT`, `STRING`, `TIMESTAMP`, `DATETIME`, `DATE` (default: `TIMESTAMP`)
+    - string:    `BOOLEAN`, `INTEGER`, `FLOAT`, `STRING`, `TIME`, `TIMESTAMP`, `DATETIME`, `DATE`, `RECORD` (default: `STRING`)
+    - timestamp: `INTEGER`, `FLOAT`, `STRING`, `TIME`, `TIMESTAMP`, `DATETIME`, `DATE` (default: `TIMESTAMP`)
     - json:      `STRING`,  `RECORD` (default: `STRING`)
     - numeric:   `STRING`
   - **mode**: BigQuery mode such as `NULLABLE`, `REQUIRED`, and `REPEATED` (string, default: `NULLABLE`)
@@ -458,6 +464,24 @@ MEMO: [jobs#configuration.load.schemaUpdateOptions](https://cloud.google.com/big
 to update the schema of the desitination table as a side effect of the load job, but it is not available for copy job.
 Thus, it was not suitable for embulk-output-bigquery idempotence modes, `append`, `replace`, and `replace_backup`, sigh.
 
+### Range Partitioning
+
+See also [Creating and Updating Range-Partitioned Tables](https://cloud.google.com/bigquery/docs/creating-partitioned-tables).
+
+To load into a partition, specify `range_partitioning` and `table` parameter with a partition decorator as:
+
+```yaml
+out:
+  type: bigquery
+  table: table_name$1
+  range_partitioning:
+    field: customer_id
+    range:
+      start: 1
+      end: 99999
+      interval: 1
+```
+
 ## Development
 
 ### Run example:
diff --git a/embulk-output-bigquery.gemspec b/embulk-output-bigquery.gemspec
@@ -1,6 +1,6 @@
 Gem::Specification.new do |spec|
   spec.name          = "embulk-output-bigquery"
-  spec.version       = "0.6.9.trocco.0.1.0"
+  spec.version       = "0.7.5.trocco.0.0.1"
   spec.authors       = ["Satoshi Akama", "Naotoshi Seo"]
   spec.summary       = "Google BigQuery output plugin for Embulk"
   spec.description   = "Embulk plugin that insert records to Google BigQuery."
diff --git a/example/config_replace_field_range_partitioned_table.yml b/example/config_replace_field_range_partitioned_table.yml
@@ -0,0 +1,36 @@
+in:
+  type: file
+  path_prefix: example/example.csv
+  parser:
+    type: csv
+    charset: UTF-8
+    newline: CRLF
+    null_string: 'NULL'
+    skip_header_lines: 1
+    comment_line_marker: '#'
+    columns:
+      - {name: date,        type: string}
+      - {name: timestamp,   type: timestamp, format: "%Y-%m-%d %H:%M:%S.%N", timezone: "+09:00"}
+      - {name: "null",      type: string}
+      - {name: long,        type: long}
+      - {name: string,      type: string}
+      - {name: double,      type: double}
+      - {name: boolean,     type: boolean}
+out:
+  type: bigquery
+  mode: replace
+  auth_method: service_account
+  json_keyfile: example/your-project-000.json
+  dataset: your_dataset_name
+  table: your_field_partitioned_table_name
+  source_format: NEWLINE_DELIMITED_JSON
+  compression: NONE
+  auto_create_dataset: true
+  auto_create_table: true
+  schema_file: example/schema.json
+  range_partitioning:
+    field: 'long'
+    range:
+      start: 90
+      end: 100
+      interval: 1
diff --git a/lib/embulk/output/bigquery.rb b/lib/embulk/output/bigquery.rb
@@ -89,6 +89,7 @@ def self.configure(config, schema, task_count)
           'ignore_unknown_values'          => config.param('ignore_unknown_values',          :bool,    :default => false),
           'allow_quoted_newlines'          => config.param('allow_quoted_newlines',          :bool,    :default => false),
           'time_partitioning'              => config.param('time_partitioning',              :hash,    :default => nil),
+          'range_partitioning'             => config.param('range_partitioning',             :hash,    :default => nil),
           'clustering'                     => config.param('clustering',                     :hash,    :default => nil), # google-api-ruby-client >= v0.21.0
           'schema_update_options'          => config.param('schema_update_options',          :array,   :default => nil),
           'merge_keys'                     => config.param('merge_keys',                     :array,   :default => []),
@@ -229,14 +230,55 @@ def self.configure(config, schema, task_count)
           task['abort_on_error'] = (task['max_bad_records'] == 0)
         end
 
+        if task['time_partitioning'] && task['range_partitioning']
+          raise ConfigError.new "`time_partitioning` and `range_partitioning` cannot be used at the same time"
+        end
+
         if task['time_partitioning']
           unless task['time_partitioning']['type']
             raise ConfigError.new "`time_partitioning` must have `type` key"
           end
-        elsif Helper.has_partition_decorator?(task['table'])
+        end
+
+        if Helper.has_partition_decorator?(task['table'])
+          if task['range_partitioning']
+            raise ConfigError.new "Partition decorators(`#{task['table']}`) don't support `range_partition`"
+          end
           task['time_partitioning'] = {'type' => 'DAY'}
         end
 
+        if task['range_partitioning']
+          unless task['range_partitioning']['field']
+            raise ConfigError.new "`range_partitioning` must have `field` key"
+          end
+          unless task['range_partitioning']['range']
+            raise ConfigError.new "`range_partitioning` must have `range` key"
+          end
+
+          range = task['range_partitioning']['range']
+          unless range['start']
+            raise ConfigError.new "`range_partitioning` must have `range.start` key"
+          end
+          unless range['start'].is_a?(Integer)
+            raise ConfigError.new "`range_partitioning.range.start` must be an integer"
+          end
+          unless range['end']
+            raise ConfigError.new "`range_partitioning` must have `range.end` key"
+          end
+          unless range['end'].is_a?(Integer)
+            raise ConfigError.new "`range_partitioning.range.end` must be an integer"
+          end
+          unless range['interval']
+            raise ConfigError.new "`range_partitioning` must have `range.interval` key"
+          end
+          unless range['interval'].is_a?(Integer)
+            raise ConfigError.new "`range_partitioning.range.interval` must be an integer"
+          end
+          if range['start'] + range['interval'] >= range['end']
+            raise ConfigError.new "`range_partitioning.range.start` + `range_partitioning.range.interval` must be less than `range_partitioning.range.end`"
+          end
+        end
+
         if task['clustering']
           unless task['clustering']['fields']
             raise ConfigError.new "`clustering` must have `fields` key"
diff --git a/lib/embulk/output/bigquery/bigquery_client.rb b/lib/embulk/output/bigquery/bigquery_client.rb
@@ -21,7 +21,7 @@ def initialize(task, schema, fields = nil)
           @destination_project = @task['destination_project']
           @dataset = @task['dataset']
           @location = @task['location']
-          @location_for_log = @location.nil? ? 'us/eu' : @location
+          @location_for_log = @location.nil? ? 'Primary location' : @location
 
           @task['source_format'] ||= 'CSV'
           @task['max_bad_records'] ||= 0
@@ -300,6 +300,7 @@ def wait_load(kind, response)
 
           while true
             job_id = _response.job_reference.job_id
+            location = @location || _response.job_reference.location
             elapsed = Time.now - started
             status = _response.status.state
             if status == "DONE"
@@ -319,7 +320,7 @@ def wait_load(kind, response)
                 "job_id:[#{job_id}] elapsed_time:#{elapsed.to_f}sec status:[#{status}]"
               }
               sleep wait_interval
-              _response = with_network_retry { client.get_job(@project, job_id, location: @location) }
+              _response = with_network_retry { client.get_job(@project, job_id, location: location) }
             end
           end
 
@@ -434,6 +435,18 @@ def create_table_if_not_exists(table, dataset: nil, options: nil)
               }
             end
 
+            options['range_partitioning'] ||= @task['range_partitioning']
+            if options['range_partitioning']
+              body[:range_partitioning] = {
+                field: options['range_partitioning']['field'],
+                range: {
+                  start: options['range_partitioning']['range']['start'].to_s,
+                  end: options['range_partitioning']['range']['end'].to_s,
+                  interval: options['range_partitioning']['range']['interval'].to_s,
+                },
+              }
+            end
+
             options['clustering'] ||= @task['clustering']
             if options['clustering']
               body[:clustering] = {
diff --git a/lib/embulk/output/bigquery/value_converter_factory.rb b/lib/embulk/output/bigquery/value_converter_factory.rb
@@ -230,6 +230,14 @@ def string_converter
                 val # Users must care of BQ timestamp format
               }
             end
+          when 'TIME'
+            # TimeWithZone doesn't affect any change to the time value
+            Proc.new {|val|
+              next nil if val.nil?
+              with_typecast_error(val) do |val|
+                TimeWithZone.set_zone_offset(Time.parse(val), zone_offset).strftime("%H:%M:%S.%6N")
+              end
+            }
           when 'RECORD'
             Proc.new {|val|
               next nil if val.nil?
@@ -284,6 +292,11 @@ def timestamp_converter
               next nil if val.nil?
               val.localtime(zone_offset).strftime("%Y-%m-%d %H:%M:%S.%6N")
             }
+          when 'TIME'
+            Proc.new {|val|
+              next nil if val.nil?
+              val.localtime(zone_offset).strftime("%H:%M:%S.%6N")
+            }
           else
             raise NotSupportedType, "cannot take column type #{type} for timestamp column"
           end
diff --git a/test/test_configure.rb b/test/test_configure.rb
@@ -273,6 +273,44 @@ def test_time_partitioning
         assert_equal 'DAY', task['time_partitioning']['type']
       end
 
+      def test_range_partitioning
+        config = least_config.merge('range_partitioning' => {'field' => 'foo', 'range' => { 'start' => 1, 'end' => 3, 'interval' => 1 }})
+        assert_nothing_raised { Bigquery.configure(config, schema, processor_count) }
+
+        # field is required
+        config = least_config.merge('range_partitioning' => {'range' => { 'start' => 1, 'end' => 2, 'interval' => 1 }})
+        assert_raise { Bigquery.configure(config, schema, processor_count) }
+
+
+        # range is required
+        config = least_config.merge('range_partitioning' => {'field' => 'foo'})
+        assert_raise { Bigquery.configure(config, schema, processor_count) }
+
+        # range.start is required
+        config = least_config.merge('range_partitioning' => {'field' => 'foo', 'range' => { 'end' => 2, 'interval' => 1 }})
+        assert_raise { Bigquery.configure(config, schema, processor_count) }
+
+        # range.end is required
+        config = least_config.merge('range_partitioning' => {'field' => 'foo', 'range' => { 'start' => 1, 'interval' => 1 }})
+        assert_raise { Bigquery.configure(config, schema, processor_count) }
+
+        # range.interval is required
+        config = least_config.merge('range_partitioning' => {'field' => 'foo', 'range' => { 'start' => 1, 'end' => 2 }})
+        assert_raise { Bigquery.configure(config, schema, processor_count) }
+
+        # range.start + range.interval should be less than range.end
+        config = least_config.merge('range_partitioning' => {'field' => 'foo', 'range' => { 'start' => 1, 'end' => 2, 'interval' => 2 }})
+        assert_raise { Bigquery.configure(config, schema, processor_count) }
+      end
+
+      def test_time_and_range_partitioning_error
+        config = least_config.merge('time_partitioning' => {'type' => 'DAY'}, 'range_partitioning' => {'field' => 'foo', 'range' => { 'start' => 1, 'end' => 2, 'interval' => 1 }})
+        assert_raise { Bigquery.configure(config, schema, processor_count) }
+
+        config = least_config.merge('table' => 'table_name$20160912', 'range_partitioning' => {'field' => 'foo', 'range' => { 'start' => 1, 'end' => 2, 'interval' => 1 }})
+        assert_raise { Bigquery.configure(config, schema, processor_count) }
+      end
+
       def test_clustering
         config = least_config.merge('clustering' => {'fields' => ['field_a']})
         assert_nothing_raised { Bigquery.configure(config, schema, processor_count) }
diff --git a/test/test_value_converter_factory.rb b/test/test_value_converter_factory.rb
@@ -262,6 +262,23 @@ def test_datetime
           assert_equal "2016-02-26 00:00:00", converter.call("2016-02-26 00:00:00")
         end
 
+        def test_time
+          converter = ValueConverterFactory.new(SCHEMA_TYPE, 'TIME').create_converter
+          assert_equal nil, converter.call(nil)
+          assert_equal "00:03:22.000000", converter.call("00:03:22")
+          assert_equal "15:22:00.000000", converter.call("3:22 PM")
+          assert_equal "03:22:00.000000", converter.call("3:22 AM")
+          assert_equal "00:00:00.000000", converter.call("2016-02-26 00:00:00")
+
+           # TimeWithZone doesn't affect any change to the time value
+          converter = ValueConverterFactory.new(
+            SCHEMA_TYPE, 'TIME', timezone: 'Asia/Tokyo'
+          ).create_converter
+          assert_equal "15:00:01.000000", converter.call("15:00:01")
+
+          assert_raise { converter.call('foo') }
+        end
+
         def test_record
           converter = ValueConverterFactory.new(SCHEMA_TYPE, 'RECORD').create_converter
           assert_equal({'foo'=>'foo'}, converter.call(%Q[{"foo":"foo"}]))
@@ -350,6 +367,24 @@ def test_datetime
           assert_raise { converter.call('foo') }
         end
 
+        def test_time
+          converter = ValueConverterFactory.new(SCHEMA_TYPE, 'TIME').create_converter
+          assert_equal nil, converter.call(nil)
+          timestamp = Time.parse("2016-02-26 00:00:00.500000 +00:00")
+          expected = "00:00:00.500000"
+          assert_equal expected, converter.call(timestamp)
+
+          converter = ValueConverterFactory.new(
+            SCHEMA_TYPE, 'TIME', timezone: 'Asia/Tokyo'
+          ).create_converter
+          assert_equal nil, converter.call(nil)
+          timestamp = Time.parse("2016-02-25 15:00:00.500000 +00:00")
+          expected = "00:00:00.500000"
+          assert_equal expected, converter.call(timestamp)
+
+          assert_raise { converter.call('foo') }
+        end
+
         def test_record
           assert_raise { ValueConverterFactory.new(SCHEMA_TYPE, 'RECORD').create_converter }
         end