Skip to content

Commit 0a68ad8

Browse files
authored
feat: add Lineage metrics for CloudBigtableIO (#4438)
Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly: - [ ] Make sure to open an issue as a [bug/issue](https://togithub.com/googleapis/java-bigtable-hbase/issues/new/choose) before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea - [ ] Ensure the tests and linter pass - [ ] Code coverage does not decrease (if any source code was changed) - [ ] Appropriate docs were updated (if necessary) Fixes #<issue_number_goes_here> ☕️ If you write sample code, please follow the [samples format]( https://togithub.com/GoogleCloudPlatform/java-docs-samples/blob/main/SAMPLE_FORMAT.md).
1 parent 358a043 commit 0a68ad8

File tree

4 files changed

+328
-38
lines changed

4 files changed

+328
-38
lines changed

bigtable-dataflow-parent/bigtable-beam-import/README.md

Lines changed: 45 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -3,15 +3,19 @@
33
This folder contains tools to support importing and exporting HBase data to
44
Google Cloud Bigtable using Cloud Dataflow.
55

6-
## Setup
6+
## Setup
77

8-
To use the tools in this folder, you can download them from the maven repository, or
9-
you can build them using Maven.
8+
To use the tools in this folder, you can download them from the maven
9+
repository, or
10+
you can build them using Maven.
1011

1112

1213
[//]: # ({x-version-update-start:bigtable-client-parent:released})
14+
1315
### Download the jars
14-
Download [the import/export jars](https://search.maven.org/artifact/com.google.cloud.bigtable/bigtable-beam-import), which is an aggregation of all required jars.
16+
17+
Download [the import/export jars](https://search.maven.org/artifact/com.google.cloud.bigtable/bigtable-beam-import),
18+
which is an aggregation of all required jars.
1519

1620
### Build the jars yourself
1721

@@ -25,12 +29,13 @@ cd bigtable-dataflow-parent/bigtable-beam-import
2529
```
2630

2731
***
32+
2833
# Tools
2934

3035
## Data export pipeline
3136

3237
You can export data into a snapshot or into sequence files. If you're migrating
33-
your data from HBase to Bigtable, using snapshots is the preferred method.
38+
your data from HBase to Bigtable, using snapshots is the preferred method.
3439

3540
### Exporting snapshots from HBase
3641

@@ -50,20 +55,20 @@ Perform these steps from Unix shell on an HBase edge node.
5055
echo "snapshot '$TABLE_NAME', '$SNAPSHOT_NAME'" | hbase shell -n
5156
```
5257
53-
1. Export the snapshot
54-
1. Install [hadoop connectors](https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/master/gcs/INSTALL.md)
58+
1. Export the snapshot
59+
1.
60+
Install [hadoop connectors](https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/master/gcs/INSTALL.md)
5561
1. Copy to a GCS bucket
5662
```
5763
hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot $SNAPSHOT_NAME \
5864
-copy-to $BUCKET_NAME$SNAPSHOT_EXPORT_PATH/data -mappers $NUM_MAPPERS
5965
```
6066
1. Create hashes for the table to be used during the data validation step.
61-
[Visit the HBase documentation for more information on each parameter](http://hbase.apache.org/book.html#_step_1_hashtable).
67+
[Visit the HBase documentation for more information on each parameter](http://hbase.apache.org/book.html#_step_1_hashtable).
6268
```
6369
hbase org.apache.hadoop.hbase.mapreduce.HashTable --batchsize=10 --numhashfiles=10 \
6470
$TABLE_NAME $BUCKET_NAME$SNAPSHOT_EXPORT_PATH/hashtable
6571
```
66-
6772
6873
### Exporting sequence files from HBase
6974
@@ -74,7 +79,8 @@ Perform these steps from Unix shell on an HBase edge node.
7479
hadoop fs -mkdir -p ${EXPORTDIR}
7580
MAXVERSIONS=2147483647
7681
```
77-
1. On an edge node, that has HBase classpath configured, run the export commands.
82+
1. On an edge node, that has HBase classpath configured, run the export
83+
commands.
7884
```
7985
cd $HBASE_HOME
8086
bin/hbase org.apache.hadoop.hbase.mapreduce.Export \
@@ -114,19 +120,22 @@ Exporting HBase snapshots from Bigtable is not supported.
114120
--region=$REGION
115121
```
116122
117-
118123
## Importing to Bigtable
119124
120-
You can import data into Bigtable from a snapshot or sequence files. Before you begin your import you must create
121-
the tables and column families in Bigtable via the [schema translation tool](https://github.com/googleapis/java-bigtable-hbase/tree/master/bigtable-hbase-1.x-parent/bigtable-hbase-1.x-tools)
122-
or using the Bigtable command line tool and running the following:
125+
You can import data into Bigtable from a snapshot or sequence files. Before you
126+
begin your import you must create
127+
the tables and column families in Bigtable via
128+
the [schema translation tool](https://github.com/googleapis/java-bigtable-hbase/tree/master/bigtable-hbase-1.x-parent/bigtable-hbase-1.x-tools)
129+
or using the Bigtable command line tool and running the following:
123130
124131
cbt createtable your-table-name
125132
cbt createfamily your-table-name your-column-family
126133
127-
Once your import is completed follow the instructions for the validator below to ensure it was successful.
134+
Once your import is completed follow the instructions for the validator below to
135+
ensure it was successful.
128136
129-
Please pay attention to the Cluster CPU usage and adjust the number of Dataflow workers accordingly.
137+
Please pay attention to the Cluster CPU usage and adjust the number of Dataflow
138+
workers accordingly.
130139
131140
### Snapshots (preferred method)
132141
@@ -140,7 +149,7 @@ Please pay attention to the Cluster CPU usage and adjust the number of Dataflow
140149
SNAPSHOT_GCS_PATH="$BUCKET_NAME/hbase-migration-snap"
141150
SNAPSHOT_NAME=your-snapshot-name
142151
```
143-
152+
144153
1. Run the import.
145154
```
146155
java -jar bigtable-beam-import-2.3.0.jar importsnapshot \
@@ -214,7 +223,7 @@ Please pay attention to the Cluster CPU usage and adjust the number of Dataflow
214223
## Validating data
215224
216225
Once your snapshot or sequence file is imported, you should run the validator to
217-
check if there are any rows with mismatched data.
226+
check if there are any rows with mismatched data.
218227
219228
1. Set the environment variables.
220229
```
@@ -225,7 +234,8 @@ check if there are any rows with mismatched data.
225234
226235
SNAPSHOT_GCS_PATH="$BUCKET_NAME/hbase-migration-snap"
227236
```
228-
1. Run the sync job. It will put the results into `$SNAPSHOT_GCS_PATH/data-verification/output-TIMESTAMP`.
237+
1. Run the sync job. It will put the results
238+
into `$SNAPSHOT_GCS_PATH/data-verification/output-TIMESTAMP`.
229239
```
230240
java -jar bigtable-beam-import-2.3.0.jar sync-table \
231241
--runner=dataflow \
@@ -239,5 +249,21 @@ check if there are any rows with mismatched data.
239249
--region=$REGION
240250
```
241251
252+
## Tracking lineage
253+
254+
CloudBigtableIO supports data lineage for Dataflow
255+
jobs. [Data lineage](https://cloud.google.com/dataplex/docs/about-data-lineage)
256+
is a
257+
[Dataplex](https://cloud.google.com/dataplex/docs/introduction) feature that
258+
lets you track how data moves through your systems. In
259+
order to begin
260+
automatically tracking lineage
261+
information [enable the Lineage API](https://cloud.google.com/dataplex/docs/use-lineage#enable-apis)
262+
in the project where the Dataflow job
263+
is running and the project where you view lineage in the Dataplex web interface.
264+
In addition, you must [enable lineage during Dataflow job creation](https://cloud.devsite.corp.google.com/dataflow/docs/guides/lineage#enable-data-lineage) by providing
265+
the service option
266+
`--enable_lineage`.
267+
242268
243269
[//]: # ({x-version-update-end})

0 commit comments

Comments
 (0)