Switch to using v2 schema for importer #5583

josneville · 2023-02-21T21:30:11Z

josneville
Feb 21, 2023

Problem

I'm trying to build a Google datastream from the postgres database in the helm chart to a bigquery database. I'm running into an issue where a lot of the column types in postgres aren't recognizable by bigquery since they are user-defined types (entity_id, hbar_tinybars etc). This is leading to a lot of columns unable to be imported into bigquery.

Solution

I noticed that there is a v2 schema provided in the code that uses native postgres types as opposed to user-defined types under hedera-mirror-node/hedera-mirror-importer/bin/main/db/migration/v2/V2.0.0__create_tables.sql. Is it possible to use this schema to initiate the database as opposed to using V1 schema using flags or env variables or some kubeconfig?

Alternatives

No response

steven-sheehy · 2023-02-21T21:46:44Z

steven-sheehy
Feb 21, 2023
Maintainer

In v1, we initially used a lot of user defined types but in later versions removed most of those. So you could use v1 against a postgres database, then export the latest version of that schema for use in bigquery.

You could use v2, but it's pre-alpha and we break it all the time. Plus it's specific to Citus so wouldn't work without tweaking. Just set the spring.profiles.active=v2 to use it or SPRING_PROFILES_ACTIVE=v2 environment variable.

We also already export the data to bigquery via our PubSubRecordItemListener and our hedera-etl project. You can find a link to this public dataset on Google's github. This however is a pretty rudimentary export of just the TransactionBody and TransactionRecord protobuf contents and not the mirror node database schema.

0 replies

josneville · 2023-02-21T21:59:57Z

josneville
Feb 21, 2023
Author

@steven-sheehy I've had a lot of trouble with PubSubRecordItemListener and trying to use the hedera-etl project. With default settings, I was continuously getting http connection hang errors and a lot of messages getting dropped and not being published to pubsub (I'd estimate I was only getting 1/10 transactions through). If you have settings to make the pubsub work smoothly, I'd love to see them. I tried scaling up the importer pod as well but that led to even more issues.

I'm still seeing too many user-defined types in the schema to be used in bigquery. Right now, I have almost one user-defined type in almost every table, there by making every table have errors with bigquery (most common one being entity_id). Is there an even later version of V1 schema that uses native types? A table like account_balances still uses nanos_timestamp, entity_id and hbar_tinybars, none of which bigquery recognizes.

0 replies

steven-sheehy · 2023-02-21T22:08:46Z

steven-sheehy
Feb 21, 2023
Maintainer

True, some tables still use user defined types even in the latest version. You can adjust anything that uses entity_id or hbar_tinybars to be bigint to make it work. Should not require any code changes since they're compatible. We do also use enum types which I'm not sure how bigquery would handle.

0 replies

josneville · 2023-02-21T22:14:24Z

josneville
Feb 21, 2023
Author

@steven-sheehy Big query is not a fan. Do you have any recommendations on the best way to do this? I'm currently running a kubernetes cluster on GCP with the hedera-mirror helm chart and a long custom values.yaml file. I could write a script which a series of ALTER statements to the DB but I'd prefer if the columns were set to the native types on creation as opposed to doing an alter. Is there any way I could do that?

0 replies

edwin-greene · 2023-02-21T22:27:33Z

edwin-greene
Feb 21, 2023

Hi @josneville, for the BigQuery schema errors, the schema described here should help solve your problem with the common user defined types, such as entity_id: https://github.com/blockchain-etl/hedera-etl/blob/master/hedera-etl-bigquery/src/main/resources/transactions-schema.json

Can you reference that and see if that helps with all the user defined types in BigQuery?

0 replies

edwin-greene · 2023-02-21T22:30:59Z

edwin-greene
Feb 21, 2023

@josneville For connection errors and only 1/10th of the transactions going through, can you post a snippet of the importer log output?

0 replies

josneville · 2023-02-21T23:10:51Z

josneville
Feb 21, 2023
Author

@edwin-greene This is the main error I see with the importer when I use it in pubsub mode:

2023-02-21T17:08:47.122-0600 ERROR sdk-async-response-1-4 r.c.p.Operators Operator called
 default onErrorDropped software.amazon.awssdk.core.exception.SdkClientException: Unable to execute HTTP request: Acquire operation took longer than the configured maximum time. This indicates that a request cannot get a connection from the pool within the specified maximum time. This can be due to high request rate.\\nConsider taking any of the following actions to mitigate the issue: increase max connections, increase acquire timeout, or slowing the request rate.\\nIncreasing the max connections can increase client throughput (unless the network interface is already fully utilized), but can eventually start to hit operation system limitations on the number of file descriptors used by the process. If you already are fully utilizing your network interface or cannot further increase your connection count, increasing the acquire timeout gives extra time for requests to acquire a connection before timing out. If the connections doesn't free up, the subsequent requests will still timeout.\\nIf the above mechanisms are not able to fix the issue, try smoothing out your requests so that large traffic bursts cannot overload the client, being more efficient with the number of times you need to call AWS, or by increasing the number of hosts sending requests

On the topic, I receive about 40-ish requests a second, far below what I need speed wise to ingest live data.

0 replies

josneville · 2023-02-21T23:17:18Z

josneville
Feb 21, 2023
Author

@edwin-greene I've tried increasing hedera.mirror.importer.downloader.sources.maxConcurrency and increasing importer.config.hedera.mirror.importer.downloader.sources.connectionTimeout to no benefit. The google datastream solution seems to be the smoothest so far if I can only get the schema to work @steven-sheehy

0 replies

edwin-greene · 2023-02-23T15:12:14Z

edwin-greene
Feb 23, 2023

@josneville Please see the Pub/Sub batching settings here: https://github.com/hashgraph/hedera-mirror-node/blob/main/hedera-mirror-importer/src/test/resources/config/application-pubsub.yml

By enabling batching and adjusting the batch settings to your environment/setup it may help resolve the connection problems you have been experiencing. There is an open issue to move the batch settings to the main application.yml.

0 replies

josneville · 2023-03-10T04:10:27Z

josneville
Mar 10, 2023
Author

@edwin-greene that makes sense. We decided to stick to the postgres download for now. Do you or @steven-sheehy have recommendations for speeding up syncing the mirror node to live data (starting with historical first)? If cost wasn't a restriction, how would you guys recommending scaling up this node to download faster? It's currently downloading 1-2 days worth of data each day, does that sound about right?

0 replies

edwin-greene · 2023-03-14T13:45:45Z

edwin-greene
Mar 14, 2023

Pub/Sub publishing speed was recently improved. Make sure you are on version 0.75 or greater with batch settings as described in application-pubsub.yml.

0 replies

steven-sheehy · 2023-03-14T14:33:14Z

steven-sheehy
Mar 14, 2023
Maintainer

@josneville Converted this issue to a discussion since it seems to have strayed from its original purpose with a bunch of separate questions.

Historical syncing is known to be slow as we have not optimized for that path. There is a ticket that notes some ways that should be able to dramatically speed it up. If you end up attempting that we'd greatly appreciate any documentation you can contribute that goes toward solving that ticket. 🙏🏻

0 replies

Switch to using v2 schema for importer #5583

Uh oh!

josneville Feb 21, 2023

Problem

Solution

Alternatives

Replies: 12 comments

Uh oh!

Uh oh!

steven-sheehy Feb 21, 2023 Maintainer

Uh oh!

josneville Feb 21, 2023 Author

Uh oh!

steven-sheehy Feb 21, 2023 Maintainer

Uh oh!

josneville Feb 21, 2023 Author

Uh oh!

edwin-greene Feb 21, 2023

Uh oh!

edwin-greene Feb 21, 2023

Uh oh!

Uh oh!

josneville Feb 21, 2023 Author

Uh oh!

josneville Feb 21, 2023 Author

Uh oh!

edwin-greene Feb 23, 2023

Uh oh!

josneville Mar 10, 2023 Author

Uh oh!

edwin-greene Mar 14, 2023

Uh oh!

steven-sheehy Mar 14, 2023 Maintainer

josneville
Feb 21, 2023

steven-sheehy
Feb 21, 2023
Maintainer

josneville
Feb 21, 2023
Author

steven-sheehy
Feb 21, 2023
Maintainer

josneville
Feb 21, 2023
Author

edwin-greene
Feb 21, 2023

edwin-greene
Feb 21, 2023

josneville
Feb 21, 2023
Author

josneville
Feb 21, 2023
Author

edwin-greene
Feb 23, 2023

josneville
Mar 10, 2023
Author

edwin-greene
Mar 14, 2023

steven-sheehy
Mar 14, 2023
Maintainer