Skip to content

Commit a9b5807

Browse files
d116626rdahislucasnascmgustavoairestiagolucascr91
authored
[infra] Python 1.6.2 (#1109)
* feat(infra): create version 1.6.2 * feat(infra): create version 1.6.2 * feat(infra): create version 1.6.2 * [infra] python-v1.6.2 (#1089) * [infra] fix dataset_config.yaml folder path (#1067) * feat(infra) merge master * [infra] conform Metadata to new metadata changes (#1093) * [dados-bot] br_ms_vacinacao_covid19 (2022-01-23) (#1086) Co-authored-by: terminal_name <github_email> * [dados] br_bd_diretorios_brasil.etnia_indigena (#1087) * Sobe diretorio etnia_indigena * Update table_config.yaml * Update table_config.yaml * feat: conform Metadata's schema to new one * fix: conform yaml generation to new schema * fix: delete test_dataset folder Co-authored-by: Lucas Moreira <65978482+lucasnascm@users.noreply.github.com> Co-authored-by: Gustavo Aires Tiago <36206956+gustavoairestiago@users.noreply.github.com> Co-authored-by: Ricardo Dahis <6617207+rdahis@users.noreply.github.com> Co-authored-by: Lucas Moreira <65978482+lucasnascm@users.noreply.github.com> Co-authored-by: Gustavo Aires Tiago <36206956+gustavoairestiago@users.noreply.github.com> * feat(infra): 1.6.2a3 version * feat(infra): 1.6.2a3 version * fix(ingra): edit partitions and update_locally * feat(infra): update_columns new fields and accepts local files * [infra] option to make dataset public (#1020) * feat(infra): option to make dataset public * feat(infra): fix None data * fix(infra): roll back * fix(infra): fix retry in storage upload * fix(infra): add option to dataset data location * feat(infra): make staging dataset not public * feat(infra): make staging dataset not public * fix(infra): change bd version in actions * fix(infra): add toml to install in ci * fix(infra): remove a forget print * fix(infra): fix location location * fix(infra): fix dataset description * feat(infra): bump-version * feat(infra): temporal coverage as list in update_columns * feat(infra): add new parameters to cli * feat(infra): fix cli options * [infra] change download functions to consume CKAN endpoints #1129 (#1130) * [infra] add function to wrap bd_dataset_search endpoint * Update download.py * [infra] modify list_datasets function to consume CKAN endpoint * [infra] fix list_dataset function to include limit and remove order_by * [infra] change function list_dataset_tables to use CKAN endpoint * [infra] apply PEP8 to list_dataset_tables and respective tests * add get_dataset_description, get_table_description, get_table_columns * [infra] fix dataset_config.yaml folder path (#1067) * feat(infra) merge master * fix files organization to match master * remove download.py * remove test_download * Delete test_download.py * remove test files * remove test_download.py * remove test_download.py * remove test_download.py * remove test_download.py * add tests metadata * remove test_download.py * remove unused imports * [infra] add _safe_fetch and get_table_size functions Co-authored-by: lucascr91 <lucas.ecomg@gmail.com> * fix(infra): add a empty list to not a partition * [infra] Adiciona suporte a Avro e Parquet (#1145) * adiciona suporte a Avro e Parquet para upload * Adds test for source formats * [infra] update tests for avro, parquet, and csv upload Co-authored-by: Gabriel Gazola Milan <gabriel.gazola@poli.ufrj.br> Co-authored-by: Isadora Bugarin <isadorabugarin@gmail.com > Co-authored-by: lucascr91 <lucas.ecomg@gmail.com> * [infra] Feedback messages in upload methods [issue #1059] (#1085) * Creating dataclass config * Success messages - create and update (table.py) using loguru * feat: improve log level control * refa: move logger config to Base.__init__ * Improving log level control * Adjusting log level control function in base.py * Fixing repeated 'DELETE' messages everytime Table is replaced. * Importing 'dataclass' from 'dataclasses' to make config work. * Fixing repeated 'UPDATE' messages inside other functions. * Defining a new script message format. * Definng standard log messages for 'dataset.py' functions * Definng standard log messages for 'storage.py' functions * Definng standard log messages for 'table.py' functions * Definng standard log messages for 'metadata.py' functions * Adds standard configuration to billing_project_id in download.py * Configuring billing_project_id in download.py * Configuring config_path in base.py Co-authored-by: Guilherme Salustiano <guissalustiano@gmail.com> Co-authored-by: Isadora Bugarin <isadorabugarin@gmail.com> * update toml Co-authored-by: Ricardo Dahis <6617207+rdahis@users.noreply.github.com> Co-authored-by: Lucas Moreira <65978482+lucasnascm@users.noreply.github.com> Co-authored-by: Gustavo Aires Tiago <36206956+gustavoairestiago@users.noreply.github.com> Co-authored-by: lucascr91 <lucas.ecomg@gmail.com> Co-authored-by: Isadora Bugarin <57679195+isadorabugarin@users.noreply.github.com> Co-authored-by: Gabriel Gazola Milan <gabriel.gazola@poli.ufrj.br> Co-authored-by: Isadora Bugarin <isadorabugarin@gmail.com > Co-authored-by: Guilherme Salustiano <guissalustiano@gmail.com> Co-authored-by: Isadora Bugarin <isadorabugarin@gmail.com>
1 parent 86b12f3 commit a9b5807

File tree

22 files changed

+1706
-1187
lines changed

22 files changed

+1706
-1187
lines changed

.github/workflows/python-ci.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -61,7 +61,7 @@ jobs:
6161
run: |
6262
cd python-package
6363
pip install -r requirements-dev.txt
64-
pip install coveralls
64+
pip install coveralls toml
6565
shell: bash
6666
- name: Install package
6767
run: |
@@ -109,7 +109,7 @@ jobs:
109109
run: |
110110
cd python-package
111111
pip install -r requirements-dev.txt
112-
pip install coveralls
112+
pip install coveralls toml
113113
shell: cmd
114114
- name: Install package
115115
run: |

bases/br_bd_diretorios_brasil/dataset_config.yaml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -45,4 +45,5 @@ github_url:
4545

4646
# Não altere esse campo.
4747
# Data da última modificação dos metadados gerada automaticamente pelo CKAN.
48-
metadata_modified: '2022-02-09T21:59:32.440801'
48+
49+
metadata_modified: '2022-02-09T21:59:32.440801'

bases/test_dataset/README.md

Lines changed: 0 additions & 7 deletions
This file was deleted.

python-package/README.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,3 +37,10 @@ Publique nova versão
3737
poetry version [patch|minor|major]
3838
poetry publish --build
3939
```
40+
41+
Versão Alpha e Beta
42+
43+
```
44+
version = "1.6.2-alpha.3"
45+
version = "1.6.2-beta.3"
46+
```

python-package/basedosdados/__init__.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,4 +21,5 @@
2121
get_dataset_description,
2222
get_table_columns,
2323
get_table_size,
24-
)
24+
search
25+
)

python-package/basedosdados/cli/cli.py

Lines changed: 83 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -77,10 +77,25 @@ def mode_text(mode, verb, obj_id):
7777
default="raise",
7878
help="[raise|update|replace|pass] if dataset alread exists",
7979
)
80+
@click.option(
81+
"--dataset_is_public",
82+
default=True,
83+
help="Control if prod dataset is public or not. By default staging datasets like `dataset_id_staging` are not public.",
84+
)
85+
@click.option(
86+
"--location",
87+
default=None,
88+
help="Location of dataset data. List of possible region names locations: https://cloud.google.com/bigquery/docs/locations",
89+
)
8090
@click.pass_context
81-
def create_dataset(ctx, dataset_id, mode, if_exists):
91+
def create_dataset(ctx, dataset_id, mode, if_exists, dataset_is_public, location):
8292

83-
Dataset(dataset_id=dataset_id, **ctx.obj).create(mode=mode, if_exists=if_exists)
93+
Dataset(dataset_id=dataset_id, **ctx.obj).create(
94+
mode=mode,
95+
if_exists=if_exists,
96+
dataset_is_public=dataset_is_public,
97+
location=location,
98+
)
8499

85100
click.echo(
86101
click.style(
@@ -96,9 +111,9 @@ def create_dataset(ctx, dataset_id, mode, if_exists):
96111
"--mode", "-m", default="all", help="What datasets to create [prod|staging|all]"
97112
)
98113
@click.pass_context
99-
def update_dataset(ctx, dataset_id, mode):
114+
def update_dataset(ctx, dataset_id, mode, location):
100115

101-
Dataset(dataset_id=dataset_id, **ctx.obj).update(mode=mode)
116+
Dataset(dataset_id=dataset_id, **ctx.obj).update(mode=mode, location=location)
102117

103118
click.echo(
104119
click.style(
@@ -110,10 +125,17 @@ def update_dataset(ctx, dataset_id, mode):
110125

111126
@cli_dataset.command(name="publicize", help="Make a dataset public")
112127
@click.argument("dataset_id")
128+
@click.option(
129+
"--dataset_is_public",
130+
default=True,
131+
help="Control if prod dataset is public or not. By default staging datasets like `dataset_id_staging` are not public.",
132+
)
113133
@click.pass_context
114-
def publicize_dataset(ctx, dataset_id):
134+
def publicize_dataset(ctx, dataset_id, dataset_is_public):
115135

116-
Dataset(dataset_id=dataset_id, **ctx.obj).publicize()
136+
Dataset(dataset_id=dataset_id, **ctx.obj).publicize(
137+
dataset_is_public=dataset_is_public
138+
)
117139

118140
click.echo(
119141
click.style(
@@ -168,7 +190,12 @@ def cli_table():
168190
help="[raise|replace|pass] actions if table config files already exist",
169191
)
170192
@click.option(
171-
"--columns_config_url",
193+
"--source_format",
194+
default="csv",
195+
help="Data source format. Only 'csv' is supported. Defaults to 'csv'.",
196+
)
197+
@click.option(
198+
"--columns_config_url_or_path",
172199
default=None,
173200
help="google sheets URL. Must be in the format https://docs.google.com/spreadsheets/d/<table_key>/edit#gid=<table_gid>. The sheet must contain the column name: 'coluna' and column description: 'descricao'.",
174201
)
@@ -180,14 +207,16 @@ def init_table(
180207
data_sample_path,
181208
if_folder_exists,
182209
if_table_config_exists,
183-
columns_config_url,
210+
source_format,
211+
columns_config_url_or_path,
184212
):
185213

186214
t = Table(table_id=table_id, dataset_id=dataset_id, **ctx.obj).init(
187215
data_sample_path=data_sample_path,
188216
if_folder_exists=if_folder_exists,
189217
if_table_config_exists=if_table_config_exists,
190-
columns_config_url=columns_config_url,
218+
source_format=source_format,
219+
columns_config_url_or_path=columns_config_url_or_path,
191220
)
192221

193222
click.echo(
@@ -232,9 +261,24 @@ def init_table(
232261
help="[raise|replace|pass] actions if table config files already exist",
233262
)
234263
@click.option(
235-
"--columns_config_url",
264+
"--source_format",
265+
default="csv",
266+
help="Data source format. Only 'csv' is supported. Defaults to 'csv'.",
267+
)
268+
@click.option(
269+
"--columns_config_url_or_path",
270+
default=None,
271+
help="Path to the local architeture file or a public google sheets URL. Path only suports csv, xls, xlsx, xlsm, xlsb, odf, ods, odt formats. Google sheets URL must be in the format https://docs.google.com/spreadsheets/d/<table_key>/edit#gid=<table_gid>.",
272+
)
273+
@click.option(
274+
"--dataset_is_public",
275+
default=True,
276+
help="Control if prod dataset is public or not. By default staging datasets like `dataset_id_staging` are not public.",
277+
)
278+
@click.option(
279+
"--location",
236280
default=None,
237-
help="google sheets URL. Must be in the format https://docs.google.com/spreadsheets/d/<table_key>/edit#gid=<table_gid>",
281+
help="Location of dataset data. List of possible region names locations: https://cloud.google.com/bigquery/docs/locations",
238282
)
239283
@click.pass_context
240284
def create_table(
@@ -247,7 +291,10 @@ def create_table(
247291
force_dataset,
248292
if_storage_data_exists,
249293
if_table_config_exists,
250-
columns_config_url,
294+
source_format,
295+
columns_config_url_or_path,
296+
dataset_is_public,
297+
location,
251298
):
252299

253300
Table(table_id=table_id, dataset_id=dataset_id, **ctx.obj).create(
@@ -257,7 +304,10 @@ def create_table(
257304
force_dataset=force_dataset,
258305
if_storage_data_exists=if_storage_data_exists,
259306
if_table_config_exists=if_table_config_exists,
260-
columns_config_url=columns_config_url,
307+
source_format=source_format,
308+
columns_config_url_or_path=columns_config_url_or_path,
309+
dataset_is_public=dataset_is_public,
310+
location=location,
261311
)
262312

263313
click.echo(
@@ -297,23 +347,32 @@ def update_table(ctx, dataset_id, table_id, mode):
297347
@click.argument("dataset_id")
298348
@click.argument("table_id")
299349
@click.option(
300-
"--columns_config_url",
350+
"--columns_config_url_or_path",
301351
default=None,
302-
help="""\nGoogle sheets URL. Must be in the format https://docs.google.com/spreadsheets/d/<table_key>/edit#gid=<table_gid>.
303-
\nThe sheet must contain the columns:\n
304-
- nome: column name\n
305-
- descricao: column description\n
306-
- tipo: column bigquery type\n
307-
- unidade_medida: column mesurement unit\n
308-
- dicionario: column related dictionary\n
309-
- nome_diretorio: column related directory in the format <dataset_id>.<table_id>:<column_name>
352+
help="""\nFills columns in table_config.yaml automatically using a public google sheets URL or a local file. Also regenerate
353+
\npublish.sql and autofill type using bigquery_type.\n
354+
355+
\nThe sheet must contain the columns:\n
356+
- name: column name\n
357+
- description: column description\n
358+
- bigquery_type: column bigquery type\n
359+
- measurement_unit: column mesurement unit\n
360+
- covered_by_dictionary: column related dictionary\n
361+
- directory_column: column related directory in the format <dataset_id>.<table_id>:<column_name>\n
362+
- temporal_coverage: column temporal coverage\n
363+
- has_sensitive_data: the column has sensitive data\n
364+
- observations: column observations\n
365+
\nArgs:\n
366+
\ncolumns_config_url_or_path (str): Path to the local architeture file or a public google sheets URL.\n
367+
Path only suports csv, xls, xlsx, xlsm, xlsb, odf, ods, odt formats.\n
368+
Google sheets URL must be in the format https://docs.google.com/spreadsheets/d/<table_key>/edit#gid=<table_gid>.\n
310369
""",
311370
)
312371
@click.pass_context
313-
def update_columns(ctx, dataset_id, table_id, columns_config_url):
372+
def update_columns(ctx, dataset_id, table_id, columns_config_url_or_path):
314373

315374
Table(table_id=table_id, dataset_id=dataset_id, **ctx.obj).update_columns(
316-
columns_config_url=columns_config_url,
375+
columns_config_url_or_path=columns_config_url_or_path,
317376
)
318377

319378
click.echo(

python-package/basedosdados/configs/templates/table/table_description.txt

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -48,8 +48,8 @@ Email: {{ data_cleaned_by.email }}
4848
{% call input(partitions) -%}
4949
Partições (Filtre a tabela por essas colunas para economizar dinheiro e tempo)
5050
---------
51-
{% if (partitions.split(',') is not none) -%}
52-
{% for partition in partitions.split(',') -%}
51+
{% if (partitions is not none) -%}
52+
{% for partition in partitions -%}
5353
- {{ partition }}
5454
{% endfor -%}
5555
{%- endif %}

python-package/basedosdados/constants.py

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,14 @@
1-
__all__ = ["constants"]
1+
__all__ = ["config", "constants"]
22

33
from enum import Enum
4+
from dataclasses import dataclass
5+
6+
7+
@dataclass
8+
class config:
9+
verbose: bool = True
10+
billing_project_id: str = None
11+
project_config_path: str = None
412

513

614
class constants(Enum):

python-package/basedosdados/download/download.py

Lines changed: 19 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,7 @@
1818
BaseDosDadosInvalidProjectIDException,
1919
BaseDosDadosNoBillingProjectIDException,
2020
)
21+
from basedosdados.constants import config, constants
2122
from pandas_gbq.gbq import GenericGBQException
2223

2324

@@ -49,6 +50,10 @@ def read_sql(
4950
Query result
5051
"""
5152

53+
# standard billing_project_id configuration
54+
if billing_project_id is None:
55+
billing_project_id == config.billing_project_id
56+
5257
try:
5358
# Set a two hours timeout
5459
bigquery_storage_v1.client.BigQueryReadClient.read_rows = partialmethod(
@@ -86,8 +91,8 @@ def read_sql(
8691
def read_table(
8792
dataset_id,
8893
table_id,
89-
query_project_id="basedosdados",
9094
billing_project_id=None,
95+
query_project_id="basedosdados",
9196
limit=None,
9297
from_file=False,
9398
reauth=False,
@@ -101,10 +106,10 @@ def read_table(
101106
table_id (str): Optional.
102107
Table id available in basedosdados.dataset_id.
103108
It should always come with dataset_id.
104-
query_project_id (str): Optional.
105-
Which project the table lives. You can change this you want to query different projects.
106109
billing_project_id (str): Optional.
107110
Project that will be billed. Find your Project ID here https://console.cloud.google.com/projectselector2/home/dashboard
111+
query_project_id (str): Optional.
112+
Which project the table lives. You can change this you want to query different projects.
108113
limit (int): Optional.
109114
Number of rows to read from table.
110115
from_file (boolean): Optional.
@@ -122,6 +127,10 @@ def read_table(
122127
Query result
123128
"""
124129

130+
# standard billing_project_id configuration
131+
if billing_project_id is None:
132+
billing_project_id == config.billing_project_id
133+
125134
if (dataset_id is not None) and (table_id is not None):
126135
query = f"""
127136
SELECT *
@@ -147,8 +156,8 @@ def download(
147156
query=None,
148157
dataset_id=None,
149158
table_id=None,
150-
query_project_id="basedosdados",
151159
billing_project_id=None,
160+
query_project_id="basedosdados",
152161
limit=None,
153162
from_file=False,
154163
reauth=False,
@@ -180,10 +189,10 @@ def download(
180189
table_id (str): Optional.
181190
Table id available in basedosdados.dataset_id.
182191
It should always come with dataset_id.
183-
query_project_id (str): Optional.
184-
Which project the table lives. You can change this you want to query different projects.
185192
billing_project_id (str): Optional.
186193
Project that will be billed. Find your Project ID here https://console.cloud.google.com/projectselector2/home/dashboard
194+
query_project_id (str): Optional.
195+
Which project the table lives. You can change this you want to query different projects.
187196
limit (int): Optional
188197
Number of rows.
189198
from_file (boolean): Optional.
@@ -201,6 +210,10 @@ def download(
201210
"Either table_id, dataset_id or query should be filled."
202211
)
203212

213+
# standard billing_project_id configuration
214+
if billing_project_id is None:
215+
billing_project_id == config.billing_project_id
216+
204217
client = google_client(query_project_id, billing_project_id, from_file, reauth)
205218

206219
# makes sure that savepath is a filepath and not a folder

0 commit comments

Comments
 (0)