Module pipelines.rj_escritorio.data_catalog.tasks

Tasks for generating a data catalog from BigQuery.

Functions

def generate_dataframe_from_list_of_tables(list_of_tables: list) ‑> pandas.core.frame.DataFrame

Generate a Pandas DataFrame from a list of tables.

Args

list_of_tables
List of tables.

Returns

Pandas DataFrame.

def list_projects(mode: str = 'prod', exclude_dev: bool = True) ‑> List[str]

Lists all GCP projects that we have access to.

Args

mode
Credentials mode.
exclude_dev
Exclude projects that ends with "-dev".

Returns

List of project IDs.

def list_tables(project_id: str, client: google.cloud.bigquery.client.Client = None, mode: str = 'prod', exclude_staging: bool = True, exclude_test: bool = True, exclude_logs: bool = True)

List all datasets and tables in a project.

Args

client
BigQuery client.
project_id
Project ID.
mode
BigQuery client mode.
exclude_staging
Exclude staging datasets.
exclude_test
Exclude anything that contains the word "test".
exclude_logs
Exclude log datasets.

Returns

List of dictionaries in the format: { "project_id": "project_id", "dataset_id": "dataset_id", "table_id": "table_id", "url": "https://console.cloud.google.com/bigquery?p={project_id}&d={dataset_id}&t={table_id}&page=table", "private": True/False, }

def merge_list_of_list_of_tables(list_of_list_of_tables: list) ‑> list

Merge a list of list of tables into a single list of tables.

Args

list_of_list_of_tables
List of list of tables.

Returns

List of tables.

def update_gsheets_data_catalog(dataframe: pandas.core.frame.DataFrame, spreadsheet_url: str, sheet_name: str) ‑> None

Update a Google Sheets spreadsheet with a DataFrame.

Args

dataframe
Pandas DataFrame.
spreadsheet_url
Google Sheets spreadsheet URL.
sheet_name
Google Sheets sheet name.