Module pipelines.rj_escritorio.data_catalog.tasks
Tasks for generating a data catalog from BigQuery.
Functions
def generate_dataframe_from_list_of_tables(list_of_tables: list) ‑> pandas.core.frame.DataFrame
-
Generate a Pandas DataFrame from a list of tables.
Args
list_of_tables
- List of tables.
Returns
Pandas DataFrame.
def list_projects(mode: str = 'prod', exclude_dev: bool = True) ‑> List[str]
-
Lists all GCP projects that we have access to.
Args
mode
- Credentials mode.
exclude_dev
- Exclude projects that ends with "-dev".
Returns
List of project IDs.
def list_tables(project_id: str, client: google.cloud.bigquery.client.Client = None, mode: str = 'prod', exclude_staging: bool = True, exclude_test: bool = True, exclude_logs: bool = True)
-
List all datasets and tables in a project.
Args
client
- BigQuery client.
project_id
- Project ID.
mode
- BigQuery client mode.
exclude_staging
- Exclude staging datasets.
exclude_test
- Exclude anything that contains the word "test".
exclude_logs
- Exclude log datasets.
Returns
List of dictionaries in the format: { "project_id": "project_id", "dataset_id": "dataset_id", "table_id": "table_id", "url": "https://console.cloud.google.com/bigquery?p={project_id}&d={dataset_id}&t={table_id}&page=table", "private": True/False, }
def merge_list_of_list_of_tables(list_of_list_of_tables: list) ‑> list
-
Merge a list of list of tables into a single list of tables.
Args
list_of_list_of_tables
- List of list of tables.
Returns
List of tables.
def update_gsheets_data_catalog(dataframe: pandas.core.frame.DataFrame, spreadsheet_url: str, sheet_name: str) ‑> None
-
Update a Google Sheets spreadsheet with a DataFrame.
Args
dataframe
- Pandas DataFrame.
spreadsheet_url
- Google Sheets spreadsheet URL.
sheet_name
- Google Sheets sheet name.