Module pipelines.utils.dump_url.tasks

General purpose tasks for dumping data from URLs.

Functions

def download_url(url: str, fname: str, url_type: str = 'direct', gsheets_sheet_order: int = 0, gsheets_sheet_name: str = None, gsheets_sheet_range: str = None) ‑> None

Downloads a file from a URL and saves it to a local file. Try to do it without using lots of RAM. It is not optimized for Google Sheets downloads.

Args

url
URL to download from.
fname
Name of the file to save to.
url_type
Type or URL that is being passed. direct-> common URL to download directly; google_drive-> Google Drive URL; google_sheet-> Google Sheet URL.
gsheets_sheet_order
Worksheet index, in the case you want to select it by index. Worksheet indexes start from zero.
gsheets_sheet_name
Worksheet name, in the case you want to select it by name.
gsheets_sheet_range
Range in selected worksheet to get data from. Defaults to entire worksheet.

Returns

None.

def dump_files(file_path: str, partition_columns: List[str], save_path: str = '.', chunksize: int = 1000000, build_json_dataframe: bool = False, dataframe_key_column: str = None, encoding: str = 'utf-8', on_bad_lines: str = 'error', separator: str = ',') ‑> None

Dump files according to chunk size and read mode