Module pipelines.utils.dump_url.tasks
General purpose tasks for dumping data from URLs.
Functions
def download_url(url: str, fname: str, url_type: str = 'direct', gsheets_sheet_order: int = 0, gsheets_sheet_name: str = None, gsheets_sheet_range: str = None) ‑> None
-
Downloads a file from a URL and saves it to a local file. Try to do it without using lots of RAM. It is not optimized for Google Sheets downloads.
Args
url
- URL to download from.
fname
- Name of the file to save to.
url_type
- Type or URL that is being passed.
direct
-> common URL to download directly;google_drive
-> Google Drive URL;google_sheet
-> Google Sheet URL. gsheets_sheet_order
- Worksheet index, in the case you want to select it by index. Worksheet indexes start from zero.
gsheets_sheet_name
- Worksheet name, in the case you want to select it by name.
gsheets_sheet_range
- Range in selected worksheet to get data from. Defaults to entire worksheet.
Returns
None.
def dump_files(file_path: str, partition_columns: List[str], save_path: str = '.', chunksize: int = 1000000, build_json_dataframe: bool = False, dataframe_key_column: str = None, encoding: str = 'utf-8', on_bad_lines: str = 'error', separator: str = ',') ‑> None
-
Dump files according to chunk size and read mode