All templates
PDFIntermediate
Extract Tables from PDF
Robomotion•Updated 6 months ago

Overview
Locates tables inside a PDF and extracts their rows into structured data. A drop-in stage in any OCR-adjacent pipeline.
Extract Tables from PDF
Although PDF format is typical for sharing content, directly manipulating tables inside it can be overwhelming. Robomotion enables users to extract tables from PDF files and store them in other file types, such as Excel worksheets, for easier editing.
What Extract Tables from PDF can do
Core.Flow.SubFlowdownloads fixtures; a Function buildsmsg.sample_pdf(.../fixtures/tables.pdf).- Input Dialog titled
Extract PDF tables to Excel, messageSelect the PDF to extract table(s) from:, defaultmsg.sample_pdf→msg.pdf_path. - Validate (
Core.Programming.Function,outputs: 2) — require a.pdfpath; derivemsg.xlsx_path(<stem>_tables.xlsx), timestampedmsg.tables_json_pathandmsg.ps_script_pathnext to the source. Robomotion.Pandas.PdfToDataTable(optPages: 'all',optTableSettings: 'lines') →msg.table_list.- Function serialises
msg.table_listtomsg.tables_jsonand embeds the PowerShell script asmsg.ps_script. - Two
Core.FileSystem.WriteFilenodes (optMode: 'truncate') writemsg.tables_json_pathandmsg.ps_script_path; a Function buildsmsg.ps_args = ['-NoProfile', '-ExecutionPolicy', 'Bypass', '-File', msg.ps_script_path, '-JsonPath', msg.tables_json_path, '-XlsxPath', msg.xlsx_path]. Core.Process.StartProcessrunspowershellwithmsg.ps_argsin the foreground; the script adds one sheet per table (namedTable_1,Table_2, …) and saves with Excel format code51(.xlsx).- Two
Core.FileSystem.Deletenodes clean up the JSON and PS1 temp files; a Function buildsmsg.dialog_text = 'Extracted tables saved in: ' + msg.xlsx_path;Core.Dialog.MessageBoxtitledDone!(typeinfo) displays it.
Behind the scenes
- The flow uses a JSON-plus-PowerShell bridge rather than driving Excel with UI automation:
Robomotion.Pandas.PdfToDataTableproduces structured rows,ConvertFrom-Jsonrehydrates them, and the COM object writes each sheet deterministically. This avoids UI timing issues and keeps Excel invisible ($excel.Visible = $false,$excel.DisplayAlerts = $false). - The workbook is saved to
msg.xlsx_pathand Excel is cleanly closed ($wb.Close($false),$excel.Quit(),ReleaseComObject) so there is no orphanEXCEL.EXEand no "do you want to save?" prompt on a later run. - Temp file names are timestamped (
Date.now()) so two concurrent runs against the same PDF do not collide; the cleanup deletes usecontinueOnError: trueso an AV-locked file does not abort the flow after the main work is done. optTableSettings: 'lines'extracts tables defined by visible rules; switch to'text'for borderless tables that rely on whitespace alignment.- Format code
51corresponds toxlOpenXMLWorkbook(the modern.xlsxcontainer); change to56if a legacy.xlsis required.