A vibrant community of Excel enthusiasts. Get expert tips, ask questions, and share your love for all things Excel. Elevate your spreadsheet skills with us!
Extracting data from PDF to excel spreadsheet
Hello everyone,
I'm looking for help to extract data from a big pdf file to excel spreadsheet.
The data I have in unstructured. I did watch a couple youtube videos and tried to use power query and I ended up having 1000+ tables. I'm not sure what I did wrong.
Would appreciate any help.
PS. I'm dumb when it comes to coding.
u/omarjibory - Your post was submitted successfully.
Once your problem is solved, reply to the answer(s) saying
Solution Verified
to close the thread.Follow the submission rules -- particularly 1 and 2. To fix the body, click edit. To fix your title, delete and re-post.
Include your Excel version and all other relevant information
Failing to follow these steps may result in your post being removed without warning.
I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.
It really depends on how your Pdf looks like. Are there Tables inside? What can you select if you import it into PowerQuery?
No tables in the data. it's basically patients with certain diagnosis. Every time I try to do PowerQuery (combine and transform), I get groups of tables. 1 group each individual patient MRN and ID in one table then diagnosis in a separate one.
So there are groups of tables or only pages? You use Excel on Windows, don't you? What do you combine and transform? Are these multiple PDFs or one PDF with multiple data? Can you query the data before combining?
It's excel on windows
I watched a youtube video and this is what I did
I went to data ----> Get Data ----> from File ----> from folder and then I do combine and transform. This is where I get so many tables. I took a screenshot to show of it.
If the tables are not suitable for you, there should be pages at the bottom. Choose Parameter1 and in the next step filter out all tables and use the pages to continue.
I had to find a way to extract data from multiple unstructured PDFs to Excel.
To do this, I used Excel VBA and referenced the appropriate library to communicate with the PDF editor.
Then I used regular expressions to match for certain patterns in the PDF. After I matched for the string patterns, I would store the strings into an ArrayList object and transfer the data to Excel that way.
Btw, even though this is related to coding, might as well give it a shot if macros are allowed in your company. You might even learn something new and impress your colleagues if your program is stable / fool proofed enough.
Try https://bankstmtconverter.com where you can extract tables from PDF. Can process 100 pages in single PDF
Thank you, I will try it.
If you know exactly what you want to extract, just take each page separately and use an automation tool for data extraction like this.
I'm exploring this option. Thanks for the tip.
Could you share part of that doc to see if is there a possible solution?
Absolutely,
It's redacted for privacy but I want to collect all the elements listed in the reports.
You can see one page might have 1 report only while others have more than 1
opening the file in power query does not seem to present problems with the structure of your doc, are you using multiple pdf files for each report? or how exactly did you open the file?
I watched a youtube video and this is what I did
I went to data ----> Get Data ----> from File ----> from folder and then I do combine and transform. This is where I get so many tables. I took a screenshot to show of it.
I wrote this for inspecting PDF files using PQ: https://www.dropbox.com/scl/fi/isec3htdw4206ck2naeje/PDFinfoV2.xlsx?rlkey=7bqcrmssm0a6bprwnh9x320jp&dl=1