Extracting data from PDF to excel spreadsheet : r/excel Skip to main content

Get the Reddit app

Scan this QR code to download the app now
Or check it out in the app stores
r/excel icon
r/excel icon
Go to excel
r/excel
A banner for the subreddit

A vibrant community of Excel enthusiasts. Get expert tips, ask questions, and share your love for all things Excel. Elevate your spreadsheet skills with us!


Members Online

Extracting data from PDF to excel spreadsheet

unsolved

Hello everyone,

I'm looking for help to extract data from a big pdf file to excel spreadsheet.

The data I have in unstructured. I did watch a couple youtube videos and tried to use power query and I ended up having 1000+ tables. I'm not sure what I did wrong.

Would appreciate any help.

PS. I'm dumb when it comes to coding.

Share
Sort by:
Top
Open comment sort options
u/AutoModerator avatar
Moderator Announcement Read More »

It really depends on how your Pdf looks like. Are there Tables inside? What can you select if you import it into PowerQuery?

u/omarjibory avatar

No tables in the data. it's basically patients with certain diagnosis. Every time I try to do PowerQuery (combine and transform), I get groups of tables. 1 group each individual patient MRN and ID in one table then diagnosis in a separate one.

So there are groups of tables or only pages? You use Excel on Windows, don't you? What do you combine and transform? Are these multiple PDFs or one PDF with multiple data? Can you query the data before combining?

u/omarjibory avatar

It's excel on windows

I watched a youtube video and this is what I did

I went to data ----> Get Data ----> from File ----> from folder and then I do combine and transform. This is where I get so many tables. I took a screenshot to show of it.

Comment Image

If the tables are not suitable for you, there should be pages at the bottom. Choose Parameter1 and in the next step filter out all tables and use the pages to continue.

More replies
More replies
More replies
More replies

I had to find a way to extract data from multiple unstructured PDFs to Excel.

To do this, I used Excel VBA and referenced the appropriate library to communicate with the PDF editor.

Then I used regular expressions to match for certain patterns in the PDF. After I matched for the string patterns, I would store the strings into an ArrayList object and transfer the data to Excel that way.

Btw, even though this is related to coding, might as well give it a shot if macros are allowed in your company. You might even learn something new and impress your colleagues if your program is stable / fool proofed enough.

Try https://bankstmtconverter.com where you can extract tables from PDF. Can process 100 pages in single PDF

u/omarjibory avatar

Thank you, I will try it.

More replies
u/Pineapple_Playful avatar

If you know exactly what you want to extract, just take each page separately and use an automation tool for data extraction like this.

u/omarjibory avatar

I'm exploring this option. Thanks for the tip.

More replies

Could you share part of that doc to see if is there a possible solution?

u/omarjibory avatar

Absolutely,

It's redacted for privacy but I want to collect all the elements listed in the reports.

You can see one page might have 1 report only while others have more than 1

Comment Image
Edited

opening the file in power query does not seem to present problems with the structure of your doc, are you using multiple pdf files for each report? or how exactly did you open the file?

Comment Image
u/omarjibory avatar

I watched a youtube video and this is what I did

I went to data ----> Get Data ----> from File ----> from folder and then I do combine and transform. This is where I get so many tables. I took a screenshot to show of it.

Comment Image
More replies
More replies
More replies