Extract text from pdf and word files

Question

How can I extract text from pdf or word files (remove bold, images, and other rich text formatting media) in C#?

pbz · Accepted Answer · 2010-09-10 03:21:30Z

You can use the filters designed for / used by the indexing service. They're designed to extract the plain text out of various documents, which is useful for searching inside a document. You can use it for Office files, PDFs, HTML and so on, basically any file type that has a filter. The only downside is that you have to install these filters on the server, so if you don't have direct access to the server this may not be possible. Some filters come pre-installed with Windows, but some, like PDF, you have to install yourself. For a C# implementation check out this article: Using IFilter in C#

Kurt Pfeifle · Accepted Answer · 2010-09-07 00:00:02Z

PDF:

You have various options.

pdftotext:
Download the XPDF utilities. In the .zip file there are various commandline utilities. One is pdftotext(.exe). It can extract all text content from a well-behaving PDF file. Type pdftotext -help to learn about some if its commandline parameters.

Ghostscript:
Install the latest version of Ghostscript (v.8.71). Ghostscript is a PostScript- and PDF-interpreter. You can use it to extract text from a PDF as well:

gswin32c.exe ^
 -q ^
 -sFONTPATH=c:/windows/fonts ^
 -dNODISPLAY ^
 -dSAFER ^
 -dDELAYBIND ^
 -dWRITESYSTEMDICT ^
 -dSIMPLE ^
 -f ps2ascii.ps ^
 -dFirstPage=3 ^
 -dLastPage=7 ^
 input.pdf ^
 -dQUIET

This will output text contained on pages 3-7 of input.pdf to stdout. You can redirect this to a file by appending > /path/to/output.txt to the command. (Check to make sure that the PostScript utility program ps2ascii.ps is present in your Ghostscript's lib subdirectory.)

If you omit the -dSIMPLE parameter, the text output will be guessing line breaks and word spacings. For details look at the comments inside the ps2ascii.ps file itself. You can even replace that param with -dCOMPLEX for gaining additional text formatting info.

Adnan · Accepted Answer · 2010-09-06 16:39:57Z

0

For PDF did you take a look at TallPDF

Also check this one: http://www.codeproject.com/KB/files/PDF_to_TEXT.aspx

answered Sep 6, 2010 at 16:39

Adnan

26.1k18 gold badges82 silver badges110 bronze badges

Add a comment |

Dmitry Karpezo · Accepted Answer · 2010-09-06 17:28:04Z

0

Use Word object model, it's the only reliable way since Word format is not open and vary from version to version.

answered Sep 6, 2010 at 17:28

Dmitry Karpezo

1,06411 silver badges26 bronze badges

But how though? This is a useless response without a code sample.
– Kyle
Dec 27, 2011 at 19:48

Add a comment |

Andrew Cash · Accepted Answer · 2010-09-07 14:42:13Z

0

You might want to look at PDFBox. Here is a link to a Code Project page showing you how to use it in C# as well as other useful comments.

http://www.codeproject.com/KB/string/pdf2text.aspx

As for Word the suggestion of using the Word Object model is probably the most accurate.

answered Sep 7, 2010 at 14:42

Andrew Cash

2,3561 gold badge17 silver badges11 bronze badges

Add a comment |

Bobrovsky · Accepted Answer · 2024-03-20 16:23:01Z

0

Docotic.Pdf library can be used to extract text from PDF files.

The library can extract plain text and text with formatting. Also, a collection of words or characters with bounding rectangles can be retrieved using library's API.

Disclaimer: I work for the vendor of the library.

edited Mar 20 at 16:23

answered Apr 29, 2012 at 14:42

Bobrovsky

14.1k19 gold badges84 silver badges132 bronze badges

Add a comment |

Collectives™ on Stack Overflow

Extract text from pdf and word files

6 Answers 6

PDF:

Your Answer

Not the answer you're looking for? Browse other questions tagged
c#
pdf
ms-word
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

PDF:

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged c#pdfms-word or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
c#
pdf
ms-word
or ask your own question.