Converting,PDF,to,Text,in,C#

Omar 8/28/2016 0

Converting PDF to Text in C# There are several main methods for extracting text from PDF files in .NET: Microsoft IFilter interface and Adobe IFilter implementation. iTextSharp PDFBox

C#
//Step 1
/*
In order to parse PDF files using IFilter interface you need the following: Windows 2000 or later Adobe Acrobat or Reader 7.0.5 (or the standalone Adobe PDF IFilter [adobe.com]) IFilter COM wrapper class [dotlucene.net] Sample code:

*/

using IFilter; // ... public static string ExtractTextFromPdf(string path) { return DefaultParser.Extract(path); }


//Step 2
/*
iTextSharp is a .NET port of iText, a PDF manipulation library for Java. It is primarily focused on creating and not reading PDFs but it supports extracting text from PDF as well.

*/
using iTextSharp.text.pdf; using iTextSharp.text.pdf.parser; // ... public static string ExtractTextFromPdf(string path) { using (PdfReader reader = new PdfReader(path)) { StringBuilder text = new StringBuilder(); for (int i = 1; i <= reader.NumberOfPages; i ) { text.Append(PdfTextExtractor.GetTextFromPage(reader, i)); } return text.ToString(); } }


//Step3

public static string ExtractTextFromPdf(string path) { ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.LocationTextExtractionStrategy(); using (PdfReader reader = new PdfReader(path)) { StringBuilder text = new StringBuilder(); for (int i = 1; i <= reader.NumberOfPages; i ) { string thePage = PdfTextExtractor.GetTextFromPage(reader, i, its); string[] theLines = thePage.Split('\n'); foreach (var theLine in theLines) { text.AppendLine(theLine); } } return text.ToString(); } }

//Step4

using org.apache.pdfbox.pdmodel; using org.apache.pdfbox.util; // ... private static string ExtractTextFromPdf(string path) { PDDocument doc = null; try { doc = PDDocument.load(path) PDFTextStripper stripper = new PDFTextStripper(); return stripper.getText(doc); } finally { if (doc != null) { doc.close(); } } }


Report Bug

Please Login to Report Bug

Reported Bugs

Comments

Please Login to Comment

Comments