Learn a Specific Web page from a PDF File in Python


Doc processing is among the commonest use instances for the Python programming language. This enables the language to course of many information, corresponding to database information, multimedia information and encrypted information, to call a couple of. This text will train you methods to learn a selected web page from a PDF (Moveable Doc Format) file in Python.

Technique 1: Utilizing Pymupdf library to learn web page in Python

The PIL (Python Imaging Library), together with the PyMuPDF library, will likely be used for PDF processing on this article. To put in the PyMuPDF library, run the next command within the command processor of the working system:

pip set up pymupdf

Notice: This PyMuPDF library is imported by utilizing the next command.

import fitz

Studying a web page from a pdf file requires loading it after which displaying the contents of solely one among its pages. This basically makes that one-page equal of a picture. Subsequently, the web page from the pdf file could be learn and displayed as a picture. 

The next instance demonstrates the above course of:

Python3

import fitz

from PIL import Picture

  

input_file = r"take a look at.pdf"

  

file_handle = fitz.open(input_file)

  

web page = file_handle[0]

  

page_img = web page.get_pixmap()

  

page_img.save('PDF_page.png')

  

img = Picture.open('PDF_page.png')

  

img.present()

Output:

 

Clarification:

Firstly the pdf file is opened, and its file deal with is saved. Then the primary web page of the pdf (at index 0) is loaded utilizing checklist indexing. This web page’s pixel map (pixel array) is obtained utilizing the get_pixmap operate, and the resultant pixel map is saved in a variable. Then this pixel map is saved as a png picture file. Then this png file is opened utilizing the open operate current within the Picture module of PIL. In the long run, the picture is displayed utilizing the present operate. 

Notice: The primary open operate is used to open a pdf file, and the later one is used to open the png picture file. The capabilities belong to totally different libraries and are used for various functions. 

Technique 2: Studying a selected web page from a PDF utilizing PyPDF2 

For the second instance, the PyPDF2 library could be used. Which may very well be put in by working the next command:

pip set up PyPDF2

The identical goal may very well be achieved by utilizing the PyPDF2 library. The library permits processing for pdf information and permits varied operations corresponding to studying, writing or making a pdf file. For the duty at hand, the usage of the extract textual content operate could be made to acquire the textual content from the PDF file and show it. The code for that is as follows:

Python3

import PyPDF2

    

input_file = r"take a look at.pdf"

  

web page = 4

  

pdfFileObj = open('take a look at.pdf', 'rb')

    

pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

    

pageObj = pdfReader.getPage(web page)

    

knowledge = pageObj.extractText()

  

pdfFileObj.shut()

  

print(knowledge)

Output:

He began this Journey with only one 
thought- each geek ought to have 
entry to a by no means ending vary of 
educational assets and with loads 
of hardwork and willpower, 
GeeksforGeeks was born.
By this platform, he has        
efficiently enriched the minds of 
college students with information which has 
led to a lift of their careers. However 
most significantly, GeeksforGeeks 
will at all times assist college students keep in 
contact with their Geeky facet!
EXPERT ADVICE
CEO and Founding father of 
GeeksforGeeks
                  I perceive that many 
college students who come to us are 
both followers of the sciences or 
have been pushed into this 
feild by their dad and mom.
And I simply need you to 
know that regardless of 
the place life takes you, we 
at GeeksforGeeks hope 
to have made this 
journey simpler for  
you.Mr. Sandeep Jain
3

Clarification:

Firstly the trail to the enter pdf and the web page quantity are outlined in separate variables. Then the pdf file is opened, and its file object is saved in a variable. Then this variable is handed as an argument to the PdfFileReader operate, which creates a pdf reader object out of a file object. Then the info saved inside the web page quantity outlined within the web page variable is obtained and saved in a variable. Then the textual content is extracted from that PDF web page, and the file object is closed. In the long run, the extracted textual content knowledge is displayed.

Leave a Reply