How To Extract Text From Image In Python using Pytesseract

Hey everyone, welcome to How To Extract Text From Image In Python tutorial. In this tutorial, you will learn how you can extract text from a image using python. Extracting text from an image can be done with image processing. So let’s see how to do that.

Contents

1 What Is OCR(Optical Character Recognition) ?
2 How To Extract Text From Image In Python
3 Related Articles :

What Is OCR(Optical Character Recognition) ?

Introduction

It is a widespread technology to recognise text inside images, such as scanned documents and photos.
OCR technology is used to convert virtually any kind of images containing written text (typed, handwritten or printed) into machine-readable text data.

How To Implement OCR ?

Now the question arises that how you can implement OCR. Python provides a tool pytesseract for OCR. That is, it will recognize and “read” the text embedded in images.

What Is pytesseract ?

pytesseract will recognize and read the text present in images.
It can read all image types — png, jpeg, gif, tiff, bmp etc. It’s widely used to process everything from scanned documents.

Installing pytesseract

To install pytesseract, you have to run the following command in your terminal.


pip install pytesseract

pip install pytesseract

How To Extract Text From Image In Python

So now we will see how can we implement the program.

Downloading and Installing Tesseract

The first thing you need to do is to download and install tesseract on your system. Tesseract is a popular OCR engine.

Download tesseract from this link.
And install this as usual as you install other softwares.

Creating New Project

Now create your project as usual. In my case, my project is like that –

Python Program For How To Extract Text From Image

This is our image, and we want to extract texts from this image.

So guys now write the following code to extract texts from above image. I will explain it later.


# Import modules
from PIL import Image
import pytesseract

# Include tesseract executable in your path
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"

# Create an image object of PIL library
image = Image.open('F:/imagess.jpg')

# pass image into pytesseract module
# pytesseract is trained in many languages
image_to_text = pytesseract.image_to_string(image, lang='eng')

# Print the text
print(image_to_text)

# Import modules

from PIL import Image

import pytesseract

# Include tesseract executable in your path

pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"

# Create an image object of PIL library

image = Image.open('F:/imagess.jpg')

# pass image into pytesseract module

# pytesseract is trained in many languages

image_to_text = pytesseract.image_to_string(image, lang='eng')

# Print the text

print(image_to_text)

Explanation

First of all you have to import Image class from PIL(Python Imaging Library) library. PIL is short form of Pillow and this is the name used for importing the library.
Image class is required so that we can load our input image from disk in PIL format.
Then import pytesseract.
Now you have to include tesseract executable in your path.
Then you will need to create an image object of PIL library.
Now you have to pass that image into pytesseract module.
image_to_string returns the result of a Tesseract OCR run on the image to string.
Then finally print the text.

Output

Now run the above code and check the output.

So guys, you can see the code is working successfully. Texts have been extracted from the image.

So this was all about How To Extract Text From Image In Python tutorial. I hope, you will have learned lots of thing from it. But anyway if you have any confusion regarding this tutorial then feel free to ask. And please share it your friends and python learners. Thanks Everybody.