Detect and Extract table data using OpenCV

Detect and Extract table data using OpenCV

This example demonstrates how to use OpenCV for table data detection and extraction. We’ll be analyzing some example outputs generated by the following code. The Colab link for this code can be found at the end of the page.

Example output :

Output of table detected
Output of table with cells detected
Cropped image of a table cell
Output of table extracted data
from google.colab.patches import cv2_imshow
import pandas as pd
import cv2
import numpy as np
import easyocr
reader = easyocr.Reader(['th','en'])

def table_detection(img_path):
img = cv2.imread(img_path)
img_gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
(thresh, img_bin) = cv2.threshold(img_gray, 128, 255, cv2.THRESH_BINARY | cv2.THRESH_OTSU)
img_bin = cv2.bitwise_not(img_bin)

kernel_length_v = (np.array(img_gray).shape[1])//120
vertical_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (1, kernel_length_v))
im_temp1 = cv2.erode(img_bin, vertical_kernel, iterations=3)
vertical_lines_img = cv2.dilate(im_temp1, vertical_kernel, iterations=3)

kernel_length_h = (np.array(img_gray).shape[1])//40
horizontal_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (kernel_length_h, 1))
im_temp2 = cv2.erode(img_bin, horizontal_kernel, iterations=3)
horizontal_lines_img = cv2.dilate(im_temp2, horizontal_kernel, iterations=3)

kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (3, 3))
table_segment = cv2.addWeighted(vertical_lines_img, 0.5, horizontal_lines_img, 0.5, 0.0)
table_segment = cv2.erode(cv2.bitwise_not(table_segment), kernel, iterations=2)
thresh, table_segment = cv2.threshold(table_segment, 0, 255, cv2.THRESH_OTSU)

contours, hierarchy = cv2.findContours(table_segment, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE)
count = 0

full_list=[]
row=[]
data=[]
first_iter=0
firsty=-1


for c in contours:
x, y, w, h = cv2.boundingRect(c)

if h > 9 and h<100:
if first_iter==0:
first_iter=1
firsty=y
if firsty!=y:
row.reverse()
full_list.append(row)
row=[]
data=[]
print(x,y,w,h)
cropped = img[y:y + h, x:x + w]
cv2_imshow(cropped)
bounds = reader.readtext(cropped)

try:
data.append(bounds[0][1])
data.append(w)
row.append(data)
data=[]
except:
data.append("--")
data.append(w)
row.append(data)
data=[]
firsty=y
cv2.rectangle(img,(x, y),(x + w, y + h),(0, 255, 0), 2)
cv2_imshow(img)
full_list.reverse()
print(full_list)

new_data=[]
new_row=[]
for i in full_list:
for j in i:
new_row.append(j[0])
new_data.append(new_row)
new_row=[]
print(new_data)

# Convert list of lists into a DataFrame
df = pd.DataFrame(new_data)
df = df.applymap(lambda x: '' if pd.isna(x) else x)
from tabulate import tabulate
table = tabulate(df, headers='firstrow', tablefmt='grid')

# Print DataFrame
print(table)
table_detection("/content/table.png")

1. Image Loading and Grayscale Conversion:

using cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)

2. Thresholding and Binarization:

(thresh, img_bin) = cv2.threshold(img_gray, 128, 255, cv2.THRESH_BINARY | cv2.THRESH_OTSU): Applies adaptive thresholding (Otsu’s method) to convert the grayscale image to binary (black and white). The threshold (thresh) is automatically determined to separate the foreground (table lines) from the background.

3. Vertical and Horizontal Line Detection:

  1. Kernel Creation:

kernel_length_v = (np.array(img_gray).shape[1]) // 120 (vertical) and kernel_length_h = (np.array(img_gray).shape[0]) // 40 (horizontal): Calculate kernel lengths based on image width and height to ensure appropriate line detection sizes.

vertical_kernel and horizontal_kernel: Create structuring elements (kernels) for morphological operations. These are essentially small rectangles used for erosion and dilation.

2. Erosion and Dilation:

im_temp1 (vertical) and im_temp2 (horizontal): Apply erosion to thin lines and remove noise.

vertical_lines_img (vertical) and horizontal_lines_img (horizontal): Apply dilation to thicken the detected lines.

3. Combining Vertical and Horizontal Lines:

kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (3, 3)): Creates a square kernel for further processing.

table_segment = cv2.addWeighted(vertical_lines_img, 0.5, horizontal_lines_img, 0.5, 0.0): Combines the detected vertical and horizontal lines using a weighted average.

4. Additional Erosion and Thresholding:

table_segment = cv2.erode(cv2.bitwise_not(table_segment), kernel, iterations=2): Erode the combined image further to remove noise and isolate table regions.

thresh, table_segment = cv2.threshold(table_segment, 0, 255, cv2.THRESH_OTSU): Apply thresholding again (adaptive-Otsu) to refine the table segment image.

5. Cell Detection and Extraction:

contours, hierarchy = cv2.findContours(table_segment, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE): Finds contours (boundaries) of potential cells in the table segment image.

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *