驗證碼識別

驗證碼識別是爬蟲必不可少的一項技能，但是目前的驗證碼花樣百出，此教程只能做到識別較簡單的，那些人眼都很難識別，或者字符扭曲混合在一起的驗證碼也很難做到正確識別。
我們不追求百分百的識別正確率，能達到10%已經是很好的結果。

識別思路：

graph LR
opencv圖像預處理-->pytesseract進行識別

如果正確率很差，可以考慮在圖像預處理后進行人工訓練，使用訓練的語言包進行識別

識別過程

graph LR
灰度圖轉換-->去噪
去噪-->otsu's二值化
otsu's二值化-->pytesseract識別

1.灰度圖轉換

灰度圖轉換有多重不同的算法實現，這里使用的算法對應的函數為imread

import cv2
img = cv2.imread('image.png', 0)

第一個參數為文件名，第二個參數有兩個值，0代表cv2.IMREAD_RRATSCALE,表示讀入灰度圖
1代表cv2.IMREAD_COLOR,表示讀入彩色圖像

2. 去噪

常用圖片平滑即圖像模糊的方式進行去噪，opencv提供了4種圖片平滑的方式：

1）均值濾波器 hamogeneous blur

blur = cv2.blur(img,(5,5))

2) 高斯濾波器 guassian blur

blur = cv2.GaussianBlur(img,(5,5),0)

3) 中值濾波器 median blur

median = cv2.medianBlur(img,5)

4) 雙邊濾波器 bilatrial blur

blur = cv2.bilateralFilter(img,9,75,75)

兩外還有內置的4個函數也可以進行去噪

1. cv2.fastNlMeansDenoising() - works with a single grayscale images
2. cv2.fastNlMeansDenoisingColored() - works with a color image.
3. cv2.fastNlMeansDenoisingMulti() - works with image sequence captured in short period of time (grayscale images)
4. cv2.fastNlMeansDenoisingColoredMulti() - same as above, but for color images.

在遇到的驗證碼中經過測試，選定高斯濾波器和雙邊濾波器進行去噪效果較好

3. otsu's二值化

ret, th = cv2.threshold(img,0,255,cv2.THRESH_BINARY+cv2.THRESH_OTSU)

4. pytesseract進行識別

pytesseract.image_to_string(image1, lang='eng')

其中lang為指定eng語言包

示例代碼

# -*- coding:utf-8 -*-

"""
File Name : 'test'.py
Description:
Author: 'chengwei'
Date: '2016/5/24' '10:17'
python:2.7.10
"""
# coding=utf-8
import cv2
import numpy as np
from matplotlib import pyplot as plt
from PIL import Image
import pytesseract
import os
import time


# 獲取指定目錄下驗證碼文件列表
image_path = "D:\\test_img"

def get_files(path):
    file_list = []
    files = os.listdir(path)
    for f in files:
        if(os.path.isfile(path + '\\' + f)):
            file_list.append(path + '\\' + f)
    return file_list

# 高斯濾波器
def guassian_blur(img, a, b):
    #（a,b）為高斯核的大小，0 為標準差, 一般情況a,b = 5
    blur = cv2.GaussianBlur(img,(a,b),0)
    # 閾值一定要設為 0！
    ret, th = otsu_s(blur)
    return ret, th

# 均值濾波器
def hamogeneous_blur(img):
    blur = cv2.blur(img,(5,5))
    ret, th = otsu_s(blur)
    return ret, th

# 中值濾波器
def median_blur(img):
    blur = cv2.medianBlur(img,5)
    ret, th = otsu_s(blur)
    return ret, th

#雙邊濾波器
def bilatrial_blur(img):
    blur = cv2.bilateralFilter(img,9,75,75)
    ret, th = otsu_s(blur)
    return ret, th

def otsu_s(img):
    ret, th = cv2.threshold(img,0,255,cv2.THRESH_BINARY+cv2.THRESH_OTSU)
    return ret, th

def main():
    """
    測試模糊處理后otsu's二值化
    :return:
    """
    file_list = get_files(image_path)
    for filename in file_list:
        print filename
        img = cv2.imread(filename, 0)
        ret1, th1 = guassian_blur(img, 5, 5)
        ret2, th2 = bilatrial_blur(img)

        cv2.imwrite('temp1.png', th1)
        cv2.imwrite('temp2.png', th2)

        titles = ['original', 'guassian', 'bilatrial']
        images = [img, th1, th2]
        for i in xrange(3):
            plt.subplot(1,3,i+1),plt.imshow(images[i], 'gray')
            plt.title(titles[i])
            plt.xticks([]),plt.yticks([])
        plt.show()

        image1 = Image.open("temp1.png")
        image2 = Image.open("temp2.png")
        image3 = Image.open(filename)

        print pytesseract.image_to_string(image1, lang='eng')
        print pytesseract.image_to_string(image2, lang='eng')
        print pytesseract.image_to_string(image3, lang='eng')

if __name__ == '__main__':
    main()

補充說明：

精度不夠可以通過人工訓練提高,訓練方法參考http://www.cnblogs.com/samlin/p/Tesseract-OCR.html
opencv有很多強大的功能，這只是冰山一角，有興趣可以到官方主頁
更好的庫？scikit？

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频

爬蟲：9. 驗證碼識別

爬蟲：9. 驗證碼識別

驗證碼識別

識別思路：

識別過程

1.灰度圖轉換

2. 去噪

1）均值濾波器 hamogeneous blur

2) 高斯濾波器 guassian blur

3) 中值濾波器 median blur

4) 雙邊濾波器 bilatrial blur

3. otsu's二值化

4. pytesseract進行識別

示例代碼

補充說明：

推薦閱讀更多精彩內容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美 国产 综合 欧美 视频

爬蟲：9. 驗證碼識別

驗證碼識別

識別思路：

識別過程

1.灰度圖轉換

2. 去噪

1）均值濾波器 hamogeneous blur

2) 高斯濾波器 guassian blur

3) 中值濾波器 median blur

4) 雙邊濾波器 bilatrial blur

3. otsu's二值化

4. pytesseract進行識別

示例代碼

補充說明：

推薦閱讀更多精彩內容

三个男躁一个女,国精产品一区一手机的秘密,麦子交换系列最经典十句话,欧美国产综合欧美视频