基于字符串的模糊匹配

Mike617 發(fā)布于2019-07-30 18:39 / 521人閱讀

摘要：近期由于數(shù)據(jù)庫中保存的一些類似小區(qū)名稱，街道名稱存在簡寫，錯別字等不規(guī)范的現(xiàn)象，需要將不規(guī)范的書寫進行糾錯改正。編輯距離距離是一種計算兩個字符串間的差異程度的字符串度量。

近期由于數(shù)據(jù)庫中保存的一些類似小區(qū)名稱，街道名稱存在簡寫，錯別字等不規(guī)范的現(xiàn)象，需要將不規(guī)范的書寫進行糾錯改正。在進行糾錯的過程中用到了【編輯距離】的計算方式來與對照表進行精確匹配。

編輯距離

1.Levenshtein距離是一種計算兩個字符串間的差異程度的字符串度量（string metric）。我們可以認(rèn)為Levenshtein距離就是從一個字符串修改到另一個字符串時，其中編輯單個字符（比如修改、插入、刪除）所需要的最少次數(shù)。

2.jaro距離

3.jaro-winkler距離

注：其中的相似度 = 1 - 距離

由于jaro的distance中存在局部可視窗口的概念，即使有相同的子串出現(xiàn)，但是超過可視窗口的長度依舊不會計算，但是業(yè)務(wù)的數(shù)據(jù)大多數(shù)帶有寫比較長的前綴，就會影響最終匹配的準(zhǔn)確度，所以將可視窗口的長度放大至比較字符串的最長串的長度，所以將包中的部分源碼修改，python代碼如下:

def count_matches(s1, s2, len1, len2):
    assert len1 and len1 <= len2
    # search_range = max(len2//2-1, 0)
    # print ("search_range",search_range)
    search_range = len2
    num_matches = 0

    flags1 = [0] * len1
    flags2 = [0] * len2

    for i, char in enumerate(s1):

        lolim = max(i - search_range, 0)
        hilim = min(i + search_range, len2 - 1)

        for j in range(lolim, hilim + 1):

            if not flags2[j] and char == s2[j]:
                flags1[i] = flags2[j] = 1
                # where_matched[i] = j
                num_matches += 1
                break
    return num_matches, flags1, flags2  # , where_matched

def count_half_transpositions(s1, s2, flags1, flags2):
    half_transposes = 0
    k = 0

    for i, flag in enumerate(flags1):
        if not flag: continue
        while not flags2[k]: k += 1
        if s1[i] != s2[k]:
            half_transposes += 1
        k += 1
    return half_transposes

def count_typos(s1, s2, flags1, flags2, typo_table):
    assert 0 in flags1

    typo_score = 0
    for i, flag1 in enumerate(flags1):
        if flag1: continue  # Iterate through unmatched chars
        row = s1[i]
        if row not in typo_table:
            # If we don"t have a similarity mapping for the char, continue
            continue
        typo_row = typo_table[row]

        for j, flag2 in enumerate(flags2):
            if flag2: continue
            col = s2[j]
            if col not in typo_row: continue

            # print "Similarity!", row, col
            typo_score += typo_row[col]
            flags2[j] = 2
            break
    return typo_score, flags2

def fn_jaro(len1, len2, num_matches, half_transposes, typo_score, typo_scale):
    if not len1:
        if not len2: return 1.0
        return 0.0
    if not num_matches: return 0.0

    similar = (typo_score / typo_scale) + num_matches
    weight = (similar / len1
              + similar / len2
              + (num_matches - half_transposes // 2) / num_matches)

    return weight / 3

def string_metrics(s1, s2, typo_table=None, typo_scale=1, boost_threshold=None,
                   pre_len=0, pre_scale=0, longer_prob=False):
    len1 = len(s1)
    len2 = len(s2)

    if len2 < len1:
        s1, s2 = s2, s1
        len1, len2 = len2, len1
    assert len1 <= len2

    if not (len1 and len2): return len1, len2, 0, 0, 0, 0, False

    num_matches, flags1, flags2 = count_matches(s1, s2, len1, len2)

    # If no characters in common - return
    if not num_matches: return len1, len2, 0, 0, 0, 0, False

    half_transposes = count_half_transpositions(s1, s2, flags1, flags2)

    # adjust for similarities in non-matched characters
    typo_score = 0
    if typo_table and len1 > num_matches:
        typo_score, flags2 = count_typos(s1, s2, flags1, flags2, typo_table)

    if not boost_threshold:
        return len1, len2, num_matches, half_transposes, typo_score, 0, 0

    pre_matches = 0
    adjust_long = False
    weight_typo = fn_jaro(len1, len2, num_matches, half_transposes,
                          typo_score, typo_scale)

    # Continue to boost the weight if the strings are similar
    if weight_typo > boost_threshold:
        # Adjust for having up to first "pre_len" chars (not digits) in common
        limit = min(len1, pre_len)
        while pre_matches < limit:
            char1 = s1[pre_matches]
            if not (char1.isalpha() and char1 == s2[pre_matches]):
                break
            pre_matches += 1

        if longer_prob:
            cond = len1 > pre_len
            cond = cond and num_matches > pre_matches + 1
            cond = cond and 2 * num_matches >= len1 + pre_matches
            cond = cond and s1[0].isalpha()
            if cond:
                adjust_long = True

    return (len1, len2, num_matches, half_transposes,
            typo_score, pre_matches, adjust_long)

def metric_jaro(string1, string2):
    "The standard, basic Jaro string metric."

    ans = string_metrics(string1, string2)
    len1, len2, num_matches, half_transposes = ans[:4]
    assert ans[4:] == (0, 0, False)
    return fn_jaro(len1, len2, num_matches, half_transposes, 0, 1)
    
def metric_jaro_score(s1,s2):
    return metric_jaro(s1,s2)    
    
print (metric_jaro_score("賽鼎線世紀(jì)明珠45號","世紀(jì)明珠45號"))

GPU云服務(wù)器云服務(wù)器字符串的模糊匹配算法 php里面的模糊匹配模糊匹配模糊匹配查詢

文章版權(quán)歸作者所有，未經(jīng)允許請勿轉(zhuǎn)載,若此文章存在違規(guī)行為，您可以聯(lián)系管理員刪除。

轉(zhuǎn)載請注明本文地址：http://systransis.cn/yun/42818.html

發(fā)表評論

登陸后可評論

0條評論

Mike617

男|高級講師

我要關(guān)注我要私信

TA的文章

阿里云企業(yè)和個人賬號區(qū)別是什么?阿里云企業(yè)認(rèn)證和個人實名認(rèn)證區(qū)別

閱讀 2481·2021-11-19 09:59
上百道最新前端面試題

閱讀 2006·2019-08-30 15:55
前端小白進階筆記之多級菜單分享

閱讀 938·2019-08-29 13:30
簡單的node爬蟲存入excel數(shù)據(jù)分析

閱讀 1342·2019-08-26 10:18
JS - debounce(去抖) 和 throttle(節(jié)流)

閱讀 3091·2019-08-23 18:36
JavaScript 之原型和原型鏈

閱讀 2394·2019-08-23 18:25
webpack 配置多頁面應(yīng)用的一次嘗試

閱讀 1168·2019-08-23 18:07
url字符串解析

閱讀 441·2019-08-23 17:15

成人国产在线小视频_日韩寡妇人妻调教在线播放_色成人www永久在线观看_2018国产精品久久_亚洲欧美高清在线30p_亚洲少妇综合一区_黄色在线播放国产_亚洲另类技巧小说校园_国产主播xx日韩_a级毛片在线免费

資訊專欄INFORMATION COLUMN

上云采購季！| 2核2G4M爆款云服務(wù)器低至59元/年，更有多臺、長期優(yōu)惠，快來選購！

基于字符串的模糊匹配

相關(guān)文章

深度學(xué)習(xí)在美團點評的應(yīng)用

Programming Computer Vision with Python （學(xué)習(xí)筆記十一）

正則表達式之字符匹配

發(fā)表評論

0條評論

Mike617

男|高級講師

TA的文章

阿里云企業(yè)和個人賬號區(qū)別是什么?阿里云企業(yè)認(rèn)證和個人實名認(rèn)證區(qū)別

上百道最新前端面試題

前端小白進階筆記之多級菜單分享

簡單的node爬蟲存入excel數(shù)據(jù)分析

JS - debounce(去抖) 和 throttle(節(jié)流)

JavaScript 之原型和原型鏈

webpack 配置多頁面應(yīng)用的一次嘗試

url字符串解析

最新活動

資訊專欄INFORMATION COLUMN

上云采購季！| 2核2G4M爆款云服務(wù)器低至59元/年，更有多臺、長期優(yōu)惠，快來選購！

基于字符串的模糊匹配

相關(guān)文章

發(fā)表評論

0條評論

男|高級講師

TA的文章

最新活動

上云采購季！| 2核2G4M爆款云服務(wù)器低至59元/年，更有多臺、長期優(yōu)惠，快來選購！