亚洲综合图片区自拍_思思91精品国产综合在线观看_一区二区三区欧美_欧美黑人又粗又大_亚洲人成精品久久久久桥本

世界即時(shí):【算法與數(shù)據(jù)結(jié)構(gòu)】Trie樹簡(jiǎn)介及應(yīng)用

2023-01-30 12:24:08 來源:51CTO博客

作者:京東物流 馬瑞

1 什么是Trie樹

1.1 Trie樹的概念

Trie樹,即字典樹,又稱單詞查找樹或鍵樹,是一種樹形結(jié)構(gòu),典型應(yīng)用是用于統(tǒng)計(jì),排序和保存大量的字符串(但不僅限于字符串),所以經(jīng)常被搜索引擎系統(tǒng)用于文本詞頻統(tǒng)計(jì)。它的優(yōu)點(diǎn)是:利用字符串的公共前綴來減少查詢時(shí)間,最大限度地減少無謂的字符串比較,查詢效率比哈希樹高。


(資料圖)

Trie, also called digital tree and sometimes radix tree or prefix tree (as they can be searched by prefixes), is a kind of search tree—an ordered tree data structure that is used to store a dynamic set or associative array where the keys are usually strings. It is one of those data-structures that can be easily implemented.

1.2 Trie樹優(yōu)點(diǎn)

最大限度地減少無謂的字符串比較,查詢效率比較高。核心思想是空間換時(shí)間,利用字符串的公共前綴來降低查詢時(shí)間的開銷以達(dá)到提高效率的目的。

插入、查找的時(shí)間復(fù)雜度均為O(N),其中N為字符串長(zhǎng)度??臻g復(fù)雜度是26^n級(jí)別的,非常龐大(可采用雙數(shù)組實(shí)現(xiàn)改善)。

1.3 Trie樹的三個(gè)基本性質(zhì)

根節(jié)點(diǎn)不包含字符,除根節(jié)點(diǎn)外每一個(gè)節(jié)點(diǎn)都只包含一個(gè)字符從根節(jié)點(diǎn)到某一節(jié)點(diǎn),路徑上經(jīng)過的字符連接起來,為該節(jié)點(diǎn)對(duì)應(yīng)的字符串每個(gè)節(jié)點(diǎn)的所有子節(jié)點(diǎn)包含的字符都不相同

2 Trie樹數(shù)據(jù)結(jié)構(gòu)

以字符串”hi”和”經(jīng)海路”的數(shù)據(jù)為例:

Java的數(shù)據(jù)結(jié)構(gòu)定義:

@Datapublic class TrieTreeNode {    private Character data;    private Map children;    private boolean isEnd;    //  前綴,冗余信息,可選    private String prefix;    //  后綴,冗余信息,可選    private String suffix;}

如果只是處理26個(gè)英文字符,data可以通過children數(shù)組是否為空來判斷。如果處理程序,默認(rèn)children為空來判斷是否為最后一個(gè)節(jié)點(diǎn),則isEnd字段可以省略。前綴prefix和suffix可以在數(shù)據(jù)處理的時(shí)候,方便拿到當(dāng)前節(jié)點(diǎn)前綴和后綴信息,如果不需要可以去除。

3 Trie樹在臟話過濾中的應(yīng)用

3.1 臟話關(guān)鍵詞Keyword定義

public class KeyWord implements Serializable {    /**     * 關(guān)鍵詞內(nèi)容     */    private String word;//其他省略}

3.2 關(guān)鍵詞查詢器

public class KWSeeker {    /**     * 所有的關(guān)鍵詞     */    private Set sensitiveWords;    /**     * 關(guān)鍵詞樹     */    private Map wordsTree = new ConcurrentHashMap();    /**     * 最短的關(guān)鍵詞長(zhǎng)度。用于對(duì)短于這個(gè)長(zhǎng)度的文本不處理的判斷,以節(jié)省一定的效率     */    private int wordLeastLen = 0;//其他處理方法,省略}

3.3 字符串構(gòu)造一棵樹

/**    * 將指定的詞構(gòu)造到一棵樹中。    *     * @param tree 構(gòu)造出來的樹    * @param word 指定的詞    * @param KeyWord 對(duì)應(yīng)的詞    * @return    */public static Map makeTreeByWord(Map tree, String word,        KeyWord KeyWord) {    if (StringUtils.isEmpty(word)) {        tree.putAll(EndTagUtil.buind(KeyWord));        return tree;    }    String next = word.substring(0, 1);    Map nextTree = tree.get(next);    if (nextTree == null) {        nextTree = new HashMap();    }    // 遞歸構(gòu)造樹結(jié)構(gòu)    tree.put(next, makeTreeByWord(nextTree, word.substring(1), KeyWord));    return tree;}

對(duì)關(guān)鍵詞字符串,逐個(gè)字符進(jìn)行構(gòu)建。

3.4 詞庫樹的生成

/**    * 構(gòu)造、生成詞庫樹。并返回所有敏感詞中最短的詞的長(zhǎng)度。    *     * @param sensitiveWords 詞庫    * @param wordsTree 聚合詞庫的樹    * @return 返回所有敏感詞中最短的詞的長(zhǎng)度。    */public int generalTree(Set sensitiveWords, Map wordsTree) {    if (sensitiveWords == null || sensitiveWords.isEmpty() || wordsTree == null) {        return 0;    }    wordsTreeTmp.clear();    int len = 0;    for (KeyWord kw : sensitiveWords) {        if (len == 0) {            len = kw.getWordLength();        } else if (kw.getWordLength() < len) {            len = kw.getWordLength();        }        AnalysisUtil                .makeTreeByWord(wordsTreeTmp, StringUtils.trimToEmpty(kw.getWord()), kw);    }    wordsTree.clear();    wordsTree.putAll(wordsTreeTmp);    return len;}

3.5 關(guān)鍵詞提取

/** * 將文本中的關(guān)鍵詞提取出來。 */public List process(Map wordsTree, String text,        AbstractFragment fragment, int minLen) {    // 詞的前面一個(gè)字    String pre = null;    // 詞匹配的開始位置    int startPosition = 0;    // 返回結(jié)果    List rs = new ArrayList();    while (true) {        try {            if (wordsTree == null || wordsTree.isEmpty() || StringUtils.isEmpty(text)) {                return rs;            }            if (text.length() < minLen) {                return rs;            }            String chr = text.substring(0, 1);            text = text.substring(1);            Map nextWord = wordsTree.get(chr);            // 沒有對(duì)應(yīng)的下一個(gè)字,表示這不是關(guān)鍵詞的開頭,進(jìn)行下一個(gè)循環(huán)            if (nextWord == null) {                pre = chr;                continue;            }            List keywords = Lists.newArrayList();            KeyWord kw = AnalysisUtil.getSensitiveWord(chr, pre, nextWord, text, keywords);            if (keywords == null || keywords.size() == 0) {                // 沒有匹配到完整關(guān)鍵字,下一個(gè)循環(huán)                pre = chr;                continue;            }            for (KeyWord tmp : keywords) {                // 同一個(gè)word多次出現(xiàn)記錄在一起                SensitiveWordResult result = new SensitiveWordResult(startPosition, tmp.getWord());                int index = rs.indexOf(result);                if (index > -1) {                    rs.get(index).addPosition(startPosition, tmp.getWord());                } else {                    rs.add(result);                }            }            // 從text中去除當(dāng)前已經(jīng)匹配的內(nèi)容,進(jìn)行下一個(gè)循環(huán)匹配            // 這行注釋了,避免"中國(guó)人",導(dǎo)致"國(guó)人"。搜索不出來,逐個(gè)字符遍歷            // text = text.substring(kw.getWordLength() - 1);            pre = kw.getWord().substring(kw.getWordLength() - 1, kw.getWordLength());            continue;        } finally {            if (pre != null) {                startPosition = startPosition + pre.length();            }        }    }}/** * 查詢文本開頭的詞是否在詞庫樹中,如果在,則返回對(duì)應(yīng)的詞,如果不在,則返回null。return 返回找到的最長(zhǎng)關(guān)鍵詞 *  * @param append 追加的詞 * @param pre 詞的前一個(gè)字,如果為空,則表示前面沒有內(nèi)容 * @param nextWordsTree 下一層樹 * @param text 剩余的文本內(nèi)容 * @param keywords 返回的keywords,可能多個(gè) * @return 返回找到的最長(zhǎng)關(guān)鍵詞 */public static KeyWord getSensitiveWord(String append, String pre,                Map nextWordsTree, String text, List keywords) {    if (nextWordsTree == null || nextWordsTree.isEmpty()) {        return null;    }    Map endTag = nextWordsTree.get(EndTagUtil.TREE_END_TAG);    // 原始文本已到末尾    if (StringUtils.isEmpty(text)) {        // 如果有結(jié)束符,則表示匹配成功,沒有,則返回null        if (endTag != null) {            keywords.add(checkPattern(getKeyWord(append, endTag), pre, null));            return checkPattern(getKeyWord(append, endTag), pre, null);        } else {            return null;        }    }    String next = text.substring(0, 1);    String suffix = text.substring(0, 1);    Map nextTree = nextWordsTree.get(next);    // 沒有遇到endTag,繼續(xù)匹配    if (endTag == null) {        if (nextTree != null && nextTree.size() > 0) {            // 沒有結(jié)束標(biāo)志,則表示關(guān)鍵詞沒有結(jié)束,繼續(xù)往下走。            return getSensitiveWord(append + next, pre, nextTree, text.substring(1), keywords);        }        // 如果沒有下一個(gè)匹配的字,表示匹配結(jié)束!        return null;    } else { // endTag , 添加關(guān)鍵字        KeyWord tmp = getKeyWord(append, endTag);        keywords.add(checkPattern(tmp, pre, suffix));    }    // 有下一個(gè)匹配的詞則繼續(xù)匹配,一直取到最大的匹配關(guān)鍵字    KeyWord tmp = null;    if (nextTree != null && nextTree.size() > 0) {        // 如果大于0,則表示還有更長(zhǎng)的詞,繼續(xù)往下找        tmp = getSensitiveWord(append + next, pre, nextTree, text.substring(1), keywords);        if (tmp == null) {            // 沒有更長(zhǎng)的詞,則就返回這個(gè)詞。在返回之前,先判斷它是模糊的,還是精確的            tmp = getKeyWord(append, endTag);        }        return checkPattern(tmp, pre, suffix);    }    // 沒有往下的詞了,返回該關(guān)鍵詞。    return checkPattern(getKeyWord(append, endTag), pre, suffix);}

思路是對(duì)某個(gè)字符串text,逐個(gè)字符ch,獲取ch對(duì)應(yīng)的詞庫樹的children,然后獲取匹配到的單個(gè)或多個(gè)結(jié)果,將匹配到的關(guān)鍵詞在text中的開始和結(jié)束下標(biāo)進(jìn)行記錄,如后續(xù)需要html標(biāo)記,或者字符替換可直接使用。如果未能在詞庫樹中找到對(duì)應(yīng)的ch的children,或者詞庫樹的children未能匹配到去除ch的子字符串,則繼續(xù)循環(huán)。具體可再詳細(xì)讀一下代碼。

4 Radix Tree的應(yīng)用

4.1 RAX - Redis Tree

Redis實(shí)現(xiàn)了不定長(zhǎng)壓縮前綴的radix tree,用在集群模式下存儲(chǔ)slot對(duì)應(yīng)的的所有key信息。

/* Representation of a radix tree as implemented in this file, that contains * the strings "foo", "foobar" and "footer" after the insertion of each * word. When the node represents a key inside the radix tree, we write it * between [], otherwise it is written between (). * * This is the vanilla representation: * *              (f) "" *                \ *                (o) "f" *                  \ *                  (o) "fo" *                    \ *                  [t   b] "foo" *                  /     \ *         "foot" (e)     (a) "foob" *                /         \ *      "foote" (r)         (r) "fooba" *              /             \ *    "footer" []             [] "foobar" * * However, this implementation implements a very common optimization where * successive nodes having a single child are "compressed" into the node * itself as a string of characters, each representing a next-level child, * and only the link to the node representing the last character node is * provided inside the representation. So the above representation is turned * into: * *                  ["foo"] "" *                     | *                  [t   b] "foo" *                  /     \ *        "foot" ("er")    ("ar") "foob" *                 /          \ *       "footer" []          [] "foobar" * * However this optimization makes the implementation a bit more complex. * For instance if a key "first" is added in the above radix tree, a * "node splitting" operation is needed, since the "foo" prefix is no longer * composed of nodes having a single child one after the other. This is the * above tree and the resulting node splitting after this event happens: * * *                    (f) "" *                    / *                 (i o) "f" *                 /   \ *    "firs"  ("rst")  (o) "fo" *              /        \ *    "first" []       [t   b] "foo" *                     /     \ *           "foot" ("er")    ("ar") "foob" *                    /          \ *          "footer" []          [] "foobar" * * Similarly after deletion, if a new chain of nodes having a single child * is created (the chain must also not include nodes that represent keys), * it must be compressed back into a single node. * */#define RAX_NODE_MAX_SIZE ((1<<29)-1)typedef struct raxNode {    uint32_t iskey:1;     /* Does this node contain a key? */    uint32_t isnull:1;    /* Associated value is NULL (don"t store it). */    uint32_t iscompr:1;   /* Node is compressed. */    uint32_t size:29;     /* Number of children, or compressed string len. */    unsigned char data[];} raxNode;typedef struct rax {    raxNode *head;    uint64_t numele;    uint64_t numnodes;} rax;typedef struct raxStack {    void **stack; /* Points to static_items or an heap allocated array. */    size_t items, maxitems; /* Number of items contained and total space. */    void *static_items[RAX_STACK_STATIC_ITEMS];    int oom; /* True if pushing into this stack failed for OOM at some point. */} raxStack;

如Redis源碼中的注釋所寫,RAX進(jìn)行了一些優(yōu)化,并不會(huì)將一個(gè)字符串直接按照每個(gè)字符進(jìn)行樹的構(gòu)建,而是在Insert有沖突時(shí)節(jié)點(diǎn)分割處理,在Delete時(shí)如果子節(jié)點(diǎn)和父節(jié)點(diǎn)都只有一個(gè),則需要進(jìn)行合并操作。對(duì)于RAX有興趣的同學(xué),可以看一下rax.h、rax.c的相關(guān)代碼。

4.2 Linux內(nèi)核

Linux radix樹最廣泛的用途是用于內(nèi)存管理,結(jié)構(gòu)address_space通過radix樹跟蹤綁定到地址映射上的核心頁,該radix樹允許內(nèi)存管理代碼快速查找標(biāo)識(shí)為dirty或writeback的頁。Linux radix樹的API函數(shù)在lib/radix-tree.c中實(shí)現(xiàn)。Linux基數(shù)樹(radix tree)是將指針與long整數(shù)鍵值相關(guān)聯(lián)的機(jī)制,它存儲(chǔ)有效率,并且可快速查詢,用于指針與整數(shù)值的映射(如:IDR機(jī)制)、內(nèi)存管理等。

struct radix_tree_node {        unsigned int    path;        unsigned int    count;        union {                struct {                        struct radix_tree_node *parent;                        void *private_data;                };                struct rcu_head rcu_head;        };        /* For tree user */        struct list_head private_list;        void __rcu      *slots[RADIX_TREE_MAP_SIZE];        unsigned long   tags[RADIX_TREE_MAX_TAGS][RADIX_TREE_TAG_LONGS];};

關(guān)于Linux內(nèi)核使用Radix Tree的具體代碼,有興趣的同學(xué)可以繼續(xù)深入。

5 總結(jié)

Trie樹在單詞搜索、統(tǒng)計(jì)、排序等領(lǐng)域有大量的應(yīng)用。文章從基礎(chǔ)概念到具體的臟話過濾的應(yīng)用、Redis的RAX和Linux內(nèi)核的Radix Tree對(duì)Trie樹做了介紹。數(shù)據(jù)結(jié)構(gòu)和算法是程序高性能的基礎(chǔ),本文拋磚引玉,希望大家對(duì)Trie樹有所了解,并在未來開發(fā)過程實(shí)踐和應(yīng)用Trie樹解決中類似情景的問題。

標(biāo)簽: 數(shù)據(jù)結(jié)構(gòu) 最大限度 冗余信息

上一篇:
下一篇: