<ruby id="bdb3f"></ruby>

    <p id="bdb3f"><cite id="bdb3f"></cite></p>

      <p id="bdb3f"><cite id="bdb3f"><th id="bdb3f"></th></cite></p><p id="bdb3f"></p>
        <p id="bdb3f"><cite id="bdb3f"></cite></p>

          <pre id="bdb3f"></pre>
          <pre id="bdb3f"><del id="bdb3f"><thead id="bdb3f"></thead></del></pre>

          <ruby id="bdb3f"><mark id="bdb3f"></mark></ruby><ruby id="bdb3f"></ruby>
          <pre id="bdb3f"><pre id="bdb3f"><mark id="bdb3f"></mark></pre></pre><output id="bdb3f"></output><p id="bdb3f"></p><p id="bdb3f"></p>

          <pre id="bdb3f"><del id="bdb3f"><progress id="bdb3f"></progress></del></pre>

                <ruby id="bdb3f"></ruby>

                ??一站式輕松地調用各大LLM模型接口,支持GPT4、智譜、豆包、星火、月之暗面及文生圖、文生視頻 廣告
                [[standard-tokenizer]] === standard Tokenizer A _tokenizer_ accepts a string as input, processes((("words", "identifying", "using standard tokenizer")))((("standard tokenizer")))((("tokenizers"))) the string to break it into individual words, or _tokens_ (perhaps discarding some characters like punctuation), and emits a _token stream_ as output. What is interesting is the algorithm that is used to _identify_ words. The `whitespace` tokenizer ((("whitespace tokenizer")))simply breaks on whitespace--spaces, tabs, line feeds, and so forth--and assumes that contiguous nonwhitespace characters form a single token. For instance: [source,js] -------------------------------------------------- GET /_analyze?tokenizer=whitespace You're the 1st runner home! -------------------------------------------------- This request would return the following terms: `You're`, `the`, `1st`, `runner`, `home!` The `letter` tokenizer, on the other hand, breaks on any character that is not a letter, and so would ((("letter tokenizer")))return the following terms: `You`, `re`, `the`, `st`, `runner`, `home`. The `standard` tokenizer((("Unicode Text Segmentation algorithm"))) uses the Unicode Text Segmentation algorithm (as defined in http://unicode.org/reports/tr29/[Unicode Standard Annex #29]) to find the boundaries _between_ words,((("word boundaries"))) and emits everything in-between. Its knowledge of Unicode allows it to successfully tokenize text containing a mixture of languages. Punctuation may((("punctuation", "in words"))) or may not be considered part of a word, depending on where it appears: [source,js] -------------------------------------------------- GET /_analyze?tokenizer=standard You're my 'favorite'. -------------------------------------------------- In this example, the apostrophe in `You're` is treated as part of the word, while the single quotes in `'favorite'` are not, resulting in the following terms: `You're`, `my`, `favorite`. [TIP] ================================================== The `uax_url_email` tokenizer works((("uax_url_email tokenizer"))) in exactly the same way as the `standard` tokenizer, except that it recognizes((("email addresses and URLs, tokenizer for"))) email addresses and URLs and emits them as single tokens. The `standard` tokenizer, on the other hand, would try to break them into individual words. For instance, the email address `joe-bloggs@foo-bar.com` would result in the tokens `joe`, `bloggs`, `foo`, `bar.com`. ================================================== The `standard` tokenizer is a reasonable starting point for tokenizing most languages, especially Western languages. In fact, it forms the basis of most of the language-specific analyzers like the `english`, `french`, and `spanish` analyzers. Its support for Asian languages, however, is limited, and you should consider using the `icu_tokenizer` instead,((("icu_tokenizer"))) which is available in the ICU plug-in.
                  <ruby id="bdb3f"></ruby>

                  <p id="bdb3f"><cite id="bdb3f"></cite></p>

                    <p id="bdb3f"><cite id="bdb3f"><th id="bdb3f"></th></cite></p><p id="bdb3f"></p>
                      <p id="bdb3f"><cite id="bdb3f"></cite></p>

                        <pre id="bdb3f"></pre>
                        <pre id="bdb3f"><del id="bdb3f"><thead id="bdb3f"></thead></del></pre>

                        <ruby id="bdb3f"><mark id="bdb3f"></mark></ruby><ruby id="bdb3f"></ruby>
                        <pre id="bdb3f"><pre id="bdb3f"><mark id="bdb3f"></mark></pre></pre><output id="bdb3f"></output><p id="bdb3f"></p><p id="bdb3f"></p>

                        <pre id="bdb3f"><del id="bdb3f"><progress id="bdb3f"></progress></del></pre>

                              <ruby id="bdb3f"></ruby>

                              哎呀哎呀视频在线观看