Skip to content

Tokenizing Text

ODict includes a built-in NLP tokenizer that segments text into words and automatically matches each token against dictionary entries. This is especially useful for languages without whitespace-delimited words (Chinese, Japanese, Korean, Thai, Khmer) as well as compound-word languages (German, Swedish).

Language familyLanguagesTokenizer
ChineseSimplified & Traditional Chinesejieba
JapaneseJapaneseLindera (UniDic)
KoreanKoreanLindera (KoDic)
ThaiThaiICU-based
KhmerKhmerICU-based
GermanicGerman, SwedishCompound word splitting
Latin-scriptEnglish, French, Spanish, etc.Unicode word boundaries
use odict::{OpenDictionary, tokenize::TokenizeOptions};
fn main() -> odict::Result<()> {
let file = OpenDictionary::from_path("my-dictionary.odict")?;
let dict = file.contents()?;
let tokens = dict.tokenize(
"the cat ran",
TokenizeOptions::default(),
)?;
for token in &tokens {
println!("'{}' ({} entries found)",
token.lemma,
token.entries.len()
);
}
Ok(())
}

For Chinese (and other CJK languages), ODict automatically detects the script and uses the appropriate segmenter.

let tokens = dict.tokenize("你好世界", TokenizeOptions::default())?;
for token in &tokens {
println!("Lemma: {}, Script: {:?}, Language: {:?}",
token.lemma,
token.script.name(),
token.language.as_ref().map(|l| l.code())
);
}

Like lookup, tokenization supports following see cross-references.

let options = TokenizeOptions::default().follow(true);
let tokens = dict.tokenize("the cat ran", options)?;
for token in &tokens {
for result in &token.entries {
if let Some(from) = &result.directed_from {
println!("'{}' → '{}'",
from.term.as_str(),
result.entry.term.as_str()
);
}
}
}
// e.g. 'ran' → 'run'
let options = TokenizeOptions::default().insensitive(true);
// "DOG" will match the "dog" entry
let tokens = dict.tokenize("DOG cat", options)?;

Each token returned by tokenize() includes metadata about the match.

PropertyDescription
lemmaThe original text of the token as it appears in the input
languageDetected language code (e.g. "cmn" for Mandarin), if applicable
scriptDetected script name (e.g. "Han", "Latin")
kindToken kind (e.g. "Word", "Punctuation")
startStart byte offset in the original text
endEnd byte offset in the original text
entriesArray of LookupResult objects for matched dictionary entries
let options = TokenizeOptions::default()
.follow(true)
.insensitive(true);
let tokens = dict.tokenize("The CAT RaN away", options)?;