| | |
- SpellChecker
class SpellChecker |
| |
Class implementing stateful spellchecking behaviour.
This class is designed to implement a spell-checking loop over
a block of text, correcting/ignoring/replacing words as required.
This loop is implemented using an iterator paradigm so it can be
embedded inside other loops of control.
The SpellChecker object is stateful, and the appropriate methods
must be called to alter its state and affect the progress of
the spell checking session. At any point during the checking
session, the attribute 'word' will hold the current erroneously
spelled word under consideration. The action to take on this word
is determined by calling methods such as 'replace', 'replace_always'
and 'ignore_always'. Once this is done, calling 'next' advances
to the next misspelled word.
As a quick (and rather silly) example, the following code replaces
each misspelled word with the string "SPAM":
>>> text = "This is sme text with a fw speling errors in it."
>>> chkr = SpellChecker("en_US",text)
>>> for err in chkr:
... err.replace("SPAM")
...
>>> chkr.get_text()
'This is SPAM text with a SPAM SPAM errors in it.'
>>>
Internally, the SpellChecker always works with arrays of (possibly
unicode) character elements. This allows the in-place modification
of the string as it is checked, and is the closest thing Python has
to a mutable string. The text can be set as any of a normal string,
unicode string, character array or unicode character array. The
'get_text' method will return the modified array object if an
array is used, or a new string object if a string it used.
Words input to the SpellChecker may be either plain strings or
unicode objects. They will be converted to the same type as the
text being checked, using python's default encoding/decoding
settings.
If using an array of characters with this object and the
array is modified outside of the spellchecking loop, use the
'set_offset' method to reposition the internal loop pointer
to make sure it doesn't skip any words. |
| |
Methods defined here:
- __init__(self, lang=None, text=None, tokenize=None, chunkers=None, filters=None)
- Constructor for the SpellChecker class.
SpellChecker objects can be created in two ways, depending on
the nature of the first argument. If it is a string, it
specifies a language tag from which a dictionary is created.
Otherwise, it must be an enchant Dict object to be used.
Optional keyword arguments are:
* text: to set the text to be checked at creation time
* tokenize: a custom tokenization function to use
* chunkers: a list of chunkers to apply during tokenization
* filters: a list of filters to apply during tokenization
If <tokenize> is not given and the first argument is a Dict,
its 'tag' attribute must be a language tag so that a tokenization
function can be created automatically. If this attribute is missing
the user's default language will be used.
- __iter__(self)
- Each SpellChecker object is its own iterator
- __next__(self)
- add(self, word=None)
- Add given word to the personal word list.
If no word is given, the current erroneous word is added.
- add_to_personal(self, word=None)
- Add given word to the personal word list.
If no word is given, the current erroneous word is added.
- check(self, word)
- Check correctness of the given word.
- coerce_string(self, text, enc=None)
- Coerce string into the required type.
This method can be used to automatically ensure that strings
are of the correct type required by this checker - either unicode
or standard. If there is a mismatch, conversion is done using
python's default encoding unless another encoding is specified.
- get_text(self)
- Return the spell-checked text.
- ignore_always(self, word=None)
- Add given word to list of words to ignore.
If no word is given, the current erroneous word is added.
- leading_context(self, chars)
- Get <chars> characters of leading context.
This method returns up to <chars> characters of leading
context - the text that occurs in the string immediately
before the current erroneous word.
- next(self)
- Process text up to the next spelling error.
This method is designed to support the iterator protocol.
Each time it is called, it will advance the 'word' attribute
to the next spelling error in the text. When no more errors
are found, it will raise StopIteration.
The method will always return self, so that it can be used
sensibly in common idioms such as:
for err in checker:
err.do_something()
- replace(self, repl)
- Replace the current erroneous word with the given string.
- replace_always(self, word, repl=None)
- Always replace given word with given replacement.
If a single argument is given, this is used to replace the
current erroneous word. If two arguments are given, that
combination is added to the list for future use.
- set_offset(self, off, whence=0)
- Set the offset of the tokenization routine.
For more details on the purpose of the tokenization offset,
see the documentation of the 'enchant.tokenize' module.
The optional argument whence indicates the method by
which to change the offset:
* 0 (the default) treats <off> as an increment
* 1 treats <off> as a distance from the start
* 2 treats <off> as a distance from the end
- set_text(self, text)
- Set the text to be spell-checked.
This method must be called, or the 'text' argument supplied
to the constructor, before calling the 'next()' method.
- suggest(self, word=None)
- Return suggested spellings for the given word.
If no word is given, the current erroneous word is used.
- trailing_context(self, chars)
- Get <chars> characters of trailing context.
This method returns up to <chars> characters of trailing
context - the text that occurs in the string immediately
after the current erroneous word.
- wants_unicode(self)
- Check whether the checker wants unicode strings.
This method will return True if the checker wants unicode strings
as input, False if it wants normal strings. It's important to
provide the correct type of string to the checker.
| |