public class TokenizerImpl extends Object implements Tokenizer
Modifier and Type | Field and Description |
---|---|
private int |
currentChar
The current character, whether its from the file or the input text.
|
private int |
currentPosition
The current char position for the input text (not the file)
this is called "file_pos" in flite
|
static String |
DEFAULT_POSTPUNCTUATION_SYMBOLS
A string containing the default post-punctuation characters.
|
static String |
DEFAULT_PREPUNCTUATION_SYMBOLS
A string containing the default pre-punctuation characters.
|
static String |
DEFAULT_SINGLE_CHAR_SYMBOLS
A string containing the default single characters.
|
static String |
DEFAULT_WHITESPACE_SYMBOLS
A string containing the default whitespace characters.
|
static int |
EOF
A constant indicating that the end of the stream has been read.
|
private String |
errorDescription
The error description.
|
private String |
inputText
The input text (from the Utterance) to tokenize.
|
private Token |
lastToken |
private int |
lineNumber
The line number.
|
private String |
postpunctuationSymbols |
private String |
prepunctuationSymbols |
private Reader |
reader
The file to read input text from, if using file mode.
|
private String |
singleCharSymbols |
private Token |
token
A place to store the current token.
|
private String |
whitespaceSymbols
The delimiting symbols of this tokenizer.
|
Constructor and Description |
---|
TokenizerImpl()
Constructs a Tokenizer.
|
TokenizerImpl(Reader file)
Creates a tokenizer that will return tokens from
the given file.
|
TokenizerImpl(String string)
Creates a tokenizer that will return tokens from
the given string.
|
Modifier and Type | Method and Description |
---|---|
String |
getErrorDescription()
if hasErrors returns
true , this will return a
description of the error encountered, otherwise
it will return null |
private int |
getNextChar()
Advances the currentPosition pointer by 1 (if not exceeding
length of inputText, and returns the character pointed by
currentPosition.
|
Token |
getNextToken()
Returns the next token.
|
private String |
getTokenByCharClass(String charClass,
boolean containThisCharClass)
Provides a `compressed' method from getTokenOfCharClass() and
getTokenNotOfCharClass().
|
private String |
getTokenNotOfCharClass(String endingCharClass)
Starting from the current position of the input text/file,
returns the subsequent characters, not of type singleCharSymbols,
and ended at characters of type endingCharClass.
|
private String |
getTokenOfCharClass(String charClass)
Starting from the current position of the input text,
returns the subsequent characters of type charClass,
and not of type singleCharSymbols.
|
boolean |
hasErrors()
Returns
true if there were errors while reading tokens |
boolean |
hasMoreTokens()
Returns
true if there are more tokens,
false otherwise. |
boolean |
isBreak()
Determines if the current token should start a new sentence.
|
private void |
removeTokenPostpunctuation()
Removes the postpunctuation characters from the current token.
|
void |
setInputReader(Reader reader)
Sets the input reader
|
void |
setInputText(String inputString)
Sets the text to tokenize.
|
void |
setPostpunctuationSymbols(String symbols)
Sets the postpunctuation symbols of this Tokenizer to the given
symbols.
|
void |
setPrepunctuationSymbols(String symbols)
Sets the prepunctuation symbols of this Tokenizer to the given
symbols.
|
void |
setSingleCharSymbols(String symbols)
Sets the single character symbols of this Tokenizer to the given
symbols.
|
void |
setWhitespaceSymbols(String symbols)
Sets the whitespace symbols of this Tokenizer to the given symbols.
|
public static final int EOF
public static final String DEFAULT_WHITESPACE_SYMBOLS
public static final String DEFAULT_SINGLE_CHAR_SYMBOLS
public static final String DEFAULT_PREPUNCTUATION_SYMBOLS
public static final String DEFAULT_POSTPUNCTUATION_SYMBOLS
private int lineNumber
private int currentChar
private int currentPosition
private String whitespaceSymbols
private String singleCharSymbols
private String prepunctuationSymbols
private String postpunctuationSymbols
private String errorDescription
public TokenizerImpl()
public TokenizerImpl(String string)
string
- the string to tokenizepublic TokenizerImpl(Reader file)
file
- where to read the input frompublic void setWhitespaceSymbols(String symbols)
setWhitespaceSymbols
in interface Tokenizer
symbols
- the whitespace symbolspublic void setSingleCharSymbols(String symbols)
setSingleCharSymbols
in interface Tokenizer
symbols
- the single character symbolspublic void setPrepunctuationSymbols(String symbols)
setPrepunctuationSymbols
in interface Tokenizer
symbols
- the prepunctuation symbolspublic void setPostpunctuationSymbols(String symbols)
setPostpunctuationSymbols
in interface Tokenizer
symbols
- the postpunctuation symbolspublic void setInputText(String inputString)
setInputText
in interface Tokenizer
inputString
- the string to tokenizepublic void setInputReader(Reader reader)
setInputReader
in interface Tokenizer
reader
- the input sourcepublic Token getNextToken()
getNextToken
in interface Tokenizer
null
if no more tokenspublic boolean hasMoreTokens()
true
if there are more tokens,
false
otherwise.hasMoreTokens
in interface Tokenizer
true
if there are more tokens
false
otherwiseprivate int getNextChar()
private String getTokenOfCharClass(String charClass)
charClass
- the type of characters to look forbuffer
- the place to append characters of type charClassprivate String getTokenNotOfCharClass(String endingCharClass)
endingCharClass
- the type of characters to look forprivate String getTokenByCharClass(String charClass, boolean containThisCharClass)
true
,
then a string from the
current position to the last character in charClass is returned.
If containThisCharClass is false
, then a string
before the first
occurrence of a character in containThisCharClass is returned.charClass
- the string of characters you want included or
excluded in your returncontainThisCharClass
- determines if you want characters
in charClass in the returned string or notprivate void removeTokenPostpunctuation()
public boolean hasErrors()
true
if there were errors while reading tokenspublic String getErrorDescription()
true
, this will return a
description of the error encountered, otherwise
it will return null
getErrorDescription
in interface Tokenizer
WebARTS Library Licensed Under the GNU - General Public License. Other Libraries licensed under their respective Open Source Licenses