NuriaProject Framework
0.1
The NuriaProject Framework
|
General-purpose run-time tokenizer. More...
#include <tokenizer.hpp>
Public Member Functions | |
Tokenizer (QObject *parent=nullptr) | |
~Tokenizer () override | |
void | addTokenizerRules (const QString &name, const TokenizerRules &ruleSet) |
bool | atEnd () const |
int | currentColumn () const |
int | currentPosition () const |
int | currentRow () const |
const TokenizerRules & | currentTokenizerRules () const |
TokenizerRules & | defaultTokenizerRules () |
int | errorColumn () const |
int | errorPosition () const |
int | errorRow () const |
bool | hasError () const |
Token | nextToken () |
void | removeTokenizerRules (const QString &name) |
void | setCurrentTokenizerRules (const QString &name) |
void | setDefaultTokenizerRules (const TokenizerRules &ruleSet) |
void | setPosition (int position, int column, int row) |
void | tokenize (const QByteArray &data) |
QByteArray | tokenizeData () const |
TokenizerRules | tokenizerRules (const QString &name) const |
General-purpose run-time tokenizer.
This is a general-purpose tokenizer which can be constructed and configured at run-time, allowing for complex token-schemes.
Basic usage consists of creating an instance of Nuria::Tokenizer, filling the default rule-set with data (See defaultTokenizerRules() ) and then using tokenize() and nextToken() to iterate over a data stream.
Please note that Nuria::Tokenizer uses std::regex for regular expressions as QRegularExpression only works on QStrings. Although the syntax is pretty similar, please have a look at the documentation of ECMAScript regular expressions, e.g. here: http://www.cplusplus.com/reference/regex/ECMAScript/
For some types of data it may be desirable to use multiple rule-sets. Nuria::Tokenizer allows this use-case too.
To do this, you'll first need to create all needed rule-sets by creating instances of TokenizerRules and filling them.
Second, you'll need to use TokenizerRules::setTokenAction() to register a token handler for all tokens which should switch the currently used rule-set. You can do this by calling setCurrentTokenizerRules() on the Nuria::Tokenizer instance passed to the handler.
After this, it's just a matter of adding all rule-sets using addTokenizerRules().
All tokens with negative token ids will be ignored and silently discarded. If nextToken() encounters such a token, it'll read on until it found a token which is not ignored or until the end of the data-stream.
To decide later if a token should be discarded, you can register a token action handler on a specific token. If you want to ignore that token, you can simply set the tokenId of the 'token' argument to a negative value.
Nuria::Tokenizer automatically takes care of location and error tracking. You can access the cursor position using currentRow(), currentColumn or currentPosition() to get the current row, column, or position in the data-stream respectively.
The same goes for the error position, which can be accessed using errorRow(), errorColumn() and errorPosition().
Token action handlers can move the internal cursor using setPosition().
Nuria::Tokenizer::Tokenizer | ( | QObject * | parent = nullptr | ) |
Constructor.
|
override |
Destructor.
void Nuria::Tokenizer::addTokenizerRules | ( | const QString & | name, |
const TokenizerRules & | ruleSet | ||
) |
Adds the named ruleSet as name.
bool Nuria::Tokenizer::atEnd | ( | ) | const |
Returns true
if the tokenizer reached the end of the data stream.
int Nuria::Tokenizer::currentColumn | ( | ) | const |
Returns the current column in the data-stream.
int Nuria::Tokenizer::currentPosition | ( | ) | const |
Returns the current position in the data-stream.
int Nuria::Tokenizer::currentRow | ( | ) | const |
Returns the current row in the data-stream.
const TokenizerRules& Nuria::Tokenizer::currentTokenizerRules | ( | ) | const |
Returns the currently used tokenizer rule-set.
TokenizerRules& Nuria::Tokenizer::defaultTokenizerRules | ( | ) |
Returns the default tokenizer rule-set.
int Nuria::Tokenizer::errorColumn | ( | ) | const |
Returns the column where the error occured.
int Nuria::Tokenizer::errorPosition | ( | ) | const |
Returns the position in the data-stream where the error occured.
int Nuria::Tokenizer::errorRow | ( | ) | const |
Returns the row where the error occured.
bool Nuria::Tokenizer::hasError | ( | ) | const |
Returns true
if the last call to nextToken() raised an error.
Token Nuria::Tokenizer::nextToken | ( | ) |
Moves the tokenizer onwards by one token, returning the most-recently read token.
If the token id of the returned token is less than 0, the returend token is to be ignored by the caller. This happens in the following scenarios:
void Nuria::Tokenizer::removeTokenizerRules | ( | const QString & | name | ) |
Removes the rule-set called name. If name is empty the call will have no effect. If name is the currently used rule-set, the default rule-set will be the currently used one after the call.
void Nuria::Tokenizer::setCurrentTokenizerRules | ( | const QString & | name | ) |
Tells the tokenizer to use the rule-set known as name from now on. If name is not a known rule-set, the default rule-set is used.
void Nuria::Tokenizer::setDefaultTokenizerRules | ( | const TokenizerRules & | ruleSet | ) |
Sets the default tokenizer ruleSet.
void Nuria::Tokenizer::setPosition | ( | int | position, |
int | column, | ||
int | row | ||
) |
Moves the cursor to position in the tokenize data. Also sets the current column and row, which are only used for diagnostics.
void Nuria::Tokenizer::tokenize | ( | const QByteArray & | data | ) |
Sets data to be tokenized. Use nextToken() to acquire the next token.
QByteArray Nuria::Tokenizer::tokenizeData | ( | ) | const |
Returns the data as passed to the last call to tokenize().
TokenizerRules Nuria::Tokenizer::tokenizerRules | ( | const QString & | name | ) | const |
Returns the rule-set called name.