NuriaProject Framework  0.1
The NuriaProject Framework
Public Member Functions | List of all members
Nuria::Tokenizer Class Reference

General-purpose run-time tokenizer. More...

#include <tokenizer.hpp>

Inheritance diagram for Nuria::Tokenizer:

Public Member Functions

 Tokenizer (QObject *parent=nullptr)
 
 ~Tokenizer () override
 
void addTokenizerRules (const QString &name, const TokenizerRules &ruleSet)
 
bool atEnd () const
 
int currentColumn () const
 
int currentPosition () const
 
int currentRow () const
 
const TokenizerRulescurrentTokenizerRules () const
 
TokenizerRulesdefaultTokenizerRules ()
 
int errorColumn () const
 
int errorPosition () const
 
int errorRow () const
 
bool hasError () const
 
Token nextToken ()
 
void removeTokenizerRules (const QString &name)
 
void setCurrentTokenizerRules (const QString &name)
 
void setDefaultTokenizerRules (const TokenizerRules &ruleSet)
 
void setPosition (int position, int column, int row)
 
void tokenize (const QByteArray &data)
 
QByteArray tokenizeData () const
 
TokenizerRules tokenizerRules (const QString &name) const
 

Detailed Description

General-purpose run-time tokenizer.

This is a general-purpose tokenizer which can be constructed and configured at run-time, allowing for complex token-schemes.

Usage

Basic usage consists of creating an instance of Nuria::Tokenizer, filling the default rule-set with data (See defaultTokenizerRules() ) and then using tokenize() and nextToken() to iterate over a data stream.

Please note that Nuria::Tokenizer uses std::regex for regular expressions as QRegularExpression only works on QStrings. Although the syntax is pretty similar, please have a look at the documentation of ECMAScript regular expressions, e.g. here: http://www.cplusplus.com/reference/regex/ECMAScript/

Using multiple rule-sets

For some types of data it may be desirable to use multiple rule-sets. Nuria::Tokenizer allows this use-case too.

To do this, you'll first need to create all needed rule-sets by creating instances of TokenizerRules and filling them.

Second, you'll need to use TokenizerRules::setTokenAction() to register a token handler for all tokens which should switch the currently used rule-set. You can do this by calling setCurrentTokenizerRules() on the Nuria::Tokenizer instance passed to the handler.

After this, it's just a matter of adding all rule-sets using addTokenizerRules().

Note
The default rule-set has the name "" (empty string).
Ignoring tokens

All tokens with negative token ids will be ignored and silently discarded. If nextToken() encounters such a token, it'll read on until it found a token which is not ignored or until the end of the data-stream.

To decide later if a token should be discarded, you can register a token action handler on a specific token. If you want to ignore that token, you can simply set the tokenId of the 'token' argument to a negative value.

Location and error handling

Nuria::Tokenizer automatically takes care of location and error tracking. You can access the cursor position using currentRow(), currentColumn or currentPosition() to get the current row, column, or position in the data-stream respectively.

The same goes for the error position, which can be accessed using errorRow(), errorColumn() and errorPosition().

Token action handlers can move the internal cursor using setPosition().

Constructor & Destructor Documentation

Nuria::Tokenizer::Tokenizer ( QObject *  parent = nullptr)

Constructor.

Nuria::Tokenizer::~Tokenizer ( )
override

Destructor.

Member Function Documentation

void Nuria::Tokenizer::addTokenizerRules ( const QString &  name,
const TokenizerRules ruleSet 
)

Adds the named ruleSet as name.

bool Nuria::Tokenizer::atEnd ( ) const

Returns true if the tokenizer reached the end of the data stream.

int Nuria::Tokenizer::currentColumn ( ) const

Returns the current column in the data-stream.

int Nuria::Tokenizer::currentPosition ( ) const

Returns the current position in the data-stream.

int Nuria::Tokenizer::currentRow ( ) const

Returns the current row in the data-stream.

const TokenizerRules& Nuria::Tokenizer::currentTokenizerRules ( ) const

Returns the currently used tokenizer rule-set.

TokenizerRules& Nuria::Tokenizer::defaultTokenizerRules ( )

Returns the default tokenizer rule-set.

int Nuria::Tokenizer::errorColumn ( ) const

Returns the column where the error occured.

int Nuria::Tokenizer::errorPosition ( ) const

Returns the position in the data-stream where the error occured.

int Nuria::Tokenizer::errorRow ( ) const

Returns the row where the error occured.

bool Nuria::Tokenizer::hasError ( ) const

Returns true if the last call to nextToken() raised an error.

Token Nuria::Tokenizer::nextToken ( )

Moves the tokenizer onwards by one token, returning the most-recently read token.

If the token id of the returned token is less than 0, the returend token is to be ignored by the caller. This happens in the following scenarios:

  • If the tokenizer is already at the end (See atEnd() )
  • If an error occured (See hasError() )
  • If all data from the position till the end are ignored tokens
See also
atEnd hasError
void Nuria::Tokenizer::removeTokenizerRules ( const QString &  name)

Removes the rule-set called name. If name is empty the call will have no effect. If name is the currently used rule-set, the default rule-set will be the currently used one after the call.

void Nuria::Tokenizer::setCurrentTokenizerRules ( const QString &  name)

Tells the tokenizer to use the rule-set known as name from now on. If name is not a known rule-set, the default rule-set is used.

void Nuria::Tokenizer::setDefaultTokenizerRules ( const TokenizerRules ruleSet)

Sets the default tokenizer ruleSet.

void Nuria::Tokenizer::setPosition ( int  position,
int  column,
int  row 
)

Moves the cursor to position in the tokenize data. Also sets the current column and row, which are only used for diagnostics.

void Nuria::Tokenizer::tokenize ( const QByteArray &  data)

Sets data to be tokenized. Use nextToken() to acquire the next token.

QByteArray Nuria::Tokenizer::tokenizeData ( ) const

Returns the data as passed to the last call to tokenize().

TokenizerRules Nuria::Tokenizer::tokenizerRules ( const QString &  name) const

Returns the rule-set called name.


The documentation for this class was generated from the following file: