Simplexer
Description:
The Challenge
You'll need to implement a simple lexer type Simplexer
, which, when constructed with a given string containing an expression in a simple language, transforms that string into a stream of Token
s.
Simplexer
Your Simplexer
type is created with the expression it should tokenize. It should act like an iterator, yielding Token
items until there are no more items to yield, at which point it should do whatever the appropriate action is for your chosen language.
Instances of the Simplexer
class are initialized with a string and should be iterators as well as iterable, i.e. they must implement both __iter__
and __next__
.
Like all iterators, __next__
should raise a StopIteration
exception when no more tokens remain to be yielded.
Tokens
Tokens are represented by Token
objects, which are preloaded for you and take the following shape:
class Token:
def __init__(self, text: str, kind: str):
self.text = text
self.kind = kind
Token.text
is the value of the matched portion of the expressionToken.kind
is the type of the token (see below)
Language Grammar
The language for this task has a simple grammar, consisting of the following constructs and their associated token types:
Type Construct
integer: Any sequence of one or more decimal digits (leading zeroes allowed, no negative numbers)
boolean: Any of the following words: [true, false]
string: Any sequence of zero or more characters surrounded by "double quotes"
operator: Any of the following characters: [+, -, *, /, %, (, ), =]
keyword: Any of the following words: if, else, for, while, return, func, break
whitespace: Any sequence of the following characters: [' ', '\t', '\n']
- Consecutive whitespace should be collapsed into a single token
identifier: Any sequence of alphanumeric characters, as well as '_' and '$'
- Must not start with a digit
- Make sure that keywords and booleans aren't matched as identifiers
Notes
Individual constructs are disambiguated by whitespace if necessary, so
true123
is anidentifier
, as opposed toboolean
followed byinteger
123true
is aninteger
followed byboolean
"123"true
is astring
followed byboolean
x+y
isidentifier
op
identifier
Any character is permissable between double quotes, including keywords, numbers and arbitrary whitespace, so
"true"
and"123"
arestring
s. The quotes""
are to be included in the Token.The input strings are guaranteed to be lexically valid according to the grammar above. Specifically:
- Input will consist only of valid constructs that can be mapped unambiguously to one of the above tokens
- No assumptions need be made regarding the structure of tokens in the input, i.e. syntax.
- Input may be the empty string
That means the input will not contain any surprising characters, there is no need for error handling, and quotes will always appear in balanced pairs. This does not mean that the input needs to make semantic or syntactic sense. For example,
if 123) return else"five")(
is valid input for this task.After all, the job of a lexer is not to interpret the given input, merely transform it into tokens that could then be passed on to e.g. a parser, which would then check that the tokens received are syntactically valid and imbue them with semantics.
Stats:
Created | Jan 15, 2015 |
Published | Jan 15, 2015 |
Warriors Trained | 5753 |
Total Skips | 2599 |
Total Code Submissions | 6712 |
Total Times Completed | 749 |
Java Completions | 208 |
JavaScript Completions | 169 |
Python Completions | 249 |
C# Completions | 109 |
Rust Completions | 42 |
Total Stars | 185 |
% of votes with a positive feedback rating | 91% of 192 |
Total "Very Satisfied" Votes | 167 |
Total "Somewhat Satisfied" Votes | 17 |
Total "Not Satisfied" Votes | 8 |
Total Rank Assessments | 19 |
Average Assessed Rank | 4 kyu |
Highest Assessed Rank | 3 kyu |
Lowest Assessed Rank | 6 kyu |