A Small Programming Language

This design is about a small programing language. Small, but rich enough for most projects. There are no constructs such as for or while, although cycles are still possible. This language is very close to assembly languages, but it is not assembly, because it is possible to construct rich (numerical and other) expressions, and also it is sort of object-oriented. Read on to find out more.

Monotype text in bold is used for reserved words, and should not be used for new identifiers. Monotype text in italic is meant to be replaced by actual identifiers or other code.

Source Code

Source code files have an extension of .pip.

There are two kinds of comments: single-line and multi-line. The first runs from a semicolon until the end of the line:

; single line comment

The second kind is text delimited by braces:

{ multiple line comment { nested comment }
  the rest of the first comment }

An identifier is a sequence of characters that begins with either an underscore or a letter (Unicode uppercase or lowercase) and is followed by more underscores or letters, or digits. Identifiers follow camel case notation. Variables begin with a lowercase letter. Constants begin with the character 'k'. Types and functions begin with uppercase letters.

Source File Structure

Source files begin with the imports:

import file-name, ...
import library-name.module-name, ...
import library-name.module-name.file-name, ...
...

These are useful not only for not specifying during use the full name of the entities declared in source files, but also for telling the compiler about such entities and their features in order to aid in error prevention. file-name does not have an extension.

Next come the macros:

macro macro-name(argument, ...)
  replacement-text
  ...

macro-name and arguments are just identifiers, and replacement-text is one or more lines of code. Notice how arguments are delimited by parentheses and how the macro contents are indented one tab in relation to the macro header. This indentation mechanism is featured in other parts of PicoPoe, and it allows us to remove unnecessary braces (or other delimiters) and semicolons at the end of lines. We usually indent our code, so why not make use of this phenomenon?

To invoke a macro we simply say:

macro-name(expression, ...)

and this bit of code will be replaced by the replacement text defined in the macro. Macro arguments are, of course, optional.

Next in the source code file come one or more of protocols, wrappers, unions, or structs. Protocols are like this:

protocol protocol-name(argument:type, ...)
  operator-declaration
  array-declaration
  block-declaration
  ...

The arguments are used for what other languages call generics and are optional. See below for block, array, and operator declarations (note that these are just declarations, not implementations). For instance:

protocol Map(Key:struct Comparable, Data:struct Object)
  block AddElement(k:Key, e:Data):Void
  block RemoveElement(k:Key):Void
  block GetElement(k:Key):Data
  block ElementExists(k:Key):Bool

Note the use of struct Comparable and struct Object in the generics arguments to the protocol. This use says that Key and Data are actually types, not data. If we said only Comparable and Object, without struct, Key and Data would be data, not types. We may then say:

data myObject:Map(String, UWord)

myObject.RemoveElement("myElement")

Wrappers are just (extended) typedefs. This is the syntax:

wrapper wrapper-name(argument:type, ...) <underlying-type>
  literal-implementation
  operator-implementation
  array-implementation
  block-implementation
  ...

Again, arguments are for generics and optional. underlying-type is the old type name. wrapper-name is the new type name. For example, we may simply say:

wrapper Handle <**Object>

and, from then on, Handle and **Object will be synonyms. On the other hand, we have the option of completely redefining the interface of the underlying type and add our own methods and operators and more to the underlying type. We just cannot add new data members to a wrapper - we'll have to work with the underlying type's restrictions.

Unions are like this:

union union-name(argument:type, ...)
  data-declaration
  union-declaration
  struct-declaration
  ...

union-name is optional. If absent, an anonymous union is being declared, and the identifiers inside the union must be different from the other identifiers where the union is declared, to avoid naming conflicts.

Structs inside unions and inside other structs are declared like this:

struct struct-name(argument:type, ...)
  data-declaration
  union-declaration
  struct-declaration
  ...

Here is an example:

union Registers
  struct
    bytes[4] eax
    bytes[4] ebx
    bytes[4] ecx
    bytes[4] edx
  struct
    bytes[2] reserved
    bytes[2] ax
    bytes[2] reserved
    bytes[2] bx
    bytes[2] reserved
    bytes[2] cx
    bytes[2] reserved
    bytes[2] dx
  struct
    bytes[2] reserved
    byte ah
    byte al
    bytes[2] reserved
    byte bh
    byte bl
    bytes[2] reserved
    byte ch
    byte cl
    bytes[2] reserved
    byte dh
    byte dl

Note the use of the keyword reserved (reserved fields cannot be accessed) and of anonymous structs, similar to unions. Elsewhere we can say:

data r:Registers

r.eax <- r.bx × (r.cl / 2)

Finally, come the structs declared outside of other structs or unions. These are the classes, although they still use the keyword struct, like this:

struct struct-name(argument:type, ...) <category-name>
  equ-declaration
  data-declaration
  union-declaration
  struct-declaration
  literal-implementation
  operator-implementation
  array-implementation
  block-implementation
  initer-implementation
  ...

category-name may be nothing (in which case, we don't write < >), or it may be the super class name, or a protocol name, or Private, or a general identifier beginning with an uppercase letter. Usually, a class is just a sequence of these struct declarations, each grouped under its own category. If this category name isn't specified, the class inherits from no super class. If it's nothing or the super struct name, it may have data added in this category. If it's a protocol name, we must implement in the category the methods and operators of that protocol. If the category name is the reserved keyword Private, the methods implemented in that category are not accessible to subclasses or other classes. A class may have only one category of each name, except Private, which may appear more than once in a class definition. Here follows an example.

struct Rectangle <GeometricFigure>
  data x:Real, y:Real, w:Real, h:Real

  block Init(x:Real, y:Real, w:Real, h:Real):Rectangle
    self.x <- x
    self.y <- y
    self.w <- w
    self.h <- h
    ! self

  block Area:Real
    ! w × h

  block MakeSquare(s:Real):Void
    w h <- s

struct Rectangle <ChangeOrigin>
  block MoveToOrigin:Void
    x y <- 0.0

  block CenterOnOrigin:Void
    x <- -w / 2.0
    y <- -h / 2.0

Here we see that class Rectangle is being defined across two categories so far, namely GeometricFigure (its super class), and ChangeOrigin, a protocol with two methods declared elsewhere in the program and implemented here. We could have added more protocol categories, or even Private or general categories, but not more super categories or empty categories. Exactly one of either of these last two must always be present, and one or the other must be the first category of a class.

Primitive Types

There are only four primitive data types, and they're all related:

bit
bits[unsigned-word]
byte
bytes[unsigned-word]

If we say:

data s:bit
data v:bits[7]

we end up with data occupying one full byte in memory, that is, bits and bytes are packed together, not spread across bytes as in other programming languages.

If we wish to pass blocks around, we may do so with the block type:

(argument-type, ...)::(return-type, ...)

This is the signature of a block, indicating both its argument types as well as its return types. It may be used anywhere we expect a block name to be passed around.

Member Declarations

Constants are declared in one of several ways. The easiest case is with just one constant:

static equ constant <- expression

This works if the constant is of the same type as the class. If not, we can use:

static equ constant1:type1 <- expression1, constant2:type2 <- expression2, ...

Any combination of these two cases is possible. static is optional and it says whether the constant belongs to the class or is different per instance. Enumerations are like this:

equ enum-name
  name1
  name2 name3
  name4 <- unsigned-word
  name5
  ...

The identifier enum-name is optional, and, if absent, care must be taken to avoid naming conflicts. name1 is 0 (zero), name2 and name3 are on the same line and therefore synonyms (both equal to 1), name4 is initialized with a number, name5 is that number plus 1, and so on.

Data declarations are like this, for example:

static getter setter data data1:type1 <- expression1, data2:type2 <- expression2, ...

static, getter, and setter are optional. static says the data belongs to the class, not to its instances. getter allows the data to be read from outside of the class, like this:

x <- myObj.data

setter allows the data to be written to outside of its class:

myObj.data <- x

These are similar to the public/private mechanisms of other languages. The initializer expressions expression1, expression2, ... are also optional. A type is:

*type-name(generics)[array-size]

* indicates the data is in fact a pointer. We may have pointers to pointers to pointers... Generics were seen above, with the Map example. Arrays, if present, may be multidimensional, like this:

data myArray:UWord[10][20][30]

A literal is declared so:

literal regular-expression
  statement
  ...

Here is an excerpt of an example:

struct Bool <Object>
  data value:bit

  literal false
    value <- 0

  literal true
    value <- 1

Operators come in several flavours:

prefixop symbol:return-type
  statement
  ...

suffixop symbol:return-type
  statement
  ...

linfixop symbol(argument:argument-type):return-type
  statement
  ...

rinfixop symbol(argument:argument-type):return-type
  statement
  ...

There are prefix operators, like ¬myVar, suffix operators like myVar--, and infix operators (left and right associative) like myVar ∈ mySet. symbol is any Unicode mathematical symbol, or combination (without spaces), with a few exceptions. Operators may be overloaded. In case of ambiguities, the compiler should select longest match first, followed by order of imports, followed by order of implementation. The order of implementation gives us the operator precedence, from highest to lowest.

Array access is like the following:

arrayget(index:index-type):return-type
  statement
  ...

arrayset(index:index-type, value:value-type):return-type
  statement
  ...

We have already seen examples of blocks. They're just pieces of code, like this:

static block name(argument:argument-type, ...):return-type, ...
  statement
  ...

Blocks may be overloaded. They may also be or not be static. They may return more than one expression.

An initer is simply code that gets called at class load time, to initialize static data or perform some other code at that time. Initers are usually the last blocks to be implemented in the class.

initer
  statement
  ...

Statements

The simplest statement is probably a data declaration. These are similar to the data declaration we saw above, but have no getter and no setter modifiers. They must appear after the block arguments and before all other statements. Next are the assignments (we already saw a few):

data1 data2 ... <- expression

And the associated return:

! expression1, expression2, ...

For example, the code:

block MyFunc(x:Real, y:Real):Real, Real
  ! x / y, x \ y

quo rem <- MyFunc(10.0, 5.0)

places the quotient of dividing 10.0 by 5.0 in quo and the remainder of that division in rem. If a block returns no values, a return may come all by itself:

!

A statement may also be an expression (see below). Next come the labels and the gotos:

block name(argument:argument-type, ...):return-type, ...
  statement1
  ...
@label
  statement2
  ...

Labels are identifiers, and are preceded by the character @, indented at the same level as the block where they appear. Gotos are like this:

boolean -> label-if-true | label-if-false

The boolean expression is evaluated. If true, execution continues at the instruction following label label-if-true, and if false, at the instruction following label-if-false. The bit of code “| label-if-false” may be omitted and, if this is so, no jump occurs if the expression is false. The labels may be block calls. More generally, we have the following construct:

unsigned-word -> 0:label-if-0 | 1,3,5..9:label-if-1-3-5-6-7-8-9 | ... | label-if-default

Or the more readable:

unsigned-word -> 0:label-if-0
  | 1,3,5..9:label-if-1-3-5-6-7-8-9
  | ...
  | label-if-default

This jumps to the labels according to the value of the unsigned-word expression. Again, the labels may be block calls. The numbers 0, 1 and so on may be enum constants. Note the use of ranges and of multiple values. Finally, there's the assembler statement:

block name(argument:argument-type, ...):return-type, ...
  statement1
  ...
  asm(cpu)
    asm-statement1
    ...
  @asm-label
    asm-statement2
    ...
  statement2
  ...

Notice how the @asm-label is indented at the same level as the asm() statement.

Expressions

Any expression may be enclosed in parentheses. It may be data, with an optional object before:

object.data

It may be a constant:

object.equ
enum-name.equ

Or a block call:

object.block(expression1, expression2, ...)

It may be a literal:

object.literal

It may be a struct or a union field:

struct.data
union.data

It may be the contents pointed to by a variable (we may have pointers to pointers to pointers...):

*data

It may be a meta access

data@size
data@name
data@addr
data@type

Or an array access:

data[unsigned-word]

It may involve an operator:

prefix-op expression
expression suffix-op
expression infix-op expression

It may be a cast:

(expression, type)

It may be a primitive call:

pmtv(operation, expression1, expression2, ...)

It may be the ternary operator of other languages:

boolean ? expression-if-true : expression-if-false

It may be:

self
super

Or it may be the following structure used to initialize arrays, dictionaries or maps, or other complex structures:

[expression1a | expression1b | ..., expression2a | expression2b | ..., ...]

Finally, it may be the array:

args

used when a variable number of arguments in a block are declared (with ellipsis). We may say:

args.count
args[unsigned-word]

Preprocessor Directives

There's just the conditional compilation directive:

#if condition
statement1
...
#elsif condition
statement2
...
#else
statement3
...
#fi

condition may be complex, using the logical operators ¬, , and , parentheses, and other defined symbols in the call to the compiler.