5. Type System¶
The Mu type system is defined in the specification.
Like many programming languages and frameworks, Mu also has a type system.
The Mu type system is low level. There is no object-oriented programming concepts, such as class, inheritance, polymorphism. There is no high-level concepts such as strings, either. The language implementer is responsible to implement these high-level concepts.
But the Mu type system is also not too low level. Notably, unlike C, C++ or LLVM, the Mu type system still has object reference types in it, and the garbage collector is fully aware of the presence of them. The main idea is, as long as you use the Mu type system, and refer to heap objects using references, you can forget about garbage collection details, such as stack maps, GC-safe points, and read/write barriers.
5.1. Overview¶
Some of the types (actually type constructors, explained later) contain angular brackets. These are parameters to these types which may be integer literals, other types or function signatures.
The types in the Mu type system can be put into several categories:
- Scalar value types:
int<n>
,float
,double
,uptr<T>
andufuncptr<sig>
. These types represent plain values. - Scalar reference types:
ref<T>
,iref<T>
,weakref<T>
,funcref<sig>
,threadref
,stackref
,framecursorref
,irbuilderref
andtagref64
. These types refer to “things” in the Mu micro VM. All such references are opaque in the sense that their representation is implementation dependent. - Composite types:
struct<F1 F2 ...>
,array<T n>
,hybrid<F1 F2 ... V>
andvector<T n>
. These types combine simpler types into more complex types. - The void type
void
. It has only one use case: when used as the “referenced type” of references or pointers, it conveys the meaning of “reference/pointer to anything”.
This is a complete list of Mu types. All values of Mu come from one of these types.
5.2. Define Types and Function Signatures¶
5.2.1. Type definition¶
To define a type in Mu, you use the top-level type definition: .typedef
.
.typedef @i1 = int<1>
.typedef @i8 = int<8>
.typedef @i16 = int<16>
.typedef @i32 = int<32>
.typedef @i64 = int<64>
.typedef @float = float
.typedef @double = double
.typedef @ptri32 = uptr<@i32>
.typedef @foo.fp = ufuncptr<@foo_sig>
.typedef @refi32 = ref <@i32>
.typedef @irefi64 = iref<@i64>
.typedef @weakreffloat = weakref<@float>
.funcsig @foo.sig = (@i32) -> (@i32)
.typedef @foo.fr = funcref<@foo_sig>
.typedef @stackref = stackref
.typedef @threadref = threadref
.typedef @stkref = stackref
.typedef @thrref = threadref
.typedef @fcref = framecursorref
.typedef @ibref = irbuilderref
.typedef @struct1 = struct<@i32 @i32 @i32>
.typedef @struct2 = struct<@i64 @double>
.typedef @array1 = array<@i32 10>
.typedef @array2 = array<@i32 4096>
.typedef @hybrid1 = hybrid<@i64 @i32>
.typedef @hybrid2 = hybrid<@i8>
.typedef @hybrid3 = hybrid<@i64 @i64 @i64 @float>
.typedef @4xi32 = vector<@i32 4>
.typedef @void = void
Every type defined in the Mu IR has a name, which is on the left side of the
equal sign. All characters in [a-zA-Z0-9_.]
are legal. You can use the dot
.
arbitrarily in the name. So the dot in @foo.sig
does not mean anything
special to Mu.
On the right side of the equal sign is the type constructor: it constructs a type. Some type constructors take parameters while others do not.
What are type constructors?
If we imagine a Mu type as a Java or C++ object, then the type constructor
is like the constructor of such an object. int
is just the abstract concept
of integer, but int<32>
is a concrete 32-bit integer type. Similarly
ref<@i32>
constructs a reference type to @i32
:
.typedef @i32 = int<32>
.typedef @refi32 = ref<@i32>
Some type constructors, such as float
, double
, threadref
or
void
, do not take any parameters. You can consider them as C++/Java
constructors with an empty parameter list. You may have written new Object()
or new StringBuilder()
before. Similarly you define a concrete instance of
float
type in this way:
.typedef @float = float
.typedef @blahblah = float
, where the name @float
or @blahblah
are just names.
When types or function signatures are taken as argument, their names (such as
@i32
, @float
and @void
, not int<32>
, float
or void
) are
used. So the following are not accepted by Holstein:
.typedef @refi32 = ref<int<32>> // ERROR! int<32> must be defined separately.
.typedef @refvoid = ref<void> // ERROR! void must be defined separately.
.typedef @bar.ref = funcref<(@i32) -> (@float)> // ERROR! The signature must be defined separately.
But these are right:
.typedef @i32 = int<32>
.typedef @refi32 = ref<@i32> // Correct.
.typedef @void = void
.typedef @refvoid = ref<@void> // Correct.
.typedef @float = float
.funcsig @bar.sig = (@i32) -> (@float)
.typedef @bar.ref = funcref<@bar.sig> // Correct.
Note
So why does Mu force all types to be “constructed” at the top level? Well,
that’s what Holstein accepts now. There are alternative text Mu IR parsers that accept in-line types
such as ref<int<32>>
.
Actually, productional Mu and client implementations will use the IR building API. It will skip the text parsing phase completely.
The reason why Holstein was designed like that was to let the text match the
actual data structure of the IR. In the IR building API, each type is a
“node”. Types that have parameters (such as ref<T>
) refer to other
nodes by their IDs. Similarly, in the text form, such type constructors
refer to other types by names.
If you have used LLVM before, you may find that you can write types “directly”, “inline”, in the LLVM IR, such as:
%c = add i32 %a, %b
%f = fadd double %d, %e
%g = load i32* %x
But have a look at the C++ API of the LLVM:
Type *i32 = Type::getInt32Ty(ctx);
Type *i64 = IntegerType::get(ctx, 64); // alternative method
Type *floatTy = Type::getFloatTy(ctx);
Type *doubleTy = Type::getDoubleTy(ctx);
Type *voidTy = Type::getVoidTy(ctx);
Type *blahblah = Type::getFloatTy(ctx);
Type *ptri32 = Type::getInt32PtrTy(ctx);
Type *ptri64 = PointerType::getUnqual(i64);
In this API, the programmer still needs to refer to types by pointers to the types. So this API is more similar to having to define (or, at least, make pointers to) the types separately.
On the other hand, there is only 19 types in the Mu type system, among which only 6 do not take arguments. Even if the client programmer has to define each and every types, all common types can be defined in about 20 lines as above, and his/her pain ends there.
5.2.2. Function signature definition¶
A function signature defines the parameter types and the return types of a
function. It is defined by the .funcsig
top-level definition:
.typedef @i32 = int<32>
.typedef @float = float
.funcsig @sig1 = (@i32) -> (@float)
.funcsig @sig2 = (@i32 @i32 @i32) -> (@i32 @float)
.funcsig @sig3 = () -> (@i32)
.funcsig @sig4 = (@i32) -> ()
.funcsig @sig5 = () -> ()
.typedef @funcref1 = funcref <@sig1>
.typedef @ufuncptr1 = ufuncptr<@sig1>
On the left side of =
is the name of the signature. On the right side is the
function signature constructor. In Mu, a function takes 0 or more parameters and
return 0 or more values. It is written in the form (parameter types) ->
(return types)
.
A function signature is not a type. Unlike the C or C++ programming language, there is no “function type” in Mu. In fact, in C, if an expression has function type, it is implicitly converted to the pointer of that function. Mu takes the explicit approach: there are two types that use function signatures:
- The
funcref<sig>
type refers to a Mu function which has signaturesig
. - The
ufuncptr<sig>
type is a pointer that points to a native function that has signaturesig
.
When defining or declaring functions, such as:
.funcdecl @foo <@sig1>
.funcdef @bar VERSION %v1 <@sig2> {
...
%rv = CALL <@sig1> @foo (...) // arguments omitted
...
}
The names of the functions @foo
and @bar
has the funcref<@sig1>
and
the funcref<@sig2>
type, respectively, when used as a value.
5.3. Details¶
This section will only discuss the most important types. For more details, you can read the Type System section of the specification.
5.3.1. Integer and FP types¶
int<n>
is the integer type of n
bits. Like LLVM, the int
type is
fixed-length. For example, int<32>
is the 32-bit integer type.
.typedef @i32 = int<32>
.typedef @i64 = int<64>
It is also signedness-neutral: whether an integer is signed or not depends on
the operation, not the type. Most instructions, such as ADD
, SUB
,
MUL
, work correctly for both signed and unsigned integers. Some instructions
have signed and unsigned variants, such as SDIV
/UDIV
,
FPTOSI
/FPTOUI
.
Like LLVM, int<1>
is returned by most instructions that return Boolean
results, such as EQ
and SLT
.
.typedef @i1 = int<1>
float
and double
are the IEEE 754 single and double-precision floating
point number types, respectively.
.typedef @float = float
.typedef @double = double
Like LLVM but unlike some intermediate languages such as C minus minus, Mu does not use a single type (such as “bits32”) to hold both integers and FP numbers, because in modern machines integers and FP numbers are usually held in different kinds of registers.
5.3.2. References to the memory¶
ref<T>
is the object reference type. It refers to objects in the
garbage-collected Mu heap.
iref<T>
is the internal reference type. It refers to a memory
location, that is, a place in the Mu memory that can be loaded or stored. A
field of a heap object is a memory location.
Attention
“Memory location” does not mean “address”. Do not assume a Mu heap object or any other memory locations have addresses. This is very important in Mu. This will discussed in details in later chapters. The specification contains some explanation
Both ref
and iref
may be NULL
.
Note
Sorry for the billion-dollar mistake, but NULL
is
really easy to implement, and Mu is closer to the machine. The client, on
the other hand, should implement a decent language and help the programmers
prevent such mistakes.)
The <T>
type parameter is the type of the heap object it refers to.
For ref<T>
, the T
means it refers to a heap object of type T
. For
example, ref<@i32>
refers to a heap object of @i32
type, which we
previously defined as int<32>
:
.typedef @refi32 = ref<@i32>
.typedef @refdouble = ref<@double>
.typedef @link = ref<@link>
In the last line, @link
is recursively defined as ref<@link>
. It means
it refers to a heap object, whose entire content is an object reference to the
same type, or NULL
. It is very similar to the C definition: struct Link {
struct Link *next; }
. Mu does not need struct
to construct recursive
types.
For iref<T>
, the T
means it refers to a memory location of type T
.
So if you use the LOAD
instruction on an iref<@i32>
, you get a value of
type @i32
. You can also STORE
an @i64
value to a memory location
referred by an iref<@i64>
.
.typedef @irefi32 = iref<@i32>
.typedef @irefi64 = iref<@i64>
5.3.3. References to Mu functions¶
funcref<sig>
is the function reference type. It refers to a Mu
function. Whenever you call a Mu function, you call it with its function
reference. sig
is the function signature.
.funcsig @sig1 = (@i32) -> (@float)
.typedef @funcref1 = funcref <@sig1>
funcref
only refers to Mu functions. It cannot refer to C functions (that is
what ufuncptr
is for).
Like other references, funcref
can also be NULL
(sorry).
5.3.4. Aggregate types¶
Among all aggregate types, hybrid
is the only “variable-length” types. All
others are “fixed-length”.
5.3.4.1. Fixed-length aggregate types¶
struct<F1 F2 ...>
is the structure type. Like the struct type in
C, it has many fields of types F1
, F2
, ...
.typedef @struct1 = struct<@i32 @i32 @i32>
.typedef @struct2 = struct<@i64 @double>
Structs may contain other structs, arrays or vectors, but cannot contain themselves (otherwise it will be infinitely big). It must have at least one field. But it may contain references so that you can allocate many structs in the Mu heap, each refer to another object.
.typedef @ListNode = struct<@i64 @ListNodeRef>
.typedef @ListNodeRef = ref<@ListNode>
array<T n>
is the fixed-size array type. T
is the element type.
n
is an integer literal and it is part of the type. A particular
array<T n>
holds exactly n
instances of T
. For example, array<@i32
10>
contains exactly 10 @i32
values:
.typedef @array1 = array<@i32 10>
.typedef @array2 = array<@i32 4096>
Like structs, arrays may contain other structs, arrays or vectors, but not itself. It must have at least one element. Arrays of references are allowed.
vector<T n>
is the vector type. It is designed for single-instruction
multiple-data (SIMD) operations. Most modern desktop processors have SIMD
capabilities. Vectors are used in very different ways compared to arrays.
Vectors are usually small and are usually similar to the vector sizes supported
by the machine.
Even today, architectures still do not agree upon any particular vector sizes. Mu only mandate the following three vector types to be implemented:
.typedef @4xi32 = vector<@i32 4>
.typedef @4xfloat = vector<@float 4>
.typedef @2xdouble = vector<@double 2>
5.3.4.2. The hybrid¶
hybrid<F1 F2 ... V>
is a hybrid of a struct and an array. It starts
with a fixed part: F1
, F2
, ... which is like a struct. It is
followed by a variable part: an array of many elements of type V
.
.typedef @hybrid1 = hybrid<@i64 @i32>
.typedef @hybrid2 = hybrid<@i8>
.typedef @hybrid3 = hybrid<@i64 @i64 @i64 @float>
In the above example, @hybrid1
has one @i64
field in its fixed part, and
many @i32
elements in its variable part. @hybrid2
has an empty fixed
part, and its variable parts are many @i8
elements. @hybrid3
has three
@i64
fields in its fixed part, and many @float
elements in the variable
part.
hybrid
is the only variable-size type type in Mu whose size is determined
at allocation site rather than the type itself. A hybrid must be allocated by
special instructions, such as NEWHYBRID
, which takes not only the type but
also the length as its arguments.
%length1 = .....
%length2 = .....
%r1 = NEWHYBRID <@hybrid1 @i64> %length1 // @i64 is the length of %length1
%r2 = NEWHYBRID <@hybrid1 @i64> %length2 // @i64 is the length of %length2
In the above example, %r1
and %r2
refers to two different objects. Both
have type @hybrid1
, but the length of their variable parts are %length1
and %length2
, respectively.
Since the length cannot be determined by the type itself, it cannot be embedded in other aggregate types, not even other hybrids:
.typedef @some_struct = struct<@i64 @hybrid1 @hybrid2> // ERROR! cannot embed hybrids
Hybrid is the counterpart of the C99 structs with “flexible array elements”. In C99, you can write something like:
struct hybrid1 { int64_t f1; int32_t v[]; };
struct hybrid2 { int32_t v[]; };
struct hybrid3 { int64_t f1, f2, f3; float v[]; };
struct hybrid1 *p1 = malloc(sizeof(int64_t) + 1000*sizeof(int32_t));
struct hybrid1 *p2 = malloc(sizeof(int64_t) + 2000*sizeof(int32_t));
Once malloc-ed with enough memory, C can access the dynamically allocated “tail” elements.
5.3.5. The void type¶
void
means “anything”, and can only be used as the target of references or
pointers. For example:
.typedef @void = void
.typedef @refvoid = ref<@void>
.typedef @irefvoid = iref<@void>
.typedef @uptrvoid = uptr<@void>
ref<@void>
means the object reference can refer to any object.
iref<@void>
means the internal reference can refer to any memory location.
uptr<@void>
means it is... err... just a pointer, and has not been assigned
a type yet.
Mu does not have the concept of “inheritance”, but there are some “prefix rules”
so that a reference may refer to some more complex objects than its <T>
parameter. void
is just the “simplest” type: no content at all.
You can allocate heap objects of the void
type.
%r = NEW <@void>
Such objects have no contents, but each allocated void
object is different,
and compares equal (EQ
) to only itself.
In Java, such use is like Object o1 = new Object();
. There are some corner
cases where such objects can be used as a “key” to identify something.
5.3.6. Other types¶
stackref
, threadref
and framecursorref
refers to “special
things” in Mu: stacks, threads and frame cursors. You will need the first two to
start a Mu program, and need the third to perform stack introspection and
on-stack replacement.
weakref<T>
is the weak object reference type.
tagref64
is the tagged reference type. It uses some clever bit-magic to
reuse the NaN space of double
to represent a tagged union of double
,
int<52>
and a struct of ref<void>
and int<6>
.
uptr<T>
and ufuncptr<sig>
are untraced (raw) pointers. They are
defined to be represented as integers of the pointer size, which is
implementation-specific. For example, on a 64-bit implementation, it is 64 bits.
But if you want to perform pointer arithmetic, you need to convert them to
integers first.
You are unlikely to use raw pointers unless your program interacts with native programs (usually written in C). The garbage collector will not trace them: they are treated just like integers.
If you worked with x86 before, you may ask: Wait! Pointer is not just the address, but also its segment. Sorry, x86. But we see the trend is to move away from segmented architecture (x86_64 moved away from segments, too). For embedded systems that may have multiple address spaces, Mu is not designed for such systems, but supporting such architectures is an open topic.
5.4. Bonus section¶
Note
These contents should be moved to other chapters in the future. But if you are interested and patient enough, you can keep reading.
In fact, an internal reference refers to a “memory location” (discussed
in later chapters) of type T
. Memory location is a very important
concept in Mu. It is a location in the Mu memory that can hold a Mu value.
A field of an object is one kind of memory location. All memory accessing
operations, such as LOAD
and STORE
directly work on internal
references. This is different from JVM, where there are getfield
and
setfield
instructions that work on object references.
If you worked with C before, it is the counterpart of the concept of “object”. (What? You say C does not have “objects” but C++ does? Go ahead and read the C specification. In C, “object” means “a region of data storage” and does not mean object-oriented programming.) But the word “object” is used as a synonym of “heap object” in Mu. To avoid ambiguity, we use the word “memory location” instead.