Homogenous sequences
I was just idly thinking about those "sequence of xxx" discussed eons ago.
Note those are the [ref*4-1] bytes, not the 1..15 T_integer..T_object of the compiler.
So, first replace all #82 in the backend with #84 (should be no problems with doing that).
#8F is bascially what we have now, maybe try that (blanketwise) and see what probs arise.
UTF-8 and/or UTF-16-encoded binary strings are perfectly good enough, bar ws[i]<->#hhhh,
in other words it is no great hardship to convert to UTF-32 before anything char-by-char.
Existing opcodes opRepe etc should ||= type bytes, opSubse etc can avoid some refcounts.
Obviously opMkSq should also collect element types, and repeat() shd be a smidge easier.
Even better savings occur inline: types shd be mirrored in the new parse tree localtypes.
A new array type is required to use the new #82/#86, existing opcodes should just crash.
There should also be a new float64|128 type for dealing with array elements, probably.
If parameters (etc) of that type are allowed, they would also have to be reference types.
Obviously if you box and unbox into an atom type there will just be no performance gains.
I assume there is simply no point trying to implement an array of 80-bit floats, anyway.
The ability to take or replace slices of arrays is probably just simply not worthwhile.
Allowing explicit "sequence of" in the source code is a niceity, and not a necessity, and
it would probably be best to introduce a new implicit udt-style routine to enforce that.
The handling of opTchk/T_seq within pilx86.e might not actually even need any tweaks.
[1] standard 4/8 bytes per element, can be nested refs except o/c for "of integer".
*,** : good optimisations available with these
*** : lesser optimisations (on the localtypes of extracted elements)
Note that any nested sequences have the same potential optimisations available to them.
There is probably very little gain for "sequence of arrays" over "sequence of sequence"
The procedure p recieves a pointer to an array of integer. The invocation makes it clear that myarray might and probably will be modified. A fatal error occurs should myarray not have a reference count of 1, and that is not incremented by the call, or of course decremented when p returns. The actual argument a is probably an atom, boxed into a float when it exceeds #3FFFFFFFF, but the compiler and hopefully also the debugger both know what it really is. A string* would be an array of strings, the recipient of a single string would be a byte* (which may be raw binary). Other types would be float64* and float128*, but I just don’t see atom*, sequence*, or object* as being sufficiently worthwhile. Anonymous arguments such as &g() might be an issue: forcing such to be stored in a local var first would a) offer sufficient type into to be inferred/specified and b) be somewhere or something which we can decref and free properly. Or maybe integer^ to mean sole owner?
type current proposed #80 (0b0000) seq empty sequence (and/or "legacy behaviour") #81 (0b0001) sequence of integer [1] * #82 (0b0010) string sequence of 64-bit floats (8 bytes per) [array] ** #83 (0b0011) sequence of atom [1] *** #84 (0b0100) string (1 byte per, see UTF the notes below) #85 (0b0101) sequence of string|int [1] *** #86 (0b0110) sequence of 128-bit floats (16 bytes per) [array] ** #87 (0b0111) sequence of string|atom [1] *** #88 (0b1000) sequence of sequence [1] *** #89 (0b1001) sequence of sequence|integer [1] *** #8A (0b1010) sequence of 64-bit arrays (/illegal?) #8B (0b1011) sequence of sequence|atom [1] *** #8C (0b1100) sequence of string [1] *** #8D (0b1101) sequence of sequence|string|integer [1] *** #8E (0b1110) sequence of 128-bit arrays (/illegal?) #8F (0b1111) sequence of object [1] (=="legacy") #12 atom (unchanged)
Note those are the [ref*4-1] bytes, not the 1..15 T_integer..T_object of the compiler.
So, first replace all #82 in the backend with #84 (should be no problems with doing that).
#8F is bascially what we have now, maybe try that (blanketwise) and see what probs arise.
UTF-8 and/or UTF-16-encoded binary strings are perfectly good enough, bar ws[i]<->#hhhh,
in other words it is no great hardship to convert to UTF-32 before anything char-by-char.
Existing opcodes opRepe etc should ||= type bytes, opSubse etc can avoid some refcounts.
Obviously opMkSq should also collect element types, and repeat() shd be a smidge easier.
Even better savings occur inline: types shd be mirrored in the new parse tree localtypes.
A new array type is required to use the new #82/#86, existing opcodes should just crash.
There should also be a new float64|128 type for dealing with array elements, probably.
If parameters (etc) of that type are allowed, they would also have to be reference types.
Obviously if you box and unbox into an atom type there will just be no performance gains.
I assume there is simply no point trying to implement an array of 80-bit floats, anyway.
The ability to take or replace slices of arrays is probably just simply not worthwhile.
Allowing explicit "sequence of" in the source code is a niceity, and not a necessity, and
it would probably be best to introduce a new implicit udt-style routine to enforce that.
The handling of opTchk/T_seq within pilx86.e might not actually even need any tweaks.
[1] standard 4/8 bytes per element, can be nested refs except o/c for "of integer".
*,** : good optimisations available with these
*** : lesser optimisations (on the localtypes of extracted elements)
Note that any nested sequences have the same potential optimisations available to them.
There is probably very little gain for "sequence of arrays" over "sequence of sequence"
pointer types
I am also toying with the idea of pointer types. There would be insufficient gain on a (single integer)* or (single atom)* so we can forget about those and assume that a pointer implicitly refers to an array. Further, to achieve a significant performance gain they should be homogenous. An array cannot be resized, except perhaps shorten-in-situ.procedure p(integer* a) end procedure p(&myarray)
The procedure p recieves a pointer to an array of integer. The invocation makes it clear that myarray might and probably will be modified. A fatal error occurs should myarray not have a reference count of 1, and that is not incremented by the call, or of course decremented when p returns. The actual argument a is probably an atom, boxed into a float when it exceeds #3FFFFFFFF, but the compiler and hopefully also the debugger both know what it really is. A string* would be an array of strings, the recipient of a single string would be a byte* (which may be raw binary). Other types would be float64* and float128*, but I just don’t see atom*, sequence*, or object* as being sufficiently worthwhile. Anonymous arguments such as &g() might be an issue: forcing such to be stored in a local var first would a) offer sufficient type into to be inferred/specified and b) be somewhere or something which we can decref and free properly. Or maybe integer^ to mean sole owner?