Description:
|
Return the length of the sequence or string s.
The compiler issues compile-time errors/tests for run-time errors, if s is/might be an atom or integer
(see technicalia).
If s is a string, the result is the number of bytes, see technicalia for utf8/16 handling.
If s is a sequence, the result is the number of top-level elements.
(In the latter, nested sub-sequences, however complicated, and string elements, each count as one towards the length.)
Alternatively (when constant ORAC is 1 in pmain.e, which it is in all released versions), ~s is shorthand for length(s).
|
Examples:
|
?length("") -- 0: the empty string
?length({}) -- 0: the empty sequence
?length({{}}) -- 1: s[1] is {}.
?length("four") -- 4: 4 bytes/characters
?length({1.2,"three",{4,{5,"six"}},7}) -- 4: s[1] is 1.2
-- s[2] is "three",
-- s[3] is {4,{5,"six"}},
-- s[4] is 7.
|
Technicalia
|
The length of each sequence (and string) is stored internally, allowing fast lookup.
The compiler only invokes the generic/full-blown version in :%opLen in builtins\VM\pLen.e as the very last resort.
The length functionality is inlined by the compiler if any of the following are true:
- The argument is known to have a fixed length, in which case a literal constant is used without referencing the argument at all.
- The argument is known to be assigned to a sequence, in which case run-time checks for that are not emitted.
- The result is known to be an integer (ie it does not require decref/dealloc).
Since length() is one of the most often-called functions, inlining can have a significant impact on performance, and especially
in tight inner loops it can be worthwhile checking the list.asm to ensure the compiler is not emitting calls to :%opLen.
The last element of a sequence can be referenced as s[length(s)] , s[$] , or s[-1] .
The latter two are in fact absolutely identical, and to be preferred in every sense over an explicit length call:
The compiler replaces eg s[length(s)] with s[$] , as another optimisation, however more complex equivalents such as
s[i][j][length(s[i][j])] that the compiler might have missed may benefit (performancewise, as well as typing/RSI/reading) from
the programmer explicitly using s[i][j][$] instead.
Phix strings are null-terminated, and technically s[length(s)+1] is a null character, however it is an error to try and reference it.
Obviously, for traditional ascii/ansi/latin-1 strings, the number of bytes is the same as the number of characters, however with unicode
encodings such as UTF-8 some characters are multibyte, clearly not the same. Thankfully, UTF-32 uses a much more consistent
4-bytes-per-character(/unicode point), for which Phix almost always uses a dword_sequence (UTF-32 is
almost never written to file, so I am not going to talk about theoretical 4-bytes-per-character LE/BE binary strings any further).
Almost all manipulation of UTF-8 strings can usually be performed without any conversion to UTF-32 whatsoever, in fact UTF-8 was deliberately
designed for precisely that, however should you really need to consider each individual unicode point separately, it is a relatively simple
matter of say sequence utf32 = utf8_to_utf32(utf8) , then invoking length()/subscripting on utf32,
before a final utf8 = utf32_to_utf8(utf32) , and obviously something fairly similar for
UTF-16 handling, except that the initial utf16 input would either already be a dword sequence or (rarer) some kind of byte-pair string,
with whichever endianness.
In Euphoria, the length() function has been redefined to yield 1 when passed an atom. While that is helpful for certain algorithms,
sometimes making tests for atom() or sequence() unnecessary, I firmly believe that fatal errors from length() play a far more important and
vital role in debugging, by highlighting errors in logic, and hence Phix will not be following suit.
Besides, it is not exactly difficult to write a tiny shim (a teeny tiny tiniest one!) to achieve the same
effect/benefit (as Eu), and use it sparingly.
|