Expand/Shrink

Strings

A character is really just an integer. Individual characters may be specified using single quotes, e.g:
     ’B’
which is completely indistinguishable from the (4 or 8 byte) integer 66, the equivalent ASCII code.
Note that single quotes define a single 8-bit byte in the range 0..255 in Phix, and not a string, nor a multibyte UTF-8 character.

Just as integers can be considered an optimised form of atom, strings can be considered an optimised form of sequence of character, with each ASCII character occupying one byte, and multibyte UTF-8 characters occupying said multiple bytes. Strings may be specified using double quotes, e.g:
     "ABCDEFG"
Character strings may be manipulated and operated upon just like any other sequences. Yes, you read that right: strings are mutable, with constants pooled by the compiler and automatically protected from accidental mutilation by the copy-on-write semantics of Phix, and have the baked-in performance characteristics of things like StringBuilder - but all that is handled for you automatically and there is no need to worry about it at all. The following dword-sequence is in fact almost entirely equivalent to the above string, except for explicit type tests and the space used, but they compare as equal:
     {65, 66, 67, 68, 69, 70, 71}
which contains the corresponding ASCII codes (see technical note below).
Note: pass strings containing multibyte UTF-8 characters through utf8_to_utf32() to improve/simplify indexing of individual graphemes: indexes on strings work at the byte-level, which is [only] the same for ASCII-only strings.

It follows that "" is equivalent to {}. Both represent a sequence of length zero, also known as the empty sequence. As a matter of programming style, it is natural to use "" to suggest a length zero string, and {} to suggest some other kind of sequence.

While you can store any string value in a variable declared as a sequence, the reverse is not true.

An individual character is an integer. It must be entered using single quotes. There is a difference between an individual character (which is an integer), and a character string of length 1 (which is a sequence), e.g.
    ’B’   -- equivalent to the integer 66 - the ASCII code for B
    "B"   -- equivalent to the sequence {66}
Again, 'B' is just a notation that is equivalent to typing 66. There isn’t really a "character" in phix, except as part of a string, otherwise individual characters are stored as 4 (or 8) byte integers. Note that string characters are unsigned (one of the relatively few times "unsigned" is a thing in Phix, the other main one being peek[1248Nn]u()), in other words all s[i] of any string (including so-called binary strings, and UTF-8 strings) are strictly 0..255.

Tip: ""&ch is just a simple and handy notation for creating a length-1 string from a single character. Obviously it does exactly what it says, namely appends a single character to a length-0 string, although admittedly it does look a little Perl-esque. If you prefer, the expression ch&"" would also be fine, or extracting s[i..i] is better than extracting s[i] and then having to convert that to a string.

Keep in mind that an atom is not equivalent to a one-element sequence containing the same value, although there are a few built-in routines that choose to treat them similarly.

Special characters may be entered (between quotes) using a back-slash:
       Code     Value   Meaning
        \n       #0A     newline (decimal 10)
        \r       #0D     carriage return (decimal 13)
        \b       #08     backspace (phix-only)
        \t       #09     tab
        \\       #5C     backslash
        \"       #22	 double quote (\ is optional in ', mandatory in ")
        \'       #27	 single quote (\ is optional in ", mandatory in ')
        \0       #00     null
        \e       #1B     escape (ditto \E)
        \#HH     #HH     any hexadecimal byte (phix-only)
        \xHH     #HH     any hexadecimal byte
        \uH4      -      any 16-bit unicode point, eg "\u1234", max #FFFF
        \UH8      -      any 32-bit unicode point, eg "\U00105678", max #10FFFF
 
For example, "Hello, World!\n", or '\\'. By default Edita displays character strings in green. Just as with fractional results converting to floating-point automatically, setting a string element to a non-character value automatically expands it to a dword-sequence, and if that is not desired you are immediately notified in a clear and no-nonsense, human-readable manner.

Unicode points may be specified using "\uHHHH" (must be exactly 4 hex digits) and "\UHHHHHHHH" (must be exactly 8 hex digits). The compiler automatically converts such to UTF-8, though personally I suspect that since the compiler supports UTF-8 source files, it is much easier to do this unicode stuff WYSIWYG-style. Note that you cannot use either \u or \U to specify UTF-16 surrogates, and of course all values outside 0..#10FFFF are automatically replaced with the standard invalid character ("\#EF\#BF\#BD" aka #FFFD). Also note that \u and \U are double-quote-only - they cannot be used between single quotes. Plus of course that is a general point about unicode characters: although your episilon-cidilla-umlaut may look like a single character, it is in fact a 3-byte (UTF-8) string, and therefore cannot be single quoted, even in WYSIWYG-style - the phix compiler supports plain ansi or utf-8 source files, but not utf-32 or utf-16. See also the Unicode Library Routines.

Note that JavaScript does not support \0, \e, \u, or \U, and therefore neither does pwa/p2js.
Technical note: Internally that string data example ("ABCDEFG") occupies 24 bytes whereas the sequence of (4-byte) integers occupies 48 bytes. On 64-bit, it is 39 vs 96 bytes (see builtins\VM\pHeap.e if you must). On very long strings, the space savings approach 75% (32-bit) or 87.5% (64-bit) and that can lead, in extreme cases such as pointless benchmarks, to a 4 or 8-fold performance improvement. Despite such differences, they compare as equal and display the same, and in fact pretty much the only way to tell them apart is with the string() builtin.

Note that strings are byte-subscripted; while theoretically you may store UTF16 data in a string, it is usually easier to store and manipulate such in dword-sequences. The handling is however completely transparent; you quite simply should not care, at least not when dealing with plain ASCII characters.

This (the "quite simply should not care" part) was hammered home to me when I added Unicode file support to Edita, which I knew made heavy use of 8-bit strings. Obviously I had to change readFile() to detect the unicode BOM (Byte Order Mark) and then read a word instead of a byte for each character, and similar for saveFile(). Of course program source files should (must) be ansi or UTF8, not UTF16, it was .reg files that prompted me to support the latter (UTF16LE/BE).

All the other code, be that display/edit/find/replace/cut/copy/paste/whatever, that had only ever been asked to manipulate 8-bit strings, some of which was 7 or 8 years old by then, all worked seamlessly when I started throwing dword sequences at it, without a single change.

Just to avoid being accused of misleading anyone, full unicode support required several other changes, what "worked seamlessly" was ansi in dword sequences. In particular, further changes were needed to use the widestring rather than the ansistring versions of certain windows API routines. Even so, the total number of changes was substantially less than first feared.

UPDATE
Note, however, that over time the benefits of using the builtin string type are slowly creeping into the builtins themselves; for instance chdir just needed a hack because it accepts sequence rather than string, so that may change soon, also panykey/pgetpath/scanf/substitute/timedate and timestamp were all written from the get-go using the string type, as was the entire pGUI cross-platform GUI library. On the plus side, it is usually pretty obvious what needs to be done when that sort of typecheck triggers, and obviously string-only code can be a fair bit faster and smaller than string-or-dword-sequence code.
It should also be noted that some conversions may be plain wrong, especially UTF16 held as a dword-sequence to a byte-truncated string. These will be fixed as and when reported, but failing that I would advise immediately converting any such to UTF8.
Compatibility Note In Phix, \b is used for backspace, whereas (very oddly imo) Euphoria uses it to declare a mid-string binary value, eg "\b01010101" is the same as {#55} or "U", erm, hmmm. I have never seen this used, but if it was byte-sized I would immediately convert it to \xHH form (by putting in a leading 0 to make it 0b01010101 and then using Edita\Case\Show as Hex (Ctrl H)). Obviously, on the compatibility front, you should use \x08 instead of \b, and likewise \xHH not \#HH.
Strings can also be entered by using triple quotes or backticks intead of double quotes to include linebreaks and avoid any backslash interpretation. (On my keyboard the backtick is between the Esc and Tab keys.) If the literal begins with a newline, it is discarded and any immediately following leading underscores specify a (maximum) trimming that should be applied to all subsequent lines. Examples:
ts = """
this
string\thing"""

ts = """
_____this
     string\thing"""

ts = `this
string\thing`

ts = "this\nstring\\thing"
which are all equivalent.

Tip: If you accidentally start with a quadruple quote (and similarly a quintuple, though a sextuplet is the same as "" anyway), instead of a triplequote, then the underscore thing won’t work - obviously the compiler treats it as a triplequote followed by a normal (single) doublequote that does not need escaping, all perfectly valid, so it would be quite wrong for any syntax-colouring or suchlike to suggest any error.
I only mention this because it is very easy to assume there is a bug in the handling(/compiler/editor), when in fact the problem is in your code.
A quadruple closing quote is, in contrast, quite clearly highlighted (in Edix/Edita), and also triggers a compilation error.

Phix also supports hexadecimal strings, eg:
x"1 2 34 5678_AbC" -- same as {0x01, 0x02, 0x34, 0x56, 0x78, 0xAB, 0x0C}
                   -- note it displays as {1,2,52’4’,86’V’,120’x’,171,12}
                   -- whereas x"414243" displays as "ABC" (as all chars)
A hexadecimal string begins with the pair x“ (for 1 byte codes, u“/U“ for 2/4 byte codes) and ends with a double-quote (”) character.
They can only contain hexadecimal digits (0-9 A-F a-f), and space, tab, or underscore. Anything else is invalid.
They may not (unlike Euphoria) span multiple lines, or otherwise contain cr or lf characters.
Whitespace delimits individual values. Underscores are treated as whitespace, unlike Euphoria, which treats them as if they were never there - quite wrongly, imo, since that makes both x"12" and x"1_2" yield {18}, whereas on Phix the latter yields (the string equivalent of) {1,2}. It makes no difference if underscores are ignored or treated as delimeters when they are at even positions (relative to any prior delimeters), however Phix treats x"123_456" as {#12,#3,#45,#6} (or perhaps more accurately as "\x12\x03\x45\x06", length 4) whereas Euphoria treats it as {#12,#34,#56} (or "\x12\x34\x56", length 3).
Each pair of contiguous hex digits represents a single sequence element with a value from 0 to 255.
The value is in fact always an 8-bit string, though as above it may be displayed like a dword_sequence if it contains any obviously non-character bytes, specifically <#20 or >#7F, with a few scattered exclusions (such as \t\n etc).

Note that u“ and U“ produce the utf8 equivalents of the logical utf16 or utf32 sequence. While, for example, in string s = "##", t = x"2323", u = u"00230023", v = U"0000002300000023" all four variables are in fact assigned exactly the same value, the same is obviously not true for x"00230023" and u"00230023", or x"0000002300000023" and U"0000002300000023" (lengths 4,2,8,2 respectively). Equally you would(/should) not expect equivalence when dealing with any unicode points above #7F. It is in fact the x"nnnn" variants you would want for explicit utf16/32 strings, that do not require the invocation of utf8_to_utf16() or utf8_to_utf32(), or need allocate() and poke2() or poke4(), or perhaps instead allocate_wstring(), when speaking to a C API that expects 16-bit WideString or 32-bit UTF32 arguments, whereas \u and \U would (but obviously don’t need the inverse for other uses).

Note, however, that Phix does not currently support Euphoria’s binary strings, such as b"1 10 11_0100 01010110_01111000" == {0x01, 0x02, 0x34, 0x5678} - so far I have not found any practical use, or any actual code that actually uses that feature. Also note that since JavaScript does not support such hexadecimal strings, in that format, pwa/p2js automatically maps them, for example x"123_456" ==> "\x12\x03\x45\x06". As stated/implied at the start, single quotes can only hold single bytes and not multibyte UTF8 characters, for which double uotes must be used.

On a practical note, as long as you have at least 2GB of physical memory, you should experience no problems whatsoever constructing a string with 400 million characters, and you could more than treble that by allocating things up front. However: deliberately hogging the biggest block of memory the system will allow is generally considered poor programming practice, and may lead to disk thrashing. On 64-bit systems such limits can theoretically be multiplied by several billion. As previously mentioned, pHeap.e contains the full and very scary low-down.