[lucy-dev] Clownfish String API

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

[lucy-dev] Clownfish String API

Nick Wellnhofer
Lucifers,

here are my condensed thoughts on cleaning up the Clownfish String API:

String

- Rename `new_stack_string` to `init_stack_string`.
- Make all `new_*`, `init_wrap_*` functions public except:
   - `init_stack_string` (only called from macros)
- Remove `Cat_Utf8` and `Cat_Trusted_Utf8`?
   Callers should probably use CharBufs instead.
   Only keep `Cat`.
- Remove underused `Swap_Chars`.
   Maybe add a `Replace` method instead.
- Make stack strings public?
- Add macro to create stack string from null-terminated UTF-8
   without passing the string size:
       String *str = SSTR_WRAP_C("A string.");
- Make `Code_Point_*` return special value instead of 0
   if outside the string.
- Make `Find` return a size_t.
   Requires special value for "not found".
- Make all methods public except:
   - `Is_Copy_On_IncRef`
   - `Hash_Sum`

StringIterator

- Rename StrIter_substring to Str_new_from_iter?
- Make `Starts_With`, `Ends_With` skip over strings?
   I think this better matches current usage patterns.
- Find better names for `Skip_{Next|Prev}_Whitespace`?
- Make all methods public.

Nick
Reply | Threaded
Open this post in threaded view
|

Re: [lucy-dev] Clownfish String API

Marvin Humphrey-2
Hi, Nick,

Thanks for thinking this through!

On Wed, Oct 7, 2015 at 4:10 AM, Nick Wellnhofer <[hidden email]> wrote:

> Lucifers,
>
> here are my condensed thoughts on cleaning up the Clownfish String API:
>
> String
>
> - Rename `new_stack_string` to `init_stack_string`.
> - Make all `new_*`, `init_wrap_*` functions public except:
>   - `init_stack_string` (only called from macros)
> - Remove `Cat_Utf8` and `Cat_Trusted_Utf8`?
>   Callers should probably use CharBufs instead.
>   Only keep `Cat`.

+1

> - Remove underused `Swap_Chars`.
>   Maybe add a `Replace` method instead.

+1 to remove Swap_Chars.  If you get serious about adding Replace, let's
contemplate the interface.

> - Make stack strings public?

I'd say wait for now.  They're truly an expert API -- it's too easy to blow
the C stack by using them in a loop, etc.

> - Add macro to create stack string from null-terminated UTF-8
>   without passing the string size:
>       String *str = SSTR_WRAP_C("A string.");
> - Make `Code_Point_*` return special value instead of 0
>   if outside the string.

+1

> - Make `Find` return a size_t.
>   Requires special value for "not found".

Hmm, that's a toughie.  If the sentinel is SIZE_MAX, that might not fit in all
host numeric types.

> - Make all methods public except:
>   - `Is_Copy_On_IncRef`
>   - `Hash_Sum`

In addition, I'd suggest:

*   Remove `Str_less_than`, replacing it in PolyLexicon with
    `Str_Compare_To(a, b) < 0`.
*   Remove `Str_compare` from the public API.
*   Remove `new_from_char`.
*   Remove `BaseX_To_I64`, optionally adding its functionality to
    StringHelper.
*   Rename the argument to `Ends_With` from `postfix` to `suffix`.  Same with
    `Ends_With_Utf8`.
*   The return values for `Trim`, `Trim_Top`, and `Trim_Tail` should be
    `incremented`.

> StringIterator
>
> - Rename StrIter_substring to Str_new_from_iter?

Hmm.  I see where you're going with that, but new_from_iter doesn't
communicate that there must be a single backing String for the two
StringIterator arguments.  Maybe `StrIter_crop`?

> - Make `Starts_With`, `Ends_With` skip over strings?
>   I think this better matches current usage patterns.

Somehow I don't understand what you mean.

Also, as with String, the argument to Ends_With should be `suffix` rather than
`postfix`.

> - Find better names for `Skip_{Next|Prev}_Whitespace`?

`Advance_Whitespace` and `Recede_Whitespace`?

> - Make all methods public.

+1

...

While preparing this reply, I also created the patch below with some doc
tweaks.

Marvin Humphrey

diff --git a/runtime/core/Clownfish/String.cfh b/runtime/core/Clownfish/String.cfh
index e422cbf..b49a0df 100644
--- a/runtime/core/Clownfish/String.cfh
+++ b/runtime/core/Clownfish/String.cfh
@@ -34,52 +34,52 @@ public final class Clownfish::String nickname Str
     size_t      size;
     String     *origin;
 
-    /** Return a new String which holds a copy of the passed-in string.
-     * Check for UTF-8 validity.
+    /** Return a String which holds a copy of the supplied UTF-8 character
+     * data after checking for validity.
      */
     inert incremented String*
     new_from_utf8(const char *utf8, size_t size);
 
-    /** Return a new String which holds a copy of the passed-in string.  No
-     * validity checking is performed.
+    /** Return a String which holds a copy of the supplied UTF-8 character
+     * data, skipping validity checks.
      */
     inert incremented String*
     new_from_trusted_utf8(const char *utf8, size_t size);
 
-    /** Initialize the String using the passed-in string.  Do not check
-     * validity of supplied UTF-8.
+    /** Initialize a String which holds a copy of the supplied UTF-8 character
+     * data, skipping validity checks.
      */
     inert String*
     init_from_trusted_utf8(String *self, const char *utf8, size_t size);
 
-    /** Return a pointer to a new String which assumes ownership of the
-     * passed-in string.  Check validity of supplied UTF-8.
+    /** Return a String which assumes ownership of the supplied buffer
+     * containing UTF-8 character data after checking for validity.
      */
     inert incremented String*
     new_steal_utf8(char *utf8, size_t size);
 
-    /** Return a pointer to a new String which assumes ownership of the
-     * passed-in string.  Do not check validity of supplied UTF-8.
+    /** Return a String which assumes ownership of the supplied buffer
+     * containing UTF-8 character data, skipping validity checks.
      */
     inert incremented String*
     new_steal_trusted_utf8(char *utf8, size_t size);
 
-    /** Initialize the String using the passed-in string.  Do not check
-     * validity of supplied UTF-8.
+    /** Initialize a String which assumes ownership of the supplied buffer
+     * containing UTF-8 character data, skipping validity checks.
      */
     public inert String*
     init_steal_trusted_utf8(String *self, char *utf8, size_t size);
 
-    /** Return a pointer to a new String which wraps an external buffer
-     * containing UTF-8.  The buffer must stay unchanged for the lifetime
-     * of the String.  Check validity of supplied UTF-8.
+    /** Return a String which wraps an external buffer containing UTF-8
+     * character data after checking for validity.  The buffer must stay
+     * unchanged for the lifetime of the String.
      */
     inert incremented String*
     new_wrap_utf8(const char *utf8, size_t size);
 
-    /** Return a pointer to a new String which wraps an external buffer
-     * containing UTF-8.  The buffer must stay unchanged for the lifetime
-     * of the String.  Do not check validity of supplied UTF-8.
+    /** Return a String which wraps an external buffer containing UTF-8
+     * character data, skipping validity checks.  The buffer must stay
+     * unchanged for the lifetime of the String.
      */
     inert incremented String*
     new_wrap_trusted_utf8(const char *utf8, size_t size);
@@ -87,8 +87,8 @@ public final class Clownfish::String nickname Str
     inert incremented String*
     new_stack_string(void *allocation, const char *utf8, size_t size);
 
-    /** Initialize the String which wraps an external buffer containing
-     * UTF-8.  Do not check validity of supplied UTF-8.
+    /** Initialize a String which wraps an external buffer containing UTF-8
+     * character data after checking for validity.
      */
     public inert String*
     init_wrap_trusted_utf8(String *self, const char *utf8, size_t size);
@@ -98,8 +98,8 @@ public final class Clownfish::String nickname Str
     inert incremented String*
     new_from_char(int32_t code_point);
 
-    /** Return a pointer to a new String which contains formatted data
-     * expanded according to CB_VCatF.
+    /** Return a String with content expanded from a pattern and arguments
+     * conforming to the spec defined by CharBuf.
      *
      * Note: a user-supplied `pattern` string is a security hole
      * and must not be allowed.
@@ -128,13 +128,14 @@ public final class Clownfish::String nickname Str
     incremented String*
     Cat(String *self, String *other);
 
-    /** Return the concatenation of the String and the passed-in raw UTF-8.
+    /** Return the concatenation of the String and the supplied UTF-8
+     * character data after checking for validity.
      */
     incremented String*
     Cat_Utf8(String *self, const char *ptr, size_t size);
 
-    /** Return the concatenation of the String and the passed-in raw UTF-8.
-     * Don't check for UTF-8 validity.
+    /** Return the concatenation of the String and the supplied UTF-8
+     * character data, skipping validity checks.
      */
     incremented String*
     Cat_Trusted_Utf8(String *self, const char *ptr, size_t size);
@@ -157,22 +158,22 @@ public final class Clownfish::String nickname Str
     public double
     To_F64(String *self);
 
-    /** Test whether the String starts with the content of another.
+    /** Test whether the String starts with `prefix`.
      */
     bool
     Starts_With(String *self, String *prefix);
 
-    /** Test whether the String starts with the passed-in string.
+    /** Test whether the String starts with `prefix`.
      */
     bool
     Starts_With_Utf8(String *self, const char *prefix, size_t size);
 
-    /** Test whether the String ends with the content of another.
+    /** Test whether the String ends with `postfix`.
      */
     bool
     Ends_With(String *self, String *postfix);
 
-    /** Test whether the String ends with the passed-in string.
+    /** Test whether the String ends with `postfix`.
      */
     bool
     Ends_With_Utf8(String *self, const char *postfix, size_t size);
@@ -186,17 +187,17 @@ public final class Clownfish::String nickname Str
     int64_t
     Find_Utf8(String *self, const char *ptr, size_t size);
 
-    /** Test whether the String matches the passed-in string.
+    /** Test whether the String matches the supplied UTF-8 character data.
      */
     bool
     Equals_Utf8(String *self, const char *ptr, size_t size);
 
-    /** Return the number of Unicode code points in the object's string.
+    /** Return the number of Unicode code points the String contains.
      */
     size_t
     Length(String *self);
 
-    /** Get the String's `size` attribute.
+    /** Return the number of bytes occupied by the String's internal content.
      */
     size_t
     Get_Size(String *self);
@@ -251,35 +252,33 @@ public final class Clownfish::String nickname Str
     String*
     Trim_Tail(String *self);
 
-    /** Return the Unicode code point at the specified number of code points
-     * in.  Return 0 if the string length is exceeded.  (XXX It would be
+    /** Return the Unicode code point located `tick` code points in from the
+     * top.  Return 0 if out of bounds.  (XXX It would be
      * better to throw an exception, but that's not practical with UTF-8 and
      * no cached length.)
      */
     int32_t
     Code_Point_At(String *self, size_t tick);
 
-    /** Return the Unicode code point at the specified number of code points
-     * counted backwards from the end of the string.  Return 0 if outside the
-     * string.
+    /** Return the Unicode code point located `tick` code points counting
+     * backwards from the end.  Return 0 if out of bounds.
      */
     int32_t
     Code_Point_From(String *self, size_t tick);
 
-    /** Return a newly allocated String containing a copy of the indicated
-     * substring.
+    /** Return a new String containing a copy of the specified substring.
      * @param offset Offset from the top, in code points.
      * @param len The desired length of the substring, in code points.
      */
     incremented String*
     SubString(String *self, size_t offset, size_t len);
 
-    /** Return an iterator to the start of the string.
+    /** Return an iterator initialized to the start of the string.
      */
     incremented StringIterator*
     Top(String *self);
 
-    /** Return an iterator to the end of the string.
+    /** Return an iterator initialized to the end of the string.
      */
     incremented StringIterator*
     Tail(String *self);
@@ -363,14 +362,12 @@ public final class Clownfish::StringIterator nickname StrIter
     public size_t
     Skip_Prev_Whitespace(StringIterator *self);
 
-    /** Test whether the content after the iterator starts with
-     * `prefix`.
+    /** Test whether the content after the iterator starts with `prefix`.
      */
     bool
     Starts_With(StringIterator *self, String *prefix);
 
-    /** Test whether the content after the iterator starts with the passed-in
-     * string.
+    /** Test whether the content after the iterator starts with `prefix`.
      */
     bool
     Starts_With_Utf8(StringIterator *self, const char *prefix, size_t size);
@@ -381,8 +378,7 @@ public final class Clownfish::StringIterator nickname StrIter
     bool
     Ends_With(StringIterator *self, String *postfix);
 
-    /** Test whether the content before the iterator ends with the passed-in
-     * string.
+    /** Test whether the content before the iterator ends with `postfix`.
      */
     bool
     Ends_With_Utf8(StringIterator *self, const char *postfix, size_t size);


Reply | Threaded
Open this post in threaded view
|

Re: [lucy-dev] Clownfish String API

Nick Wellnhofer
On 15/10/2015 21:21, Marvin Humphrey wrote:
> On Wed, Oct 7, 2015 at 4:10 AM, Nick Wellnhofer <[hidden email]> wrote:
>> - Make stack strings public?
>
> I'd say wait for now.  They're truly an expert API -- it's too easy to blow
> the C stack by using them in a loop, etc.

OK.

>> - Make `Find` return a size_t.
>>    Requires special value for "not found".
>
> Hmm, that's a toughie.  If the sentinel is SIZE_MAX, that might not fit in all
> host numeric types.

Good point. The problem is not so much the byte size of the return type but
the fact that a host language might not support unsigned integers. Maybe we
should limit string sizes to SSIZE_MAX and make `Find` return an ssize_t. But
this requires to emulate the ssize_t type on non-POSIX platforms.

> In addition, I'd suggest:
>
> *   Remove `Str_less_than`, replacing it in PolyLexicon with
>      `Str_Compare_To(a, b) < 0`.
> *   Remove `Str_compare` from the public API.

+1

> *   Remove `new_from_char`.

-0.5

> *   Remove `BaseX_To_I64`, optionally adding its functionality to
>      StringHelper.

+0

> *   Rename the argument to `Ends_With` from `postfix` to `suffix`.  Same with
>      `Ends_With_Utf8`.
> *   The return values for `Trim`, `Trim_Top`, and `Trim_Tail` should be
>      `incremented`.

+1

>> StringIterator
>>
>> - Rename StrIter_substring to Str_new_from_iter?
>
> Hmm.  I see where you're going with that, but new_from_iter doesn't
> communicate that there must be a single backing String for the two
> StringIterator arguments.  Maybe `StrIter_crop`?

I'm also OK with either StrIter_substring ot StrIter_crop. I only thought it
might be clearer to make this function a constructor of the String class.

>> - Make `Starts_With`, `Ends_With` skip over strings?
>>    I think this better matches current usage patterns.
>
> Somehow I don't understand what you mean.

I mean that `Starts_With` should be named something like `Advance_If_Match`
and make the iterator move past the prefix if it matches. Currently, most code
that calls StrIter_Start_With looks like

     if (StrIter_Starts_With(iter, prefix)) {
         StrIter_Advance(iter, prefix_len);
         ...
     }

This could be replaced with a simpler and more efficient

     if (StrIter_Advance_If_Match(iter, prefix)) {
         ...
     }

>> - Find better names for `Skip_{Next|Prev}_Whitespace`?
>
> `Advance_Whitespace` and `Recede_Whitespace`?

As a non-native speaker, I'm not really sure.

> While preparing this reply, I also created the patch below with some doc
> tweaks.

Looks good. I will start off with your patch.

Nick


Reply | Threaded
Open this post in threaded view
|

[lucy-dev] Str_Find return type

Nick Wellnhofer
On 18/10/2015 14:01, Nick Wellnhofer wrote:

>>> - Make `Find` return a size_t.
>>>    Requires special value for "not found".
>>
>> Hmm, that's a toughie.  If the sentinel is SIZE_MAX, that might not fit in all
>> host numeric types.
>
> Good point. The problem is not so much the byte size of the return type but
> the fact that a host language might not support unsigned integers. Maybe we
> should limit string sizes to SSIZE_MAX and make `Find` return an ssize_t. But
> this requires to emulate the ssize_t type on non-POSIX platforms.

Here's another idea. Most of the time, users of Str_Find don't care about the
position of the substring and only want to know whether the substring is
contained or not. For this use case, a method like Str_Contains returning a
bool is a more appropriate interface.

If someone is interested in the exact position of the substring, it might make
more sense to return a string iterator pointing to the first occurrence of the
substring. So what about:

     public bool
     Contains(String *self, String *substring);

     public incremented nullable StringIterator*
     Find(String *self, String *substring);

Nick

Reply | Threaded
Open this post in threaded view
|

Re: [lucy-dev] Str_Find return type

Marvin Humphrey
On Thu, Oct 22, 2015 at 8:06 AM, Nick Wellnhofer <[hidden email]> wrote:
> On 18/10/2015 14:01, Nick Wellnhofer wrote:

>>>> - Make `Find` return a size_t.
>>>>   Requires special value for "not found".
>>>
>>> Hmm, that's a toughie.  If the sentinel is SIZE_MAX, that might not fit in
>>> all host numeric types.
>>
>>
>> Good point. The problem is not so much the byte size of the return type but
>> the fact that a host language might not support unsigned integers. Maybe we
>> should limit string sizes to SSIZE_MAX and make `Find` return an ssize_t.
>> But this requires to emulate the ssize_t type on non-POSIX platforms.
>
> Here's another idea. Most of the time, users of Str_Find don't care about
> the position of the substring and only want to know whether the substring is
> contained or not. For this use case, a method like Str_Contains returning a
> bool is a more appropriate interface.
>
> If someone is interested in the exact position of the substring, it might
> make more sense to return a string iterator pointing to the first occurrence
> of the substring. So what about:
>
>     public bool
>     Contains(String *self, String *substring);
>
>     public incremented nullable StringIterator*
>     Find(String *self, String *substring);

+1

You're absolutely right, avoiding an index which counts code points is
consistent with our iterator-centric model for string processing.  Good
insight, and nice API proposal!

Marvin Humphrey
Reply | Threaded
Open this post in threaded view
|

Re: [lucy-dev] Clownfish String API

Marvin Humphrey
In reply to this post by Nick Wellnhofer
On Sun, Oct 18, 2015 at 5:01 AM, Nick Wellnhofer <[hidden email]> wrote:
> On 15/10/2015 21:21, Marvin Humphrey wrote:

>> *   Remove `new_from_char`.
>
>
> -0.5
>
>> *   Remove `BaseX_To_I64`, optionally adding its functionality to
>>      StringHelper.
>
>
> +0

I don't feel that strongly about either of these, so I'm amenable to your
plan.

>>> StringIterator
>>>
>>> - Rename StrIter_substring to Str_new_from_iter?
>>
>>
>> Hmm.  I see where you're going with that, but new_from_iter doesn't
>> communicate that there must be a single backing String for the two
>> StringIterator arguments.  Maybe `StrIter_crop`?
>
> I'm also OK with either StrIter_substring ot StrIter_crop. I only thought it
> might be clearer to make this function a constructor of the String class.

Then how about `Str_crop`?  I think that would be OK because it correctly
implies that we're cropping a single source String.

>>> - Make `Starts_With`, `Ends_With` skip over strings?
>>>    I think this better matches current usage patterns.
>>
>>
>> Somehow I don't understand what you mean.
>
>
> I mean that `Starts_With` should be named something like `Advance_If_Match`
> and make the iterator move past the prefix if it matches. Currently, most
> code that calls StrIter_Start_With looks like
>
>     if (StrIter_Starts_With(iter, prefix)) {
>         StrIter_Advance(iter, prefix_len);
>         ...
>     }
>
> This could be replaced with a simpler and more efficient
>
>     if (StrIter_Advance_If_Match(iter, prefix)) {
>         ...
>     }

How about `Skip_Match`?

    if (!StrIter_Skip_Match(iter, prefix)) {
        return 0;
    }

>>> - Find better names for `Skip_{Next|Prev}_Whitespace`?
>>
>> `Advance_Whitespace` and `Recede_Whitespace`?
>
> As a non-native speaker, I'm not really sure.

I think `Skip_Whitespace` is the best option for forward skipping.  I can't
think of a good option for moving backwards past any whitespace before the
current iterator position.  It's not a common operation, though, and I can
live with a suboptimal name.  `Skip_Prev_Whitespace` might be the clearest,
since it can be contrasted with `Skip_Whitespace` -- and it's less
gramatically awkward than `Recede_Whitespace`.

Marvin Humphrey
Reply | Threaded
Open this post in threaded view
|

Re: [lucy-dev] Clownfish String API

Nathan Kurz
On Thu, Oct 22, 2015 at 2:57 PM, Marvin Humphrey <[hidden email]> wrote:
> I think `Skip_Whitespace` is the best option for forward skipping.  I can't
> think of a good option for moving backwards past any whitespace before the
> current iterator position.  It's not a common operation, though, and I can
> live with a suboptimal name.  `Skip_Prev_Whitespace` might be the clearest,
> since it can be contrasted with `Skip_Whitespace` -- and it's less
> gramatically awkward than `Recede_Whitespace`.

Hi Guys --

Perhaps consider the modifier to be adverb rather than an adjective?
That is, rather than specifying which whitespace, specify how you are
skipping.

For example:

 Skip_Forward_Whitespace / Skip_Backward_Whitespace.

Or put it at the end:

 Skip_Whitespace_Forward / Skip_Whitespace_Backward

Or as you suggest, keep the common case short and modify only the rare one:

 Skip_Whitespace / Skip_Whitespace_[Back|Backward|Reverse]

I think all of these are slightly better than Skip_Prev_Whitespace,
which in turn is indeed much better than Recede_Whitespace.

--nate
Reply | Threaded
Open this post in threaded view
|

Re: [lucy-dev] Clownfish String API

Marvin Humphrey
On Thu, Oct 22, 2015 at 3:13 PM, Nathan Kurz <[hidden email]> wrote:

> Perhaps consider the modifier to be adverb rather than an adjective?
> That is, rather than specifying which whitespace, specify how you are
> skipping.
>
> For example:
>
>  Skip_Forward_Whitespace / Skip_Backward_Whitespace.
>
> Or put it at the end:
>
>  Skip_Whitespace_Forward / Skip_Whitespace_Backward
>
> Or as you suggest, keep the common case short and modify only the rare one:
>
>  Skip_Whitespace / Skip_Whitespace_[Back|Backward|Reverse]

Thanks for the fresh perspective, Nate!  I'm fine with any of these;
if I were to pick one combo it would be
`Skip_Whitespace`/`Skip_Whitespace_Back`.

Marvin Humphrey
Reply | Threaded
Open this post in threaded view
|

Re: [lucy-dev] Str_Find return type

Nick Wellnhofer
In reply to this post by Marvin Humphrey
On 22/10/2015 23:21, Marvin Humphrey wrote:
> You're absolutely right, avoiding an index which counts code points is
> consistent with our iterator-centric model for string processing.  Good
> insight, and nice API proposal!

I'm also tempted to remove Str_Code_Point_At and make string iterators the
only way to access character data in a string.

Nick