From: "jhawthorn (John Hawthorn) via ruby-core" Date: 2023-09-01T00:07:33+00:00 Subject: [ruby-core:114608] [Ruby master Bug#19784] String#delete_prefix! problem Issue #19784 has been updated by jhawthorn (John Hawthorn). @ywenc and I found a regression from this patch. We have some code handling a broken UTF-8 String with a combination of valid and invalid bytes (UTF-8 followed by binary, which IMO should probably be binary encoded, but it's surprising that the behaviour changed). ``` "hello\xBE".start_with?("hello") #=> false in trunk, was true on 3.2 "hello\xFE".start_with?("hello") #=> true (both 3.2 and trunk, intended behaviour) "hello\xBE".delete_prefix("hello") => "\xBE" (both on 3.2 and trunk), because we skip the check when the prefix is valid "\xFFhello\xBE".delete_prefix("\xFFhello") => "\xFFhello\xBE" in trunk ``` This is because we're looking at character following the prefix, observing that it looks like a UTF-8 continuation byte, and so returns false. This approach might work for ends_with?/delete_suffix, where we don't break on an invalid character in the suffix, but doesn't feel right for prefixes. It sounds like the intended design is that to the user this should feel like we were comparing from the start of the strings char-by-char for valid and byte-by-byte for invalid. We added tests and tried using the end of the previous character, rather than the "start" of the current, to determine if the prefix ends at a char boundary. https://github.com/ruby/ruby/pull/8348 ---------------------------------------- Bug #19784: String#delete_prefix! problem https://bugs.ruby-lang.org/issues/19784#change-104432 * Author: inversion (Yura Babak) * Status: Closed * Priority: Normal * ruby -v: ruby 3.2.2 (2023-03-30 revision e51014f9c0) [x86_64-linux] * Backport: 3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN ---------------------------------------- Here is the snipped and the question is in the comments: ``` ruby fp = 'with_BOM_16.txt' body = File.read(fp).force_encoding('UTF-8') p body # "\xFF\xFE1\u00001\u0000" p body.start_with?("\xFF\xFE") # true body.delete_prefix!("\xFF\xFE") # !!! why doesn't work? p body # "\xFF\xFE1\u00001\u0000" p body.start_with?("\xFF\xFE") # true body[0, 2] = '' p body # "1\u00001\u0000" p body.start_with?("\xFF\xFE") # false ``` Works same on Linux (ruby 3.2.2 (2023-03-30 revision e51014f9c0) [x86_64-linux]) and Windows (ruby 3.2.2 (2023-03-30 revision e51014f9c0) [x64-mingw-ucrt]) -- https://bugs.ruby-lang.org/ ______________________________________________ ruby-core mailing list -- ruby-core@ml.ruby-lang.org To unsubscribe send an email to ruby-core-leave@ml.ruby-lang.org ruby-core info -- https://ml.ruby-lang.org/mailman3/postorius/lists/ruby-core.ml.ruby-lang.org/