From: "whitehat101 (Jeremy Ebler)" Date: 2013-10-17T09:31:10+09:00 Subject: [ruby-core:57906] [ruby-trunk - Bug #9028][Open] Make SSLSocket Support Encodings Issue #9028 has been reported by whitehat101 (Jeremy Ebler). ---------------------------------------- Bug #9028: Make SSLSocket Support Encodings https://bugs.ruby-lang.org/issues/9028 Author: whitehat101 (Jeremy Ebler) Status: Open Priority: Normal Assignee: drbrain (Eric Hodel) Category: Target version: ruby -v: 1.9.3, 2.0.0-p0 Backport: 1.9.3: UNKNOWN, 2.0.0: UNKNOWN I was working on a bug in the xmpp4r project that caused REXML exceptions when receiving UTF-8 Strings. https://github.com/xmpp4r/xmpp4r/issues/13 The issue ended up being that SSLSocket#readline didn't always return strings with the same encoding. It gave plain ASCII strings an encoding of UTF-8, and UTF-8 strings an encoding of ASCII-8BIT. We were passing the SSLSocket directly to REXML::Parsers::SAX2Parser and REXML throws exceptions when the input is not UTF-8. Our solution, wrap the socket and always return consistently encoded strings: class SSLSocketUtf8 < OpenSSL::SSL::SSLSocket def sysread *args super.force_encoding ::Encoding::UTF_8 end end Hello, I'm investigating some strange behavior with OpenSSL::SSL::SSLSocket and string encodings #readline returns UTF-8 encoded strings, until the string actually contains UTF-8, then it claims that the encoding is ASCII-8BIT I've been reading through the source, and I'm not sure where to try to patch it whitehat101: have an example script? whitehat101: can you reproduce it with #sysread? if you can, the problem lies in the C code if you cannot, the problem lies in the OpenSSL::Buffering module I don't have a concise example, I'm working with the xmpp4r project whitehat101: look at sample/openssl/echo_* you can probably make a simple example out of that I found that #sysread always returns 8BIT, but #readline usually gives UTF-8 Thank you, i'll look at those whitehat101: then I imagine the problem is that OpenSSL::Buffering#initialize creates a UTF-8 buffer (@rbuffer) I bet that # encoding: ASCII-8BIT at the very top of the file will fix it in buffering.rb? in ext/openssl/lib/openssl/buffering.rb My feeling is that these functions should be returning UTF-8 A patch that works for my project: class SSLSocketUtf8 < OpenSSL::SSL::SSLSocket def sysread *args super.force_encoding ::Encoding::UTF_8 end end hrm they should be returning the encoding of the SSLSocket It doesn't look like SSLSocket has any supportfor encodings I tried setting the encoding of the TCPSocket, but it had no effect since SSLSocket wraps the TCPSocket, I don't know if that has an effect on SSLSocket#sysread I'm guessing that SSLSocket has no idea what the encoding is, it just deals with bytes We're passing the SSLSocket directly to REXML::Parsers::SAX2Parser and REXML throws exceptions when the input is not UTF-8 possibly, since it isn't an IO subclass and doesn't seem to respond to #set_encoding setting the encoding on the TCPSocket probably has no effect because SSLSocket needs to read binary data off the TCPSocket the ultimate solution would be "make SSLSocket support encodings" That sounds right to me a short-term fix would be "make the SSLSocket methods return a consistent encoding, regardless of correctness" whitehat101: if you file a bug, maybe I'll find the time to fix it for ruby 2.1 you can file one here: http://bugs.ruby-lang.org/projects/ruby-trunk/issues/new That would be excellent, thanks Should I try to make an example, or just include this conversation? this conversation is enough -- http://bugs.ruby-lang.org/