Replies: 15 comments
-
Examples of possible usage:
iconvLite.split(someString, someEncoding).length <= 1
iconvLite.split(someString, someEncoding).map(function(substring, index){
if( index % 2 === 0 ){ // encodable substring: 0th, 2nd, 4th…
return substring;
} else { // non-encodable substring: 1st, 3rd, 5th…
// TODO: use a real transliteration here instead of a mere slugification
return require('underscore.string').slugify(substring).
}
}).join(''); |
Beta Was this translation helpful? Give feedback.
-
I don't quite understand when it could be useful compared to a callback. Do you think several invalid characters could be handled in a smarter way if handled together? The analogy to If there is really a smarter way to handle multiple invalid chars as opposed to single, then I'd suggest using the same callback, but passing multiple invalid bytes to it as a Buffer. What do you think? |
Beta Was this translation helpful? Give feedback.
-
I have to admit I could have misunderstood what a callback in #53 actually meant. What I proposed here is a method to deal with non-encodable parts of a Unicode string — not with a non-decodable parts of a Buffer. Therefore here the even parts ( Also the odd parts ( Does it explain the absence of Buffers and the analogy to |
Beta Was this translation helpful? Give feedback.
-
The thought of several invalid characters that could be handled in a smarter way if handled together — that is a thought that I borrowed from Wikipedia's “UTF-7” article, from its section “Encoding”. That's what they say:
Imagine a medium where most characters are encoded with some default encoding (probably one byte per character, such as CP866 or КOI-8R), but the rest are converted to UTF-7. Such a method ( |
Beta Was this translation helpful? Give feedback.
-
Ok, now I got you, thanks for the explanation! Well, this would require additional code that does just this in all codecs. It's rather big time investment for me, plus burden to all future codecs, so I'm a bit reluctant to implement it without a hard use case. The use case you described above is mostly solved by a callback, with the exception of single valid char surrounded by invalid, which I believe is a rare case (that the format supports and recommends it). |
Beta Was this translation helpful? Give feedback.
-
What if I wrote a pull request? I may actually be willing to write it, but that depends on how many codecs iconv-lite has: Preliminary note 1. For the Unicode encodings (such as UTF-16) any JavaScript character is encodable and thus Preliminary note 2. One of the remaining codecs ( |
Beta Was this translation helpful? Give feedback.
-
I'm still not convinced in the usefulness of this functionality and it will not only increase code complexity/size and API surface, but also increase the burden of adding each new encoding, even if you write all the code for now. IMHO it's just not worth it until we find a compelling use case. |
Beta Was this translation helpful? Give feedback.
-
Well, I hope I have some (more or less compelling) use case. You know how the UTF-7 was first invented, right? There was (and still somewhere is) a medium (MIME headers) where Unicode was not permitted and thus any characters of a Unicode string either would fit in the other (ASCII) encoding or would have to be encoded (using ASCII characters) and escaped (not to be confused with real ASCII characters). I have to face another similar medium now. That medium is Fidonet where the design of the most popular message editor (GoldED+) makes any support of the multibyte encodings in GoldED+ impossible. Therefore it is also not possible to simply write Unicode messages to Fidonet and expect them to be ever read by the users of GoldED+; however, if the text is mostly in Russian, it becomes possible to write most of the message in a single-byte encoding (such as CP866), split out the substrings that won't fit, encode them differently and escape them (not to be confused with the rest of message). If a standard arises for such encoding and escaping, then the other message editors (and mere Fidonet browsers and WebBBS) could collectively embrace and extend GoldED+. Speaking of the different encoding of such substrings, at first I suggested (there, last paragraph) that it could be Punycode; then Serguei E. Leontiev pointed out that UTF-7 is more compact. However, before they are differently encoded, these substrings (outside of CP866 range) have to be isolated, split out of the string containing the original (Unicode) text of the message. That's my use case for the |
Beta Was this translation helpful? Give feedback.
-
In a nutshell this use case is a generalization of the UTF-7's use case: the Unicode characters are forced into some 8-bit medium (defined by one of the supported single-byte encodings) instead of UTF-7's original 7-bit medium. The whole implementation of the use case would look like the following: var iconvLite = require('iconv-lite');
iconvLite.extendNodeEncodings();
var UnicodeTo8bit = function(sourceString, targetEncoding){
var buffers = iconvLite.split(
sourceString, targetEncoding
).map(function(substring, index){
if( index % 2 === 0 ){ // encodable substring: 0th, 2nd, 4th…
return Buffer(substring, targetEncoding);
} else { // non-encodable substring: 1st, 3rd, 5th…
// TODO: define an escaping function
var escapedString = escapingFunction(
Buffer(substring, 'utf7').toString('utf8')
);
return Buffer(escapedString, targetEncoding);
}
});
return Buffer.concat(buffers);
}; |
Beta Was this translation helpful? Give feedback.
-
Ok, but why can't you do this using callbacks? Each range of non-encodable When the scheme stabilizes, you can also make it into its own codec. Or I What do you think? Alexander Shtuchkin On Thu, Jul 17, 2014 at 5:08 AM, Mithgol notifications@github.com wrote:
|
Beta Was this translation helpful? Give feedback.
-
Would such callback be programmed to receive only one character (the next non-encodable character)? or a whole substring of such characters collected until the next encodable character is encountered (or original string ended)?
|
Beta Was this translation helpful? Give feedback.
-
I'm thinking about the latter case, that would be more convenient and probably faster too. There is one issue though - as iconv-lite is stream-oriented and working chunk-by-chunk, I would not want to accumulate ranges of invalid characters, because we can run out of memory on very long invalid streams. So I'm thinking about the following interface: function unencodableHandler(str, offset, context, rangeStarted, rangeFinished) {
// str is a string of unencodable characters.
// offset - index of str in context.
// context is the whole string (chunk) currently encoded.
// rangeStarted flag is true when str starts a range of contiguous invalid characters.
// rangeFinished flag is true when str completes the range.
// Flags are always true in non-streaming usage.
// Return characters that can be translated thus far.
return (rangeStarted ? start_escape : "") + escape(str) + (rangeFinished ? end_escape : "");
}
// Convert a string - all unencodable strings will be complete.
buf = iconv.encode(str, "cp866", {unencodableHandler: unencodableHandler});
// Or a stream
inputStream.pipe(iconv.encodeStream("cp866", {unencodableHandler: unencodableHandler})).pipe(outStream); |
Beta Was this translation helpful? Give feedback.
-
LGTM. That For example, as Wikipedia says about UTF-7,
Thus However, in practice However, that's still better than not having |
Beta Was this translation helpful? Give feedback.
-
Just a nudge because ≈nine months have passed. |
Beta Was this translation helpful? Give feedback.
-
Sorry, not much free time recently. I do remember about it and will fix eventually. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I'd like to propose an alternative to #53.
It is supposed in #53 that “invalid” characters (i.e. characters that cannot be encoded with the given encoding) should be dealt with individually. Sometimes, however, it becomes more useful to deal with the whole susbstrings of such characters. For such cases I propose an idea of a method that would split any given string into an array of encodable and non-encodable substrings following each other.
Example:
The above suggested method is inspired by a behaviour of
String.prototype.split
when it is given a regular expression enclosed in a single set of capturing parentheses:The proposed method should remind its users of
String.prototype.split
(hence the name.split
) and thus be understood by analogy.To make a complete similarity, it should also behave similarly, i.e. the even array indices (0, 2, 4…) should always correspond to encodable substrings while the odd array indices (1, 3, 5…) should always correspond to non-encodable substring. (To achieve that, the first substring in the returned array could sometimes be intentionally left blank, like
String.prototype.split
does it in the[ '', '--', 'foo', '-', 'bar' ]
example above, to preserve the meaning of odd and even indices.)Beta Was this translation helpful? Give feedback.
All reactions