Chars enables you to work transparently with UTF-8 encoding in the Ruby String class without having extensive knowledge about the encoding. A Chars object accepts a string upon initialization and proxies String methods in an encoding safe manner. All the normal String methods are also implemented on the proxy.
String methods are proxied through the Chars object, and can be accessed through the mb_chars method. Methods which would normally return a String object now return a Chars object so methods can be chained.
"The Perfect String ".mb_chars.downcase.strip.normalize #=> "the perfect string"
Chars objects are perfectly interchangeable with String objects as long as no explicit class checks are made. If certain methods do explicitly check the class, call to_s before you pass chars objects to them.
bad.explicit_checking_method "T".mb_chars.downcase.to_s
The default Chars implementation assumes that the encoding of the string is UTF-8, if you want to handle different encodings you can write your own multibyte string handler and configure it through ActiveSupport::Multibyte.proxy_class.
class CharsForUTF32
def size
@wrapped_string.size / 4
end
def self.accepts?(string)
string.length % 4 == 0
end
end
ActiveSupport::Multibyte.proxy_class = CharsForUTF32
Hangul character boundaries and properties
BOM (byte order mark) can also be seen as whitespace, it’s a non-rendering character used to distinguish between little and big endian. This is not an issue in utf-8, so it must be ignored.
All the unicode whitespace
Compose decomposed characters to the composed form.
# File lib/active_support/multibyte/chars.rb, line 576 def compose_codepoints(codepoints) pos = 0 eoa = codepoints.length - 1 starter_pos = 0 starter_char = codepoints[0] previous_combining_class = -1 while pos < eoa pos += 1 lindex = starter_char - HANGUL_LBASE # -- Hangul if 0 <= lindex and lindex < HANGUL_LCOUNT vindex = codepoints[starter_pos+1] - HANGUL_VBASE rescue vindex = -1 if 0 <= vindex and vindex < HANGUL_VCOUNT tindex = codepoints[starter_pos+2] - HANGUL_TBASE rescue tindex = -1 if 0 <= tindex and tindex < HANGUL_TCOUNT j = starter_pos + 2 eoa -= 2 else tindex = 0 j = starter_pos + 1 eoa -= 1 end codepoints[starter_pos..j] = (lindex * HANGUL_VCOUNT + vindex) * HANGUL_TCOUNT + tindex + HANGUL_SBASE end starter_pos += 1 starter_char = codepoints[starter_pos] # -- Other characters else current_char = codepoints[pos] current = UCD.codepoints[current_char] if current.combining_class > previous_combining_class if ref = UCD.composition_map[starter_char] composition = ref[current_char] else composition = nil end unless composition.nil? codepoints[starter_pos] = composition starter_char = composition codepoints.delete_at pos eoa -= 1 pos -= 1 previous_combining_class = -1 else previous_combining_class = current.combining_class end else previous_combining_class = current.combining_class end if current.combining_class == 0 starter_pos = pos starter_char = codepoints[pos] end end end codepoints end
Returns true when the proxy class can handle the string. Returns false otherwise.
# File lib/active_support/multibyte/chars.rb, line 122 def self.consumes?(string) # Unpack is a little bit faster than regular expressions. string.unpack('U*') true rescue ArgumentError false end
Decompose composed characters to the decomposed form.
# File lib/active_support/multibyte/chars.rb, line 555 def decompose_codepoints(type, codepoints) codepoints.inject([]) do |decomposed, cp| # if it's a hangul syllable starter character if HANGUL_SBASE <= cp and cp < HANGUL_SLAST sindex = cp - HANGUL_SBASE ncp = [] # new codepoints ncp << HANGUL_LBASE + sindex / HANGUL_NCOUNT ncp << HANGUL_VBASE + (sindex % HANGUL_NCOUNT) / HANGUL_TCOUNT tindex = sindex % HANGUL_TCOUNT ncp << (HANGUL_TBASE + tindex) unless tindex == 0 decomposed.concat ncp # if the codepoint is decomposable in with the current decomposition type elsif (ncp = UCD.codepoints[cp].decomp_mapping) and (!UCD.codepoints[cp].decomp_type || type == :compatability) decomposed.concat decompose_codepoints(type, ncp.dup) else decomposed << cp end end end
Reverse operation of g_unpack.
Example:
Chars.g_pack(Chars.g_unpack('क्षि')) #=> 'क्षि'
# File lib/active_support/multibyte/chars.rb, line 526 def g_pack(unpacked) (unpacked.flatten).pack('U*') end
Unpack the string at grapheme boundaries. Returns a list of character lists.
Example:
Chars.g_unpack('क्षि') #=> [[2325, 2381], [2359], [2367]]
Chars.g_unpack('Café') #=> [[67], [97], [102], [233]]
# File lib/active_support/multibyte/chars.rb, line 492 def g_unpack(string) codepoints = u_unpack(string) unpacked = [] pos = 0 marker = 0 eoc = codepoints.length while(pos < eoc) pos += 1 previous = codepoints[pos-1] current = codepoints[pos] if ( # CR X LF one = ( previous == UCD.boundary[:cr] and current == UCD.boundary[:lf] ) or # L X (L|V|LV|LVT) two = ( UCD.boundary[:l] === previous and in_char_class?(current, [:l,:v,:lv,:lvt]) ) or # (LV|V) X (V|T) three = ( in_char_class?(previous, [:lv,:v]) and in_char_class?(current, [:v,:t]) ) or # (LVT|T) X (T) four = ( in_char_class?(previous, [:lvt,:t]) and UCD.boundary[:t] === current ) or # X Extend five = (UCD.boundary[:extend] === current) ) else unpacked << codepoints[marker..pos-1] marker = pos end end unpacked end
Detect whether the codepoint is in a certain character class. Returns true when it’s in the specified character class and false otherwise. Valid character classes are: :cr, :lf, :l, :v, :lv, :lvt and :t.
Primarily used by the grapheme cluster support.
# File lib/active_support/multibyte/chars.rb, line 483 def in_char_class?(codepoint, classes) classes.detect { |c| UCD.boundary[c] === codepoint } ? true : false end
Creates a new Chars instance by wrapping string.
# File lib/active_support/multibyte/chars.rb, line 83 def initialize(string) @wrapped_string = string @wrapped_string.force_encoding(Encoding::UTF_8) unless @wrapped_string.frozen? end
Re-order codepoints so the string becomes canonical.
# File lib/active_support/multibyte/chars.rb, line 539 def reorder_characters(codepoints) length = codepoints.length- 1 pos = 0 while pos < length do cp1, cp2 = UCD.codepoints[codepoints[pos]], UCD.codepoints[codepoints[pos+1]] if (cp1.combining_class > cp2.combining_class) && (cp2.combining_class > 0) codepoints[pos..pos+1] = cp2.code, cp1.code pos += (pos > 0 ? -1 : 1) else pos += 1 end end codepoints end
Replaces all ISO-8859-1 or CP1252 characters by their UTF-8 equivalent resulting in a valid UTF-8 string.
# File lib/active_support/multibyte/chars.rb, line 635 def tidy_bytes(string) string.split(//).map do |c| c.force_encoding(Encoding::ASCII) if c.respond_to?(:force_encoding) if !ActiveSupport::Multibyte::VALID_CHARACTER['UTF-8'].match(c) n = c.unpack('C')[0] n < 128 ? n.chr : n < 160 ? [UCD.cp1252[n] || n].pack('U') : n < 192 ? "\xC2" + n.chr : "\xC3" + (n-64).chr else c end end.join end
Unpack the string at codepoints boundaries. Raises an EncodingError when the encoding of the string isn’t valid UTF-8.
Example:
Chars.u_unpack('Café') #=> [67, 97, 102, 233]
# File lib/active_support/multibyte/chars.rb, line 470 def u_unpack(string) begin string.unpack 'U*' rescue ArgumentError raise EncodingError, 'malformed UTF-8 character' end end
Returns true if the Chars class can and should act as a proxy for the string string. Returns false otherwise.
# File lib/active_support/multibyte/chars.rb, line 117 def self.wants?(string) $KCODE == 'UTF8' && consumes?(string) end
Returns a new Chars object containing the other object concatenated to the string.
Example:
('Café'.mb_chars + ' périferôl').to_s #=> "Café périferôl"
# File lib/active_support/multibyte/chars.rb, line 146 def +(other) self << other end
Returns -1, 0 or +1 depending on whether the Chars object is to be sorted before, equal or after the object on the right side of the operation. It accepts any object that implements to_s. See String#<=> for more details.
Example:
'é'.mb_chars <=> 'ü'.mb_chars #=> -1
# File lib/active_support/multibyte/chars.rb, line 138 def <=>(other) @wrapped_string <=> other.to_s end
Like String#=~ only it returns the character offset (in codepoints) instead of the byte offset.
Example:
'Café périferôl'.mb_chars =~ /ô/ #=> 12
# File lib/active_support/multibyte/chars.rb, line 154 def =~(other) translate_offset(@wrapped_string =~ other) end
Like String#[]=, except instead of byte offsets you specify character offsets.
Example:
s = "Müller" s.mb_chars[2] = "e" # Replace character with offset 2 s #=> "Müeler" s = "Müller" s.mb_chars[1, 2] = "ö" # Replace 2 characters at character offset 1 s #=> "Möler"
# File lib/active_support/multibyte/chars.rb, line 230 def []=(*args) replace_by = args.pop # Indexed replace with regular expressions already works if args.first.is_a?(Regexp) @wrapped_string[*args] = replace_by else result = self.class.u_unpack(@wrapped_string) if args[0].is_a?(Fixnum) raise IndexError, "index #{args[0]} out of string" if args[0] >= result.length min = args[0] max = args[1].nil? ? min : (min + args[1] - 1) range = Range.new(min, max) replace_by = [replace_by].pack('U') if replace_by.is_a?(Fixnum) elsif args.first.is_a?(Range) raise RangeError, "#{args[0]} out of range" if args[0].min >= result.length range = args[0] else needle = args[0].to_s min = index(needle) max = min + self.class.u_unpack(needle).length - 1 range = Range.new(min, max) end result[range] = self.class.u_unpack(replace_by) @wrapped_string.replace(result.pack('U*')) end end
Enable more predictable duck-typing on String-like classes. See Object#acts_like?.
# File lib/active_support/multibyte/chars.rb, line 111 def acts_like_string? true end
Converts the first character to uppercase and the remainder to lowercase.
Example:
'über'.mb_chars.capitalize.to_s #=> "Über"
# File lib/active_support/multibyte/chars.rb, line 392 def capitalize (slice(0) || chars('')).upcase + (slice(1..-1) || chars('')).downcase end
Works just like String#center, only integer specifies characters instead of bytes.
Example:
"¾ cup".mb_chars.center(8).to_s #=> " ¾ cup " "¾ cup".mb_chars.center(8, " ").to_s # Use non-breaking whitespace #=> " ¾ cup "
# File lib/active_support/multibyte/chars.rb, line 292 def center(integer, padstr=' ') justify(integer, :center, padstr) end
Performs composition on all the characters.
Example:
'é'.length #=> 3 'é'.mb_chars.compose.to_s.length #=> 2
# File lib/active_support/multibyte/chars.rb, line 434 def compose chars(self.class.compose_codepoints(self.class.u_unpack(@wrapped_string)).pack('U*')) end
Performs canonical decomposition on all the characters.
Example:
'é'.length #=> 2 'é'.mb_chars.decompose.to_s.length #=> 3
# File lib/active_support/multibyte/chars.rb, line 425 def decompose chars(self.class.decompose_codepoints(:canonical, self.class.u_unpack(@wrapped_string)).pack('U*')) end
Convert characters in the string to lowercase.
Example:
'VĚDA A VÝZKUM'.mb_chars.downcase.to_s #=> "věda a výzkum"
# File lib/active_support/multibyte/chars.rb, line 384 def downcase apply_mapping :lowercase_mapping end
Returns the number of grapheme clusters in the string.
Example:
'क्षि'.mb_chars.length #=> 4 'क्षि'.mb_chars.g_length #=> 3
# File lib/active_support/multibyte/chars.rb, line 443 def g_length self.class.g_unpack(@wrapped_string).length end
Returns true if contained string contains other. Returns false otherwise.
Example:
'Café'.mb_chars.include?('é') #=> true
# File lib/active_support/multibyte/chars.rb, line 187 def include?(other) # We have to redefine this method because Enumerable defines it. @wrapped_string.include?(other) end
Returns the position needle in the string, counting in codepoints. Returns nil if needle isn’t found.
Example:
'Café périferôl'.mb_chars.index('ô') #=> 12
'Café périferôl'.mb_chars.index(/\w/u) #=> 0
# File lib/active_support/multibyte/chars.rb, line 197 def index(needle, offset=0) wrapped_offset = self.first(offset).wrapped_string.length index = @wrapped_string.index(needle, wrapped_offset) index ? (self.class.u_unpack(@wrapped_string.slice(0...index)).size) : nil end
Inserts the passed string at specified codepoint offsets.
Example:
'Café'.mb_chars.insert(4, ' périferôl').to_s #=> "Café périferôl"
# File lib/active_support/multibyte/chars.rb, line 171 def insert(offset, fragment) unpacked = self.class.u_unpack(@wrapped_string) unless offset > unpacked.length @wrapped_string.replace( self.class.u_unpack(@wrapped_string).insert(offset, *self.class.u_unpack(fragment)).pack('U*') ) else raise IndexError, "index #{offset} out of string" end self end
Works just like String#ljust, only integer specifies characters instead of bytes.
Example:
"¾ cup".mb_chars.rjust(8).to_s #=> "¾ cup " "¾ cup".mb_chars.rjust(8, " ").to_s # Use non-breaking whitespace #=> "¾ cup "
# File lib/active_support/multibyte/chars.rb, line 279 def ljust(integer, padstr=' ') justify(integer, :left, padstr) end
Strips entire range of Unicode whitespace from the left of the string.
# File lib/active_support/multibyte/chars.rb, line 302 def lstrip chars(@wrapped_string.gsub(UNICODE_LEADERS_PAT, '')) end
Forward all undefined methods to the wrapped string.
# File lib/active_support/multibyte/chars.rb, line 94 def method_missing(method, *args, &block) if method.to_s =~ /!$/ @wrapped_string.__send__(method, *args, &block) self else result = @wrapped_string.__send__(method, *args, &block) result.kind_of?(String) ? chars(result) : result end end
Returns the KC normalization of the string by default. NFKC is considered the best normalization form for passing strings to databases and validations.
str - The string to perform normalization on.
form - The form you want to normalize in. Should be one of the following: :c, :kc, :d, or :kd. Default is ActiveSupport::Multibyte.default_normalization_form
# File lib/active_support/multibyte/chars.rb, line 403 def normalize(form=ActiveSupport::Multibyte.default_normalization_form) # See http://www.unicode.org/reports/tr15, Table 1 codepoints = self.class.u_unpack(@wrapped_string) chars(case form when :d self.class.reorder_characters(self.class.decompose_codepoints(:canonical, codepoints)) when :c self.class.compose_codepoints(self.class.reorder_characters(self.class.decompose_codepoints(:canonical, codepoints))) when :kd self.class.reorder_characters(self.class.decompose_codepoints(:compatability, codepoints)) when :kc self.class.compose_codepoints(self.class.reorder_characters(self.class.decompose_codepoints(:compatability, codepoints))) else raise ArgumentError, "#{form} is not a valid normalization variant", caller end.pack('U*')) end
Returns the codepoint of the first character in the string.
Example:
'こんにちは'.mb_chars.ord #=> 12371
# File lib/active_support/multibyte/chars.rb, line 368 def ord self.class.u_unpack(@wrapped_string)[0] end
Returns true if obj responds to the given method. Private methods are included in the search only if the optional second parameter evaluates to true.
# File lib/active_support/multibyte/chars.rb, line 106 def respond_to?(method, include_private=false) super || @wrapped_string.respond_to?(method, include_private) || false end
Reverses all characters in the string.
Example:
'Café'.mb_chars.reverse.to_s #=> 'éfaC'
# File lib/active_support/multibyte/chars.rb, line 321 def reverse chars(self.class.g_unpack(@wrapped_string).reverse.flatten.pack('U*')) end
Returns the position needle in the string, counting in codepoints, searching backward from offset or the end of the string. Returns nil if needle isn’t found.
Example:
'Café périferôl'.mb_chars.rindex('é') #=> 6
'Café périferôl'.mb_chars.rindex(/\w/u) #=> 13
# File lib/active_support/multibyte/chars.rb, line 210 def rindex(needle, offset=nil) offset ||= length wrapped_offset = self.first(offset).wrapped_string.length index = @wrapped_string.rindex(needle, wrapped_offset) index ? (self.class.u_unpack(@wrapped_string.slice(0...index)).size) : nil end
Works just like String#rjust, only integer specifies characters instead of bytes.
Example:
"¾ cup".mb_chars.rjust(8).to_s #=> " ¾ cup" "¾ cup".mb_chars.rjust(8, " ").to_s # Use non-breaking whitespace #=> " ¾ cup"
# File lib/active_support/multibyte/chars.rb, line 266 def rjust(integer, padstr=' ') justify(integer, :right, padstr) end
Strips entire range of Unicode whitespace from the right of the string.
# File lib/active_support/multibyte/chars.rb, line 297 def rstrip chars(@wrapped_string.gsub(UNICODE_TRAILERS_PAT, '')) end
Returns the number of codepoints in the string
# File lib/active_support/multibyte/chars.rb, line 312 def size self.class.u_unpack(@wrapped_string).size end
Implements Unicode-aware slice with codepoints. Slicing on one point returns the codepoints for that character.
Example:
'こんにちは'.mb_chars.slice(2..3).to_s #=> "にち"
# File lib/active_support/multibyte/chars.rb, line 330 def slice(*args) if args.size > 2 raise ArgumentError, "wrong number of arguments (#{args.size} for 1)" # Do as if we were native elsif (args.size == 2 && !(args.first.is_a?(Numeric) || args.first.is_a?(Regexp))) raise TypeError, "cannot convert #{args.first.class} into Integer" # Do as if we were native elsif (args.size == 2 && !args[1].is_a?(Numeric)) raise TypeError, "cannot convert #{args[1].class} into Integer" # Do as if we were native elsif args[0].kind_of? Range cps = self.class.u_unpack(@wrapped_string).slice(*args) result = cps.nil? ? nil : cps.pack('U*') elsif args[0].kind_of? Regexp result = @wrapped_string.slice(*args) elsif args.size == 1 && args[0].kind_of?(Numeric) character = self.class.u_unpack(@wrapped_string)[args[0]] result = character.nil? ? nil : [character].pack('U') else result = self.class.u_unpack(@wrapped_string).slice(*args).pack('U*') end result.nil? ? nil : chars(result) end
Like String#slice!, except instead of byte offsets you specify character offsets.
Example:
s = 'こんにちは' s.mb_chars.slice!(2..3).to_s #=> "にち" s #=> "こんは"
# File lib/active_support/multibyte/chars.rb, line 358 def slice!(*args) slice = self[*args] self[*args] = '' slice end
Works just like String#split, with the exception that the items in the resulting list are Chars instances instead of String. This makes chaining methods easier.
Example:
'Café périferôl'.mb_chars.split(/é/).map { |part| part.upcase.to_s } #=> ["CAF", " P", "RIFERÔL"]
# File lib/active_support/multibyte/chars.rb, line 163 def split(*args) @wrapped_string.split(*args).map { |i| i.mb_chars } end
Strips entire range of Unicode whitespace from the right and left of the string.
# File lib/active_support/multibyte/chars.rb, line 307 def strip rstrip.lstrip end
Generated with the Darkfish Rdoc Generator 2.