RMail has invalid byte sequence #22

gauteh · 2013-04-13T12:14:00Z

/home/gaute/.gem/ruby/2.0.0/gems/rmail-1.0.0/lib/rmail/address.rb:694: invalid multibyte escape: /\A[\200-\377\w!$%&\'*+\/=?^_\`{\}|~#-]+/m
/home/gaute/.gem/ruby/2.0.0/gems/rmail-1.0.0/lib/rmail.rb:41:in `<top (required)>'
/home/gaute/dev/ruby/sup/lib/sup.rb:8:in `<top (required)>'
/home/gaute/dev/ruby/sup/sup-version.rb:9:in `<top (required)>'
/home/gaute/dev/ruby/sup/Rakefile:49:in `require_relative'
/home/gaute/dev/ruby/sup/Rakefile:49:in `<top (required)>'

The text was updated successfully, but these errors were encountered:

gauteh · 2013-04-13T12:28:13Z

Seems to do the trick, but I guess the gem is also outdated...:

diff -ru old/lib/rmail/address.rb new/lib/rmail/address.rb
--- old/lib/rmail/address.rb    1970-01-01 01:00:00.000000000 +0100
+++ new/lib/rmail/address.rb    2013-04-13 14:26:00.713612620 +0200
@@ -691,7 +691,7 @@
        @sym = SYM_DOMAIN_LITERAL
        @lexeme = $1.gsub(/(^|[^\\])[\r\n\t ]+/, '\1').gsub(/\\(.)/, '\1')
        break
-          when /\A[\200-\377\w!$%&\'*+\/=?^_\`{\}|~#-]+/m
+          when /\A[\200-\377\w!$%&\'*+\/=?^_\`{\}|~#-]+/nm
             # This is just like SYM_ATOM, but includes all characters
             # with high bits.  This is so we can allow such tokens in
             # the display name portion of an address even though it
diff -ru old/lib/rmail/header.rb new/lib/rmail/header.rb
--- old/lib/rmail/header.rb 1970-01-01 01:00:00.000000000 +0100
+++ new/lib/rmail/header.rb 2013-04-13 14:26:00.720279286 +0200
@@ -73,7 +73,7 @@

     class Field                 # :nodoc:
       # fixme, document methadology for this (RFC2822)
-      EXTRACT_FIELD_NAME_RE = /\A([^\x00-\x1f\x7f-\xff :]+):\s*/o
+      EXTRACT_FIELD_NAME_RE = /\A([^\x00-\x1f\x7f-\xff :]+):\s*/no

       class << self
         def parse(field)

rakoo · 2013-04-13T13:03:51Z

I switched heliotrope from RMail to the mail gem (https://github.com/mikel/mail). The new message.rb is API-compatible, it could be interesting to switch to it.

Here's the diff for that file :

diff --git a/lib/heliotrope/message.rb b/lib/heliotrope/message.rb
index 0c928cd..d87cca9 100644
--- a/lib/heliotrope/message.rb
+++ b/lib/heliotrope/message.rb
@@ -1,10 +1,29 @@
 # encoding: UTF-8

-require 'rmail'
+require 'mail'
 require 'digest/md5'
 require 'json'
 require 'timeout'

+module Mail
+  class Message
+
+    # a common interface that matches all the field
+    # IMPORTANT : if not existing, it must return nil
+    def fetch_header field
+      sym = field.to_sym
+      self[sym] ? self[sym].to_s : nil
+    end
+
+    # Make sure the message has valid message ids for the message, and
+    # fetch them
+    def fetch_message_ids field
+      self[field] ? self[field].message_ids || [self[field].message_id] : []
+    end
+
+  end
+end
+
 module Heliotrope
 class InvalidMessageError < StandardError; end
 class Message
@@ -14,43 +33,53 @@ class Message
   end

   def parse!
-    @m = RMail::Parser.read @rawbody
+    @m = Mail.read_from_string @rawbody

-    @msgid = find_msgids(decode_header(validate_field(:message_id, @m.header["message-id"]))).first
-    ## this next error happens if we have a field, but we can't find a <something> in it
-    raise InvalidMessageError, "can't parse msgid: #{@m.header['message-id']}" unless @msgid
+    # Mail::MessageIdField.message_id returns the msgid with < and >, which is not correct
+    unless @m.message_id
+      @m.message_id = "<#{Time.now.to_i}-defaulted-#{munge_msgid @m.header.to_s}@heliotrope>"
+    end
+    @msgid = @m.message_id
     @safe_msgid = munge_msgid @msgid

-    @from = Person.from_string decode_header(validate_field(:from, @m.header["from"]))
-    @date = begin
-      Time.parse(validate_field(:date, @m.header["date"])).to_i
-    rescue ArgumentError
-      #puts "warning: invalid date field #{@m.header['date']}"
-      0
-    end
+    @from = Person.from_string @m.fetch_header(:from)

-    @to = Person.many_from_string decode_header(@m.header["to"])
-    @cc = Person.many_from_string decode_header(@m.header["cc"])
-    @bcc = Person.many_from_string decode_header(@m.header["bcc"])
-    @subject = decode_header @m.header["subject"]
-    @reply_to = Person.from_string @m.header["reply-to"]
+    @sender = begin
+      # Mail::SenderField.sender returns an array, not a String
+      Person.from_string @m.fetch_header(:sender)
+      rescue InvalidMessageError
+        ""
+    end

-    @refs = find_msgids decode_header(@m.header["references"] || "")
-    in_reply_to = find_msgids decode_header(@m.header["in-reply-to"] || "")
-    @refs += in_reply_to unless @refs.member? in_reply_to.first
-    @safe_refs = @refs.map { |r| munge_msgid(r) }
+    @date = (@m.date || Time.now).to_time.to_i
+
+    @to = Person.many_from_string(@m.fetch_header(:to))
+    @cc = Person.many_from_string(@m.fetch_header(:cc))
+    @bcc = Person.many_from_string(@m.fetch_header(:bcc))
+    @subject =  (@m.subject || "")
+    @reply_to = Person.from_string(@m.fetch_header(:reply_to))
+
+    # same as message_id : we must use message_ids to get them without <
+    # and >
+    begin
+      @refs = @m.fetch_message_ids(:references)
+      in_reply_to = @m.fetch_message_ids(:in_reply_to)
+    rescue Mail::Field::FieldError => e
+      raise InvalidMessageError, e.message
+    end
+    @refs += in_reply_to unless @refs.member?(in_reply_to.first)
+    @safe_refs = @refs.nil? ? [] : @refs.compact.map { |r| munge_msgid(r) }

     ## various other headers that you don't think we will need until we
     ## actually need them.

     ## this is sometimes useful for determining who was the actual target of
     ## the email, in the case that someone has aliases
-    @recipient_email = @m.header["envelope-to"] || @m.header["x-original-to"] || @m.header["delivered-to"]
+    @recipient_email = @m.fetch_header(:envelope_to) || @m.fetch_header(:x_original_to) || @m.fetch_header(:delivered_to)

-    @list_id = @m.header["list-id"]
-    @list_subscribe = @m.header["list-subscribe"]
-    @list_unsubscribe = @m.header["list-unsubscribe"]
-    @list_post = @m.header["list-post"] || @m.header["x-mailing-list"]
+    @list_subscribe = @m.fetch_header(:list_subscribe)
+    @list_unsubscribe = @m.fetch_header(:list_unsubscribe)
+    @list_post = @m.fetch_header(:list_post) || @m.fetch_header(:x_mailing_list)

     self
   end
@@ -75,8 +104,8 @@ class Message

     { :from => (from ? from.to_email_address : ""),
       :to => to.map(&:to_email_address),
-      :cc => cc.map(&:to_email_address),
-      :bcc => bcc.map(&:to_email_address),
+      :cc => (cc || []).map(&:to_email_address),
+      :bcc => (bcc || []).map(&:to_email_address),
       :subject => subject,
       :date => date,
       :refs => refs,
@@ -95,8 +124,8 @@ class Message
   end

   def direct_recipients; to end
-  def indirect_recipients; cc + bcc end
-  def recipients; direct_recipients + indirect_recipients end
+  def indirect_recipients; (cc || []) + (bcc || []) end
+  def recipients; (direct_recipients || []) + (indirect_recipients || []) end

   def indexable_text
     @indexable_text ||= begin
@@ -134,10 +163,7 @@ class Message
   end

   def has_attachment?
-    @has_attachment ||=
-      mime_parts("text/plain").any? do |type, fn, id, content|
-        fn && (type !~ SIGNATURE_ATTACHMENT_TYPE)
-    end
+    @m.has_attachments? # defined in the mail gem
   end

   def signed?
@@ -159,13 +185,9 @@ private
     Digest::MD5.hexdigest msgid
   end

-  def find_msgids msgids
-    msgids.scan(/<(.+?)>/).map(&:first)
-  end
-
   def mime_part_types part=@m
-    ptype = part.header["content-type"] || ""
-    [ptype] + (part.multipart? ? part.body.map { |sub| mime_part_types sub } : [])
+    ptype = part.fetch_header(:content_type)
+    [ptype] + (part.multipart? ? part.body.parts.map { |sub| mime_part_types sub } : [])
   end

   ## unnests all the mime stuff and returns a list of [type, filename, content]
@@ -176,14 +198,14 @@ private
   def decode_mime_parts part, preferred_type, level=0
     if part.multipart?
       if mime_type_for(part) =~ /multipart\/alternative/
-        target = part.body.find { |p| mime_type_for(p).index(preferred_type) } || part.body.first
+        target = part.body.parts.find { |p| mime_type_for(p).index(preferred_type) } || part.body.parts.first
         if target # this can be nil
           decode_mime_parts target, preferred_type, level + 1
         else
           []
         end
       else # decode 'em all
-        part.body.compact.map { |subpart| decode_mime_parts subpart, preferred_type, level + 1 }.flatten 1
+        part.body.parts.compact.map { |subpart| decode_mime_parts subpart, preferred_type, level + 1 }.flatten 1
       end
     else
       type = mime_type_for part
@@ -204,11 +226,11 @@ private
   end

   def mime_type_for part
-    (part.header["content-type"] || "text/plain").gsub(/\s+/, " ").strip.downcase
+    (part.fetch_header(:content_type) || "text/plain").gsub(/\s+/, " ").strip.downcase
   end

   def mime_id_for part
-    header = part.header["content-id"]
+    header = part.fetch_header(:content_id)
     case header
       when /<(.+?)>/; $1
       else header
@@ -217,8 +239,8 @@ private

   ## a filename, or nil
   def mime_filename_for part
-    cd = part.header["Content-Disposition"]
-    ct = part.header["Content-Type"]
+    cd = part.fetch_header(:content_disposition)
+    ct = part.fetch_header(:content_type)

     ## RFC 2183 (Content-Disposition) specifies that disposition-parms are
     ## separated by ";". So, we match everything up to " and ; (if present).
@@ -229,20 +251,7 @@ private
     end

     ## filename could be RFC2047 encoded
-    decode_header(filename).chomp if filename
-  end
-
-  ## rfc2047-decode a header, convert to utf-8, and normalize whitespace
-  def decode_header v
-    return "" if v.nil?
-
-    v = if Decoder.is_rfc2047_encoded? v
-      Decoder.decode_rfc2047 "utf-8", v
-    else # assume it's ascii and transcode
-      Decoder.transcode "utf-8", "ascii", v
-    end
-
-    v.gsub(/\s+/, " ").strip
+    filename.chomp if filename
   end

   CONVERSIONS = {
@@ -255,11 +264,10 @@ private
   def mime_content_for mime_part, preferred_type
     return "" unless mime_part.body # sometimes this happens. not sure why.

-    mt = mime_type_for(mime_part) || "text/plain" # i guess
-    content_type = if mt =~ /^(.+);/ then $1.downcase else mt end
-    source_charset = if mt =~ /charset="?(.*?)"?(;|$)/i then $1 else "US-ASCII" end
+    content_type = mime_part.fetch_header(:content_type) || "text/plain"
+    source_charset = mime_part.charset || "US-ASCII"

-    content = mime_part.decode
+    content = mime_part.decoded
     converted_content, converted_charset = if(converter = CONVERSIONS[[content_type, preferred_type]])
       send converter, content, source_charset
     else

gauteh · 2013-04-13T13:14:40Z

I noticed the Mail project and wondered what effort it would require - this would be great! Would it be possible to just swap or does messageids and such get garbled?

rakoo · 2013-04-13T16:11:13Z

I'm afraid the management is completely different, so you'd have to reindex all your mails. Sup works by reading the Message-Id header and do some funny logic on it to please the index. Heliotrope's messages store 2 ids : the real Message-Id, and a munge_msgid which is an MD5 hash of the Message-Id. The reason behind is the same (the index doesn't like funny tokens) but I think it's much more simple, and it looks like William thought so.

gauteh · 2013-04-19T22:39:16Z

I got something very preliminary going over here: https://github.com/gauteh/sup/tree/rmail-to-mail-remove-iconv-ruby-2-comp, kind of have to do it all at once to get it working. I think some of the stuff should be slightly re-designed in heliotrope style.

gauteh · 2013-05-26T13:24:12Z

Upstream pull request: matta/rubymail#2

gauteh · 2013-09-28T17:44:18Z

Fixed by using rmail-sup.

* rmail -> rmail-sup, it had a bug in 2.0 encoding and sup is maintaining now sup-heliotrope/sup#22 * permutation -> more_math, author combined libraries * apostrophes got html-encoded

gauteh mentioned this issue May 20, 2013

Rmail to Mail: initial work #61

Closed

gauteh mentioned this issue May 26, 2013

Remove Iconv and use built-in encoding, more robust UTF-8 and encoding handling #72

Merged

gauteh mentioned this issue Aug 14, 2013

Use Mail gem over RMail #127

Open

gauteh closed this as completed Sep 28, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RMail has invalid byte sequence #22

RMail has invalid byte sequence #22

gauteh commented Apr 13, 2013

gauteh commented Apr 13, 2013

rakoo commented Apr 13, 2013

gauteh commented Apr 13, 2013

rakoo commented Apr 13, 2013

gauteh commented Apr 19, 2013

gauteh commented May 26, 2013

gauteh commented Sep 28, 2013

RMail has invalid byte sequence #22

RMail has invalid byte sequence #22

Comments

gauteh commented Apr 13, 2013

gauteh commented Apr 13, 2013

rakoo commented Apr 13, 2013

gauteh commented Apr 13, 2013

rakoo commented Apr 13, 2013

gauteh commented Apr 19, 2013

gauteh commented May 26, 2013

gauteh commented Sep 28, 2013