Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RMail has invalid byte sequence #22

Closed
gauteh opened this issue Apr 13, 2013 · 7 comments
Closed

RMail has invalid byte sequence #22

gauteh opened this issue Apr 13, 2013 · 7 comments

Comments

@gauteh
Copy link
Member

gauteh commented Apr 13, 2013

/home/gaute/.gem/ruby/2.0.0/gems/rmail-1.0.0/lib/rmail/address.rb:694: invalid multibyte escape: /\A[\200-\377\w!$%&\'*+\/=?^_\`{\}|~#-]+/m
/home/gaute/.gem/ruby/2.0.0/gems/rmail-1.0.0/lib/rmail.rb:41:in `<top (required)>'
/home/gaute/dev/ruby/sup/lib/sup.rb:8:in `<top (required)>'
/home/gaute/dev/ruby/sup/sup-version.rb:9:in `<top (required)>'
/home/gaute/dev/ruby/sup/Rakefile:49:in `require_relative'
/home/gaute/dev/ruby/sup/Rakefile:49:in `<top (required)>'
@gauteh
Copy link
Member Author

gauteh commented Apr 13, 2013

Seems to do the trick, but I guess the gem is also outdated...:

diff -ru old/lib/rmail/address.rb new/lib/rmail/address.rb
--- old/lib/rmail/address.rb    1970-01-01 01:00:00.000000000 +0100
+++ new/lib/rmail/address.rb    2013-04-13 14:26:00.713612620 +0200
@@ -691,7 +691,7 @@
        @sym = SYM_DOMAIN_LITERAL
        @lexeme = $1.gsub(/(^|[^\\])[\r\n\t ]+/, '\1').gsub(/\\(.)/, '\1')
        break
-          when /\A[\200-\377\w!$%&\'*+\/=?^_\`{\}|~#-]+/m
+          when /\A[\200-\377\w!$%&\'*+\/=?^_\`{\}|~#-]+/nm
             # This is just like SYM_ATOM, but includes all characters
             # with high bits.  This is so we can allow such tokens in
             # the display name portion of an address even though it
diff -ru old/lib/rmail/header.rb new/lib/rmail/header.rb
--- old/lib/rmail/header.rb 1970-01-01 01:00:00.000000000 +0100
+++ new/lib/rmail/header.rb 2013-04-13 14:26:00.720279286 +0200
@@ -73,7 +73,7 @@

     class Field                 # :nodoc:
       # fixme, document methadology for this (RFC2822)
-      EXTRACT_FIELD_NAME_RE = /\A([^\x00-\x1f\x7f-\xff :]+):\s*/o
+      EXTRACT_FIELD_NAME_RE = /\A([^\x00-\x1f\x7f-\xff :]+):\s*/no

       class << self
         def parse(field)

@rakoo
Copy link

rakoo commented Apr 13, 2013

I switched heliotrope from RMail to the mail gem (https://github.com/mikel/mail). The new message.rb is API-compatible, it could be interesting to switch to it.

Here's the diff for that file :

diff --git a/lib/heliotrope/message.rb b/lib/heliotrope/message.rb
index 0c928cd..d87cca9 100644
--- a/lib/heliotrope/message.rb
+++ b/lib/heliotrope/message.rb
@@ -1,10 +1,29 @@
 # encoding: UTF-8

-require 'rmail'
+require 'mail'
 require 'digest/md5'
 require 'json'
 require 'timeout'

+module Mail
+  class Message
+
+    # a common interface that matches all the field
+    # IMPORTANT : if not existing, it must return nil
+    def fetch_header field
+      sym = field.to_sym
+      self[sym] ? self[sym].to_s : nil
+    end
+
+    # Make sure the message has valid message ids for the message, and
+    # fetch them
+    def fetch_message_ids field
+      self[field] ? self[field].message_ids || [self[field].message_id] : []
+    end
+
+  end
+end
+
 module Heliotrope
 class InvalidMessageError < StandardError; end
 class Message
@@ -14,43 +33,53 @@ class Message
   end

   def parse!
-    @m = RMail::Parser.read @rawbody
+    @m = Mail.read_from_string @rawbody

-    @msgid = find_msgids(decode_header(validate_field(:message_id, @m.header["message-id"]))).first
-    ## this next error happens if we have a field, but we can't find a <something> in it
-    raise InvalidMessageError, "can't parse msgid: #{@m.header['message-id']}" unless @msgid
+    # Mail::MessageIdField.message_id returns the msgid with < and >, which is not correct
+    unless @m.message_id
+      @m.message_id = "<#{Time.now.to_i}-defaulted-#{munge_msgid @m.header.to_s}@heliotrope>"
+    end
+    @msgid = @m.message_id
     @safe_msgid = munge_msgid @msgid

-    @from = Person.from_string decode_header(validate_field(:from, @m.header["from"]))
-    @date = begin
-      Time.parse(validate_field(:date, @m.header["date"])).to_i
-    rescue ArgumentError
-      #puts "warning: invalid date field #{@m.header['date']}"
-      0
-    end
+    @from = Person.from_string @m.fetch_header(:from)

-    @to = Person.many_from_string decode_header(@m.header["to"])
-    @cc = Person.many_from_string decode_header(@m.header["cc"])
-    @bcc = Person.many_from_string decode_header(@m.header["bcc"])
-    @subject = decode_header @m.header["subject"]
-    @reply_to = Person.from_string @m.header["reply-to"]
+    @sender = begin
+      # Mail::SenderField.sender returns an array, not a String
+      Person.from_string @m.fetch_header(:sender)
+      rescue InvalidMessageError
+        ""
+    end

-    @refs = find_msgids decode_header(@m.header["references"] || "")
-    in_reply_to = find_msgids decode_header(@m.header["in-reply-to"] || "")
-    @refs += in_reply_to unless @refs.member? in_reply_to.first
-    @safe_refs = @refs.map { |r| munge_msgid(r) }
+    @date = (@m.date || Time.now).to_time.to_i
+
+    @to = Person.many_from_string(@m.fetch_header(:to))
+    @cc = Person.many_from_string(@m.fetch_header(:cc))
+    @bcc = Person.many_from_string(@m.fetch_header(:bcc))
+    @subject =  (@m.subject || "")
+    @reply_to = Person.from_string(@m.fetch_header(:reply_to))
+
+    # same as message_id : we must use message_ids to get them without <
+    # and >
+    begin
+      @refs = @m.fetch_message_ids(:references)
+      in_reply_to = @m.fetch_message_ids(:in_reply_to)
+    rescue Mail::Field::FieldError => e
+      raise InvalidMessageError, e.message
+    end
+    @refs += in_reply_to unless @refs.member?(in_reply_to.first)
+    @safe_refs = @refs.nil? ? [] : @refs.compact.map { |r| munge_msgid(r) }

     ## various other headers that you don't think we will need until we
     ## actually need them.

     ## this is sometimes useful for determining who was the actual target of
     ## the email, in the case that someone has aliases
-    @recipient_email = @m.header["envelope-to"] || @m.header["x-original-to"] || @m.header["delivered-to"]
+    @recipient_email = @m.fetch_header(:envelope_to) || @m.fetch_header(:x_original_to) || @m.fetch_header(:delivered_to)

-    @list_id = @m.header["list-id"]
-    @list_subscribe = @m.header["list-subscribe"]
-    @list_unsubscribe = @m.header["list-unsubscribe"]
-    @list_post = @m.header["list-post"] || @m.header["x-mailing-list"]
+    @list_subscribe = @m.fetch_header(:list_subscribe)
+    @list_unsubscribe = @m.fetch_header(:list_unsubscribe)
+    @list_post = @m.fetch_header(:list_post) || @m.fetch_header(:x_mailing_list)

     self
   end
@@ -75,8 +104,8 @@ class Message

     { :from => (from ? from.to_email_address : ""),
       :to => to.map(&:to_email_address),
-      :cc => cc.map(&:to_email_address),
-      :bcc => bcc.map(&:to_email_address),
+      :cc => (cc || []).map(&:to_email_address),
+      :bcc => (bcc || []).map(&:to_email_address),
       :subject => subject,
       :date => date,
       :refs => refs,
@@ -95,8 +124,8 @@ class Message
   end

   def direct_recipients; to end
-  def indirect_recipients; cc + bcc end
-  def recipients; direct_recipients + indirect_recipients end
+  def indirect_recipients; (cc || []) + (bcc || []) end
+  def recipients; (direct_recipients || []) + (indirect_recipients || []) end

   def indexable_text
     @indexable_text ||= begin
@@ -134,10 +163,7 @@ class Message
   end

   def has_attachment?
-    @has_attachment ||=
-      mime_parts("text/plain").any? do |type, fn, id, content|
-        fn && (type !~ SIGNATURE_ATTACHMENT_TYPE)
-    end
+    @m.has_attachments? # defined in the mail gem
   end

   def signed?
@@ -159,13 +185,9 @@ private
     Digest::MD5.hexdigest msgid
   end

-  def find_msgids msgids
-    msgids.scan(/<(.+?)>/).map(&:first)
-  end
-
   def mime_part_types part=@m
-    ptype = part.header["content-type"] || ""
-    [ptype] + (part.multipart? ? part.body.map { |sub| mime_part_types sub } : [])
+    ptype = part.fetch_header(:content_type)
+    [ptype] + (part.multipart? ? part.body.parts.map { |sub| mime_part_types sub } : [])
   end

   ## unnests all the mime stuff and returns a list of [type, filename, content]
@@ -176,14 +198,14 @@ private
   def decode_mime_parts part, preferred_type, level=0
     if part.multipart?
       if mime_type_for(part) =~ /multipart\/alternative/
-        target = part.body.find { |p| mime_type_for(p).index(preferred_type) } || part.body.first
+        target = part.body.parts.find { |p| mime_type_for(p).index(preferred_type) } || part.body.parts.first
         if target # this can be nil
           decode_mime_parts target, preferred_type, level + 1
         else
           []
         end
       else # decode 'em all
-        part.body.compact.map { |subpart| decode_mime_parts subpart, preferred_type, level + 1 }.flatten 1
+        part.body.parts.compact.map { |subpart| decode_mime_parts subpart, preferred_type, level + 1 }.flatten 1
       end
     else
       type = mime_type_for part
@@ -204,11 +226,11 @@ private
   end

   def mime_type_for part
-    (part.header["content-type"] || "text/plain").gsub(/\s+/, " ").strip.downcase
+    (part.fetch_header(:content_type) || "text/plain").gsub(/\s+/, " ").strip.downcase
   end

   def mime_id_for part
-    header = part.header["content-id"]
+    header = part.fetch_header(:content_id)
     case header
       when /<(.+?)>/; $1
       else header
@@ -217,8 +239,8 @@ private

   ## a filename, or nil
   def mime_filename_for part
-    cd = part.header["Content-Disposition"]
-    ct = part.header["Content-Type"]
+    cd = part.fetch_header(:content_disposition)
+    ct = part.fetch_header(:content_type)

     ## RFC 2183 (Content-Disposition) specifies that disposition-parms are
     ## separated by ";". So, we match everything up to " and ; (if present).
@@ -229,20 +251,7 @@ private
     end

     ## filename could be RFC2047 encoded
-    decode_header(filename).chomp if filename
-  end
-
-  ## rfc2047-decode a header, convert to utf-8, and normalize whitespace
-  def decode_header v
-    return "" if v.nil?
-
-    v = if Decoder.is_rfc2047_encoded? v
-      Decoder.decode_rfc2047 "utf-8", v
-    else # assume it's ascii and transcode
-      Decoder.transcode "utf-8", "ascii", v
-    end
-
-    v.gsub(/\s+/, " ").strip
+    filename.chomp if filename
   end

   CONVERSIONS = {
@@ -255,11 +264,10 @@ private
   def mime_content_for mime_part, preferred_type
     return "" unless mime_part.body # sometimes this happens. not sure why.

-    mt = mime_type_for(mime_part) || "text/plain" # i guess
-    content_type = if mt =~ /^(.+);/ then $1.downcase else mt end
-    source_charset = if mt =~ /charset="?(.*?)"?(;|$)/i then $1 else "US-ASCII" end
+    content_type = mime_part.fetch_header(:content_type) || "text/plain"
+    source_charset = mime_part.charset || "US-ASCII"

-    content = mime_part.decode
+    content = mime_part.decoded
     converted_content, converted_charset = if(converter = CONVERSIONS[[content_type, preferred_type]])
       send converter, content, source_charset
     else

@gauteh
Copy link
Member Author

gauteh commented Apr 13, 2013

I noticed the Mail project and wondered what effort it would require - this would be great! Would it be possible to just swap or does messageids and such get garbled?

@rakoo
Copy link

rakoo commented Apr 13, 2013

I'm afraid the management is completely different, so you'd have to reindex all your mails. Sup works by reading the Message-Id header and do some funny logic on it to please the index. Heliotrope's messages store 2 ids : the real Message-Id, and a munge_msgid which is an MD5 hash of the Message-Id. The reason behind is the same (the index doesn't like funny tokens) but I think it's much more simple, and it looks like William thought so.

@gauteh
Copy link
Member Author

gauteh commented Apr 19, 2013

I got something very preliminary going over here: https://github.com/gauteh/sup/tree/rmail-to-mail-remove-iconv-ruby-2-comp, kind of have to do it all at once to get it working. I think some of the stuff should be slightly re-designed in heliotrope style.

@gauteh
Copy link
Member Author

gauteh commented May 26, 2013

Upstream pull request: matta/rubymail#2

@gauteh
Copy link
Member Author

gauteh commented Sep 28, 2013

Fixed by using rmail-sup.

@gauteh gauteh closed this as completed Sep 28, 2013
pushcx added a commit to pushcx/chibrary.com that referenced this issue Apr 23, 2015
* rmail -> rmail-sup, it had a bug in 2.0 encoding and sup is maintaining now
  sup-heliotrope/sup#22
* permutation -> more_math, author combined libraries
* apostrophes got html-encoded
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants