[#39863] forループの速度 — Masahiro Sato <msato@...>

15 messages 2004/07/20

[#39868] イテレータとfor文 — OOTANI TAKASHI <otn@...5.so-net.ne.jp>

大谷と申します。

31 messages 2004/07/20
[#39886] Re: イテレータとfor文 — Tietew <tietew-ml-ruby-list@...> 2004/07/21

[ruby-list:39908] htreeの高速化

From: MoonWolf <moonwolf@...>
Date: 2004-07-23 13:23:02 UTC
List: ruby-list #39908
MoonWolfです。

htree(http://raa.ruby-lang.org/project/htree/)で大きなHTMLをパースすると
メモリを大量に消費し処理時間も長いので、高速化・省メモリ化を試みました。
(http://devlog.moonwolf.com/20040723.html#p01)

添付したパッチが、その成果です。

1. 正規表現のオプション
 //oオプションを付け忘れている箇所があったので//oを付け足しました。

2. HTree::Nameオブジェクトの共有化
 immutableなオブジェクトなのでHTree::NAMEというHashにキャッシュして
Name.new(arg)をNAME[arg]に書き換えました。
 ObjectSpaceからオブジェクト数を調べて見ると21,504個あったHTree::Nameが
29個と激減しメモリ使用量も減りました。

3. caseからif〜elsifへの書き換えと比較順序の変更
 一番出現頻度が高いと思われる:text_pcdataを最初にチェックするようにしま
した。

3.5MBのHTMLのparseでメモリ使用量が209MBから156MBまで減りました。しかし、
156MBでもまだ多いと思います。これ以外に改善する方法はなにかないですかねぇ……

Attachments (3)

tag.rb.patch (15.6 KB, text/x-diff)
--- tag.rb.orig	2004-05-06 14:53:34.000000000 +0900
+++ parse.rb	2004-07-23 15:12:32.000000000 +0900
@@ -1,111 +1,407 @@
-require 'htree/raw_string'
+require 'htree/scan'
+require 'htree/htmlinfo'
 require 'htree/text'
-require 'htree/scan' # for Pat::Name and Pat::Nmtoken
+require 'htree/tag'
+require 'htree/leaf'
+require 'htree/doc'
+require 'htree/elem'
+require 'htree/raw_string'
 require 'htree/context'
-require 'htree/name'
+require 'htree/encoder'
 
 module HTree
+  # HTree.parse parses <i>input</i> and return a document tree.
+  # represented by HTree::Doc.
+  #
+  # <i>input</i> should be a String or
+  # an object which respond to read or open method.
+  # For example, IO, StringIO, Pathname, URI::HTTP and URI::FTP are acceptable.
+  # Note that the URIs need open-uri.
+  #
+  # HTree.parse guesses <i>input</i> is HTML or not and XML or not.
+  #
+  # If it is guessed as HTML, the default namespace in the result is set to http://www.w3.org/1999/xhtml
+  # regardless of <i>input</i> has XML namespace declaration or not nor even it is pre-XML HTML.
+  #
+  # If it is guessed as HTML and not XML, all element and attribute names are downcaseed. 
+  #
+  # If opened file or read content has charset method,
+  # HTree.parse decode it according to $KCODE before parsing.
+  # Otherwise HTree.parse assumes the character encoding of the content is
+  # compatible to $KCODE.
+  # Note that the charset method is provided by URI::HTTP with open-uri.
+  def HTree.parse(input)
+    parse_as(input, false)
+  end
+
+  # HTree.parse_xml parses <i>input</i> as XML and
+  # return a document tree represented by HTree::Doc.
+  #
+  # It behaves almost same as HTree.parse but it assumes <i>input</> is XML
+  # even if no XML declaration.
+  # The assumption causes following differences.
+  # * doesn't downcase element name.
+  # * The content of <script> and <style> element is PCDATA, not CDATA.
+  def HTree.parse_xml(input)
+    parse_as(input, true)
+  end
+
   # :stopdoc:
 
-  class STag
-    def initialize(name, attributes=[], inherited_context=DefaultContext)
-      init_raw_string
-      # normalize xml declaration name and attribute value.
-      attributes = attributes.map {|aname, val|
-        if !(Name === aname) && /\A(?:#{Pat::Name}?\{.*\})?#{Pat::Nmtoken}\z/ !~ aname
-          raise HTree::Error, "invalid attribute name: #{aname.inspect}"
-        end
-        if !(Name === aname) && /\Axmlns(?:\z|:)/ =~ aname
-          aname = Name.parse_attribute_name(aname, nil)
-        end
-        val = val.to_node if HTree::Location === val
-        val = Text.new(val) unless Text === val
-        [aname, val]
+  def HTree.parse_as(input, is_xml)
+    input_charset = nil
+    if input.tainted? && 1 <= $SAFE
+      raise SecurityError, "input tainted"
+    end
+    if input.respond_to? :read # IO, StringIO
+      input = input.read.untaint
+      input_charset = input.charset if input.respond_to? :charset
+    elsif input.respond_to? :open # Pathname, URI with open-uri
+      input.open {|f|
+        input = f.read.untaint
+        input_charset = f.charset if f.respond_to? :charset
       }
+    end
+    if input_charset && input_charset != Encoder.internal_charset
+      input = Iconv.conv(Encoder.internal_charset, input_charset, input)
+    end
 
-      @inherited_context = inherited_context
-      @xmlns_decls = {}
+    tokens = []
+    is_xml, is_html = HTree.scan(input, is_xml) {|token|
+      tokens << token
+    }
+    context = is_html ? HTMLContext: DefaultContext
+    structure_list = parse_pairs(tokens, is_xml, is_html)
+    structure_list = fix_structure_list(structure_list, is_xml, is_html)
+    nodes = structure_list.map {|s| build_node(s, is_xml, is_html, context) }
+    Doc.new(nodes)
+  end
 
-      # validate namespace consistency of given Name objects.
-      if Name === name
-        @xmlns_decls[name.namespace_prefix] = name.namespace_uri
-      end
-      attributes.each {|aname, text|
-        next unless Name === aname
-        next if aname.xmlns?
-        if aname.namespace_prefix && aname.namespace_uri
-          if @xmlns_decls.include? aname.namespace_prefix
-            if @xmlns_decls[aname.namespace_prefix] != aname.namespace_uri
-              raise ArgumentError, "inconsistent namespace use: #{aname.namespace_prefix} is used as #{@xmlns_decls[aname.namespace_prefix]} and #{aname.namespace_uri}"
-            end
-          else
-            @xmlns_decls[aname.namespace_prefix] = aname.namespace_uri
+  def HTree.parse_pairs(tokens, is_xml, is_html)
+    stack = [[nil, nil, []]]
+    tokens.each {|token|
+      case token[0]
+      when :stag
+        stag_raw_string = token[1]
+        stagname = stag_raw_string[Pat::Name]
+        stagname = stagname.downcase if !is_xml && is_html
+        stack << [stagname, stag_raw_string, []]
+      when :etag
+        etag_raw_string = token[1]
+        etagname = etag_raw_string[Pat::Name]
+        etagname = etagname.downcase if !is_xml && is_html
+        matched_elem = nil
+        stack.reverse_each {|elem|
+          stagname, _, _ = elem
+          if stagname == etagname
+            matched_elem = elem
+            break
+          end
+        }
+        if matched_elem
+          until matched_elem.equal? stack.last
+            stagname, stag_raw_string, children = stack.pop
+            stack.last[2] << [:elem, stag_raw_string, children]
           end
+          stagname, stag_raw_string, children = stack.pop
+          stack.last[2] << [:elem, stag_raw_string, children, etag_raw_string]
+        else
+          stack.last[2] << [:bogus_etag, etag_raw_string]
         end
-      }
+      else
+        stack.last[2] << token
+      end
+    }
+    elem = nil
+    while 1 < stack.length
+      stagname, stag_raw_string, children = stack.pop
+      stack.last[2] << [:elem, stag_raw_string, children]
+    end
+    stack[0][2]
+  end
+
+  def HTree.fix_structure_list(structure_list, is_xml, is_html)
+    result = []
+    rest = structure_list.dup
+    until rest.empty?
+      structure = rest.shift
+      if structure[0] == :elem
+        elem, rest2 = fix_element(structure, [], [], is_xml, is_html)
+        result << elem
+        rest = rest2 + rest
+      else
+        result << structure
+      end
+    end
+    result
+  end
 
-      attributes.each {|aname, text|
-        next unless Name === aname
-        next unless aname.xmlns?
-        next if @xmlns_decls.include? aname.local_name
-        if aname.local_name
-          @xmlns_decls[aname.local_name] = text.to_s
+  def HTree.fix_element(elem, excluded_tags, included_tags, is_xml, is_html)
+    stag_raw_string = elem[1]
+    children = elem[2]
+    if etag_raw_string = elem[3]
+      return [:elem, stag_raw_string, fix_structure_list(children, is_xml, is_html), etag_raw_string], []
+    else
+      tagname = stag_raw_string[Pat::Name]
+      tagname = tagname.downcase if !is_xml && is_html
+      if ElementContent[tagname] == :EMPTY
+        return [:elem, stag_raw_string, []], children
+      else
+        if ElementContent[tagname] == :CDATA
+          possible_tags = []
         else
-          uri = text.to_s
-          @xmlns_decls[nil] = uri
+          possible_tags = ElementContent[tagname]
+        end
+        if possible_tags
+          excluded_tags2 = ElementExclusions[tagname]
+          included_tags2 = ElementInclusions[tagname]
+          excluded_tags |= excluded_tags2 if excluded_tags2
+          included_tags |= included_tags2 if included_tags2
+          containable_tags = (possible_tags | included_tags) - excluded_tags
+          uncontainable_tags = ElementContent.keys - containable_tags
+        else
+          # If the tagname is unknown, it is assumed that any element
+          # except excluded can be contained.
+          uncontainable_tags = excluded_tags
+        end
+        fixed_children = []
+        rest = children
+        until rest.empty?
+          if rest[0][0] == :elem
+            elem = rest.shift
+            elem_tagname = elem[1][Pat::Name]
+            elem_tagname = elem_tagname.downcase if !is_xml && is_html
+            if uncontainable_tags.include? elem_tagname
+              rest.unshift elem
+              break
+            else
+              fixed_elem, rest2 = fix_element(elem, excluded_tags, included_tags, is_xml, is_html)
+              fixed_children << fixed_elem
+              rest = rest2 + rest
+            end
+          else
+            fixed_children << rest.shift
+          end
         end
+        return [:elem, stag_raw_string, fixed_children], rest
+      end
+    end
+  end
+
+  def HTree.build_node(structure, is_xml, is_html, inherited_context=DefaultContext)
+    s = structure[0]
+    if s==:text_pcdata
+      Text.parse_pcdata(structure[1])
+    elsif s==:elem
+      _, stag_rawstring, children, etag_rawstring = structure
+      etag = etag_rawstring && ETag.parse(etag_rawstring, is_xml, is_html)
+      stag = STag.parse(stag_rawstring, is_xml, is_html, inherited_context)
+      if !children.empty? || etag
+        Elem.new!(stag,
+                  children.map {|c| build_node(c, is_xml, is_html, stag.context) },
+                  etag)
+      else
+        Elem.new!(stag)
+      end
+    elsif s==:emptytag
+      Elem.new!(STag.parse(structure[1], is_xml, is_html, inherited_context))
+    elsif s==:bogus_etag
+      BogusETag.parse(structure[1], is_xml, is_html)
+    elsif s==:xmldecl
+      XMLDecl.parse(structure[1])
+    elsif s==:doctype
+      DocType.parse(structure[1], is_xml, is_html)
+    elsif s==:procins
+      ProcIns.parse(structure[1])
+    elsif s==:comment
+      Comment.parse(structure[1])
+    elsif s==:text_cdata_content
+      Text.parse_cdata_content(structure[1])
+    elsif s==:text_cdata_section
+      Text.parse_cdata_section(structure[1])
+    else
+      raise Exception, "[bug] unknown structure: #{structure.inspect}"
+    end
+  end
+
+  def STag.parse(raw_string, is_xml, is_html, inherited_context=DefaultContext)
+    if /\A#{Pat::StartTag}\z/o =~ raw_string
+      is_stag = true
+    elsif /\A#{Pat::EmptyTag}\z/o =~ raw_string
+      is_stag = false
+    else
+      raise HTree::Error, "cannot recognize as start tag or empty tag: #{raw_string.inspect}"
+    end
+
+    attrs = []
+    if (is_stag ? /\A#{Pat::ValidStartTag_C}\z/o : /\A#{Pat::ValidEmptyTag_C}\z/o) =~ raw_string
+      qname = $1
+      $2.scan(Pat::ValidAttr_C) {
+        attrs << ($5 ? [nil, $5] : [$1, $2 || $3 || $4])
+      }
+    elsif (is_stag ? /\A#{Pat::InvalidStartTag_C}\z/o : /\A#{Pat::InvalidEmptyTag_C}\z/o) =~ raw_string
+      qname = $1
+      last_attr = $3
+      $2.scan(Pat::InvalidAttr1_C) {
+        attrs << ($5 ? [nil, $5] : [$1, $2 || $3 || $4])
       }
+      if last_attr
+        /#{Pat::InvalidAttr1End_C}/o =~ last_attr
+        attrs << [$1, $2 || $3]
+      end
+    else
+      raise Exception, "[bug] cannot recognize as start tag: #{raw_string.inspect}"
+    end
 
-      @context = make_context(@inherited_context)
+    qname = qname.downcase if !is_xml && is_html
 
-      if Name === name
-        @name = name
+    attrs.map! {|aname, aval|
+      if aname
+        aname = (!is_xml && is_html) ? aname.downcase : aname
+        [aname, Text.parse_pcdata(aval)]
       else
-        @name = Name.parse_element_name(name, @context)
+        if val2name = OmittedAttrName[qname]
+          aval_downcase = aval.downcase
+          aname = val2name.fetch(aval_downcase, aval_downcase)
+        else
+          aname = aval
+        end
+        [aname, Text.new(aval)]
       end
+    }
+
+    result = STag.new(qname, attrs, inherited_context)
+    result.raw_string = raw_string
+    result
+  end
+
+  def ETag.parse(raw_string, is_xml, is_html)
+    unless /\A#{Pat::EndTag_C}\z/o =~ raw_string
+      raise HTree::Error, "cannot recognize as end tag: #{raw_string.inspect}"
+    end
+
+    qname = $1
+    qname = qname.downcase if !is_xml && is_html
+
+    result = self.new(qname)
+    result.raw_string = raw_string
+    result
+  end
+
+  def BogusETag.parse(raw_string, is_xml, is_html)
+    unless /\A#{Pat::EndTag_C}\z/o =~ raw_string
+      raise HTree::Error, "cannot recognize as end tag: #{raw_string.inspect}"
+    end
+
+    qname = $1
+    qname = qname.downcase if !is_xml && is_html
 
-      @attributes = attributes.map {|aname, text|
-        aname = Name.parse_attribute_name(aname, @context) unless Name === aname
-        if !aname.namespace_prefix && !aname.namespace_uri.empty?
-          # xxx: should recover error?
-          raise HTree::Error, "global attribute without namespace prefix: #{aname.inspect}"
+    result = self.new(qname)
+    result.raw_string = raw_string
+    result
+  end
+
+  def Text.parse_pcdata(raw_string)
+    fixed = raw_string.gsub(/&(?:(?:#[0-9]+|#x[0-9a-fA-F]+|([A-Za-z][A-Za-z0-9]*));?)?/o) {|s|
+      name = $1
+      case s
+      when /;\z/
+        s
+      when /\A&#/
+        "#{s};"
+      when '&'
+        '&amp;'
+      else 
+        if NamedCharactersPattern =~ name
+          "&#{name};"
+        else
+          "&amp;#{name}"
         end
-        [aname, text]
-      }
-      @attributes.freeze
+      end
+    }
+    result = new!(fixed)
+    result.raw_string = raw_string
+    result
+  end
+
+  def Text.parse_cdata_content(raw_string)
+    result = Text.new(raw_string)
+    result.raw_string = raw_string
+    result
+  end
+
+  def Text.parse_cdata_section(raw_string)
+    unless /\A#{Pat::CDATA_C}\z/o =~ raw_string
+      raise HTree::Error, "cannot recognize as CDATA section: #{raw_string.inspect}"
     end
-    attr_reader :attributes, :inherited_context, :context
 
-    def element_name
-      @name
+    content = $1
+
+    result = Text.new(content)
+    result.raw_string = raw_string
+    result
+  end
+
+  def XMLDecl.parse(raw_string)
+    unless /\A#{Pat::XmlDecl_C}\z/o =~ raw_string
+      raise HTree::Error, "cannot recognize as XML declaration: #{raw_string.inspect}"
     end
 
-    def make_context(inherited_context)
-      inherited_context.subst_namespaces(@xmlns_decls)
+    version = $1 || $2
+    encoding = $3 || $4
+    case $5 || $6
+    when 'yes'
+      standalone = true
+    when 'no'
+      standalone = false
+    else
+      standalone = nil
     end
 
-    def each_namespace_attribute
-      @xmlns_decls.each {|name, uri|
-        yield name, uri
-      }
-      nil
+    result = XMLDecl.new(version, encoding, standalone)
+    result.raw_string = raw_string
+    result
+  end
+
+  def DocType.parse(raw_string, is_xml, is_html)
+    unless /\A#{Pat::DocType_C}\z/o =~ raw_string
+      raise HTree::Error, "cannot recognize as XML declaration: #{raw_string.inspect}"
     end
 
-    def each_attribute
-      @attributes.each {|name, text|
-        next if name.xmlns?
-        yield name, text
-      }
-      nil
+    root_element_name = $1
+    public_identifier = $2 || $3
+    system_identifier = $4 || $5
+
+    root_element_name = root_element_name.downcase if !is_xml && is_html
+
+    result = DocType.new(root_element_name, public_identifier, system_identifier)
+    result.raw_string = raw_string
+    result
+  end
+
+  def ProcIns.parse(raw_string)
+    unless /\A#{Pat::XmlProcIns_C}\z/o =~ raw_string
+      raise HTree::Error, "cannot recognize as processing instruction: #{raw_string.inspect}"
     end
+
+    target = $1
+    content = $2
+
+    result = ProcIns.new(target, content)
+    result.raw_string = raw_string
+    result
   end
 
-  class ETag
-    def initialize(qualified_name)
-      init_raw_string
-      @qualified_name = qualified_name.dup.freeze
+  def Comment.parse(raw_string)
+    unless /\A#{Pat::Comment_C}\z/o =~ raw_string
+      raise HTree::Error, "cannot recognize as comment: #{raw_string.inspect}"
     end
-    attr_reader :qualified_name
+
+    content = $1
+
+    result = Comment.new(content)
+    result.raw_string = raw_string
+    result
   end
 
   # :startdoc:
name.rb.patch (2.63 KB, text/x-diff)
--- name.rb.orig	2004-04-20 15:53:30.000000000 +0900
+++ name.rb	2004-07-23 15:45:12.000000000 +0900
@@ -4,6 +4,11 @@
 module HTree
   # Name represents a element name and attribute name.
   # It consists of a namespace prefix, a namespace URI and a local name.
+  NAME = Hash.new {|hash,(namespace_prefix, namespace_uri, local_name)|
+    name = Name.new(namespace_prefix, namespace_uri, local_name)
+    hash[[namespace_prefix, namespace_uri, local_name]] = name
+    name
+  }
   class Name
 =begin
 element name                    prefix  uri     localname
@@ -21,30 +26,30 @@
       if /\{(.*)\}/ =~ name
         # "{u}n" means "use default namespace",
         # "p{u}n" means "use the specified prefix p"
-        $` == '' ? Name.new(nil, $1, $') : Name.new($`, $1, $')
+        $` == '' ? NAME[[nil, $1, $']] : NAME[[$`, $1, $']]
       elsif /:/ =~ name && !context.namespace_uri($`).empty?
-        Name.new($`, context.namespace_uri($`), $')
+        NAME[[$`, context.namespace_uri($`), $']]
       elsif !context.namespace_uri(nil).empty?
-        Name.new(nil, context.namespace_uri(nil), name)
+        NAME[[nil, context.namespace_uri(nil), name]]
       else
-        Name.new(nil, '', name)
+        NAME[[nil, '', name]]
       end
     end
 
     def Name.parse_attribute_name(name, context)
       if name == 'xmlns'
-        Name.new('xmlns', nil, nil)
+        NAME[['xmlns', nil, nil]]
       elsif /\Axmlns:/ =~ name
-        Name.new('xmlns', nil, $')
+        NAME[['xmlns', nil, $']]
       elsif /\{(.*)\}/ =~ name
         case $`
-        when ''; Name.new(nil, $1, $')
-        else Name.new($`, $1, $')
+        when ''; NAME[[nil, $1, $']]
+        else NAME[[$`, $1, $']]
         end
       elsif /:/ =~ name && !context.namespace_uri($`).empty?
-        Name.new($`, context.namespace_uri($`), $')
+        NAME[[$`, context.namespace_uri($`), $']]
       else
-        Name.new(nil, '', name)
+        NAME[[nil, '', name]]
       end
     end
 
@@ -52,10 +57,10 @@
       @namespace_prefix = namespace_prefix && namespace_prefix.dup.freeze
       @namespace_uri = namespace_uri && namespace_uri.dup.freeze
       @local_name = local_name && local_name.dup.freeze
-      if @namespace_prefix && /\A#{Pat::Nmtoken}\z/ !~ @namespace_prefix
+      if @namespace_prefix && /\A#{Pat::Nmtoken}\z/o !~ @namespace_prefix
         raise HTree::Error, "invalid namespace prefix: #{@namespace_prefix.inspect}"
       end
-      if @local_name && /\A#{Pat::Nmtoken}\z/ !~ @local_name
+      if @local_name && /\A#{Pat::Nmtoken}\z/o !~ @local_name
         raise HTree::Error, "invalid local name: #{@local_name.inspect}"
       end
       if @namespace_prefix == 'xmlns'
parse.rb.patch (1.53 KB, text/x-diff)
--- parse.rb.orig	2004-05-06 15:13:52.000000000 +0900
+++ parse.rb	2004-07-23 15:12:32.000000000 +0900
@@ -189,8 +189,10 @@
   end
 
   def HTree.build_node(structure, is_xml, is_html, inherited_context=DefaultContext)
-    case structure[0]
-    when :elem
+    s = structure[0]
+    if s==:text_pcdata
+      Text.parse_pcdata(structure[1])
+    elsif s==:elem
       _, stag_rawstring, children, etag_rawstring = structure
       etag = etag_rawstring && ETag.parse(etag_rawstring, is_xml, is_html)
       stag = STag.parse(stag_rawstring, is_xml, is_html, inherited_context)
@@ -201,23 +203,21 @@
       else
         Elem.new!(stag)
       end
-    when :emptytag
+    elsif s==:emptytag
       Elem.new!(STag.parse(structure[1], is_xml, is_html, inherited_context))
-    when :bogus_etag
+    elsif s==:bogus_etag
       BogusETag.parse(structure[1], is_xml, is_html)
-    when :xmldecl
+    elsif s==:xmldecl
       XMLDecl.parse(structure[1])
-    when :doctype
+    elsif s==:doctype
       DocType.parse(structure[1], is_xml, is_html)
-    when :procins
+    elsif s==:procins
       ProcIns.parse(structure[1])
-    when :comment
+    elsif s==:comment
       Comment.parse(structure[1])
-    when :text_pcdata
-      Text.parse_pcdata(structure[1])
-    when :text_cdata_content
+    elsif s==:text_cdata_content
       Text.parse_cdata_content(structure[1])
-    when :text_cdata_section
+    elsif s==:text_cdata_section
       Text.parse_cdata_section(structure[1])
     else
       raise Exception, "[bug] unknown structure: #{structure.inspect}"

In This Thread

Prev Next