[ruby-list:39908] htreeの高速化
From:
MoonWolf <moonwolf@...>
Date:
2004-07-23 13:23:02 UTC
List:
ruby-list #39908
MoonWolfです。 htree(http://raa.ruby-lang.org/project/htree/)で大きなHTMLをパースすると メモリを大量に消費し処理時間も長いので、高速化・省メモリ化を試みました。 (http://devlog.moonwolf.com/20040723.html#p01) 添付したパッチが、その成果です。 1. 正規表現のオプション //oオプションを付け忘れている箇所があったので//oを付け足しました。 2. HTree::Nameオブジェクトの共有化 immutableなオブジェクトなのでHTree::NAMEというHashにキャッシュして Name.new(arg)をNAME[arg]に書き換えました。 ObjectSpaceからオブジェクト数を調べて見ると21,504個あったHTree::Nameが 29個と激減しメモリ使用量も減りました。 3. caseからif〜elsifへの書き換えと比較順序の変更 一番出現頻度が高いと思われる:text_pcdataを最初にチェックするようにしま した。 3.5MBのHTMLのparseでメモリ使用量が209MBから156MBまで減りました。しかし、 156MBでもまだ多いと思います。これ以外に改善する方法はなにかないですかねぇ……
Attachments (3)
tag.rb.patch
(15.6 KB, text/x-diff)
--- tag.rb.orig 2004-05-06 14:53:34.000000000 +0900
+++ parse.rb 2004-07-23 15:12:32.000000000 +0900
@@ -1,111 +1,407 @@
-require 'htree/raw_string'
+require 'htree/scan'
+require 'htree/htmlinfo'
require 'htree/text'
-require 'htree/scan' # for Pat::Name and Pat::Nmtoken
+require 'htree/tag'
+require 'htree/leaf'
+require 'htree/doc'
+require 'htree/elem'
+require 'htree/raw_string'
require 'htree/context'
-require 'htree/name'
+require 'htree/encoder'
module HTree
+ # HTree.parse parses <i>input</i> and return a document tree.
+ # represented by HTree::Doc.
+ #
+ # <i>input</i> should be a String or
+ # an object which respond to read or open method.
+ # For example, IO, StringIO, Pathname, URI::HTTP and URI::FTP are acceptable.
+ # Note that the URIs need open-uri.
+ #
+ # HTree.parse guesses <i>input</i> is HTML or not and XML or not.
+ #
+ # If it is guessed as HTML, the default namespace in the result is set to http://www.w3.org/1999/xhtml
+ # regardless of <i>input</i> has XML namespace declaration or not nor even it is pre-XML HTML.
+ #
+ # If it is guessed as HTML and not XML, all element and attribute names are downcaseed.
+ #
+ # If opened file or read content has charset method,
+ # HTree.parse decode it according to $KCODE before parsing.
+ # Otherwise HTree.parse assumes the character encoding of the content is
+ # compatible to $KCODE.
+ # Note that the charset method is provided by URI::HTTP with open-uri.
+ def HTree.parse(input)
+ parse_as(input, false)
+ end
+
+ # HTree.parse_xml parses <i>input</i> as XML and
+ # return a document tree represented by HTree::Doc.
+ #
+ # It behaves almost same as HTree.parse but it assumes <i>input</> is XML
+ # even if no XML declaration.
+ # The assumption causes following differences.
+ # * doesn't downcase element name.
+ # * The content of <script> and <style> element is PCDATA, not CDATA.
+ def HTree.parse_xml(input)
+ parse_as(input, true)
+ end
+
# :stopdoc:
- class STag
- def initialize(name, attributes=[], inherited_context=DefaultContext)
- init_raw_string
- # normalize xml declaration name and attribute value.
- attributes = attributes.map {|aname, val|
- if !(Name === aname) && /\A(?:#{Pat::Name}?\{.*\})?#{Pat::Nmtoken}\z/ !~ aname
- raise HTree::Error, "invalid attribute name: #{aname.inspect}"
- end
- if !(Name === aname) && /\Axmlns(?:\z|:)/ =~ aname
- aname = Name.parse_attribute_name(aname, nil)
- end
- val = val.to_node if HTree::Location === val
- val = Text.new(val) unless Text === val
- [aname, val]
+ def HTree.parse_as(input, is_xml)
+ input_charset = nil
+ if input.tainted? && 1 <= $SAFE
+ raise SecurityError, "input tainted"
+ end
+ if input.respond_to? :read # IO, StringIO
+ input = input.read.untaint
+ input_charset = input.charset if input.respond_to? :charset
+ elsif input.respond_to? :open # Pathname, URI with open-uri
+ input.open {|f|
+ input = f.read.untaint
+ input_charset = f.charset if f.respond_to? :charset
}
+ end
+ if input_charset && input_charset != Encoder.internal_charset
+ input = Iconv.conv(Encoder.internal_charset, input_charset, input)
+ end
- @inherited_context = inherited_context
- @xmlns_decls = {}
+ tokens = []
+ is_xml, is_html = HTree.scan(input, is_xml) {|token|
+ tokens << token
+ }
+ context = is_html ? HTMLContext: DefaultContext
+ structure_list = parse_pairs(tokens, is_xml, is_html)
+ structure_list = fix_structure_list(structure_list, is_xml, is_html)
+ nodes = structure_list.map {|s| build_node(s, is_xml, is_html, context) }
+ Doc.new(nodes)
+ end
- # validate namespace consistency of given Name objects.
- if Name === name
- @xmlns_decls[name.namespace_prefix] = name.namespace_uri
- end
- attributes.each {|aname, text|
- next unless Name === aname
- next if aname.xmlns?
- if aname.namespace_prefix && aname.namespace_uri
- if @xmlns_decls.include? aname.namespace_prefix
- if @xmlns_decls[aname.namespace_prefix] != aname.namespace_uri
- raise ArgumentError, "inconsistent namespace use: #{aname.namespace_prefix} is used as #{@xmlns_decls[aname.namespace_prefix]} and #{aname.namespace_uri}"
- end
- else
- @xmlns_decls[aname.namespace_prefix] = aname.namespace_uri
+ def HTree.parse_pairs(tokens, is_xml, is_html)
+ stack = [[nil, nil, []]]
+ tokens.each {|token|
+ case token[0]
+ when :stag
+ stag_raw_string = token[1]
+ stagname = stag_raw_string[Pat::Name]
+ stagname = stagname.downcase if !is_xml && is_html
+ stack << [stagname, stag_raw_string, []]
+ when :etag
+ etag_raw_string = token[1]
+ etagname = etag_raw_string[Pat::Name]
+ etagname = etagname.downcase if !is_xml && is_html
+ matched_elem = nil
+ stack.reverse_each {|elem|
+ stagname, _, _ = elem
+ if stagname == etagname
+ matched_elem = elem
+ break
+ end
+ }
+ if matched_elem
+ until matched_elem.equal? stack.last
+ stagname, stag_raw_string, children = stack.pop
+ stack.last[2] << [:elem, stag_raw_string, children]
end
+ stagname, stag_raw_string, children = stack.pop
+ stack.last[2] << [:elem, stag_raw_string, children, etag_raw_string]
+ else
+ stack.last[2] << [:bogus_etag, etag_raw_string]
end
- }
+ else
+ stack.last[2] << token
+ end
+ }
+ elem = nil
+ while 1 < stack.length
+ stagname, stag_raw_string, children = stack.pop
+ stack.last[2] << [:elem, stag_raw_string, children]
+ end
+ stack[0][2]
+ end
+
+ def HTree.fix_structure_list(structure_list, is_xml, is_html)
+ result = []
+ rest = structure_list.dup
+ until rest.empty?
+ structure = rest.shift
+ if structure[0] == :elem
+ elem, rest2 = fix_element(structure, [], [], is_xml, is_html)
+ result << elem
+ rest = rest2 + rest
+ else
+ result << structure
+ end
+ end
+ result
+ end
- attributes.each {|aname, text|
- next unless Name === aname
- next unless aname.xmlns?
- next if @xmlns_decls.include? aname.local_name
- if aname.local_name
- @xmlns_decls[aname.local_name] = text.to_s
+ def HTree.fix_element(elem, excluded_tags, included_tags, is_xml, is_html)
+ stag_raw_string = elem[1]
+ children = elem[2]
+ if etag_raw_string = elem[3]
+ return [:elem, stag_raw_string, fix_structure_list(children, is_xml, is_html), etag_raw_string], []
+ else
+ tagname = stag_raw_string[Pat::Name]
+ tagname = tagname.downcase if !is_xml && is_html
+ if ElementContent[tagname] == :EMPTY
+ return [:elem, stag_raw_string, []], children
+ else
+ if ElementContent[tagname] == :CDATA
+ possible_tags = []
else
- uri = text.to_s
- @xmlns_decls[nil] = uri
+ possible_tags = ElementContent[tagname]
+ end
+ if possible_tags
+ excluded_tags2 = ElementExclusions[tagname]
+ included_tags2 = ElementInclusions[tagname]
+ excluded_tags |= excluded_tags2 if excluded_tags2
+ included_tags |= included_tags2 if included_tags2
+ containable_tags = (possible_tags | included_tags) - excluded_tags
+ uncontainable_tags = ElementContent.keys - containable_tags
+ else
+ # If the tagname is unknown, it is assumed that any element
+ # except excluded can be contained.
+ uncontainable_tags = excluded_tags
+ end
+ fixed_children = []
+ rest = children
+ until rest.empty?
+ if rest[0][0] == :elem
+ elem = rest.shift
+ elem_tagname = elem[1][Pat::Name]
+ elem_tagname = elem_tagname.downcase if !is_xml && is_html
+ if uncontainable_tags.include? elem_tagname
+ rest.unshift elem
+ break
+ else
+ fixed_elem, rest2 = fix_element(elem, excluded_tags, included_tags, is_xml, is_html)
+ fixed_children << fixed_elem
+ rest = rest2 + rest
+ end
+ else
+ fixed_children << rest.shift
+ end
end
+ return [:elem, stag_raw_string, fixed_children], rest
+ end
+ end
+ end
+
+ def HTree.build_node(structure, is_xml, is_html, inherited_context=DefaultContext)
+ s = structure[0]
+ if s==:text_pcdata
+ Text.parse_pcdata(structure[1])
+ elsif s==:elem
+ _, stag_rawstring, children, etag_rawstring = structure
+ etag = etag_rawstring && ETag.parse(etag_rawstring, is_xml, is_html)
+ stag = STag.parse(stag_rawstring, is_xml, is_html, inherited_context)
+ if !children.empty? || etag
+ Elem.new!(stag,
+ children.map {|c| build_node(c, is_xml, is_html, stag.context) },
+ etag)
+ else
+ Elem.new!(stag)
+ end
+ elsif s==:emptytag
+ Elem.new!(STag.parse(structure[1], is_xml, is_html, inherited_context))
+ elsif s==:bogus_etag
+ BogusETag.parse(structure[1], is_xml, is_html)
+ elsif s==:xmldecl
+ XMLDecl.parse(structure[1])
+ elsif s==:doctype
+ DocType.parse(structure[1], is_xml, is_html)
+ elsif s==:procins
+ ProcIns.parse(structure[1])
+ elsif s==:comment
+ Comment.parse(structure[1])
+ elsif s==:text_cdata_content
+ Text.parse_cdata_content(structure[1])
+ elsif s==:text_cdata_section
+ Text.parse_cdata_section(structure[1])
+ else
+ raise Exception, "[bug] unknown structure: #{structure.inspect}"
+ end
+ end
+
+ def STag.parse(raw_string, is_xml, is_html, inherited_context=DefaultContext)
+ if /\A#{Pat::StartTag}\z/o =~ raw_string
+ is_stag = true
+ elsif /\A#{Pat::EmptyTag}\z/o =~ raw_string
+ is_stag = false
+ else
+ raise HTree::Error, "cannot recognize as start tag or empty tag: #{raw_string.inspect}"
+ end
+
+ attrs = []
+ if (is_stag ? /\A#{Pat::ValidStartTag_C}\z/o : /\A#{Pat::ValidEmptyTag_C}\z/o) =~ raw_string
+ qname = $1
+ $2.scan(Pat::ValidAttr_C) {
+ attrs << ($5 ? [nil, $5] : [$1, $2 || $3 || $4])
+ }
+ elsif (is_stag ? /\A#{Pat::InvalidStartTag_C}\z/o : /\A#{Pat::InvalidEmptyTag_C}\z/o) =~ raw_string
+ qname = $1
+ last_attr = $3
+ $2.scan(Pat::InvalidAttr1_C) {
+ attrs << ($5 ? [nil, $5] : [$1, $2 || $3 || $4])
}
+ if last_attr
+ /#{Pat::InvalidAttr1End_C}/o =~ last_attr
+ attrs << [$1, $2 || $3]
+ end
+ else
+ raise Exception, "[bug] cannot recognize as start tag: #{raw_string.inspect}"
+ end
- @context = make_context(@inherited_context)
+ qname = qname.downcase if !is_xml && is_html
- if Name === name
- @name = name
+ attrs.map! {|aname, aval|
+ if aname
+ aname = (!is_xml && is_html) ? aname.downcase : aname
+ [aname, Text.parse_pcdata(aval)]
else
- @name = Name.parse_element_name(name, @context)
+ if val2name = OmittedAttrName[qname]
+ aval_downcase = aval.downcase
+ aname = val2name.fetch(aval_downcase, aval_downcase)
+ else
+ aname = aval
+ end
+ [aname, Text.new(aval)]
end
+ }
+
+ result = STag.new(qname, attrs, inherited_context)
+ result.raw_string = raw_string
+ result
+ end
+
+ def ETag.parse(raw_string, is_xml, is_html)
+ unless /\A#{Pat::EndTag_C}\z/o =~ raw_string
+ raise HTree::Error, "cannot recognize as end tag: #{raw_string.inspect}"
+ end
+
+ qname = $1
+ qname = qname.downcase if !is_xml && is_html
+
+ result = self.new(qname)
+ result.raw_string = raw_string
+ result
+ end
+
+ def BogusETag.parse(raw_string, is_xml, is_html)
+ unless /\A#{Pat::EndTag_C}\z/o =~ raw_string
+ raise HTree::Error, "cannot recognize as end tag: #{raw_string.inspect}"
+ end
+
+ qname = $1
+ qname = qname.downcase if !is_xml && is_html
- @attributes = attributes.map {|aname, text|
- aname = Name.parse_attribute_name(aname, @context) unless Name === aname
- if !aname.namespace_prefix && !aname.namespace_uri.empty?
- # xxx: should recover error?
- raise HTree::Error, "global attribute without namespace prefix: #{aname.inspect}"
+ result = self.new(qname)
+ result.raw_string = raw_string
+ result
+ end
+
+ def Text.parse_pcdata(raw_string)
+ fixed = raw_string.gsub(/&(?:(?:#[0-9]+|#x[0-9a-fA-F]+|([A-Za-z][A-Za-z0-9]*));?)?/o) {|s|
+ name = $1
+ case s
+ when /;\z/
+ s
+ when /\A&#/
+ "#{s};"
+ when '&'
+ '&'
+ else
+ if NamedCharactersPattern =~ name
+ "&#{name};"
+ else
+ "&#{name}"
end
- [aname, text]
- }
- @attributes.freeze
+ end
+ }
+ result = new!(fixed)
+ result.raw_string = raw_string
+ result
+ end
+
+ def Text.parse_cdata_content(raw_string)
+ result = Text.new(raw_string)
+ result.raw_string = raw_string
+ result
+ end
+
+ def Text.parse_cdata_section(raw_string)
+ unless /\A#{Pat::CDATA_C}\z/o =~ raw_string
+ raise HTree::Error, "cannot recognize as CDATA section: #{raw_string.inspect}"
end
- attr_reader :attributes, :inherited_context, :context
- def element_name
- @name
+ content = $1
+
+ result = Text.new(content)
+ result.raw_string = raw_string
+ result
+ end
+
+ def XMLDecl.parse(raw_string)
+ unless /\A#{Pat::XmlDecl_C}\z/o =~ raw_string
+ raise HTree::Error, "cannot recognize as XML declaration: #{raw_string.inspect}"
end
- def make_context(inherited_context)
- inherited_context.subst_namespaces(@xmlns_decls)
+ version = $1 || $2
+ encoding = $3 || $4
+ case $5 || $6
+ when 'yes'
+ standalone = true
+ when 'no'
+ standalone = false
+ else
+ standalone = nil
end
- def each_namespace_attribute
- @xmlns_decls.each {|name, uri|
- yield name, uri
- }
- nil
+ result = XMLDecl.new(version, encoding, standalone)
+ result.raw_string = raw_string
+ result
+ end
+
+ def DocType.parse(raw_string, is_xml, is_html)
+ unless /\A#{Pat::DocType_C}\z/o =~ raw_string
+ raise HTree::Error, "cannot recognize as XML declaration: #{raw_string.inspect}"
end
- def each_attribute
- @attributes.each {|name, text|
- next if name.xmlns?
- yield name, text
- }
- nil
+ root_element_name = $1
+ public_identifier = $2 || $3
+ system_identifier = $4 || $5
+
+ root_element_name = root_element_name.downcase if !is_xml && is_html
+
+ result = DocType.new(root_element_name, public_identifier, system_identifier)
+ result.raw_string = raw_string
+ result
+ end
+
+ def ProcIns.parse(raw_string)
+ unless /\A#{Pat::XmlProcIns_C}\z/o =~ raw_string
+ raise HTree::Error, "cannot recognize as processing instruction: #{raw_string.inspect}"
end
+
+ target = $1
+ content = $2
+
+ result = ProcIns.new(target, content)
+ result.raw_string = raw_string
+ result
end
- class ETag
- def initialize(qualified_name)
- init_raw_string
- @qualified_name = qualified_name.dup.freeze
+ def Comment.parse(raw_string)
+ unless /\A#{Pat::Comment_C}\z/o =~ raw_string
+ raise HTree::Error, "cannot recognize as comment: #{raw_string.inspect}"
end
- attr_reader :qualified_name
+
+ content = $1
+
+ result = Comment.new(content)
+ result.raw_string = raw_string
+ result
end
# :startdoc:
name.rb.patch
(2.63 KB, text/x-diff)
--- name.rb.orig 2004-04-20 15:53:30.000000000 +0900
+++ name.rb 2004-07-23 15:45:12.000000000 +0900
@@ -4,6 +4,11 @@
module HTree
# Name represents a element name and attribute name.
# It consists of a namespace prefix, a namespace URI and a local name.
+ NAME = Hash.new {|hash,(namespace_prefix, namespace_uri, local_name)|
+ name = Name.new(namespace_prefix, namespace_uri, local_name)
+ hash[[namespace_prefix, namespace_uri, local_name]] = name
+ name
+ }
class Name
=begin
element name prefix uri localname
@@ -21,30 +26,30 @@
if /\{(.*)\}/ =~ name
# "{u}n" means "use default namespace",
# "p{u}n" means "use the specified prefix p"
- $` == '' ? Name.new(nil, $1, $') : Name.new($`, $1, $')
+ $` == '' ? NAME[[nil, $1, $']] : NAME[[$`, $1, $']]
elsif /:/ =~ name && !context.namespace_uri($`).empty?
- Name.new($`, context.namespace_uri($`), $')
+ NAME[[$`, context.namespace_uri($`), $']]
elsif !context.namespace_uri(nil).empty?
- Name.new(nil, context.namespace_uri(nil), name)
+ NAME[[nil, context.namespace_uri(nil), name]]
else
- Name.new(nil, '', name)
+ NAME[[nil, '', name]]
end
end
def Name.parse_attribute_name(name, context)
if name == 'xmlns'
- Name.new('xmlns', nil, nil)
+ NAME[['xmlns', nil, nil]]
elsif /\Axmlns:/ =~ name
- Name.new('xmlns', nil, $')
+ NAME[['xmlns', nil, $']]
elsif /\{(.*)\}/ =~ name
case $`
- when ''; Name.new(nil, $1, $')
- else Name.new($`, $1, $')
+ when ''; NAME[[nil, $1, $']]
+ else NAME[[$`, $1, $']]
end
elsif /:/ =~ name && !context.namespace_uri($`).empty?
- Name.new($`, context.namespace_uri($`), $')
+ NAME[[$`, context.namespace_uri($`), $']]
else
- Name.new(nil, '', name)
+ NAME[[nil, '', name]]
end
end
@@ -52,10 +57,10 @@
@namespace_prefix = namespace_prefix && namespace_prefix.dup.freeze
@namespace_uri = namespace_uri && namespace_uri.dup.freeze
@local_name = local_name && local_name.dup.freeze
- if @namespace_prefix && /\A#{Pat::Nmtoken}\z/ !~ @namespace_prefix
+ if @namespace_prefix && /\A#{Pat::Nmtoken}\z/o !~ @namespace_prefix
raise HTree::Error, "invalid namespace prefix: #{@namespace_prefix.inspect}"
end
- if @local_name && /\A#{Pat::Nmtoken}\z/ !~ @local_name
+ if @local_name && /\A#{Pat::Nmtoken}\z/o !~ @local_name
raise HTree::Error, "invalid local name: #{@local_name.inspect}"
end
if @namespace_prefix == 'xmlns'
parse.rb.patch
(1.53 KB, text/x-diff)
--- parse.rb.orig 2004-05-06 15:13:52.000000000 +0900
+++ parse.rb 2004-07-23 15:12:32.000000000 +0900
@@ -189,8 +189,10 @@
end
def HTree.build_node(structure, is_xml, is_html, inherited_context=DefaultContext)
- case structure[0]
- when :elem
+ s = structure[0]
+ if s==:text_pcdata
+ Text.parse_pcdata(structure[1])
+ elsif s==:elem
_, stag_rawstring, children, etag_rawstring = structure
etag = etag_rawstring && ETag.parse(etag_rawstring, is_xml, is_html)
stag = STag.parse(stag_rawstring, is_xml, is_html, inherited_context)
@@ -201,23 +203,21 @@
else
Elem.new!(stag)
end
- when :emptytag
+ elsif s==:emptytag
Elem.new!(STag.parse(structure[1], is_xml, is_html, inherited_context))
- when :bogus_etag
+ elsif s==:bogus_etag
BogusETag.parse(structure[1], is_xml, is_html)
- when :xmldecl
+ elsif s==:xmldecl
XMLDecl.parse(structure[1])
- when :doctype
+ elsif s==:doctype
DocType.parse(structure[1], is_xml, is_html)
- when :procins
+ elsif s==:procins
ProcIns.parse(structure[1])
- when :comment
+ elsif s==:comment
Comment.parse(structure[1])
- when :text_pcdata
- Text.parse_pcdata(structure[1])
- when :text_cdata_content
+ elsif s==:text_cdata_content
Text.parse_cdata_content(structure[1])
- when :text_cdata_section
+ elsif s==:text_cdata_section
Text.parse_cdata_section(structure[1])
else
raise Exception, "[bug] unknown structure: #{structure.inspect}"