[#144186] Re: array of object insert polices — "Pe, Botp" <botp@...>

dave [mailto:dave.m@email.it] wrote:

14 messages 2005/06/01

[#144206] Implementing a Read-Only array — Gavin Kistner <gavin@...>

Right up front, let me say that I realize that I can't prevent

14 messages 2005/06/01

[#144224] Method Chaining Issues — "aartist" <aartist@...>

try this:

28 messages 2005/06/01
[#144231] Re: Method Chaining Issues — "Phrogz" <gavin@...> 2005/06/01

This is a FAQ, though no page on the RubyGarden wiki seems to address

[#144240] Re: Method Chaining Issues — Nikolai Weibull <mailing-lists.ruby-talk@...> 2005/06/01

Phrogz wrote:

[#144230] ternary operator confusion — Belorion <belorion@...>

I don't know if this is "improper" use of the ternary operator, but I

19 messages 2005/06/01
[#144233] Re: ternary operator confusion — "Phrogz" <gavin@...> 2005/06/01

true ? a.push(1) : a.push(2)

[#144257] Re: ternary operator confusion — "Marcel Molina Jr." <marcel@...> 2005/06/01

On Thu, Jun 02, 2005 at 01:40:23AM +0900, Phrogz wrote:

[#144263] Re: ternary operator confusion — Eric Mahurin <eric_mahurin@...> 2005/06/01

--- "Marcel Molina Jr." <marcel@vernix.org> wrote:

[#144453] RubyScript2Exe and GUI toolkits — Erik Veenstra <pan@...>

13 messages 2005/06/03

[#144487] Building a business case for Ruby — Joe Van Dyk <joevandyk@...>

Hi,

29 messages 2005/06/03

[#144535] ruby-dev summary 26128-26222 — Minero Aoki <aamine@...>

Hi all,

11 messages 2005/06/04

[#144579] Package, a future replacement for setup.rb and mkmf.rb — Christian Neukirchen <chneukirchen@...>

29 messages 2005/06/04

[#144672] newbie read.scan (?) question — "Bruce D'Arcus" <bdarcus.lists@...>

Hi,

16 messages 2005/06/06

[#144691] making a duck — Eric Mahurin <eric_mahurin@...>

Regarding duck-typing... Is there an easy way make a "duck"?

27 messages 2005/06/06

[#144867] ruby-wish@ruby-lang.org mailing list — dave <dave.m@...>

19 messages 2005/06/08
[#144870] Re: [PROPOSAL] ruby-wish@ruby-lang.org mailing list — "Robert Klemme" <bob.news@...> 2005/06/08

Austin Ziegler wrote:

[#144890] RubyStuff: The Ruby Shop for Ruby Programmers — James Britt <james_b@...>

Announcing the formal grand opening of Ruby Stuff: The Ruby Shop for

36 messages 2005/06/08

[#144966] python/ruby benchmark. — "\"</script>" <groleo@...>

I took a look at

78 messages 2005/06/09
[#144967] Re: python/ruby benchmark. — gabriele renzi <surrender_it@...> 2005/06/09

"</script> ha scritto:

[#144974] Re: python/ruby benchmark. — Lothar Scholz <mailinglists@...> 2005/06/09

Hello gabriele,

[#144977] Re: python/ruby benchmark. — Kent Sibilev <ksruby@...> 2005/06/09

Java is an order of magnitude faster than Ruby. The development of a

[#144980] Re: python/ruby benchmark. — Lothar Scholz <mailinglists@...> 2005/06/09

Hello Kent,

[#144983] Re: python/ruby benchmark. — "Ryan Leavengood" <mrcode@...> 2005/06/09

Lothar Scholz said:

[#145196] Re: python/ruby benchmark(don't shoot the messenger) — ptkwt@... (Phil Tomson) 2005/06/12

In article <9e7db91105061106485b68d629@mail.gmail.com>,

[#145207] Re: python/ruby benchmark(don't shoot the messenger) — Steven Jenkins <steven.jenkins@...> 2005/06/12

Phil Tomson wrote:

[#145212] Re: python/ruby benchmark(don't shoot the messenger) — Austin Ziegler <halostatue@...> 2005/06/12

On 6/12/05, Steven Jenkins <steven.jenkins@ieee.org> wrote:

[#145219] Re: python/ruby benchmark(don't shoot the messenger) — Steven Jenkins <steven.jenkins@...> 2005/06/12

Austin Ziegler wrote:

[#145223] Re: python/ruby benchmark(don't shoot the messenger) — Austin Ziegler <halostatue@...> 2005/06/12

On 6/12/05, Steven Jenkins <steven.jenkins@ieee.org> wrote:

[#145240] Re: python/ruby benchmark(don't shoot the messenger) — Steven Jenkins <steven.jenkins@...> 2005/06/12

Austin Ziegler wrote:

[#145241] Re: python/ruby benchmark(don't shoot the messenger) — Austin Ziegler <halostatue@...> 2005/06/13

On 6/12/05, Steven Jenkins <steven.jenkins@ieee.org> wrote:

[#145000] RDoc

Hi, I have a question. When I compiled ruby-1.8.2

13 messages 2005/06/09
[#145003] Re: RDoc — Eric Hodel <drbrain@...7.net> 2005/06/09

On 09 Jun 2005, at 13:55, Jesffffas Antonio Sfffe1nchez A. wrote:

[#145238] finding Hash subsets based on key value — "ee" <erik.eide@...>

Hi

17 messages 2005/06/12

[#145304] PDF::Writer 1.0 (version 1.0.1) — Austin Ziegler <halostatue@...>

= PDF::Writer

21 messages 2005/06/13
[#145411] Re: [ANN] PDF::Writer 1.0 (version 1.0.1) — Jason Foreman <threeve.org@...> 2005/06/14

No love from PDF::Writer on Mac OS X 10.4.1. I hope to get this fixed

[#145420] Re: [ANN] PDF::Writer 1.0 (version 1.0.1) — Austin Ziegler <halostatue@...> 2005/06/14

On 6/14/05, Jason Foreman <threeve.org@gmail.com> wrote:

[#145432] Re: [ANN] PDF::Writer 1.0 (version 1.0.1) — Jamis Buck <jamis@37signals.com> 2005/06/15

On Jun 14, 2005, at 5:11 PM, Austin Ziegler wrote:

[#145339] survey: what editor do you use to hack ruby? — Lowell Kirsh <lkirsh@...>

I've been having a tough time getting emacs set up properly with ruby

62 messages 2005/06/14

[#145390] Ruby and recursion (Ackermann benchmark) — ptkwt@... (Phil Tomson)

14 messages 2005/06/14

[#145586] How to make a browser in Ruby Tk — sujeet kumar <sujeetkr@...>

Hi

13 messages 2005/06/16

[#145636] Super-scalar Optimizations — "Phrogz" <gavin@...>

I was looking over the shoulder of a C++ coworker yesterday, when he

14 messages 2005/06/16

[#145677] Truth maintenance system in Ruby — "itsme213" <itsme213@...>

Anyone know of any kind of truth-maintenance system implemented in Ruby (or,

12 messages 2005/06/17

[#145720] Frameless RDoc template ('technology preview') — ES <ruby-ml@...>

Hi!

17 messages 2005/06/17

[#145779] Newbe questions... — "Chuck Brotman" <brotman@...>

In Ruby Is there a prefered (or otherwise elegant) way to do an inner &

17 messages 2005/06/18

[#145790] GC.disable not working? — Eric Mahurin <eric_mahurin@...>

From what I can tell, GC.disable doesn't work. I'm wanting to

37 messages 2005/06/18
[#145822] Re: GC.disable not working? — ts <decoux@...> 2005/06/19

>>>>> "E" == Eric Mahurin <eric_mahurin@yahoo.com> writes:

[#146024] evaluation of ruby — "Franz Hartmann" <porschefranz@...> 2005/06/21

Hello all,

[#145830] preventing instantiation — "R. Mark Volkmann" <mark@...>

What is the recommended way in Ruby to prevent other classes from creating

13 messages 2005/06/19
[#145831] Re: preventing instantiation — Gavri Fernandez <gavri.fernandez@...> 2005/06/19

On 6/19/05, R. Mark Volkmann <mark@ociweb.com> wrote:

[#145879] x==1 vs 1==x — Gavin Kistner <gavin@...>

I'm against _premature_ optimization in theory, but believe that a

19 messages 2005/06/20
[#145880] Re: x==1 vs 1==x — ts <decoux@...> 2005/06/20

>>>>> "G" == Gavin Kistner <gavin@refinery.com> writes:

[#145943] Chess Variants (II) (#36) — James Edward Gray II <james@...>

I don't want to spoil all the fun, in case anyone is still attempting

12 messages 2005/06/20

[#146038] 1. Ruby result: 101 seconds , 2. Java result:9.8 seconds, 3. Perl result:62 seconds — Michael Tan <mtan1232000@...>

Just new to Ruby since last week, running my same functional program on the windows XP(Pentium M1.5G), the Ruby version is 10 times slower than the Java version. The program is to find the prime numbers like 2, 3,5, 7, 11, 13... Are there setup issues? or it is normal?

47 messages 2005/06/21
[#146044] Re: 1. Ruby result: 101 seconds , 2. Java result:9.8 seconds, 3. Perl result:62 seconds — "Florian Frank" <flori@...> 2005/06/21

Michael Tan wrote:

[#146047] Re: 1. Ruby result: 101 seconds , 2. Java result:9.8 seconds, 3. Perl result:62 seconds — Jim Freeze <jim@...> 2005/06/21

* Florian Frank <flori@nixe.ping.de> [2005-06-22 05:40:14 +0900]:

[#146050] Re: 1. Ruby result: 101 seconds , 2. Java result:9.8 seconds, 3. Perl result:62 seconds — "Ryan Leavengood" <mrcode@...> 2005/06/21

Jim Freeze said:

[#146132] Re: 1. Ruby result: 101 seconds , 2. Java result:9.8 seconds, 3. Perl result:62 seconds — "Mark Thomas" <mrt@...> 2005/06/22

Florian Frank wrote:

[#146064] rubyscript2exe — Joe Van Dyk <joevandyk@...>

Hi,

14 messages 2005/06/21

[#146169] spidering a website to build a sitemap — Bill Guindon <agorilla@...>

I need to spider a site and build a sitemap for it. I've looked

17 messages 2005/06/22

[#146178] traits-0.4.0 - the coffee release — "Ara.T.Howard" <Ara.T.Howard@...>

15 messages 2005/06/22

[#146328] string to Class object — "R. Mark Volkmann" <mark@...>

How can I create a Class object from a String that contains the name of a class?

15 messages 2005/06/24

[#146380] Application-0.6.0 — Jim Freeze <jim@...>

CommandLine - Application and OptionParser

22 messages 2005/06/24

[#146391] ASP.NET vs Ruby on Rails — Stephen Kellett <snail@...>

HI Folks,

21 messages 2005/06/24
[#146457] Re: ASP.NET vs Ruby on Rails — "Dema" <demetriusnunes@...> 2005/06/25

Hi Stephen,

[#146425] speeding up Process.detach frequency — Joe Van Dyk <joevandyk@...>

Is there any way to speed up Process.detach? The ri documentation for

14 messages 2005/06/25

[#146483] I saw the beauty of Ruby Re: 1. Ruby result: 101 seconds , 2. Java result:9.8 seconds, 3. Perl result:62 seconds — Michael Tan <mtan1232000@...>

22 messages 2005/06/26
[#146485] Re: I saw the beauty of Ruby Re: 1. Ruby result: 101 seconds , 2. Java result:9.8 seconds, 3. Perl result:62 seconds — "Florian Frank" <flori@...> 2005/06/26

Michael Tan wrote:

[#146504] Re: I saw the beauty of Ruby Re: 1. Ruby result: 101 seconds , 2. Java result:9.8 seconds, 3. Perl result:62 seconds — Brad Wilson <dotnetguy@...> 2005/06/26

For comparison, the port of your code to (less than elegant) C#.

[#146515] Re: I saw the beauty of Ruby Re: 1. Ruby result: 101 seconds , 2. Java result:9.8 seconds, 3. Perl result:62 seconds — Florian Gro<florgro@...> 2005/06/26

Brad Wilson wrote:

[#146491] What do you want to see in a Sparklines Library? — Daniel Nugent <nugend@...>

This is sort of an interest gauging/feature request poll.

17 messages 2005/06/26
[#146506] Re: What do you want to see in a Sparklines Library? — Daniel Amelang <daniel.amelang@...> 2005/06/26

See what's already been done before you get too far.

[#146517] Re: What do you want to see in a Sparklines Library? — Daniel Nugent <nugend@...> 2005/06/26

Yup, seen the stuff on RedHanded, I was planning on writing a little

[#146562] RCM - A Ruby Configuration Management System — Michael Neumann <mneumann@...>

Hi all,

22 messages 2005/06/27

[#146630] yield does not take a block — Daniel Brockman <daniel@...>

Under ruby 1.9.0 (2005-06-23) [i386-linux], irb 0.9.5(05/04/13),

48 messages 2005/06/28
[#146666] Re: yield does not take a block — Daniel Brockman <daniel@...> 2005/06/28

Yukihiro Matsumoto <matz@ruby-lang.org> writes:

[#146680] Re: yield does not take a block — Yukihiro Matsumoto <matz@...> 2005/06/28

Hi,

[#146684] Re: yield does not take a block — Eric Mahurin <eric_mahurin@...> 2005/06/28

[#146779] Re: yield does not take a block — "Adam P. Jenkins" <thorin@...> 2005/06/29

Eric Mahurin wrote:

[#146700] Anything in new Eclipse for Rubyists? — "jfry" <jeff.fry@...>

Hey there, I know that a number of folks on the list use Eclipse as

14 messages 2005/06/28

[#146773] Programmers Contest: Fit pictures on a page — hicinbothem@...

GLOSSY: The Summer Programmer Of The Month Contest is underway!

18 messages 2005/06/29

[#146815] shift vs. slice!(0) and others — Eric Mahurin <eric_mahurin@...>

I just did some benchmarking of various ways to insert/delete

12 messages 2005/06/29

Re: REXML libraries and parsing issues

From: mathew <meta@...>
Date: 2005-06-27 17:50:41 UTC
List: ruby-talk #146577
BA wrote:
> Yes, I want to extract the PDAT element, however, I want to use the B110 
> tag to find this element.  The XML *is* predictable, however, there are 
> variations in the placement of the elements (there could be several 
> different address fields and/or many paragraphs that need to be 
> parsed/searched).  The files are *extremely* large (some could be as 
> large as 1-2GB).

Any time you're faced with a huge XML file and you only want to get 
small pieces of it, you should think about using a stream parser, a la 
Java's SAX2. The idea behind a stream parser is that the parser runs 
through the entire file exactly once, and never has to seek forwards or 
backwards; it throws the data to you in whatever size chunks are most 
convenient for it. This means the parser does as little work as 
possible, which means so long as your code is efficient, the end result 
should be maximally efficient.

The downside is that you have to do a bit more work. For example, 
depending on how the parser buffers things internally, it might send you 
a piece of text inside an XML element in two or more pieces, and expect 
you to glue them together. It's also up to you to deal with any 
position-based restrictions on which elements you're interested in.

I assumed you wanted the text inside any PDAT element that was 
*somewhere* inside a B110 element, so I simply track which elements are 
currently "open" and by how many levels. Here's the code.

---
require 'rexml/document'
require 'rexml/parsers/streamparser'

class MyListener

   def initialize
     # Hash to record which elements we are inside at any given moment, and
     # how many of them we are inside
     @inside = Hash.new
     @textbuffer = ''
   end

   def tag_start(name, attrs)
     if @inside[name]
       @inside[name] += 1
     else
       @inside[name] = 1
     end
   end

   def text(text)
     if @inside['B110'] and @inside['PDAT']
       @textbuffer += text
     end
   end

   def tag_end(name)
     if name == 'PDAT'
       # Output the text if we just closed a PDAT inside a B110
       if @inside['B110'] and @inside['PDAT']
         puts @textbuffer
       end
       # Clear the buffer any time we close a PDAT
       @textbuffer = ''
     end
     # Decrement count, set to nil if zero
     # so @inside['foo'] works as a boolean
     if @inside[name] == 1
       @inside[name] = nil
     else
       @inside[name] -= 1
     end
   end
end

listener = MyListener.new
source = File.new "mydoc.xml"
REXML::Document.parse_stream(source, listener)
---

Here's a sample file:

---

<FOO>
   <B110><B110>
     <SOMETHINGELSE>
       <PDAT>This is the text you want</PDAT>This isn't.
     </SOMETHINGELSE>
   </B110>
     <PDAT>This is sneaky good text</PDAT>
   </B110>
   <PDAT>This is bad text</PDAT>
</FOO>

---

Output:

---

This is the text you want
This is sneaky good text

---

Note that this uses the native REXML stream parser API, not the SAX2 
clone, because the SAX2 clone is slower according to the documentation.

Disclaimers:

The above code is only lightly tested. Although a stream parser should 
theoretically be the fastest option, I haven't actually benchmarked it 
against (say) the pull parser. (Which is also documented as having an 
unstable API, so personally I'd avoid it anyway.)

The above code will break if you have a PDAT somewhere inside a PDAT. 
I'm assuming that's not allowed. If it is, you'll have to make your text 
buffer be a stack of strings rather than a simple string, append to 
@textbuffer[@inside['PDAT']], and take the performance hit.

Also, if you need to do more elaborate selection of which elements to 
process, you'll obviously need to make changes to how the current 
position is tracked... e.g. implementing "process PDAT elements only if 
they are not buried more than 2 other elements deep inside a B110 
element" is left as an exercise for the reader :-)

 > (started doing this by parsing the file line by line, however,
> ran into malformed XML where I decided that I needed to use the database 
> functionality of XML.

If by "malformed XML" you mean syntactically invalid XML, such as 
unescaped < > characters, then you may be hosed, as REXML's parsers will 
likely choke on it.


mathew
-- 
<URL:http://www.pobox.com/~meta/>

In This Thread

Prev Next