[#401849] If statement — Masoud Ahmadi <lists@...>

Will anyone be able to point out what I am doing wrong.

15 messages 2012/12/02

[#401987] Trying to get "translator" to work — JD KF <lists@...>

So, basically, I'm trying to get the below code to work properly for

12 messages 2012/12/06

[#402012] Need help to select some listbox item in different listbox together — Jonathan Masato <lists@...>

Hello,

10 messages 2012/12/07

[#402045] if n belongs to set a and m belongs to set b repeat some steps, How? — "zubair a." <lists@...>

We can do so in java and similar languages like:

11 messages 2012/12/08

[#402078] Time.new(2001, 12, 3).to_i returns wrong value — Robert Buck <lists@...>

I am doing something that not many do, I am writing a database driver

9 messages 2012/12/09

[#402145] How I can create/extract a variable/hash into the current binding in Ruby? — Ramon de C Valle <rcvalle@...>

Hi,

12 messages 2012/12/12

[#402205] Wondering About Flatiron School — "Kevin Y." <lists@...>

Hi everyone!,

35 messages 2012/12/15
[#402207] Re: Wondering About Flatiron School — Chad Perrin <code@...> 2012/12/15

On Sat, Dec 15, 2012 at 11:51:08AM +0900, Kevin Y. wrote:

[#402214] Ruby quick reference arranged in ASCII sequence? — Old Grantonian <lists@...>

As a ruby beginner, I would be grateful for any links to a ruby

17 messages 2012/12/15

[#402226] print - and strip text between tags using Nokogiri — Paul Mena <lists@...>

I'm a Ruby Newbie trying to write a program to process thousands of HTML

13 messages 2012/12/15

[#402332] Perl to Ruby: regex captures to assignment. — "Derrick B." <lists@...>

Hello all,

37 messages 2012/12/19
[#402342] Re: Perl to Ruby: regex captures to assignment. — "Derrick B." <lists@...> 2012/12/20

First of all, thanks for the fast responses!

[#402352] Re: Perl to Ruby: regex captures to assignment. — Robert Klemme <shortcutter@...> 2012/12/20

On Thu, Dec 20, 2012 at 1:38 AM, Derrick B. <lists@ruby-forum.com> wrote:

[#402357] Re: Perl to Ruby: regex captures to assignment. — "Derrick B." <lists@...> 2012/12/20

Robert Klemme wrote in post #1089733:

[#402359] trying to strip characters from a line — Paul Mena <lists@...>

I'm reading a table from a MySQL database and then processing it row by

18 messages 2012/12/20

[#402394] simple division: -9 / 5 = -2 what? — "Derrick B." <lists@...>

$ irb

13 messages 2012/12/22

[#402412] POLS and string-handling — Paul Magnussen <lists@...>

Hi,

14 messages 2012/12/22

[#402460] "Open" dialog of Windows — "Damián M. González" <lists@...>

Hi guys, been researching about pop up the "open" file dialog of

11 messages 2012/12/24

[#402466] How do I install Ruby on my Ubuntu 12.10 partition. — Kaye Ng <lists@...>

I already have Ruby installed on my Windows 7 partition.

23 messages 2012/12/25

[#402510] Ruby Association Certified Ruby Programmer — Sean Westfall <lists@...>

How well respected is this certification in the industry: Ruby

27 messages 2012/12/27
[#402528] Re: Ruby Association Certified Ruby Programmer — Peter Hickman <peterhickman386@...> 2012/12/27

On 27 December 2012 01:28, Sean Westfall <lists@ruby-forum.com> wrote:

[#402555] numeric? — Brandon Weaver <keystonelemur@...>

I've found a bit of an annoyance trying to find out if a number is numeric

20 messages 2012/12/27

[#402580] Ruby Koans regarding Hashes. — "Derrick B." <lists@...>

I am trying to understand this, so let me know how I do. :) I know

18 messages 2012/12/28

[#402609] can't open new ruby program under "new" context menu — "Lee V." <lists@...>

I'm stuck on the new version at trying to do something very simple.

10 messages 2012/12/28

[#402642] require "test/unit" — "Mattias A." <lists@...>

Hi,

17 messages 2012/12/29
[#402667] Re: require "test/unit" — "Mattias A." <lists@...> 2012/12/31

Hi Dami=C3=A1n M. Gonz=C3=A1lez!

[#402747] Re: require "test/unit" — "Derrick B." <lists@...> 2013/01/04

Mattias A. wrote in post #1090700:

[#402749] Re: require "test/unit" — sto.mar@... 2013/01/04

Am 04.01.2013 19:48, schrieb Derrick B.:

Re: print - and strip text between tags using Nokogiri

From: Paul Mena <lists@...>
Date: 2012-12-16 16:11:50 UTC
List: ruby-talk #402257
I want the thank everyone for their quick replies and helpful 
suggestions. I realized that I should probably be using the real - and 
admittedly poorly-formed - HTML for this question and not the test HTML 
I've tried to concoct for this example.  The real HTML was generated by 
the Hypermail program, basically converting an email from mbox form to 
HTML.  Here is one such file:


<html>
<head>
<title>haiku_archive: watching the news</title>
<meta name="Author" content="Paul David Mena (pauldavidmena@gmail.com)">
<meta name="Subject" content="watching the news">
</head>
<body bgcolor="#FFFFFF" text="#000000">
<h1>watching the news</h1>
<strong>From:</strong> Paul David Mena (<a
href="mailto:pauldavidmena@gmail.com?Subject=Re:%20watching%20the%20news&In-Reply-To=&lt;CAOJ9yjPRsvJ8%2BtMKjCeUnKGKcuHGQ3kuakE%2BL%2BHS1gWCMEh8jQ@mail.gmail.com&gt;"><em>pauldavidmena@gmail.com</em></a>)<br>
<strong>Date:</strong> Fri Dec 14 2012 - 18:51:14 EST
<p>
<hr noshade><p>
<!-- body="start" -->
<p>
watching the news
<br>
I feel guilty
<br>
for being alive
<br>
<p><pre>
--
Paul David Mena
--------------------
<a
href="mailto:pauldavidmena@gmail.com?Subject=Re:%20watching%20the%20news&In-Reply-To=&lt;CAOJ9yjPRsvJ8%2BtMKjCeUnKGKcuHGQ3kuakE%2BL%2BHS1gWCMEh8jQ@mail.gmail.com&gt;">pauldavidmena@gmail.com</a>
</pre>
<p><!-- body="end" -->
</body>
</html>


My ultimate goal is to extract all of the comment text between <!--
body="start" --> and <!-- body="end" --> but *not* what is between the
two "pre" tags.  So far I've been able to extract all of the comment
text but not exclude the "pre" text, using the following code:


#!/usr/bin/env ruby

require "rubygems"
require "nokogiri"

class PlainTextExtractor < Nokogiri::XML::SAX::Document

  attr_reader :plaintext

  # Initialize the state of interest variable with false
  def initialize
    @interesting = false
    @plaintext = ""
  end

  # This method is called whenever a comment occurs and
  # the comments text is passed in as string.
  def comment(string)
    case string.strip       # strip leading and trailing whitespaces
    when /^body="start"/     # match starting comment
      @interesting = true
    when /^body="end"/
      @interesting = false  # match closing comment
    end
  end

  # This callback method is called with any string between
  # a tag.
  def characters(string)
    @plaintext << string if @interesting
  end
end

# write to the screen
pte = PlainTextExtractor.new
parser = Nokogiri::HTML::SAX::Parser.new(pte)
parser.parse_file ARGV[0]
# puts pte.plaintext

# write to a file
begin
  file = File.open("snippet.txt", "w")
  file.write pte.plaintext
rescue IOError => e
  #some error occur, dir not writable etc.
ensure
  file.close unless file == nil
end

# get the date written
fname = ARGV[0]
start_column = 3
end_column = 5

target_range = (start_column-1)..(end_column-1)

IO.foreach(fname) do |line|
  if line.match(/<strong>Date:<\/strong>/)
    pieces = line.split(" ")
    puts pieces[target_range].join("-")
  end
end

# remove blank lines from file
fh = File.open('snippet.txt')
while( !fh.eof)
    line = fh.readline.chomp
    # remove leading and trailing blanks
    line.strip!
    # skip empty lines
    next if line == ''
    # convert tab chars to blanks
    line.gsub!(/\t/,' ')
    # substitute a single blank for a sequence of blanks
    line.squeeze!(' ')
    # add code to process line if needed
    puts line
end
fh.close
exit(0)


The output is as follows:

pablo@cochituate=> ./extract_haiku.rb
/export/www/html/haikupoet/archive/0925.html
watching the news
I feel guilty
for being alive
--
Paul David Mena
--------------------
pauldavidmena@gmail.com


Basically I want to omit the signature (everything below the "--", 
inclusive), which is wrapped in the "pre" tags.

-- 
Posted via http://www.ruby-forum.com/.

In This Thread