[#399938] how to read arrary with an array — "Richard D." <lists@...>

Hello. I believe this is basic question, but I'm just starting to learn

19 messages 2012/10/02

[#400050] img src while sending email ruby cgi — Ferdous ara <lists@...>

Hi

16 messages 2012/10/05

[#400351] Drop 1st and last particular character — ajay paswan <lists@...>

What is the most efficient way to drop '#' from the first place and last

15 messages 2012/10/16

[#400374] database part of a desktop application — "Sebastjan H." <lists@...>

Hi,

14 messages 2012/10/16
[#400375] Re: database part of a desktop application — Chad Perrin <code@...> 2012/10/16

On Wed, Oct 17, 2012 at 05:28:39AM +0900, Sebastjan H. wrote:

[#400377] Re: database part of a desktop application — sto.mar@... 2012/10/17

Am 16.10.2012 23:24, schrieb Chad Perrin:

[#400389] Re: database part of a desktop application — Chad Perrin <code@...> 2012/10/17

On Wed, Oct 17, 2012 at 01:39:21PM +0900, sto.mar@web.de wrote:

[#400386] Unable to send attachment, and dealing with multiple attachment — ajay paswan <lists@...>

Hi,

11 messages 2012/10/17

[#400454] Hash with Integer key issue — Wayne Simmerson <lists@...>

Hi Im new to Ruby and am getting some unexpected results from a

18 messages 2012/10/19

[#400535] Name/symbol/object type clash? What is happening here? — Todd Benson <caduceass@...>

It's nonsense code, but I'm curious as to what is going on behind the scenes...

41 messages 2012/10/23

[#400556] Calling a method foo() or an object foo.method_call_here - both — Marc Heiler <lists@...>

Hello.

13 messages 2012/10/24

[#400650] OpenSSL ECDSA public key from private — Nokan Emiro <uzleepito@...>

Hi,

11 messages 2012/10/27

[#400680] Passing folder as argument ARGV? — Joz Private <lists@...>

Is there an easy way to pass multiple files on the command line?

15 messages 2012/10/28
[#400681] Re: Passing folder as argument ARGV? — brad smith <bradleydsmith@...> 2012/10/28

How are you traversing the directory you pass in on the command line ?

[#400697] File.readable? and /proc — Jeff Moore <lists@...>

root@nail:/projects/proc_fs# uname -a

13 messages 2012/10/28

[#400714] Marshal.load weird issue — "Pierre J." <lists@...>

Hi guys

12 messages 2012/10/28

[#400781] bug?: local variable created in if modifier not available in modified expression — "Mean L." <lists@...>

irb(main):001:0> local1 if local1 = "created"

21 messages 2012/10/30
[#400807] Re: bug?: local variable created in if modifier not available in modified expression — Bartosz Dziewoński <matma.rex@...> 2012/10/31

Oh, and in case it wasn't apparent: you can just add

[#400808] Re: bug?: local variable created in if modifier not available in modified expression — Eliezer Croitoru <eliezer@...> 2012/10/31

On 10/31/2012 4:52 PM, Bartosz Dziewoナгki wrote:

[#400809] Re: bug?: local variable created in if modifier not available in modified expression — Robert Klemme <shortcutter@...> 2012/10/31

On Wed, Oct 31, 2012 at 4:28 PM, Eliezer Croitoru <eliezer@ngtech.co.il>wrote:

[#400784] REXML & HTMLentities incorrectly map to UTF-8 — "Mark S." <lists@...>

I have some XML data (UTF 8) that I'm trying to convert into another XML

13 messages 2012/10/30

Re: Looking for suggestions processing and comparing 2 very large files

From: Dave Aronson <rubytalk2dave@...>
Date: 2012-10-22 18:57:01 UTC
List: ruby-talk #400509
On Mon, Oct 22, 2012 at 2:21 PM, Ruby Student <ruby.student@gmail.com> wrote:

> Every week I get a large file, over 50 millions records

The big question is... are these files SORTED, preferably on some
UNIQUE key, or at least some order that will remain the same from week
to week?  If yes, then you can use the same sort of techniques as in
the "diff" utility found on every Unix-derived system and many others.
 (Windows has something similar but the name escapes me at the moment.
 IIRC, in an ironic twist, this is one of those cases where the
Windows command has a *more* cryptic name than its Unix cognate.)  How
to make a "diff" type program has been covered in gazillions of
blog/magazine articles, textbooks, etc., so I won't go into detail.
If you're lucky, you might even be able to just use the ones existing
on your system, with some shell scripting for glue.

On the other claw, if the records are in random order, then you've got
a much more serious problem.  In that case, ASSUMING that the keys,
and number of updated/duplicated records, are both quite small, off
the top of my head I think I'd:

- Extract the keys from last week's file
- Ditto for this week's
- Sort those, assuming the keys are sufficiently smaller that this is reasonable
- Diff them.
- Extract the actual records from both weeks for any matching keys.
- Sort and diff, under the same assumption.

Or, if the above data sets are not small enough to make sorting
reasonable, but the potential dups might at least fit in RAM:

- Extract last week's keys into a Set
- Initialize a "Needs Further Inspection" (NFI) Set
- Iterate over this week's records:
  = Try to find the key in last week's Set of keys
  = If seen, remove from last weeks and add to NFI Set
  = Else process as an Insertion
- Anything left in last week's Set is a Removal
- (You can now get rid of last week's Set of keys)
- Extract last week's full records matching NFI keys,
  putting them in a hash keyed by the key
- Extract this week's records matching NFI keys,
  looking them up in the hash
- Compare the entire records, processing as either
  Duplicate or Update as needed

-Dave

-- 
Dave Aronson, the T. Rex of Codosaurus LLC,
secret-cleared freelance software developer
taking contracts in or near NoVa or remote.
See information at http://www.Codosaur.us/.

In This Thread