[#340543] 64-bit Ruby for OS X ? — Greg Willits <lists@...>

Has anyone built a 64-bit Ruby for Leopard. I've googled my brains out,

15 messages 2009/07/01
[#340599] Re: 64-bit Ruby for OS X ? — Eric Hodel <drbrain@...7.net> 2009/07/01

On Jun 30, 2009, at 18:36, Greg Willits wrote:

[#340699] file.seek and unused bytes — Greg Willits <lists@...>

Ruby 1.8.6

40 messages 2009/07/03
[#340763] Re: file.seek and unused bytes — Gary Wright <gwtmp01@...> 2009/07/05

[#340764] Re: file.seek and unused bytes — Greg Willits <lists@...> 2009/07/05

Gary Wright wrote:

[#340766] Re: file.seek and unused bytes — Brian Candler <b.candler@...> 2009/07/05

Greg Willits wrote:

[#340767] Re: file.seek and unused bytes — Greg Willits <lists@...> 2009/07/05

Brian Candler wrote:

[#340769] Re: file.seek and unused bytes — Brian Candler <b.candler@...> 2009/07/05

Greg Willits wrote:

[#340771] Re: file.seek and unused bytes — Greg Willits <lists@...> 2009/07/05

Brian Candler wrote:

[#340787] Re: file.seek and unused bytes — Robert Klemme <shortcutter@...> 2009/07/06

On 06.07.2009 00:13, Greg Willits wrote:

[#340792] Re: file.seek and unused bytes — Greg Willits <lists@...> 2009/07/06

Robert Klemme wrote:

[#340794] Re: file.seek and unused bytes — Brian Candler <b.candler@...> 2009/07/06

Greg Willits wrote:

[#340803] Re: file.seek and unused bytes — Greg Willits <lists@...> 2009/07/06

Eeek. Opened a can of worms!

[#340743] to_proc and Proc/block conversion with & — Russ McBride <russ@...>

16 messages 2009/07/05

[#340827] Help with rdoc - generate documentation for Ruby 1.9.1 Standard Library? — Bjoern <bjoerngt@...>

Hi,

15 messages 2009/07/06

[#340868] Ruby 1.8 - character encoding — Thomas Thomassen <thomas@...>

I write Ruby plugins for Google Sketchup.

23 messages 2009/07/07
[#340878] Re: Ruby 1.8 - character encoding — Gregory Brown <gregory.t.brown@...> 2009/07/07

On Tue, Jul 7, 2009 at 8:28 AM, Thomas Thomassen<thomas@thomthom.net> wrote=

[#341133] which project should I work on? — Roger Pack <rogerpack2005@...>

I was wondering on any feedback on which of the following pet projects

37 messages 2009/07/10

[#341174] Math cube root — Zangief Ief <z4n9ief@...>

Hi,

22 messages 2009/07/11

[#341177] one line sorting

Hi.

22 messages 2009/07/11

[#341178] Ruby Versions site; shell access to historical and current Rubies — "David A. Black" <dblack@...>

Hi all --

17 messages 2009/07/11
[#341188] Re: [ANN] Ruby Versions site; shell access to historical and current Rubies — Caleb Clausen <vikkous@...> 2009/07/11

On 7/11/09, David A. Black <dblack@rubypal.com> wrote:

[#341191] Re: [ANN] Ruby Versions site; shell access to historical and current Rubies — James Gray <james@...> 2009/07/11

On Jul 11, 2009, at 11:10 AM, Caleb Clausen wrote:

[#341200] Re: [ANN] Ruby Versions site; shell access to historical and current Rubies — Caleb Clausen <vikkous@...> 2009/07/11

On 7/11/09, James Gray <james@grayproductions.net> wrote:

[#341203] Re: [ANN] Ruby Versions site; shell access to historical and current Rubies — James Gray <james@...> 2009/07/11

On Jul 11, 2009, at 2:18 PM, Caleb Clausen wrote:

[#341376] a regex that does not contain comma — Sijo Kg <sijo@...>

Hi

21 messages 2009/07/14

[#341379] Twelve rules of Ruby — Panu Kinnari <panu.kinnari@...>

Scott Adams of Dilbert fame talked about learning twelve concepts of

32 messages 2009/07/14
[#341408] Re: Twelve rules of Ruby — "David A. Black" <dblack@...> 2009/07/14

Hi --

[#341412] Re: Twelve rules of Ruby — "David A. Black" <dblack@...> 2009/07/14

Not sure why they wrapped weirdly (at least on my screen), but here's

[#341423] Re: Twelve rules of Ruby — Garry Freemyer <garryfre@...> 2009/07/14

I would think that the twelve rules should be in the form of what things are, not what they are not.

[#341424] Re: Twelve rules of Ruby — "David A. Black" <dblack@...> 2009/07/14

Hi --

[#341427] Re: Twelve rules of Ruby — Marc Heiler <shevegen@...> 2009/07/14

> Objects don't "have" methods

[#341384] can't install ruby-prof 0.7.0 or superior on windows — "DG" <nospam@...>

I found windows users of 0.7.3 here but I still can't instal

10 messages 2009/07/14

[#341553] bluecloth 2.0.5 — ged@...

23 messages 2009/07/16

[#341620] regexp exclusion search - find matches NOT ending with a string? — BrendanC <brencam@...>

I have the following text in a file:

10 messages 2009/07/17

[#341692] removing array duplicates where a subset is unique — Chuck Remes <cremes.devlist@...>

I need to remove duplicates from an array of arrays. I can't use

24 messages 2009/07/17
[#341694] Re: [Q] removing array duplicates where a subset is unique — "David A. Black" <dblack@...> 2009/07/17

Hi --

[#341697] Re: [Q] removing array duplicates where a subset is unique — Chuck Remes <cremes.devlist@...> 2009/07/17

[#341699] Re: [Q] removing array duplicates where a subset is unique — "David A. Black" <dblack@...> 2009/07/18

Hi --

[#341709] Re: [Q] removing array duplicates where a subset is unique — Chuck Remes <cremes.devlist@...> 2009/07/18

[#341784] Re: removing array duplicates where a subset is unique — 7stud -- <bbxx789_05ss@...> 2009/07/19

Chuck Remes wrote:

[#341722] Problems with gems and Ruby 1.8.7 — Henrique Testa <hgtesta@...>

Hi all,

20 messages 2009/07/18

[#341814] Do you program in any other language except for ruby? — Milan Dobrota <elitecoding@...>

And what are they? :)

28 messages 2009/07/20

[#341837] ruby IDE's — Sunil Kumar <sunil.muki@...>

Hi This is sunil..

60 messages 2009/07/20
[#341839] Re: ruby IDE's — Tom Cloyd <tomcloyd@...> 2009/07/20

Sunil Kumar wrote:

[#341841] Re: ruby IDE's — Wesley Chen <cjq.999@...> 2009/07/20

I don't agree the guy Tom.

[#341901] Re: ruby IDE's — marc <gmane@...> 2009/07/20

James Britt wrote:

[#341918] Re: ruby IDE's — Tom Cloyd <tomcloyd@...> 2009/07/20

marc wrote:

[#342011] Re: ruby IDE's — Robert Dober <robert.dober@...> 2009/07/21

On 7/20/09, Tom Cloyd <tomcloyd@comcast.net> wrote:

[#342017] Re: ruby IDE's — Garry Freemyer <garryfre@...> 2009/07/21

I have Netbeans 6.7 on the Mac Os X platform and its quite an ordeal to get it to install gems without putting it in the wrong directory because it executes installation that requires installation using sudo and does not prompt for the password, so stuff gets installed in the wrong directory.

[#342097] Re: ruby IDE's — Hassan Schroeder <hassan.schroeder@...> 2009/07/22

On Tue, Jul 21, 2009 at 9:09 AM, Garry Freemyer<garryfre@pacbell.net> wrote:

[#342109] Re: ruby IDE's — Garry Freemyer <garryfre@...> 2009/07/22

Well, take the last two lines J2SE 5.0 J2SE 1.4.2

[#342114] Re: ruby IDE's — Hassan Schroeder <hassan.schroeder@...> 2009/07/22

On Tue, Jul 21, 2009 at 7:35 PM, Garry Freemyer<garryfre@pacbell.net> wrote:

[#342116] Re: ruby IDE's — Garry Freemyer <garryfre@...> 2009/07/22

I am surprised to see sarcasm in this mailing list, or maybe I am just disappointed.

[#342117] Re: ruby IDE's — Hassan Schroeder <hassan.schroeder@...> 2009/07/22

On Tue, Jul 21, 2009 at 8:32 PM, Garry Freemyer<garryfre@pacbell.net> wrote:

[#342118] Re: ruby IDE's — Garry Freemyer <garryfre@...> 2009/07/22

I know what ruby is. I don't know what programs are included in the nebulous mass of programs that come under J2SE heading.

[#342119] Re: ruby IDE's — Hassan Schroeder <hassan.schroeder@...> 2009/07/22

On Tue, Jul 21, 2009 at 8:50 PM, Garry Freemyer<garryfre@pacbell.net> wrote:

[#341906] including newlines in a .sub — Alan Munn <amunn@...>

Hi, I'm new to ruby, and am having trouble with the following (\n is

12 messages 2009/07/20

[#341950] Byte窶都tream parsing in Ruby — Elliott Cable <me@...>

So, I’ve a problem. I’m using ncurses (or possibly not, might just

14 messages 2009/07/21
[#341979] Re: Byte窶都tream parsing in Ruby — Brian Candler <b.candler@...> 2009/07/21

Elliott Cable wrote:

[#342062] Re: Byte窶都tream parsing in Ruby — Elliott Cable <me@...> 2009/07/21

Brian Candler wrote:

[#341968] Mean method — "Älphä Blüë" <jdezenzio@...>

I'm working on a lot of math in my projects so I thought I would convert

19 messages 2009/07/21

[#341969] Ruby/Oracle connectivity — Dheeraj Gambhir <checktestingthings@...>

Hi All,

19 messages 2009/07/21

[#342013] String spliting and inclusion — Stuart Clarke <stuart.clarke1986@...>

Hi all,

17 messages 2009/07/21

[#342113] Best gem to parse Ruby with? — Tony Arcieri <tony@...>

I've been considering rewriting my require_all gem:

14 messages 2009/07/22
[#342115] Re: Best gem to parse Ruby with? — Caleb Clausen <vikkous@...> 2009/07/22

On 7/21/09, Tony Arcieri <tony@medioh.com> wrote:

[#342185] Instantiating classes / sharing data between classes — Trevoke <trevoke@...>

I think this is what I want to do (maybe I'm thinking about it wrong):

12 messages 2009/07/22

[#342287] splitting............. — Hunt Hunt <aksn18july@...>

Hi Friends,

14 messages 2009/07/23

[#342347] Watching a website for periodic outages — Glen Holcomb <damnbigman@...>

I'm needing to monitor a web application for periodic outages and log the

14 messages 2009/07/23

[#342453] using until — Lloyd Linklater <lloyd@2live4.com>

I am writing a little thing to find all the prime numbers to a million.

17 messages 2009/07/24

[#342573] What is the power function — Prateek Agarwal <prateek.agwl@...>

I am new to Ruby and am still learning some of the basic stuff.

17 messages 2009/07/27

[#342618] Posting an XML document to a protected API — Maruthy Mentireddi <maruthymukund@...>

I am working on the FrontEnd of a website and need to make a RESTful

10 messages 2009/07/28

[#342646] Good way to not forget to install gems on a server? — Max Williams <toastkid.williams@...>

I just broke my wife's website (my current side project) because i was

12 messages 2009/07/28

[#342725] previous value in array block — Jason Lillywhite <jason.lillywhite@...>

Is this a good way to use a previous value in an array block?

20 messages 2009/07/29
[#342731] Re: previous value in array block — Harry Kakueki <list.push@...> 2009/07/29

>

[#342734] Re: previous value in array block — Jes俍 Gabriel y Gal疣 <jgabrielygalan@...> 2009/07/29

On Wed, Jul 29, 2009 at 8:18 AM, Harry Kakueki<list.push@gmail.com> wrote:

[#342737] Re: previous value in array block — Xavier Noria <fxn@...> 2009/07/29

each_cons seems natural here:

[#342781] java.text api parallel in Ruby — Venkat Akkineni <venkatram.akkineni@...>

Hi

13 messages 2009/07/29
[#342810] Re: java.text api parallel in Ruby — Robert Klemme <shortcutter@...> 2009/07/30

On 30.07.2009 00:32, Venkat Akkineni wrote:

[#342806] How Come Ruby is Text-Oriented? — Mike Stephens <rubfor@...>

I've just been re-reading Byte August 1981 - an edition dedicated to

46 messages 2009/07/30

[#342865] how to stop the subclass from overriding a method. — Venkat Akkineni <venkatram.akkineni@...>

Hi

12 messages 2009/07/30

[#342952] Ruby-net-ldap fail — Bruno Sousa <brgsousa@...>

Hi,

12 messages 2009/07/31

Re: file.seek and unused bytes

From: Greg Willits <lists@...>
Date: 2009-07-06 12:41:56 UTC
List: ruby-talk #340803
Eeek. Opened a can of worms!

It's 5:30 am where I am (I'm "still up" and not "up early"), so I'll hit 
the highlights, and detail more later if for some reason there's an 
interest.

-- application: data aggregation

-- I get data from county services and school districts. They have no 
way to correlate and aggregate their data. That's what I am doing. I 
match kids that are flagged in county services for whatever reason and 
have to match them up with school records, health records, judicial 
records where applicable, etc.

-- I get dozens of CSV files of various subjects. Every school 
district's CSV file for any subject (attendance, grades, etc) has 
similar content but not identical. Ditto for demographics from various 
agencies.

-- before I can even attempt aggregation, I need to normalize all this 
data so it can be indexed and  analyzed, and I ensure the data is 
cleaned and standardized for display.

-- there's a top layer DSL where I describe data sources and how to 
transform any one particular raw data file into a normalized data file. 
There another set of descriptions to allow any one field to come from a 
number of possible sources. So, something like birthcity might come from 
data source X, but if not available, check data source Y, etc.

-- this process has been abstracted into a data aggregation core to do 
the normalization, the indexing, and other tasks, with an 
application-specific layer to handle the definition of transformations, 
indexes that are needed, order of precedence, and the stitching of data 
from various sources into records.

-- So, this particular step I've been talking about is where a raw CSV 
file undergoes normalization by reorgnizing the fields of each record 
into a common structure for that given data topic, each field undergoes 
some scrubbing (character case, packing phones, normalizing date 
formats, translation of codes into strings, etc).

-- raw data files range from a handful of columns to a couple dozen 
columns. From a few hundred rows to a couple million rows.

-- data is an array of hashes

-- by the time we get done normalizing a particular raw source, it can 
hit the 4GB memory limit any one ruby fork has available to play with 
(many forks run in parallel on multiple cores)

-- while most CSV files work out just fine reading into RAM, 
transforming them 100% in RAM, and then writing in a simple single step, 
many files do not.

-- so we have updated the process to load X records from the raw file, 
tranform X records in memory, then write X records to disk, and loop 
untill all records are done.

-- and we have to deal with indexes, duplicates, and other issues in 
there as well

-- imagine 2,000,000 raw records from one file which get processed in 
200,000 record chunks, but output back to another single file.

-- as I step through each chunk of 200,000 records, I can get the 
longest length of that 200,000, and I can store that, but I can't know 
what the longest length is for the next 200,000 that I haven't loaded 
yet.

-- having processed and written the first 200,000 results to disk, and 
then determining the length of the second 200,000, I'm not going to go 
back and change the first 200,000 just to make them all the same. 
there's no value in that at all.

-- So, when I get done with each 200,000 chunk, I now have a series of 
integers which tells me the length of the records in each chunk of 
200,000 rows.

-- the file has already been written, so again, I'm not going to go back 
and move everything to insert this data at the top (which is where I 
would put if indeed every record was the same length)

-- so, I put this data at the end.

-- BTW, the rows lengths are similiar enough that the disk efficiency is 
not an issue.

-- I read this last line using tail, and I strip off the leading empty 
bytes (if any) as I described earlier

-- it's a couple of very simple calculation to convert any "index" 
position into the exact byte seek position to find a specific record.

-- from this point on, the records are read as random access from disk 
because otherwise I would need oodles of GB of data in RAM all at once 
during the aggregation process.

-- is doing 200,000 or even 500,000 at a time in chunks really any 
faster than doing them one at a time -- that I actually don't know yet, 
I am just now finishing all the code that this "chunking" touches and 
ensure I get the same results I used to. the size of the chunks isn't as 
important for speed as it is for memory management -- making sure I stay 
within 4GB.

-- as for the speed issues, we've done a lot of profiling, and even 
wrote a mini compiler to read our file tranformation DSL and output a 
stream of inline variable declarations and commands whichs gets included 
as a module on the fly for each data source. That trick saved us from 
parsing the DSL for each data row and literally shaved hours off the 
total processing time. We attacked many other levels of optimization 
while working to keep the code as readable as possible, because it's a 
complicated layering of abstractions and processes.

-- I will look into cdb

-- gw




-- 
Posted via http://www.ruby-forum.com/.

In This Thread