Forums  > Software  > tools for large data sets  
     
Page 1 of 1
Display using:  

dd4nyc


Total Posts: 52
Joined: Aug 2005
 
Posted: 2012-06-26 01:46
What tools can I used to analyze relatively large data sets? I tried doing linear regression in matlab (3.8 mil observations, about 200 variables) but did not get farther than out of memory exception.

Looked brifly into RevoScaleR, but I cannot find in the documentation how to get it onto a cloud, and how much it would cost.

Any tools that would allow me to do really large linear regressions / robust regressions (locally or cloud) ?

homdol


Total Posts: 9
Joined: Jul 2008
 
Posted: 2012-06-26 08:32
You could give KNIME a try:
knime.org

Here you can import your data, choose the necessary statstics node and define the output. All this is done visually.


A similar product is RapidMiner:
Rapid Miner

Scotty


Total Posts: 660
Joined: Jun 2004
 
Posted: 2012-06-26 09:40
Python - pytables

pytables

“Whatever you do, or dream you can, begin it. Boldness has genius and power and magic in it.”

FatChoi


Total Posts: 106
Joined: Feb 2008
 
Posted: 2012-06-26 12:50
Pytables is great for handling data but you will also need an approach to estimation that makes use of the efficiency and flexibility of hdf5 datastores that PyTables gives you. For that this might help but there are other approaches. For the specific linear regression you mention you probably only need to find a way of chunking it to use memory more efficiently but if you have further plans and want to get on the cloud easily you could try PiCloud which gives you seamless cloud integration with Python but where you'll have to come up with your own parallel algorithms.
For more direct control of infrastructure Star Cluster is very easy to use.

nnja


Total Posts: 229
Joined: Jul 2007
 
Posted: 2012-06-26 14:04
You can do regressions of that size in R, you just need a 64-bit operating system and more memory. I regularly do stepwise non-linear regressions on >10MM data points, 150+ variables with 40GB memory. Currently I use an actual rack server dedicated to this application but I started on a budget using a regular workstation with as much memory as could fit onboard (IIRC, a few hundred dollars worth) with a Fedora Linux distro. Using ATLAS helps speed things up.

I don't always test code, but when I do, I prefer it to be in production.

h0h0


Total Posts: 15
Joined: Apr 2010
 
Posted: 2012-06-26 17:44
vowpal wabbit will take care of you

athletico


Total Posts: 894
Joined: Jun 2004
 
Posted: 2012-06-26 18:52
dd4, are you using 32-bit Matlab? I'd be surprised if the 64-bit version was not up to the task if you run it on a beefy box, like the one nnja described.

Also, you mentioned robust regressions, which are slower / iterative compared to OLS regression. You might be looking at prohibitive running times for a problem as large as yours.

dd4nyc


Total Posts: 52
Joined: Aug 2005
 
Posted: 2012-06-26 19:22
Thank you Athletico, I'm running 64 bit version, but based on your comment maybe I don't have enough Ram or maybe something is not configured properly.

Hansi


Total Posts: 189
Joined: Mar 2010
 
Posted: 2012-06-26 21:10
Might want to check biganalytics : https://sites.google.com/site/bigmemoryorg/home/biganalytics

gnarsed


Total Posts: 70
Joined: Feb 2008
 
Posted: 2012-06-27 08:32
i agree with ninja. 64bit R can handle problems this size, though you should probably be a more advanced user to do this comfortably. I would just give it a try.
if you are going to be working on larger problems iin the long run, it is
worth exploring other approaches, but the ability to do this in R will still be very helpful due to the rapid interactive prototyping you can do there.

nikol


Total Posts: 311
Joined: Jun 2005
 
Posted: 2012-07-02 07:32

try http://www.hdfgroup.org/

MatLab has tools to manipulate it. There is library, you can link to R (do not know R, just guess).

my personal conclusion is to use flat files (on-fly zipped) and the respective directory structure (data\dateYY\dateMM\dateDD\underlying.zip). main reason is that I skip some efforts on development. 

Previous Thread :: Next Thread 
Page 1 of 1