 dd4nyc
|
|
| Total Posts: 52 |
| Joined: Aug 2005 |
| |
|
What tools can I used to analyze relatively large data sets? I tried doing linear regression in matlab (3.8 mil observations, about 200 variables) but did not get farther than out of memory exception.
Looked brifly into RevoScaleR, but I cannot find in the documentation how to get it onto a cloud, and how much it would cost.
Any tools that would allow me to do really large linear regressions / robust regressions (locally or cloud) ?
|
|
|
|
 |
 homdol
|
|
| Total Posts: 9 |
| Joined: Jul 2008 |
| |
|
You could give KNIME a try: knime.org
Here you can import your data, choose the necessary statstics node and define the output. All this is done visually.
A similar product is RapidMiner: Rapid Miner |
|
|
 |
 Scotty
|
|
| Total Posts: 660 |
| Joined: Jun 2004 |
| |
|
Python - pytables
pytables |
“Whatever you do, or dream you can, begin it. Boldness has genius and power and magic in it.” |
|
|
 |
 FatChoi
|
|
| Total Posts: 106 |
| Joined: Feb 2008 |
| |
|
Pytables is great for handling data but you will also need an approach to estimation that makes use of the efficiency and flexibility of hdf5 datastores that PyTables gives you. For that this might help but there are other approaches. For the specific linear regression you mention you probably only need to find a way of chunking it to use memory more efficiently but if you have further plans and want to get on the cloud easily you could try PiCloud which gives you seamless cloud integration with Python but where you'll have to come up with your own parallel algorithms. For more direct control of infrastructure Star Cluster is very easy to use. |
|
|
 |
 nnja
|
|
| Total Posts: 229 |
| Joined: Jul 2007 |
| |
|
| You can do regressions of that size in R, you just need a 64-bit operating system and more memory. I regularly do stepwise non-linear regressions on >10MM data points, 150+ variables with 40GB memory. Currently I use an actual rack server dedicated to this application but I started on a budget using a regular workstation with as much memory as could fit onboard (IIRC, a few hundred dollars worth) with a Fedora Linux distro. Using ATLAS helps speed things up. |
I don't always test code, but when I do, I prefer it to be in production. |
|
|
 |
 h0h0
|
|
| Total Posts: 15 |
| Joined: Apr 2010 |
| |
|
| vowpal wabbit will take care of you |
|
|
 |
|
dd4, are you using 32-bit Matlab? I'd be surprised if the 64-bit version was not up to the task if you run it on a beefy box, like the one nnja described.
Also, you mentioned robust regressions, which are slower / iterative compared to OLS regression. You might be looking at prohibitive running times for a problem as large as yours. |
|
|
|
 |
 dd4nyc
|
|
| Total Posts: 52 |
| Joined: Aug 2005 |
| |
|
| Thank you Athletico, I'm running 64 bit version, but based on your comment maybe I don't have enough Ram or maybe something is not configured properly. |
|
|
 |
 Hansi
|
|
| Total Posts: 189 |
| Joined: Mar 2010 |
| |
|
| Might want to check biganalytics : https://sites.google.com/site/bigmemoryorg/home/biganalytics |
|
|
|
 |
 gnarsed
|
|
| Total Posts: 70 |
| Joined: Feb 2008 |
| |
|
i agree with ninja. 64bit R can handle problems this size, though you should probably be a more advanced user to do this comfortably. I would just give it a try. if you are going to be working on larger problems iin the long run, it is worth exploring other approaches, but the ability to do this in R will still be very helpful due to the rapid interactive prototyping you can do there. |
|
|
 |
 nikol
|
|
| Total Posts: 311 |
| Joined: Jun 2005 |
| |
|
try http://www.hdfgroup.org/
MatLab has tools to manipulate it. There is library, you can link to R (do not know R, just guess).
my personal conclusion is to use flat files (on-fly zipped) and the respective directory structure (data\dateYY\dateMM\dateDD\underlying.zip). main reason is that I skip some efforts on development. |
|
|
|
 |