- This tutorial demonstrates using pyfora to:
- Load a large CSV file from Amazon S3
- Parse it into a
- Run linear regression on the loaded DataFrame
- Download the regression coefficients and intercept back to python
The example below uses a large dataset. It is a 64GB csv file that parses into 20GB of normally-distributed, randomly generated floating point numbers. It takes about 10 minutes to run on three c3.8xlarge instances in EC2.
You can use the
pyfora_aws script installed with the pyfora package to easily
set up a pyfora cluster in EC2 using either on-demand or spot instances.
If you prefer to try a (much) smaller version of this example, you can use the 5.2GB dataset
iid-normal-floats-13mm-by-17.csv, by modifying line 9 below accordingly.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
import pyfora from pyfora.pandas_util import read_csv_from_string from pyfora.algorithms import linearRegression print "Connecting..." executor = pyfora.connect('http://<cluster_manager>:30000') print "Importing data..." raw_data = executor.importS3Dataset('ufora-test-data', 'iid-normal-floats-20GB-20-columns.csv').result() print "Parsing and regressing..." with executor.remotely: data_frame = read_csv_from_string(raw_data) predictors = data_frame.iloc[:, :-1] responses = data_frame.iloc[:, -1:] regression_result = linearRegression(predictors, responses) coefficients = regression_result[:-1] intercept = regression_result[-1] print 'coefficients:', coefficients.toLocal().result() print 'intercept:', intercept.toLocal().result()
If you are familiar with
pandas the code above should look quite familiar.
After connecting to a pyfora cluster using
pyfora.connect() in line 6, we import a dataset
from Amazon S3 in line 8 using
All the code inside the
with executor.remotely: block that starts in line 12 is shipped to the cluster
and executes remotely.
read_csv_from_string() to read the CSV in
produce a DataFrame.
Our regression fits a linear model to predict the last column from the prior ones.
linearRegression() algorithm is used to return an array with the linear
model’s coefficients and intercept.
In lines 22 and 23, outside the
with executor.remotely: block, we bring some of the values computed
remotely back into the local python environment.
Values assigned to variables inside the
with executor.remotely: are left in the pyfora cluster
by default because they can be very large - much larger than the amount of memory available on your
machine. Instead, they are represented locally using
instances that can be downloaded using their