# Linear RegressionΒΆ

- This tutorial demonstrates using pyfora to:
- Load a large CSV file from Amazon S3
- Parse it into a
`pandas.DataFrame`

- Run linear regression on the loaded DataFrame
- Download the regression coefficients and intercept back to python

Important

The example below uses a **large** dataset. It is a 64GB csv file that parses into 20GB
of normally-distributed, randomly generated floating point numbers.
It takes about 10 minutes to run on three c3.8xlarge instances in EC2.

You can use the `pyfora_aws`

script installed with the pyfora package to easily
set up a pyfora cluster in EC2 using either on-demand or spot instances.

If you prefer to try a (much) smaller version of this example, you can use the 5.2GB dataset
`iid-normal-floats-13mm-by-17.csv`

, by modifying line 9 below accordingly.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | ```
import pyfora
from pyfora.pandas_util import read_csv_from_string
from pyfora.algorithms import linearRegression
print "Connecting..."
executor = pyfora.connect('http://<cluster_manager>:30000')
print "Importing data..."
raw_data = executor.importS3Dataset('ufora-test-data',
'iid-normal-floats-20GB-20-columns.csv').result()
print "Parsing and regressing..."
with executor.remotely:
data_frame = read_csv_from_string(raw_data)
predictors = data_frame.iloc[:, :-1]
responses = data_frame.iloc[:, -1:]
regression_result = linearRegression(predictors, responses)
coefficients = regression_result[:-1]
intercept = regression_result[-1]
print 'coefficients:', coefficients.toLocal().result()
print 'intercept:', intercept.toLocal().result()
``` |

If you are familiar with `pandas`

the code above should look quite familiar.
After connecting to a pyfora cluster using `pyfora.connect()`

in line 6, we import a dataset
from Amazon S3 in line 8 using `importS3Dataset()`

.

The value `raw_data`

returned from `importS3Dataset()`

is a
`RemotePythonObject`

that represents the entire dataset as a string.
The data itself is lazily loaded to memory in the cluster when it is needed.

All the code inside the `with executor.remotely:`

block that starts in line 12 is shipped to the cluster
and executes remotely.

We use `read_csv_from_string()`

to read the CSV in `raw_data`

and
produce a DataFrame.

Our regression fits a linear model to predict the last column from the prior ones.
The `linearRegression()`

algorithm is used to return an array with the linear
model’s coefficients and intercept.

In lines 22 and 23, outside the `with executor.remotely:`

block, we bring some of the values computed
remotely back into the local python environment.
Values assigned to variables inside the `with executor.remotely:`

are left in the pyfora cluster
by default because they can be very large - much larger than the amount of memory available on your
machine. Instead, they are represented locally using `RemotePythonObject`

instances that can be downloaded using their `toLocal()`

function.