Working With Data in S3¶
Amazon’s Simple Storage Service (S3) is a highly scalable, durable, general purpose store, that has been around since the original launch of Amazon Web Services (AWS), and is one of their most widely used services.
Whether you run a pyfora cluster in AWS or locally, pyfora lets you work with datasets stored in S3 in much the same way you would use files on your local disk.
Reading From S3¶
pyfora lets you treat files stored in S3 as if they are regular python strings even if they
are much larger than amount of memory available on any machine in your cluster.
importS3Dataset() function creates a
RemotePythonObject that represents the entire content of the
specified file in S3 as a string of bytes, which can then be parsed into different data-structures.
For example, to parse a CSV file in S3 into a
import pyfora import pyfora.pandas_util executor = pyfora.connect('http://<cluster_manager_address>:30000') data_as_string = executor.importS3Dataset('bucket_name', 'path/to/file.csv') with executor.remotely: data_frame = pyfora.pandas_util.read_csv_from_string(data_as_string) # data_frame is a pandas.DataFrame that lives in memory in the pyfora cluster num_of_rows = len(data_frame) # do stuff with data_frame... print "Num of rows:", num_of_rows.toLocal().result()
Writing to S3¶
exportS3Dataset() is used to write strings into S3.
import pyfora executor = pyfora.connect('http://<cluster_manager_address>:30000') with executor.remotely: large_string = 'lots of data ' * 10**9 executor.exportS3Dataset(large_string, 'bucket_name', 'path/to/file.txt')
To access private data in S3, the pyfora cluster must be given credentials with appropriate read
and/or write permissions to the buckets and keys being used.
The pyfora worker service reads AWS credentials from two environment variables:
These are the same variables used by
boto and the AWS CLI tools.
When launching pyfora services in docker containers, you can set these variables as part of the
docker run command. For example:
docker run -d -e AWS_ACCESS_KEY_ID=<key> -e AWS_SECRET_ACCESS_KEY=<secret> ufora/service