he Swift filesystem for Hadoop (swiftfs, for short) is a Hadoop file system implementation that allows applications such as MapReduce, Pig, and Hive to read and write directly to containers in an OpenStack Swift object store such as Rackspace Cloud Files. A collaborative effort between Rackspace, Hortonworks, and Mirantis, this work was done as a part of HADOOP-8545 and merged into Hadoop as a part of version 2.3.0.
Why is swiftfs important?
swiftfs separates the compute resources of the cluster from a storage resources, allowing each to have different life spans. This separation is beneficial if you need long-term data storage but only periodically need compute resources to process that data.
Also, if you are already using Rackspace Cloud Files to store your data, you can process it in place without copying it into your cluster’s Hadoop Distributed File System (HDFS).
How do I use swiftfs?
File system URLs for Swift take the following form:
The different parts of the URL are explained in the following table:
|swift://||UThe prefix that passes file system requests to the Swift file system.|
|acontainer||The name of the container in Swift that contain the objects to be accessed.
Container names must conform to RFC952restrictions for hostnames, that is, the characters A-Z, numbers 0-9, and the hyphen (-).
Nonconforming container names are inaccessible by swiftfs.
|aservice||A user-friendly “service” name. A service name maps to a collection of configuration entries in the Hadoop core-site.xml file that specify where the container is located (for example, rackspace-dfw).|
|/path/to/files||The name of the object or objects in Swift to be referenced. Although Swift doesn’t support paths, swiftfs attempts to interpret names that look like paths and behave appropriately. For example, an input path named
Using Pig to read data from Swift:
A = LOAD 'swift://logdata.rack-dfw/2013/10/logfile-2013-10-01.txt' AS (a, b, c); B = FOREACH A GENERATE a, null;
Copying from HDFS to a Swift container named
myfiles in the Chicago (ORD) region:
hadoop fs -cp /user/joesmith/files/* swift://myfiles.rack-ord/
If your cluster is in a different region than your container, Hadoop generates traffic over the public (billable) network. To minimize cost and maximize performance, keep your Hadoop cluster and Cloud Files containers in the same region.
How do I configure swiftfs?
By default, Cloud Big Data clusters are preconfigured with service names for all of the Cloud Files regions. Currently these service names are as follows:
Each of these services is seeded with the cloud credentials (user name and API key) from your Cloud Big Data profile, if you supplied them. Also, each service is configured appropriately to use the public or private network, depending on the Swift region and the location of your cluster.
How do I add services?
You might find that you need additional service names, perhaps to use different credentials or a different Swift endpoint. If so, you need to add the following entries for your new service to the Hadoop
/etc/hadoop/conf/core-site.xml file on each node in your cluster. You also need to restart your cluster services after updating the configuration.
|fs.swift.service.aservice.auth.url||The keystone endpoint to authenticate against.|
|fs.swift.service.aservice.tenant||The tenant ID to use during authentication.|
|fs.swift.service.aservice.username||The username to authenticate with.|
|fs.swift.service.aservice.password||The password to authenticate with. Alternatively, you can use an API key for authentication.|
|fs.swift.service.aservice.apikey||The API key to authenticate with. Using an API key is an alternative to using a password; you must supply one or the other.|
|fs.swift.service.aservice.useApikey||True or false value that indicates whether to authenticate with the API key rather than the password.|
|fs.swift.service.aservice.region||The Swift region to use. This value is used to select the appropriate Swift endpoint from the service catalog.|
|fs.swift.service.aservice.public||True or false value that indicates whether traffic goes over the public or private (ServiceNet) network. ServiceNet access works only for Swift containers in the same region as the cluster.
Traffic over the public network is subject to bandwidth charge.
|fs.swift.service.aservice.location-aware||True or false value that indicates whether to enable location awareness for data within Swift. This setting is not currently supported for Rackspace Cloud Files.|
<property> <name>fs.swift.service.rack-dfw.auth.url</name> <value>https://auth.api.rackspacecloud.com/v2.0/tokens</value> </property> </property> <name>fs.swift.service.rack-dfw.username</name> <value>joesmith</value> </property> <property> <name>fs.swift.service.rack-dfw.region</name> <value>DFW</value> </property> <property> <name>fs.swift.service.rack-dfw.apikey</name> <value>74796C657264757264656E</value> </property> <property> <name>fs.swift.service.rack-dfw.public</name> <value>false</value> </property>