Use Python to Read File From Hdfs

Chapter 1. Hadoop Distributed File System (HDFS)

The Hadoop Distributed File System (HDFS) is a Java-based distributed, scalable, and portable filesystem designed to span large clusters of commodity servers. The pattern of HDFS is based on GFS, the Google File System, which is described in a paper published past Google. Like many other distributed filesystems, HDFS holds a large amount of data and provides transparent access to many clients distributed across a network. Where HDFS excels is in its ability to store very large files in a reliable and scalable manner.

HDFS is designed to shop a lot of data, typically petabytes (for very large files), gigabytes, and terabytes. This is accomplished by using a cake-structured filesystem. Individual files are carve up into stock-still-size blocks that are stored on machines across the cluster. Files made of several blocks generally practice not have all of their blocks stored on a unmarried machine.

HDFS ensures reliability past replicating blocks and distributing the replicas beyond the cluster. The default replication gene is three, meaning that each block exists iii times on the cluster. Cake-level replication enables data availability fifty-fifty when machines neglect.

This chapter begins past introducing the cadre concepts of HDFS and explains how to interact with the filesystem using the native born commands. After a few examples, a Python client library is introduced that enables HDFS to be accessed programmatically from inside Python applications.

Overview of HDFS

The architectural design of HDFS is equanimous of two processes: a process known every bit the NameNode holds the metadata for the filesystem, and i or more DataNode processes store the blocks that make upwards the files. The NameNode and DataNode processes can run on a single auto, but HDFS clusters commonly consist of a dedicated server running the NameNode procedure and perhaps thousands of machines running the DataNode procedure.

The NameNode is the most important machine in HDFS. Information technology stores metadata for the unabridged filesystem: filenames, file permissions, and the location of each cake of each file. To allow fast access to this information, the NameNode stores the entire metadata structure in memory. The NameNode also tracks the replication cistron of blocks, ensuring that car failures do not issue in data loss. Because the NameNode is a single signal of failure, a secondary NameNode tin exist used to generate snapshots of the main NameNode's retentiveness structures, thereby reducing the risk of data loss if the NameNode fails.​

The machines that shop the blocks within HDFS are referred to equally DataNodes. DataNodes are typically commodity machines with large storage capacities. Different the NameNode, HDFS will go on to operate normally if a DataNode fails. When a DataNode fails, the NameNode volition replicate the lost blocks to ensure each block meets the minimum replication factor.

The example in Effigy 1-ane illustrates the mapping of files to blocks in the NameNode, and the storage of blocks and their replicas within the DataNodes.

The following section describes how to interact with HDFS using the built-in commands.

Figure 1-ane. An HDFS cluster with a replication factor of 2; the NameNode contains the mapping of files to blocks, and the DataNodes store the blocks and their replicas

Interacting with HDFS

Interacting with HDFS is primarily performed from the command line using the script named hdfs. The hdfs script has the following usage:

$ hdfs Control [-option <arg>]          

The COMMAND argument instructs which functionality of HDFS will be used. The -choice argument is the name of a specific option for the specified command, and <arg> is one or more arguments that that are specified for this selection.

Mutual File Operations

To perform basic file manipulation operations on HDFS, use the dfs command with the hdfs script. The dfs control supports many of the aforementioned file operations found in the Linux trounce.

It is important to annotation that the hdfs command runs with the permissions of the arrangement user running the command. The following examples are run from a user named "hduser."

List Directory Contents

To listing the contents of a directory in HDFS, use the -ls control:

$ hdfs dfs -ls $              

Running the -ls command on a new cluster will not render whatsoever results. This is because the -ls command, without whatever arguments, will attempt to display the contents of the user's habitation directory on HDFS. This is not the same home directory on the host motorcar (e.g., /dwelling/$USER), merely is a directory within HDFS.

Providing -ls with the forward slash (/) as an statement displays the contents of the root of HDFS:

$ hdfs dfs -ls / Found 2 items drwxr-xr-x   - hadoop supergroup    0 2015-09-20 fourteen:36 /hadoop drwx------   - hadoop supergroup    0 2015-09-20 fourteen:36 /tmp              

The output provided by the hdfs dfs command is similar to the output on a Unix filesystem. By default, -ls displays the file and folder permissions, owners, and groups. The two folders displayed in this example are automatically created when HDFS is formatted. The hadoop user is the name of the user under which the Hadoop daemons were started (e.g., NameNode and DataNode), and the supergroup is the proper name of the grouping of superusers in HDFS (e.one thousand., hadoop).

Creating a Directory

Abode directories within HDFS are stored in /user/$Domicile. From the previous example with -ls, it can be seen that the /user directory does non currently exist. To create the /user directory inside HDFS, apply the -mkdir command:

$ hdfs dfs -mkdir /user              

To make a home directory for the current user, hduser, utilize the -mkdir command over again:

$ hdfs dfs -mkdir /user/hduser              

Utilise the -ls command to verify that the previous directories were created:

$ hdfs dfs -ls -R /user drwxr-xr-x   - hduser supergroup    0 2015-09-22 18:01 /user/hduser              

Copy Data onto HDFS

After a directory has been created for the current user, data can be uploaded to the user'southward HDFS home directory with the -put command:

$ hdfs dfs -put /home/hduser/input.txt /user/hduser

This control copies the file /abode/hduser/input.txt from the local filesystem to /user/hduser/input.txt on HDFS.

Use the -ls command to verify that input.txt was moved to HDFS:

$ hdfs dfs -ls  Plant ane items -rw-r--r--   1 hduser supergroup         52 2015-09-20 thirteen:20 input.txt              

Retrieving Data from HDFS

Multiple commands allow information to be retrieved from HDFS. To just view the contents of a file, utilize the -cat control. -cat reads a file on HDFS and displays its contents to stdout. The following control uses -cat to display the contents of /user/hduser/input.txt:

$ hdfs dfs -cat input.txt jack be nimble jack be quick jack jumped over the candlestick              

Information can also be copied from HDFS to the local filesystem using the -get command. The -become command is the opposite of the -put command:

$ hdfs dfs -go input.txt /home/hduser

This control copies input.txt from /user/hduser on HDFS to /home/hduser on the local filesystem.

HDFS Command Reference

The commands demonstrated in this section are the basic file operations needed to brainstorm using HDFS. Below is a full listing of file manipulation commands possible with hdfs dfs. This listing tin also be displayed from the control line by specifying hdfs dfs without whatsoever arguments. To get help with a specific option, use either hdfs dfs -usage <option> or hdfs dfs -aid <selection>.

Usage: hadoop fs [generic options]     [-appendToFile <localsrc> ... <dst>]     [-true cat [-ignoreCrc] <src> ...]     [-checksum <src> ...]     [-chgrp [-R] GROUP PATH...]     [-chmod [-R] <Style[,Mode]... | OCTALMODE> PATH...]     [-chown [-R] [OWNER][:[Grouping]] PATH...]     [-copyFromLocal [-f] [-p] [-l] <localsrc> ... <dst>]     [-copyToLocal [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]     [-count [-q] [-h] <path> ...]     [-cp [-f] [-p | -p[topax]] <src> ... <dst>]     [-createSnapshot <snapshotDir> [<snapshotName>]]     [-deleteSnapshot <snapshotDir> <snapshotName>]     [-df [-h] [<path> ...]]     [-du [-s] [-h] <path> ...]     [-expunge]     [-observe <path> ... <expression> ...]     [-go [-p] [-ignoreCrc] [-crc] <src> ... <localdst>]     [-getfacl [-R] <path>]     [-getfattr [-R] {-n name | -d} [-e en] <path>]     [-getmerge [-nl] <src> <localdst>]     [-help [cmd ...]]     [-ls [-d] [-h] [-R] [<path> ...]]     [-mkdir [-p] <path> ...]     [-moveFromLocal <localsrc> ... <dst>]     [-moveToLocal <src> <localdst>]     [-mv <src> ... <dst>]     [-put [-f] [-p] [-l] <localsrc> ... <dst>]     [-renameSnapshot <snapshotDir> <oldName> <newName>]     [-rm [-f] [-r|-R] [-skipTrash] <src> ...]     [-rmdir [--ignore-neglect-on-non-empty] <dir> ...]     [-setfacl [-R] [{-b|-k} {-chiliad|-x <acl_spec>} <path>]|[--set <acl_spec> <path>]]     [-setfattr {-north name [-v value] | -x proper name} <path>]     [-setrep [-R] [-w] <rep> <path> ...]     [-stat [format] <path> ...]     [-tail [-f] <file>]     [-examination -[defsz] <path>]     [-text [-ignoreCrc] <src> ...]     [-touchz <path> ...]     [-truncate [-westward] <length> <path> ...]     [-usage [cmd ...]]  Generic options supported are -conf <configuration file>     specify an application configuration file -D <property=value>            use value for given property -fs <local|namenode:port>      specify a namenode -jt <local|resourcemanager:port>    specify a ResourceManager -files <comma separated listing of files>    specify comma separated files to be copied to the map reduce cluster -libjars <comma separated list of jars>    specify comma separated jar files to include in the classpath. -archives <comma separated list of archives>    specify comma separated archives to be unarchived on the compute machines.  The general command line syntax is bin/hadoop control [genericOptions] [commandOptions]            

The next section introduces a Python library that allows HDFS to be accessed from within Python applications.

Snakebite

Snakebite is a Python package, created past Spotify, that provides a Python client library, allowing HDFS to be accessed programmatically from Python applications. The client library uses protobuf messages to communicate straight with the NameNode. The Snakebite package as well includes a control-line interface for HDFS that is based on the client library.

This section describes how to install and configure the Snakebite package. Snakebite'south client library is explained in item with multiple examples, and Snakebite's born CLI is introduced as a Python alternative to the hdfs dfs command.

Installation

Snakebite requires Python 2 and python-protobuf 2.4.1 or higher. Python 3 is currently not supported.

Snakebite is distributed through PyPI and can be installed using pip:

$ pip install snakebite            

Client Library

The client library is written in Python, uses protobuf letters, and implements the Hadoop RPC protocol for talking to the NameNode. This enables Python applications to communicate straight with HDFS and non have to brand a system call to hdfs dfs.

List Directory Contents

Example ane-1 uses the Snakebite client library to list the contents of the root directory in HDFS.

Example ane-1. python/HDFS/list_directory.py
                  from                  snakebite.client                  import                  Client                  client                  =                  Customer                  (                  'localhost'                  ,                  9000                  )                  for                  ten                  in                  customer                  .                  ls                  ([                  '/'                  ]):                  print                  x                

The most important line of this plan, and every program that uses the client library, is the line that creates a customer connection to the HDFS NameNode:

                client                =                Client                (                'localhost'                ,                9000                )              

The Customer() method accepts the following parameters:

host (string)
Hostname or IP accost of the NameNode
port (int)
RPC port of the NameNode
hadoop_version (int)
The Hadoop protocol version to be used (default: nine)
use_trash (boolean)
Use trash when removing files
effective_use (string)
Effective user for the HDFS operations (default: None or current user)

The host and port parameters are required and their values are dependent upon the HDFS configuration. The values for these parameters can exist found in the hadoop/conf/core-site.xml configuration file under the belongings fs.defaultFS:

                <property>                <name>fs.defaultFS</name>                <value>hdfs://localhost:9000</value>                </belongings>              

For the examples in this department, the values used for host and port are localhost and 9000, respectively.

After the client connexion is created, the HDFS filesystem can be accessed. The remainder of the previous application used the ls command to list the contents of the root directory in HDFS:

                for                x                in                client                .                ls                ([                '/'                ]):                print                x              

It is important to note that many of methods in Snakebite render generators. Therefore they must exist consumed to execute. The ls method takes a list of paths and returns a listing of maps that contain the file information.

Executing the list_directory .py awarding yields the following results:

$ python list_directory.py  {'group': u'supergroup', 'permission': 448, 'file_type': 'd', 'access_time': 0L, 'block_replication': 0, 'modification_time': 1442752574936L, 'length': 0L, 'blocksize': 0L, 'owner': u'hduser', 'path': '/tmp'} {'grouping': u'supergroup', 'permission': 493, 'file_type': 'd', 'access_time': 0L, 'block_replication': 0, 'modification_time': 1442742056276L, 'length': 0L, 'blocksize': 0L, 'possessor': u'hduser', 'path': '/user'}              

Create a Directory

Use the mkdir() method to create directories on HDFS. Instance one-2 creates the directories /foo/bar and /input on HDFS.

Case ane-2. python/HDFS/mkdir.py
                  from                  snakebite.client                  import                  Customer                  client                  =                  Client                  (                  'localhost'                  ,                  9000                  )                  for                  p                  in                  client                  .                  mkdir                  ([                  '/foo/bar'                  ,                  '/input'                  ],                  create_parent                  =                  True                  ):                  print                  p                

Executing the mkdir.py application produces the following results:

$ python mkdir.py  {'path': '/foo/bar', 'result': Truthful} {'path': '/input', 'event': Truthful}              

The mkdir() method takes a list of paths and creates the specified paths in HDFS. This example used the create_parent parameter to ensure that parent directories were created if they did not already exist. Setting create_parent to True is analogous to the mkdir -p Unix control.

Deleting Files and Directories

Deleting files and directories from HDFS can be accomplished with the delete() method. Instance i-3 recursively deletes the /foo and /bar directories, created in the previous case.

Example 1-3. python/HDFS/delete.py
                  from                  snakebite.customer                  import                  Client                  customer                  =                  Client                  (                  'localhost'                  ,                  9000                  )                  for                  p                  in                  customer                  .                  delete                  ([                  '/foo'                  ,                  '/input'                  ],                  recurse                  =                  True                  ):                  impress                  p                

Executing the delete.py application produces the following results:

$ python delete.py  {'path': '/foo', 'upshot': True} {'path': '/input', 'effect': True}              

Performing a recursive delete will delete any subdirectories and files that a directory contains. If a specified path cannot exist plant, the delete method throws a FileNotFoundException. If recurse is non specified and a subdirectory or file exists, DirectoryException is thrown.

The recurse parameter is equivalent to rm -rf and should be used with care.

Retrieving Data from HDFS

Similar the hdfs dfs control, the client library contains multiple methods that let information to be retrieved from HDFS. To copy files from HDFS to the local filesystem, use the copyToLocal() method. Example 1-4 copies the file /input/input.txt from HDFS and places it under the /tmp directory on the local filesystem.

Example 1-4. python/HDFS/copy_to_local.py
                  from                  snakebite.customer                  import                  Client                  client                  =                  Client                  (                  'localhost'                  ,                  9000                  )                  for                  f                  in                  client                  .                  copyToLocal                  ([                  '/input/input.txt'                  ],                  '/tmp'                  ):                  print                  f                

Executing the copy_to_local.py application produces the following result:

$ python copy_to_local.py  {'path': '/tmp/input.txt', 'source_path': '/input/input.txt', 'consequence': Truthful, 'error': ''}              

To simply read the contents of a file that resides on HDFS, the text() method can be used. Case 1-v displays the content of /input/input.txt .

Example 1-5. python/HDFS/text.py
                  from                  snakebite.client                  import                  Client                  customer                  =                  Customer                  (                  'localhost'                  ,                  9000                  )                  for                  l                  in                  client                  .                  text                  ([                  '/input/input.txt'                  ]):                  impress                  fifty                

Executing the text.py application produces the following results:

$ python text.py  jack be nimble jack be quick jack jumped over the candlestick              

The text() method will automatically uncompress and display gzip and bzip2 files.

CLI Customer

The CLI client included with Snakebite is a Python control-line HDFS customer based on the client library. To execute the Snakebite CLI, the hostname or IP address of the NameNode and RPC port of the NameNode must be specified. While there are many ways to specify these values, the easiest is to create a ~.snakebiterc configuration file. Example 1-half dozen contains a sample config with the NameNode hostname of localhost and RPC port of 9000.

Example 1-6. ~/.snakebiterc
                {                "config_version"                :                2                ,                "skiptrash"                :                true                ,                "namenodes"                :                [                {                "host"                :                "localhost"                ,                "port"                :                9000                ,                "version"                :                nine                },                ]                }              

The values for host and port can be establish in the hadoop/conf/core-site.xmfifty configuration file nether the property fs.defaultFS.

For more than information on configuring the CLI, meet the Snakebite CLI documentation online.

Usage

To use the Snakebite CLI client from the control line, simply utilise the command snakebite. Use the ls option to display the contents of a directory:

$ snakebite ls / Plant 2 items drwx------   - hadoop    supergroup    0 2015-09-20 14:36 /tmp drwxr-xr-x   - hadoop    supergroup    0 2015-09-20 11:forty /user              

Like the hdfs dfs command, the CLI client supports many familiar file manipulation commands (e.g., ls, mkdir, df, du, etc.).

The major departure between snakebite and hdfs dfs is that snakebite is a pure Python client and does non need to load whatsoever Java libraries to communicate with HDFS. This results in quicker interactions with HDFS from the command line.

CLI Command Reference

The post-obit is a full listing of file manipulation commands possible with the snakebite CLI client. This listing can be displayed from the command line past specifying snakebite without whatsoever arguments. To view help with a specific command, apply snakebite [cmd] --assistance, where cmd is a valid snakebite control.

snakebite [general options] cmd [arguments] general options:   -D --debug               Show debug information   -V --version             Hadoop protocol version (default:9)   -h --aid                show help   -j --json                JSON output   -n --namenode            namenode host   -p --port                namenode RPC port (default: 8020)   -v --ver                 Brandish snakebite version  commands:   cat [paths]                  copy source paths to stdout   chgrp <grp> [paths]          modify grouping   chmod <manner> [paths]         change file mode (octal)   chown <owner:grp> [paths]    modify owner   copyToLocal [paths] dst      copy paths to local                                   file organization destination   count [paths]                display stats for paths   df                           display fs stats   du [paths]                   display disk usage statistics   get file dst                 copy files to local                                   file arrangement destination   getmerge dir dst             concatenates files in source dir                                  into destination local file   ls [paths]                   listing a path   mkdir [paths]                create directories   mkdirp [paths]               create directories and their                                   parents   mv [paths] dst               move paths to destination   rm [paths]                   remove paths   rmdir [dirs]                 delete a directory   serverdefaults               show server data   setrep <rep> [paths]         set replication factor   stat [paths]                 stat information   tail path                    display final kilobyte of the                                   file to stdout   test path                    test a path   text path [paths]            output file in text format   touchz [paths]               creates a file of zero length   usage <cmd>                  show cmd usage  to see command-specific options utilise: snakebite [cmd] --aid              

Affiliate Summary

This chapter introduced and described the core concepts of HDFS. It explained how to collaborate with the filesystem using the built-in hdfs dfs command. It also introduced the Python library, Snakebite. Snakebite'south client library was explained in detail with multiple examples. The snakebite CLI was also introduced as a Python alternative to the hdfs dfs control.

loguewitheoper.blogspot.com

Source: https://www.oreilly.com/library/view/hadoop-with-python/9781492048435/ch01.html

0 Response to "Use Python to Read File From Hdfs"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel