Java How Upload File to Hadoop Location

5. Working with the Hadoop File Organisation

A common task in Hadoop is interacting with its file organisation, whether for provisioning, adding new files to exist candy, parsing results, or performing cleanup. Hadoop offers several means to achieve that: one can use its Java API (namely FileSystem or use the hadoop command line, in particular the file arrangement shell. Nonetheless there is no center ground, 1 either has to use the (somewhat verbose, total of checked exceptions) API or autumn back to the command line, outside the application. SHDP addresses this issue by bridging the two worlds, exposing both the FileSystem and the fs shell through an intuitive, piece of cake-to-use Java API. Add your favorite JVM scripting linguistic communication right inside your Jump for Apache Hadoop application and you have a powerful combination.

v.1 Configuring the file-system

The Hadoop file-system, HDFS, can be accessed in various ways - this section will cover the near popular protocols for interacting with HDFS and their pros and cons. SHDP does not enforce whatsoever specific protocol to be used - in fact, equally described in this section whatsoever FileSystem implementation tin be used, assuasive even other implementations than HDFS to be used.

The table below describes the common HDFS APIs in apply:

Table v.ane. HDFS APIs

File System Comm. Method Scheme / Prefix Read / Write Cross Version

HDFS

RPC

hdfs://

Read / Write

Same HDFS version only

HFTP

HTTP

hftp://

Read merely

Version independent

WebHDFS

HTTP (REST)

webhdfs://

Read / Write

Version contained

This chapter focuses on the core file-system protocols supported by Hadoop. S3, FTP and the residual of the other FileSystem implementations are supported as well - Spring for Apache Hadoop has no dependency on the underlying organization rather just on the public Hadoop API.

hdfs:// protocol should be familiar to most readers - most docs (and in fact the previous chapter as well) mention information technology. It works out of the box and it's adequately efficient. Nevertheless because it is RPC based, it requires both the client and the Hadoop cluster to share the same version. Upgrading one without the other causes serialization errors pregnant the customer cannot interact with the cluster. As an culling 1 tin use hftp:// which is HTTP-based or its more than secure blood brother hsftp:// (based on SSL) which gives you a version independent protocol meaning you can employ information technology to interact with clusters with an unknown or different version than that of the client. hftp is read only (write operations will fail right away) and it is typically used with distcp for reading data. webhdfs:// is one of the additions in Hadoop i.0 and is a mixture between hdfs and hftp protocol - it provides a version-contained, read-write, Balance-based protocol which ways that you lot can read and write to/from Hadoop clusters no matter their version. Furthermore, since webhdfs:// is backed past a Rest API, clients in other languages tin use it with minimal effort.

[Note] Note

Non all file systems piece of work out of the box. For example WebHDFS needs to be enabled commencement in the cluster (through dfs.webhdfs.enabled holding, run across this certificate for more than information) while the secure hftp, hsftp requires the SSL configuration (such equally certificates) to be specified. More than near this (and how to utilise hftp/hsftp for proxying) in this page.

In one case the scheme has been decided upon, one tin specify information technology through the standard Hadoop configuration, either through the Hadoop configuration files or its backdrop:

            <hdp:configuration>            fs.defaultFS=webhdfs://localhost   ...            </hdp:configuration>          

This instructs Hadoop (and automatically SHDP) what the default, implied file-system is. In SHDP, one tin can create additional file-systems (potentially to connect to other clusters) and specify a unlike scheme:

              <hdp:file-organisation            uri="webhdfs://localhost"            />                        <hdp:file-system            id="old-cluster"            uri="hftp://old-cluster/"            />          

Every bit with the remainder of the components, the file systems can be injected where needed - such as file beat or inside scripts (see the next section).

5.2 Using HDFS Resource Loader

In Leap the ResourceLoader interface is meant to be implemented by objects that can return (i.e. load) Resource instances.

            public            interface            ResourceLoader {   Resources getResource(String location); }

All application contexts implement the ResourceLoader interface, and therefore all awarding contexts may be used to obtain Resource instances.

When y'all call getResource() on a specific application context, and the location path specified doesn't accept a specific prefix, you will get back a Resource type that is appropriate to that item awarding context. For example, assume the following snippet of code was executed against a ClassPathXmlApplicationContext example:

Resources template = ctx.getResource("some/resource/path/myTemplate.txt");

What would be returned would be a ClassPathResource; if the aforementioned method was executed against a FileSystemXmlApplicationContext instance, you'd get back a FileSystemResource. For a WebApplicationContext, you'd get dorsum a ServletContextResource, and so on.

Equally such, y'all can load resources in a style appropriate to the particular awarding context.

On the other hand, yous may also force ClassPathResource to exist used, regardless of the application context type, past specifying the special classpath: prefix:

Resource template = ctx.getResource("classpath:some/resource/path/myTemplate.txt");
[Note] Notation

More information about the generic usage of resource loading, check the Leap Framework Documentation .

Spring Hadoop is adding its ain functionality into generic concept of resource loading. Resource abstraction in Spring has always been a way to ease resource access in terms of not having a need to know where there resources is and how it's accessed. This brainchild also goes beyond a single resources by allowing to apply patterns to access multiple resource.

Lets beginning see how HdfsResourceLoader is used manually.

            <hdp:file-system />            <hdp:resource-loader            id="loader"            file-system-ref="hadoopFs"                          />            <hdp:resources-loader            id="loaderWithUser"            user="myuser"            uri="hdfs://localhost:8020"                          />          

In higher up configuration nosotros created two beans, i with reference to existing Hadoop FileSystem bean and i with impersonated user.

              Resources resource = loader.getResource("/tmp/file.txt");  Resource resource = loaderWithUser.getResource("/tmp/file.txt");   Resource resource = loader.getResource("file.txt");  Resources resource = loaderWithUser.getResource("file.txt");   Resources[] resources = loader.getResources("/tmp/*");  Resource[] resource = loader.getResources("/tmp/**/*");  Resources[] resources = loader.getResources("/tmp/?ile?.txt");

What would be returned in above examples would exist instances of HdfsResources.

If in that location is a need for Spring Application Context to be enlightened of HdfsResourceLoader information technology needs to be registered using hdp:resources-loader-registrar namespace tag.

            <hdp:file-organisation />            <hdp:resource-loader            file-system-ref="hadoopFs"            handle-noprefix="fake"                          />            <hdp:resource-loader-registrar />          
[Note] Annotation

On default the HdfsResourceLoader will handle all resource paths without prefix. Aspect handle-noprefix can exist used to control this behaviour. If this attribute is set to false , non-prefixed resource uris will exist handled by Spring Application Context .

              Resource[] resource = context.getResources("hdfs:default.txt");  Resource[] resource = context.getResources("hdfs:/*");  Resource[] resources = context.getResources("classpath:cfg*properties");

What would exist returned in to a higher place examples would be instances of HdfsResources and ClassPathResource for the concluding one. If requesting resource paths without existing prefix, this instance would fall back into Jump Awarding Context . It may be advisable to let HdfsResourceLoader to handle paths without prefix if your application doesn't rely on loading resources from underlying context without prefixes.

Table 5.ii.hdp:resources-loader attributes

Name Values Description

file-system-ref

Bean Reference

Reference to existing Hadoop FileSystem bean

utilize-codecs

Boolean(defaults to true)

Indicates whether to use (or not) the codecs found inside the Hadoop configuration when accessing the resource input stream.

user

String

The security user (ugi) to use for impersonation at runtime.

uri

Cord

The underlying HDFS organisation URI.

handle-noprefix

Boolean(defaults to true)

Indicates if loader should handle resources paths without prefix.


Table 5.3.hdp:resource-loader-registrar attributes

Name Values Clarification

loader-ref

Bean Reference

Reference to existing Hdfs resource loader bean. Default value is 'hadoopResourceLoader'.


5.3 Scripting the Hadoop API

SHDP scripting supports any JSR-223 (also known equally javax.scripting) compliant scripting engine. Only add together the engine jar to the classpath and the application should be able to notice information technology. Near languages (such as Groovy or JRuby) provide JSR-233 support out of the box; for those that do non see the scripting projection that provides various adapters.

Since Hadoop is written in Coffee, accessing its APIs in a native way provides maximum command and flexibility over the interaction with Hadoop. This holds true for working with its file systems; in fact all the other tools that i might use are built upon these. The master entry point is the org.apache.hadoop.fs.FileSystem abstruse class which provides the foundation of most (if not all) of the actual file arrangement implementations out in that location. Whether one is using a local, remote or distributed store through the FileSystem API she can query and manipulate the available resources or create new ones. To practise so however, one needs to write Java code, compile the classes and configure them which is somewhat cumbersome especially when performing simple, straightforward operations (like re-create a file or delete a directory).

JVM scripting languages (such as Cracking, JRuby, Jython or Rhino to name just a few) provide a nice solution to the Coffee language; they run on the JVM, tin can interact with the Java code with no or few changes or restrictions and have a nicer, simpler, less ceremonial syntax; that is, in that location is no demand to define a class or a method - simply write the code that you desire to execute and you lot are washed. SHDP combines the ii, taking care of the configuration and the infrastructure and so one tin interact with the Hadoop surround from her language of selection.

Let us have a expect at a JavaScript case using Rhino (which is part of JDK 6 or college, meaning one does non need any extra libraries):

            <beans            xmlns="http://www.springframework.org/schema/beans"            ...>            <hdp:configuration            .../>            <hdp:script            id="inlined-js"            language="javascript"            run-at-startup="true"            >            try {load("nashorn:mozilla_compat.js");} catch (e) {} // for Coffee 8     importPackage(coffee.util);      proper name = UUID.randomUUID().toString()     scriptName = "src/test/resource/test.properties"     //  - FileSystem example based on 'hadoopConfiguration' bean     // call FileSystem#copyFromLocal(Path, Path)     .copyFromLocalFile(scriptName, name)     // return the file length     .getLength(proper name)            </hdp:script>            </beans>          

The script element, part of the SHDP namespace, builds on tiptop of the scripting back up in Leap permitting script declarations to be evaluated and declared equally normal bean definitions. Furthermore information technology automatically exposes Hadoop-specific objects, based on the existing configuration, to the script such every bit the FileSystem (more than on that in the next section). Every bit one tin can come across, the script is fairly obvious: information technology generates a random proper name (using the UUID course from java.util package) and then copies a local file into HDFS under the random proper noun. The concluding line returns the length of the copied file which becomes the value of the declaring bean (in this case inlined-js) - notation that this might vary based on the scripting engine used.

[Note] Note

The attentive reader might have noticed that the arguments passed to the FileSystem object are not of type Path just rather Cord. To avoid the creation of Path object, SHDP uses a wrapper class SimplerFileSystem which automatically does the conversion and so you don't have to. For more than information see the implicit variables section.

Notation that for inlined scripts, one tin can use Spring'due south belongings placeholder configurer to automatically expand variables at runtime. Using ane of the examples seen earlier:

            <beans            ...                          >            <context:belongings-placeholder            location="classpath:hadoop.properties"                          />            <hdp:script            language="javascript"            run-at-startup="true"            >            ...     tracker=     ...            </hdp:script>            </beans>          

Notice how the script higher up relies on the holding placeholder to aggrandize ${hd.fs} with the values from hadoop.properties file bachelor in the classpath.

Equally you might accept noticed, the script element defines a runner for JVM scripts. And merely similar the residuum of the SHDP runners, information technology allows 1 or multiple pre and post deportment to be specified to be executed before and after each run. Typically other runners (such as other jobs or scripts) can be specified but any JDK Callable can be passed in. Do notation that the runner will non run unless triggered manually or if run-at-startup is set to true. For more than information on runners, see the dedicated chapter.

5.3.1 Using scripts

Inlined scripting is quite handy for doing unproblematic operations and coupled with the belongings expansion is quite a powerful tool that can handle a variety of utilise cases. However when more logic is required or the script is affected by XML formatting, encoding or syntax restrictions (such as Jython/Python for which white-spaces are of import) 1 should consider externalization. That is, rather than declaring the script directly inside the XML, one can declare it in its own file. And speaking of Python, consider the variation of the previous instance:

              <hdp:script              location="org/company/bones-script.py"              run-at-startup="true"              />            

The definition does not bring whatever surprises but do detect in that location is no demand to specify the language (every bit in the case of a inlined declaration) since script extension (py) already provides that data. Simply for completeness, the bones-script.py looks as follows:

              from              java.util              import              UUID              from              org.apache.hadoop.fs              import              Path              print              "Home dir is "              + str(fs.homeDirectory)              print              "Work dir is "              + str(fs.workingDirectory)              print              "/user exists "              + str(fs.exists("/user"))  name = UUID.randomUUID().toString() scriptName =              "src/examination/resources/exam.backdrop"              fs.copyFromLocalFile(scriptName, proper noun)              print              Path(proper name).makeQualified(fs)

5.iv Scripting implicit variables

To ease the interaction of the script with its enclosing context, SHDP binds by default the so-called implicit variables. These are:

Table five.4. Implicit variables

Name Blazon Description

cfg

Configuration

Hadoop Configuration (relies on hadoopConfiguration bean or singleton type match)

cl

ClassLoader

ClassLoader used for executing the script

ctx

ApplicationContext

Enclosing application context

ctxRL

ResourcePatternResolver

Enclosing application context ResourceLoader

distcp

DistCp

Programmatic access to DistCp

fs

FileSystem

Hadoop File System (relies on 'hadoop-fs' edible bean or singleton blazon match, falls dorsum to creating one based on 'cfg')

fsh

FsShell

File System beat out, exposing hadoop 'fs' commands as an API

hdfsRL

HdfsResourceLoader

Hdfs resource loader (relies on 'hadoop-resources-loader' or singleton type match, falls back to creating one automatically based on 'cfg')


[Note] Note

If no Hadoop Configuration tin be detected (either past name hadoopConfiguration or by type), several log warnings volition be made and none of the Hadoop-based variables (namely cfg , distcp , fs , fsh , distcp or hdfsRL) volition exist bound.

As mentioned in the Description column, the variables are first looked (either by proper noun or by type) in the awarding context and, in case they are missing, created on the spot based on the existing configuration. Note that it is possible to override or add new variables to the scripts through the property sub-chemical element that can gear up values or references to other beans:

            <hdp:script            location="org/company/basic-script.js"            run-at-startup="true"            >            <hdp:holding            proper name="foo"            value="bar"            />            <hdp:belongings            proper name="ref"            ref="some-edible bean"            />            </hdp:script>          

five.4.ane Running scripts

The script namespace provides various options to suit its behaviour depending on the script content. By default the script is simply declared - that is, no execution occurs. One however can alter that so that the script gets evaluated at startup (as all the examples in this section do) through the run-at-startup flag (which is by default false) or when invoked manually (through the Callable). Similarily, past default the script gets evaluated on each run. Withal for scripts that are expensive and return the aforementioned value every time 1 has diverse caching options, so the evaluation occurs only when needed through the evaluate attribute:

Table 5.5.script attributes

Name Values Description

run-at-startup

faux(default), true

Wether the script is executed at startup or not

evaluate

E'er(default), IF_MODIFIED, Once

Wether to actually evaluate the script when invoked or used a previous value. Ever means evaluate every fourth dimension, IF_MODIFIED evaluate if the backing resource (such equally a file) has been modified in the meantime and ONCE simply one time.


5.four.2 Using the Scripting tasklet

For Leap Batch environments, SHDP provides a dedicated tasklet to execute scripts.

              <script-tasklet              id="script-tasklet"              >              <script              language="great"              >              inputPath = "/user/gutenberg/input/word/"     outputPath = "/user/gutenberg/output/word/"     if (fsh.test(inputPath)) {       fsh.rmr(inputPath)     }     if (fsh.examination(outputPath)) {       fsh.rmr(outputPath)     }     inputFile = "src/main/resources/data/nietzsche-chapter-1.txt"     fsh.put(inputFile, inputPath)              </script>              </script-tasklet>            

The tasklet above embedds the script as a nested element. Yous can as well declare a reference to another script definition, using the script-ref attribute which allows y'all to externalize the scripting code to an external resource.

              <script-tasklet              id="script-tasklet"              script-ref="clean-upwardly"              />              <hdp:script              id="clean-up"              location="org/company/myapp/clean-up-wordcount.groovy"              />            

5.5 File System Beat out (FsShell)

A handy utility provided by the Hadoop distribution is the file system shell which allows UNIX-like commands to be executed against HDFS. One tin cheque for the being of files, delete, motion, copy directories or files or set up permissions. However the utility is only available from the command-line which makes it difficult to employ from/inside a Coffee awarding. To accost this trouble, SHDP provides a lightweight, fully embeddable beat, called FsShell which mimics nearly of the commands available from the command line: rather than dealing with Organization.in or System.out, i deals with objects.

Let us take a look at using FsShell by edifice on the previous scripting examples:

            <hdp:script            location="org/company/basic-script.groovy"            run-at-startup="truthful"            />          
proper noun = UUID.randomUUID().toString() scriptName =            "src/test/resources/test.backdrop"            fs.copyFromLocalFile(scriptName, proper name)   dir =            "script-dir"            if            (!fsh.test(dir)) {    fsh.mkdir(dir); fsh.cp(name, dir); fsh.chmodr(700, dir)    println            "File content is "            + fsh.cat(dir + name).toString() } println fsh.ls(dir).toString() fsh.rmr(dir)

Every bit mentioned in the previous section, a FsShell instance is automatically created and configured for scripts, nether the name fsh . Observe how the entire block relies on the usual commands: test, mkdir, cp and so on. Their semantics are exactly the same as in the command-line version withal one has admission to a native Java API that returns actual objects (rather than String`s) making it easy to utilize them programmatically whether in Java or another language. Furthermore, the grade offers enhanced methods (such as `chmodr which stands for recursive chmod) and multiple overloaded methods taking reward of varargs so that multiple parameters can be specified. Consult the API for more information.

To exist as close every bit possible to the control-line vanquish, FsShell mimics even the messages existence displayed. Take a await at line 9 which prints the result of fsh.cat(). The method returns a Collection of Hadoop Path objects (which one tin use programatically). However when invoking toString on the drove, the same printout as from the command-line beat is being displayed:

File content is

The same goes for the rest of the methods, such every bit ls. The same script in JRuby would look something similar this:

require            'java'            name = java.util.UUID.randomUUID().to_s scriptName =            "src/test/resources/exam.backdrop"            $fs.copyFromLocalFile(scriptName, proper noun)   dir =            "script-dir/"            ... impress $fsh.ls(dir).to_s

which prints out something similar this:

drwx------   - user     supergroup          0 2012-01-26 14:08 /user/user/script-dir -rw-r--r--   3 user     supergroup        344 2012-01-26 fourteen:08 /user/user/script-dir/520cf2f6-a0b6-427e-a232-2d5426c2bc4e

As you can encounter, not merely can you reuse the existing tools and commands with Hadoop inside SHDP, but you can also code against them in various scripting languages. And as you might have noticed, there is no special configuration required - this is automatically inferred from the enclosing application context.

[Note] Note

The careful reader might have noticed that besides the syntax, there are some small-scale differences in how the various languages interact with the java objects. For example the automatic toString call called in Java for doing automated String conversion is non necessarily supported (hence the to_s in Scarlet or str in Python). This is to be expected as each language has its ain semantics - for the most function these are easy to selection upward but practice pay attending to details.

5.5.i DistCp API

Similar to the FsShell, SHDP provides a lightweight, fully embeddable DistCp version that builds on top of the distcp from the Hadoop distro. The semantics and configuration options are the same however, one can utilise it from within a Java awarding without having to use the command-line. See the API for more information:

              <hdp:script              language="groovy"              >distcp.copy("${distcp.src}", "${distcp.dst}")</hdp:script>            

The bean above triggers a distributed re-create relying once again on Jump'south belongings placeholder variable expansion for its source and destination.

tijerinasadamess.blogspot.com

Source: https://docs.spring.io/spring-hadoop/docs/current/reference/html/springandhadoop-fs.html

0 Response to "Java How Upload File to Hadoop Location"

Postar um comentário

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel