Sqoop 1 or Sqoop 2?

Sqoop1 and Sqoop2 are completely different code paths and, as such, have very different feature sets [1]. So, how do we know when we’re using Sqoop1 or Sqoop2? Here’s a quick list of ways:

Different Usecases

1. Sqoop2 has a UI devoted to it, but Sqoop1 does not

The Hue project has a wonderful UI for using Sqoop2, but nothing specific for Sqoop1.

You can run Sqoop1 jobs from Hue, but it has to be through the Job Designer app or the Oozie app.

2. Sqoop2 has an interactive command line interface, Sqoop1 does not

Sqoop2 and Sqoop1 have slightly different command line interfaces. Since Sqoop2 is service oriented and considers metadata first, there is a sequence of steps that’s necessary to start a job:

sqoop:000> create link --cid 1

Creating link for connector with id 1

Please fill following values to create new link object

Name: hdfs1

New link was successfully created with validation status FINE and persistent id 3

sqoop:000> create link --cid 2

Creating link for connector with id 2

Please fill following values to create new link object

Name: mysql

Link configuration

JDBC Driver Class: com.mysql.jdbc.Driver

JDBC Connection String: jdbc:mysql://solaris.abe.cloudera.com/test

Username: root

Password: ****

JDBC Connection Properties:

There are currently 0 values in the map:

entry#

New link was successfully created with validation status FINE and persistent id 4

sqoop:000> create job --from 4 --to 3

Creating job for links with from id 4 and to id 3

Please fill following values to create new job object

Name: MySQLtoHDFS

From database configuration

Schema name:

Table name: test

Table SQL statement:

Table column names:

Partition column name:

Nulls in partition column:

Boundary query:

ToJob configuration

Output format:

0 : TEXT_FILE

1 : SEQUENCE_FILE

Choose: 0

Compression format:

0 : NONE

1 : DEFAULT

2 : DEFLATE

3 : GZIP

4 : BZIP2

5 : LZO

6 : LZ4

7 : SNAPPY

8 : CUSTOM

Choose: 0

Custom compression format:

Output directory: /tmp/sqoop/hdfs

Throttling resources

Extractors:

Loaders:

New job was successfully created with validation status FINE and persistent id 1

sqoop:000> start job --jid 1

Submission details

Job ID: 1

Server URL: http://localhost:12000/sqoop/

Created by: ab

Creation date: 2014-10-21 10:49:24 PDT

Lastly updated by: ab

External ID: job_1412928093521_0015

http://solaris.abe.cloudera.com:8088/proxy/application_1412928093521_0015/

Source Connector schema: Schema{name=test,columns=[

FixedPoint{name=id,nullable=true,byteSize=null,unsigned=null}]}

2014-10-21 10:49:24 PDT: BOOTING - Progress is not available

In Sqoop1, creating a link and job is unnecessary. You just need to specify every thing in the command line:

sqoop import --connect jdbc:oracle:thin:@//localhost/xe --username sqoop --password sqoop --table sqoop --hbase-table sqoop --column-family cf --hbase-row-key id --hbase-create-table --hbase-bulkload

3. Oozie has support for Sqoop1, not Sqoop2 yet

If you want to Sqoop data with Oozie, then you’re going to need to use Sqoop1.

<workflow-app name="test" xmlns="uri:oozie:workflow:0.4">

<start to="MySQL-to-HDFS"/>

<action name="MySQL-to-HDFS">

<sqoop xmlns="uri:oozie:sqoop-action:0.2">

<job-tracker>${jobTracker}</job-tracker>

<name-node>${nameNode}</name-node>

<command>import --connect jdbc:oracle:thin:@//localhost/xe --username sqoop --password sqoop --table sqoop --hbase-table sqoop --column-family cf --hbase-row-key id --hbase-create-table --hbase-bulkload</command>

</sqoop>

<ok to="end"/>

<error to="kill"/>

</action>

<kill name="kill">

<message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>

</kill>

<end name="end"/>

</workflow-app>

4. Cloudera Manager manages Sqoop2 service and Sqoop1

Cloudera manager manages both Sqoop versions (as of CM 5.0). The default Sqoop service that’s installed is the Sqoop 2 service (even though it might be named SQOOP-1).

Which am I Using?

A couple of good ways to know which you version of Sqoop you are using are:

1. Check for /usr/lib/sqoop or /usr/lib/sqoop2

Sqoop2 uses the /usr/lib/sqoop2 directory, while Sqoop1 uses the /usr/lib/sqoop directory. These directories should hold the binaries and content necessary for running both versions of Sqoop.

2. Check for /usr/lib/sqoop or /usr/lib/sqoop2

Sqoop2 uses the /var/lib/sqoop2 directory, while Sqoop1 uses the /var/lib/sqoop directory. These directories should hold extra JDBC drivers necessary for connecting to different data sources (such as Teradata and Netezza).

3. Check /var/log/sqoop2

Sqoop2 stores its logs in /var/log/sqoop2. If this files in this directory are being updated, you’re likely using Sqoop2.

When in doubt about anything, check with the Sqoop mailing list.

Summary

It’s good to know when you’re using Sqoop1 versus Sqoop2. Since integration with the rest of the Big Data ecosystem varies (for instance, there’s a Sqoop2 UI in Hue, but not a Sqoop1 UI), it can get confusing to know which version of Sqoop you’re using. Hopfully the above list cleared things up! Now go get Sqooping!

Source : http://ingest.tips/2014/10/21/sqoop-1-or-sqoop-2/

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s