1/列出mysql數據庫中的所有數據庫sqoop list-databases -connect jdbc:mysql://localhost:3306/ -username root -password 1234562/連接mysql并? 列出test數據庫中的表sqoop list-tables -connect jdbc:mysql://localhost:3306/test -username root -password 1234563/將關系型數據的表結構 復制到hive中,只是復制表結構? 內容不復制sqoop create-hive-table -connect jdbc:mysql://localhost:3306/test -table sqoop_testTabinMySql -username root -password 123456 -hive-table testNewTabInHive4/從關系數據庫導入文件到hive中sqoop import -connect jdbc:mysql://localhost:3306/zxtest -username root -password 123456 -table sqoop_test -hive-import -hive-table s_test -m 15/將hive中的表數據導入到mysql 中,在進行導入前 mysql中的表hive_test 必須提前創建好sqoop export -connect jdbc:mysql://localhost:3306/zxtest -username root -password root -table hive_test -export-dir /user/hive/warehouse/new_test_partition/dt=2012-03-056/從數據庫導出表的數據到HDFS上的文件sqoop import -connect jdbc:mysql://localhost:3306/compression -username=hadoop -password=123456 -table HADOOP_USER_INFO -m 1 -target -dir /user/test7/數據庫增量導入表數據到hdfs中sqoop import -connect jdbc:mysql://localhost:3306/compression -username=hadoop -password=123456 -table HADOOP_USER_INFO -m 1 -target -dir /user/test -check -column id -incremental append -last-value 3Importsqoop 數據導入具有以下特點:1.支持文本文件(--as-textfile)、avro(--as-avrodatafile)、SequenceFiles(--as-sequencefile)。 RCFILE暫未支持,默認為文本2.支持數據追加,通過--apend指定3.支持table列選取(--column),支持數據選取(--where),和--table一起使用4.支持數據選取,例如讀入多表join后的數據'SELECT a.*, b.* FROM a JOIN b on (a.id == b.id) ‘,不可以和--table同時使用5.支持map數定制(-m)6.支持壓縮(--compress)7.支持將關系數據庫中的數據導入到Hive(--hive-import)、HBase(--hbase-table)? 數據導入Hive分三步:1)導入數據到HDFS? 2)Hive建表? 3)使用“LOAD DATA INPAHT”將數據LOAD到表中? 數據導入HBase分二部:1)導入數據到HDFS 2)調用HBase put操作逐行將數據寫入表*import是將關系數據庫遷移到HDFS上? 默認目錄是/user/${user.name}/${tablename},可以通過--target-dir設置hdfs上的目標目錄。export是import的反向過程,將hdfs上的數據導入到關系數據庫中? 由于sqoop是通過map完成數據的導入,各個map過程是獨立的,沒有事物的概念,可能會有部分map數據導入失敗的情況。為了解決這一問題,sqoop中有一個折中的辦法,即是指定中間 staging表,成功后再由中間表導入到結果表。這一功能是通過 --staging-table指定,同時staging表結構也是需要提前創建出來的:sqoop export --connect jdbc:mysql://192.168.81.176/sqoop --username root -password passwd --table sds --export-dir /user/guojian/sds --staging-table sds_tmp需要說明的是,在使用 --direct, --update-key或者--call存儲過程的選項時,staging中間表是不可用的。create-hive-table將關系數據庫表導入到hive表中參數說明–hive-homeHive的安裝目錄,可以通過該參數覆蓋掉默認的hive目錄–hive-overwrite覆蓋掉在hive表中已經存在的數據–create-hive-table默認是false,如果目標表已經存在了,那么創建任務會失敗–hive-table后面接要創建的hive表–table指定關系數據庫表名sqoop create-hive-table --connect jdbc:mysql://192.168.81.176/sqoop --username root -password passwd --table sds --hive-table sds_bak默認sds_bak是在default數據庫的。這一步需要依賴HCatalog,需要先安裝HCatalog,否則報如下錯誤:Hive history file=/tmp/guojian/hive_job_log_cfbe2de9-a358-4130-945c-b97c0add649d_1628102887.txtFAILED: ParseException line 1:44 mismatched input ')' expecting Identifier near '(' in column specificationmetastore 配置sqoop job的共享元數據信息,這樣多個用戶定義和執行sqoop job在這一 metastore中。默認存儲在~/.sqoop啟動:sqoop metastore關閉:sqoop metastore --shutdownmetastore文件的存儲位置是在 conf/sqoop-site.xml中 sqoop.metastore.server.location 配置,指向本地文件。metastore可以通過TCP/IP訪問,端口號可以通過 sqoop.metastore.server.port配置,默認是16000。客戶端可以通過 指定 sqoop.metastore.client.autoconnect.url或使用 --meta-connect,配置為 jdbc:hsqldb:hsql://:/sqoop,例如 jdbc:hsqldb:hsql://metaserver.example.com:16000/sqoop。Sqoop will read entire content of the password file and use it as a password. This will include any trailing white space characters such as new line characters that are added by default by most of the text editors. You need to make sure that your password file contains only characters that belongs to your password. On the command line you can use command echo with switch -n to store password without any trailing white space characters. For example to store password secret you would call echo -n "secret" > password.file.Sqoop automatically supports several databases, including MySQL. Connect strings beginning with jdbc:mysql:// are handled automatically in Sqoop. (A full list of databases with built-in support is provided in the "Supported Databases" section. For some, you may need to install the JDBC driver yourself.)You can use Sqoop with any other JDBC-compliant database. First, download the appropriate JDBC driver for the type of database you want to import, and install the .jar file in the $SQOOP_HOME/lib directory on your client machine. (This will be /usr/lib/sqoop/lib if you installed from an RPM or Debian package.) Each driver .jar file also has a specific driver class which defines the entry-point to the driver. For example, MySQL’s Connector/J library has a driver class of com.mysql.jdbc.Driver. Refer to your database vendor-specific documentation to determine the main driver class. This class must be provided as an argument to Sqoop with --driver.For example, to connect to a SQLServer database, first download the driver from microsoft.com and install it in your Sqoop lib path.Sqoop can also import the result set of an arbitrary SQL query. Instead of using the --table, --columns and --where arguments, you can specify a SQL statement with the --query argument.When importing a free-form query, you must specify a destination directory with --target-dir.NoteIf you are issuing the query wrapped with double quotes ("), you will have to use \$CONDITIONS instead of just $CONDITIONS to disallow your shell from treating it as a shell variable. For example, a double quoted query may look like: "SELECT * FROM x WHERE a='foo' AND \$CONDITIONS"The facility of using free-form query in the current version of Sqoop is limited to simple queries where there are no ambiguous projections and no OR conditions in the WHERE clause. Use of complex queries such as queries that have sub-queries or joins leading to ambiguous projections can lead to unexpected results.? 即? where中不能有orSqoop imports data in parallel from most database sources. You can specify the number of map tasks (parallel processes) to use to perform the import by using the -m or --num-mappers argument. Each of these arguments takes an integer value which corresponds to the degree of parallelism to employ. By default, four tasks are used. Some databases may see improved performance by increasing this value to 8 or 16. 默認開啟4個taskWhen performing parallel imports, Sqoop needs a criterion by which it can split the workload. Sqoop uses a splitting column to split the workload. By default, Sqoop will identify the primary key column (if present) in a table and use it as the splitting column. The low and high values for the splitting column are retrieved from the database, and the map tasks operate on evenly-sized components of the total range. For example, if you had a table with a primary key column of id whose minimum value was 0 and maximum value was 1000, and Sqoop was directed to use 4 tasks, Sqoop would run four processes which each execute SQL statements of the form SELECT * FROM sometable WHERE id >= lo AND id < hi, with (lo, hi) set to (0, 250), (250, 500), (500, 750), and (750, 1001) in the different tasks.Sqoop cannot currently split on multi-column indices. If your table has no index column, or has a multi-column key, then you must also manually choose a splitting column.If a table does not have a primary key defined and the --split-by? is not provided, then import will fail unless the number of mappers is explicitly set to one with the --num-mappers 1 option or the --autoreset-to-one-mapper option is used. The option --autoreset-to-one-mapper is typically used with the import-all-tables tool to automatically handle tables without a primary key in a schema.Sqoop will copy the jars in $SQOOP_HOME/lib folder to job cache every time when start a Sqoop job. When launched by Oozie this is unnecessary since Oozie use its own Sqoop share lib which keeps Sqoop dependencies in the distributed cache. Oozie will do the localization on each worker node for the Sqoop dependencies only once during the first Sqoop job and reuse the jars on worker node for subsquencial jobs. Using option --skip-dist-cache in Sqoop command when launched by Oozie will skip the step which Sqoop copies its dependencies to job cache and save massive I/O.MySQL provides the mysqldump tool which can export data from MySQL to other systems very quickly. By supplying the --direct argument, you are specifying that Sqoop should attempt the direct import channel. This channel may be higher performance than using JDBC.By default, Sqoop will import a table named foo to a directory named foo inside your home directory in HDFS. For example, if your username is someuser, then the import tool will write to /user/someuser/foo/(files). You can adjust the parent directory of the import with the --warehouse-dir argument. For example:$ sqoop import --connnect--table foo --warehouse-dir /shared \When using direct mode, you can specify additional arguments which should be passed to the underlying tool. If the argument -- is given on the command-line, then subsequent arguments are sent directly to the underlying tool. For example, the following adjusts the character set used by mysqldump:$ sqoop import --connect jdbc:mysql://server.foo.com/db --table bar \? ? --direct -- --default-character-set=latin1By default, imports go to a new target location. If the destination directory already exists in HDFS, Sqoop will refuse to import and overwrite that directory’s contents. If you use the --append argument, Sqoop will import data to a temporary directory and then rename the files into the normal target directory in a manner that does not conflict with existing filenames in that directory.Sqoop is preconfigured to map most SQL types to appropriate Java or Hive representatives. However the default mapping might not be suitable for everyone and might be overridden by --map-column-java (for changing mapping to Java) or --map-column-hive (for changing Hive mapping).Sqoop is expecting comma separated list of mapping in form=. For example:
$ sqoop import ... --map-column-java id=String,value=Integer
Sqoop will rise exception in case that some configured mapping will not be used.
You should specify append mode when importing a table where new rows are continually being added with increasing row id values. You specify the column containing the row’s id with --check-column. Sqoop imports rows where the check column has a value greater than the one specified with --last-value.
At the end of an incremental import, the value which should be specified as --last-value for a subsequent import is printed to the screen. When running a subsequent import, you should specify --last-value in this way to ensure you import only the new or updated data.
You can import data in one of two file formats: delimited text or SequenceFiles.
Delimited text is appropriate for most non-binary data types. It also readily supports further manipulation by other tools, such as Hive.
reading from SequenceFiles is higher-performance than reading from text files, as records do not need to be parsed
By default, data is not compressed. You can compress your data by using the deflate (gzip) algorithm with the -z or --compress argument, or specify any Hadoop compression codec using the --compression-codec argument. This applies to SequenceFile, text, and Avro files.
While the choice of delimiters is most important for a text-mode import, it is still relevant if you import to SequenceFiles with --as-sequencefile. The generated class' toString() method will use the delimiters you specify, so subsequent formatting of the output data will rely on the delimiters you choose.
When Sqoop imports data to HDFS, it generates a Java class which can reinterpret the text files that it creates when doing a delimited-format import.
When Sqoop imports data to HDFS, it generates a Java class which can reinterpret the text files that it creates when doing a delimited-format import. The delimiters are chosen with arguments such as --fields-terminated-by; this controls both how the data is written to disk, and how the generated parse() method reinterprets this data. The delimiters used by the parse() method can be chosen independently of the output arguments, by using --input-fields-terminated-by, and so on. This is useful, for example, to generate classes which can parse records created with one set of delimiters, and emit the records to a different set of files using a separate set of delimiters.
2016年9月2日? 閱讀筆記
Hive can put data into partitions for more efficient query performance. You can tell a Sqoop job to import data for Hive into a particular partition by specifying the --hive-partition-key and --hive-partition-value arguments. The partition value must be a string. Please see the Hive documentation for more details on partitioning.
7.3. Example Invocations
The following examples illustrate how to use the import tool in a variety of situations.
A basic import of a table named EMPLOYEES in the corp database:
$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES
A basic import requiring a login:
$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \
--username SomeUser -P
Enter password: (hidden)
Selecting specific columns from the EMPLOYEES table:
$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \
--columns "employee_id,first_name,last_name,job_title"
Controlling the import parallelism (using 8 parallel tasks):
$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \
-m 8
Storing data in SequenceFiles, and setting the generated class name to com.foocorp.Employee:
$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \
--class-name com.foocorp.Employee --as-sequencefile
Specifying the delimiters to use in a text-mode import:
$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \
--fields-terminated-by '\t' --lines-terminated-by '\n' \
--optionally-enclosed-by '\"'
Importing the data to Hive:
$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \
--hive-import
Importing only new employees:
$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \
--where "start_date > '2010-01-01'"
Changing the splitting column from the default:
$ sqoop import --connect jdbc:mysql://db.foo.com/corp --table EMPLOYEES \
--split-by dept_id
Verifying that an import was successful:
$ hadoop fs -ls EMPLOYEES
Found 5 items
drwxr-xr-x? - someuser somegrp? ? ? ? ? 0 2010-04-27 16:40 /user/someuser/EMPLOYEES/_logs
-rw-r--r--? 1 someuser somegrp? ? 2913511 2010-04-27 16:40 /user/someuser/EMPLOYEES/part-m-00000
-rw-r--r--? 1 someuser somegrp? ? 1683938 2010-04-27 16:40 /user/someuser/EMPLOYEES/part-m-00001
-rw-r--r--? 1 someuser somegrp? ? 7245839 2010-04-27 16:40 /user/someuser/EMPLOYEES/part-m-00002
-rw-r--r--? 1 someuser somegrp? ? 7842523 2010-04-27 16:40 /user/someuser/EMPLOYEES/part-m-00003
$ hadoop fs -cat EMPLOYEES/part-m-00000 | head -n 10
0,joe,smith,engineering
1,jane,doe,marketing
...
Performing an incremental import of new data, after having already imported the first 100,000 rows of a table:
$ sqoop import --connect jdbc:mysql://db.foo.com/somedb --table sometable \
--where "id > 100000" --target-dir /incremental_dataset --append
An import of a table named EMPLOYEES in the corp database that uses validation to validate the import using the table row count and number of rows copied into HDFS: More Details
$ sqoop import --connect jdbc:mysql://db.foo.com/corp \
--table EMPLOYEES --validate
The import-all-tables tool imports a set of tables from an RDBMS to HDFS. Data from each table is stored in a separate directory in HDFS.
For the import-all-tables tool to be useful, the following conditions must be met:
Each table must have a single-column primary key or --autoreset-to-one-mapper option must be used.
You must intend to import all columns of each table.
You must not intend to use non-default splitting column, nor impose any conditions via a WHERE clause.
These arguments behave in the same manner as they do when used for the sqoop-import tool, but the --table, --split-by, --columns, and --where arguments are invalid for sqoop-import-all-tables. The --exclude-tables argument is for +sqoop-import-all-tables only.
8.3. Example Invocations
Import all tables from the corp database:
$ sqoop import-all-tables --connect jdbc:mysql://db.foo.com/corp
Verifying that it worked:
$ hadoop fs -ls
Found 4 items
drwxr-xr-x? - someuser somegrp? ? ? 0 2010-04-27 17:15 /user/someuser/EMPLOYEES
drwxr-xr-x? - someuser somegrp? ? ? 0 2010-04-27 17:15 /user/someuser/PAYCHECKS
drwxr-xr-x? - someuser somegrp? ? ? 0 2010-04-27 17:15 /user/someuser/DEPARTMENTS
drwxr-xr-x? - someuser somegrp? ? ? 0 2010-04-27 17:15 /user/someuser/OFFICE_SUPPLIES
The export tool exports a set of files from HDFS back to an RDBMS. The target table must already exist in the database. The input files are read and parsed into a set of records according to the user-specified delimiters.
The default operation is to transform these into a set of INSERT statements that inject the records into the database. In "update mode," Sqoop will generate UPDATE statements that replace existing records in the database, and in "call mode" Sqoop will make a stored procedure call for each record.
Although the Hadoop generic arguments must preceed any export arguments, the export arguments can be entered in any order with respect to one another.參數無順序
Sqoop supports additional import targets beyond HDFS and Hive. Sqoop can also import records into a table in HBase.
By specifying --hbase-table, you instruct Sqoop to import to a table in HBase rather than a directory in HDFS. Sqoop will import data to the table specified as the argument to --hbase-table. Each row of the input table will be transformed into an HBase Put operation to a row of the output table. The key for each row is taken from a column of the input. By default Sqoop will use the split-by column as the row key column. If that is not specified, it will try to identify the primary key column, if any, of the source table. You can manually specify the row key column with --hbase-row-key. Each output column will be placed in the same column family, which must be specified with --column-family.
[Note] Note
This function is incompatible with direct import (parameter --direct).
This function is incompatible with direct import (parameter --direct), and cannot be used in the same operation as an HBase import.
The --export-dir argument and one of --table or --call are required. These specify the table to populate in the database (or the stored procedure to call), and the directory in HDFS that contains the source data.
By default, all columns within a table are selected for export. You can select a subset of columns and control their ordering by using the --columns argument. This should include a comma-delimited list of columns to export. For example: --columns "col1,col2,col3". Note that columns that are not included in the --columns parameter need to have either defined default value or allow NULL values. Otherwise your database will reject the imported data which in turn will make Sqoop job fail. 導出數據時? 未選擇的列在數據庫表中需要為可空類型 否則數據庫將拒絕接受數據導入
By default, Sqoop will use four tasks in parallel for the export process. This may not be optimal; you will need to experiment with your own particular setup. Additional tasks may offer better concurrency, but if the database is already bottlenecked on updating indices, invoking triggers, and so on, then additional load may decrease performance. The --num-mappers or -m arguments control the number of map tasks, which is the degree of parallelism used.可以增加map數,但是當數據庫遇到性能瓶頸時,增加map反而會降低性能。
If --input-null-string is not specified, then the string "null" will be interpreted as null for string-type columns. If --input-null-non-string is not specified, then both the string "null" and the empty string will be interpreted as null for non-string columns. Note that, the empty string will be always interpreted as null for non-string columns, in addition to other string if specified by --input-null-non-string.導出數據時,由于非空列 的null數據導出至表將會被打斷任務
Since Sqoop breaks down export process into multiple transactions, it is possible that a failed export job may result in partial data being committed to the database. This can further lead to subsequent jobs failing due to insert collisions in some cases, or lead to duplicated data in others. You can overcome this problem by specifying a staging table via the --staging-table option which acts as an auxiliary table that is used to stage exported data. The staged data is finally moved to the destination table in a single transaction.導出任務被分為多個部分 多事務 其中一部分錯誤將導致只有部分數據被導入至表? 如果使用--staging-table 將數據保存至臨時表內? 最終作為單一事務將臨時表數據導入至目標表
Support for staging data prior to pushing it into the destination table is not always available for --direct exports. It is also not available when export is invoked using the --update-key option for updating existing data, and when stored procedures are used to insert the data.
By default, sqoop-export appends new rows to a table; each input record is transformed into an INSERT statement that adds a row to the target database table.
By default, sqoop-export appends new rows to a table; each input record is transformed into an INSERT statement that adds a row to the target database table. If your table has constraints (e.g., a primary key column whose values must be unique) and already contains data, you must take care to avoid inserting records that violate these constraints. The export process will fail if an INSERT statement fails. This mode is primarily intended for exporting records to a new, empty table intended to receive these results.
If you specify the --update-key argument, Sqoop will instead modify an existing dataset in the database. Each input record is treated as an UPDATE statement that modifies an existing row. The row a statement modifies is determined by the column name(s) specified with --update-key. For example, consider the following table definition:
Depending on the target database, you may also specify the --update-mode argument with allowinsert mode if you want to update rows if they exist in the database already or insert rows if they do not exist yet.
If an UPDATE statement modifies no rows, this is not considered an error; the export will silently continue. (In effect, this means that an update-based export will not insert new rows into the database.) Likewise, if the column specified with --update-key does not uniquely identify rows and multiple rows are updated by a single statement, this condition is also undetected.
The argument --update-key can also be given a comma separated list of column names. In which case, Sqoop will match all keys from this list before updating any existing record.
Exports are performed by multiple writers in parallel. Each writer uses a separate connection to the database; these have separate transactions from one another. Sqoop uses the multi-row INSERT syntax to insert up to 100 records per statement. Every 100 statements, the current transaction within a writer task is committed, causing a commit every 10,000 rows. This ensures that transaction buffers do not grow without bound, and cause out-of-memory conditions. Therefore, an export is not an atomic process. Partial results from the export will become visible before the export is complete. 一次導出任務并行,每一個寫操作都是一個獨立的事務。注意事務緩存的增長不能超過警戒線,以及內存溢出的清空。因此,一次導出是一件原子事務。可能在數據尚未完全導入之前就可以在表中看見。
Exports may fail for a number of reasons: 導出任務可能失敗的原因
Loss of connectivity from the Hadoop cluster to the database (either due to hardware fault, or server software crashes) 失去連接
Attempting to INSERT a row which violates a consistency constraint (for example, inserting a duplicate primary key value) 違反一致性約束插入
Attempting to parse an incomplete or malformed record from the HDFS source data? 解析一個不完整的或殘缺的記錄
Attempting to parse records using incorrect delimiters 使用不正確的分割符解析記錄
Capacity issues (such as insufficient RAM or disk space) 容量問題? 內存不足或磁盤空間不足
If an export map task fails due to these or other reasons, it will cause the export job to fail. The results of a failed export are undefined. Each export map task operates in a separate transaction. Furthermore, individual map tasks commit their current transaction periodically. If a task fails, the current transaction will be rolled back. Any previously-committed transactions will remain durable in the database, leading to a partially-complete export.
每一個export map任務 都在獨立的事務中進行作業,相互獨立的提交他們各自的任務。如果任務失敗,當前事務會回滾而先前已完成的事務將繼續保留在數據庫內,就會造成部分完成的導出任務。
A basic export to populate a table named bar:
$ sqoop export --connect jdbc:mysql://db.example.com/foo --table bar? \
--export-dir /results/bar_data
This example takes the files in /results/bar_data and injects their contents in to the bar table in the foo database on db.example.com. The target table must already exist in the database. Sqoop performs a set of INSERT INTO operations, without regard for existing content. If Sqoop attempts to insert rows which violate constraints in the database (for example, a particular primary key value already exists), then the export fails.
插入操作? 保證表已存在 違反一致性約束 將失敗
Alternatively, you can specify the columns to be exported by providing --columns "col1,col2,col3". Please note that columns that are not included in the --columns parameter need to have either defined default value or allow NULL values. Otherwise your database will reject the imported data which in turn will make Sqoop job fail.? 那些未被選擇導出數據的列? 在數據庫中要么具有默認值 要么允許null值 否則將會失敗
Another basic export to populate a table named bar with validation enabled: More Details
$ sqoop export --connect jdbc:mysql://db.example.com/foo --table bar? \
--export-dir /results/bar_data --validate
An export that calls a stored procedure named barproc for every record in /results/bar_data would look like:? 導出數據時調用存儲過程
$ sqoop export --connect jdbc:mysql://db.example.com/foo --call barproc \
--export-dir /results/bar_data
Validation
Validate the data copied, either import or export by comparing the row counts from the source and the target post copy.? 比較源表和目標表的行數? 驗證操作效果
Imports and exports can be repeatedly performed by issuing the same command multiple times. Especially when using the incremental import capability, this is an expected scenario.
Sqoop allows you to define saved jobs which make this process easier. A saved job records the configuration information required to execute a Sqoop command at a later time. The section on the sqoop-job tool describes how to create and work with saved jobs.
By default, job descriptions are saved to a private repository stored in $HOME/.sqoop/. You can configure Sqoop to instead use a shared metastore, which makes saved jobs available to multiple users across a shared cluster. Starting the metastore is covered by the section on the sqoop-metastore tool.
使用sqoop job 定義多次import export
The job tool allows you to create and work with saved jobs. Saved jobs remember the parameters used to specify a job, so they can be re-executed by invoking the job by its handle.
If a saved job is configured to perform an incremental import, state regarding the most recently imported rows is updated in the saved job to allow the job to continually import only the newest rows.
job? 重復執行? 增量導入時 允許新增數據持續性加入當前文件
Creating saved jobs is done with the --create action. This operation requires a -- followed by a tool name and its arguments. The tool and its arguments will form the basis of the saved job. Consider:
$ sqoop job --create myjob -- import --connect jdbc:mysql://example.com/db \
--table mytable
This creates a job named myjob which can be executed later. The job is not run. This job is now available in the list of saved jobs:
$ sqoop job --list
Available jobs:
myjob
We can inspect the configuration of a job with the show action:
$ sqoop job --show myjob
Job: myjob
Tool: import
Options:
----------------------------
direct.import = false
codegen.input.delimiters.record = 0
hdfs.append.dir = false
db.table = mytable
...
And if we are satisfied with it, we can run the job with exec:
$ sqoop job --exec myjob
10/08/19 13:08:45 INFO tool.CodeGenTool: Beginning code generation
...
The exec action allows you to override arguments of the saved job by supplying them after a --. For example, if the database were changed to require a username, we could specify the username and password with:
$ sqoop job --exec myjob -- --username someuser -P
Enter password:
...
Incremental imports are performed by comparing the values in a check column against a reference value for the most recent import. For example, if the --incremental append argument was specified, along with --check-column id and --last-value 100, all rows with id > 100 will be imported. If an incremental import is run from the command line, the value which should be specified as --last-value in a subsequent incremental import will be printed to the screen for your reference. If an incremental import is run from a saved job, this value will be retained in the saved job. Subsequent runs of sqoop job --exec someIncrementalJob will continue to import only newer rows than those previously imported.
增量導入 --incremental append? ? --check-column id? ? ? --last-value 100
The metastore tool configures Sqoop to host a shared metadata repository. Multiple users and/or remote users can define and execute saved jobs (created with sqoop job) defined in this metastore.
Clients must be configured to connect to the metastore in sqoop-site.xml or with the --meta-connect argument.
The metastore is available over TCP/IP. The port is controlled by the sqoop.metastore.server.port configuration parameter, and defaults to 16000.
The merge tool runs a MapReduce job that takes two directories as input: a newer dataset, and an older one. These are specified with --new-data and --onto respectively. The output of the MapReduce job will be placed in the directory in HDFS specified by --target-dir. 合并數據集時指定新舊數據集且指定目標目錄
When merging the datasets, it is assumed that there is a unique primary key value in each record. The column for the primary key is specified with --merge-key. Multiple rows in the same dataset should not have the same primary key, or else data loss may occur. 合并時保證一致性約束? 保證主鍵的唯一性
The codegen tool generates Java classes which encapsulate and interpret imported records. The Java definition of a record is instantiated as part of the import process, but can also be performed separately. For example, if Java source is lost, it can be recreated. New versions of a class can be created which use different delimiters between fields, and so on.? 自動生成java類文件
Recreate the record interpretation code for the employees table of a corporate database:
$ sqoop codegen --connect jdbc:mysql://db.example.com/corp \
--table employees
The create-hive-table tool populates a Hive metastore with a definition for a table based on a database table previously imported to HDFS, or one planned to be imported. This effectively performs the "--hive-import" step of sqoop-import without running the preceeding import.
If data was already loaded to HDFS, you can use this tool to finish the pipeline of importing the data to Hive. You can also create Hive tables with this tool; data then can be imported and populated into the target after a preprocessing step run by the user.
Define in Hive a table named emps with a definition based on a database table named employees:
$ sqoop create-hive-table --connect jdbc:mysql://db.example.com/corp \
--table employees --hive-table emps
The eval tool allows users to quickly run simple SQL queries against a database; results are printed to the console. This allows users to preview their import queries to ensure they import the data they expect.
The eval tool is provided for evaluation purpose only. You can use it to verify database connection from within the Sqoop or to test simple queries. It’s not suppose to be used in production workflows.
使用eval快速簡單評估sql 查詢? ? 僅僅用于評估目的? ? 驗證數據庫鏈接等目的? 在生產工作流中并不支持
Select ten records from the employees table:
$ sqoop eval --connect jdbc:mysql://db.example.com/corp \
--query "SELECT * FROM employees LIMIT 10"
Insert a row into the foo table:
$ sqoop eval --connect jdbc:mysql://db.example.com/corp \
-e "INSERT INTO foo VALUES(42, 'bar')"
List database schemas available on a MySQL server:
$ sqoop list-databases --connect jdbc:mysql://database.example.com/
information_schema
employees
This only works with HSQLDB, MySQL and Oracle. When using with Oracle, it is necessary that the user connecting to the database has DBA privileges.
Netezza connector supports an optimized data transfer facility using the Netezza external tables feature. Each map tasks of Netezza connector’s import job will work on a subset of the Netezza partitions and transparently create and use an external table to transport data. Similarly, export jobs will use the external table to push data fast onto the NZ system. Direct mode does not support staging tables, upsert options etc.
Here is an example of complete command line for import using the Netezza external table feature.
$ sqoop import \
--direct \
--connect jdbc:netezza://nzhost:5480/sqoop \
--table nztable \
--username nzuser \
--password nzpass \
--target-dir hdfsdir
Here is an example of complete command line for export with tab as the field terminator character.
$ sqoop export \
--direct \
--connect jdbc:netezza://nzhost:5480/sqoop \
--table nztable \
--username nzuser \
--password nzpass \
--export-dir hdfsdir \
--input-fields-terminated-by "\t"
Netezza direct connector supports the null-string features of Sqoop. The null string values are converted to appropriate external table options during export and import operations.