Hadoop2.10.1(2021/5/13時点でHadoop2系の最新)のインストール方法について。
$ tar xf hadoop-2.10.1.tar.gz
$ export HADOOP_HOME=解凍した場所/hadoop-2.10.1 $ export PATH=$PATH:$HADOOP_HOME/bin
デフォルトでは、hdfsコマンドでAmazon S3のスキーマs3aを見ようとするとエラーになる。
(スキーマs3やs3nは既に非推奨なので、s3aを使う)
$ hdfs dfs -ls s3a://zzz -ls: Fatal internal error java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2425) at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3213) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3245) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:121) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3296) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3264) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:475) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:356) at org.apache.hadoop.fs.shell.PathData.expandAsGlob(PathData.java:325) at org.apache.hadoop.fs.shell.Command.expandArgument(Command.java:245) at org.apache.hadoop.fs.shell.Command.expandArguments(Command.java:228) at org.apache.hadoop.fs.shell.FsCommand.processRawArguments(FsCommand.java:103) at org.apache.hadoop.fs.shell.Command.run(Command.java:175) at org.apache.hadoop.fs.FsShell.run(FsShell.java:317) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90) at org.apache.hadoop.fs.FsShell.main(FsShell.java:380) Caused by: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2329) at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2423) ... 16 more
解凍したHadoopディレクトリー内にアクセス用のライブラリーはある(クラスパスは通っていない)ので、クラスパスに加えてやれば実行できるようになる。
$ export HADOOP_CLASSPATH="$HADOOP_HOME/share/hadoop/tools/lib/*"
なお、EC2のAmazon LinuxでIAMロールによって(シークレットキーやアクセスキーを指定せずに)S3へアクセスできるようになっている場合、(シークレットキーやアクセスキーを指定しなくても)HadoopのAWS用ライブラリーが上手く扱ってくれる。
デフォルトでは、S3への同時アクセス数が制限されている。[2021-05-14]
それを超えるアクセスを行うと、以下のような例外が発生する。(HTTPリクエストが出来ないだの、タイムアウトしただの)
java.io.InterruptedIOException: Reopen at position 4 on s3a://〜/hoge.parquet: com.amazonaws.SdkClientException: Unable to execute HTTP request: Timeout waiting for connection from pool at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:141) at org.apache.hadoop.fs.s3a.S3AInputStream.reopen(S3AInputStream.java:167) at org.apache.hadoop.fs.s3a.S3AInputStream.lazySeek(S3AInputStream.java:295) at org.apache.hadoop.fs.s3a.S3AInputStream.read(S3AInputStream.java:378) at java.io.DataInputStream.readFully(DataInputStream.java:195) at java.io.DataInputStream.readFully(DataInputStream.java:169) at parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:756) at parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:494) 〜 Caused by: com.amazonaws.SdkClientException: Unable to execute HTTP request: Timeout waiting for connection from pool at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleRetryableException(AmazonHttpClient.java:1114) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1064) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:743) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:717) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:699) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:667) at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:649) at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:513) at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4325) at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4272) at com.amazonaws.services.s3.AmazonS3Client.getObject(AmazonS3Client.java:1409) at org.apache.hadoop.fs.s3a.S3AInputStream.reopen(S3AInputStream.java:160) ... 16 more Caused by: com.amazonaws.thirdparty.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting for connection from pool at com.amazonaws.thirdparty.apache.http.impl.conn.PoolingHttpClientConnectionManager.leaseConnection(PoolingHttpClientConnectionManager.java:286) at com.amazonaws.thirdparty.apache.http.impl.conn.PoolingHttpClientConnectionManager$1.get(PoolingHttpClientConnectionManager.java:263) at sun.reflect.GeneratedMethodAccessor12.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at com.amazonaws.http.conn.ClientConnectionRequestFactory$Handler.invoke(ClientConnectionRequestFactory.java:70) at com.amazonaws.http.conn.$Proxy9.get(Unknown Source) at com.amazonaws.thirdparty.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:190) at com.amazonaws.thirdparty.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:184) at com.amazonaws.thirdparty.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184) at com.amazonaws.thirdparty.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82) at com.amazonaws.thirdparty.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55) at com.amazonaws.http.apache.client.impl.SdkHttpClient.execute(SdkHttpClient.java:72) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1236) at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1056) ... 26 more
Hadoopの設定で同時接続数を変更できる。
<configuration> <property> <name>fs.s3a.connection.maximum</name> <value>128</value> <description>Controls the maximum number of simultaneous connections to S3.</description> </property> <property> <name>fs.s3a.threads.max</name> <value>128</value> <description> Maximum number of concurrent active (part)uploads, which each use a thread from the threadpool.</description> </property> <property> <name>fs.s3a.max.total.tasks</name> <value>128</value> <description>Number of (part)uploads allowed to the queue before blocking additional uploads.</description> </property> </configuration>
参考: Hadoop-AWS module: Integration with Amazon Web Services
s3aでファイルをS3上に作成する場合、一旦ローカルディスクに一時ファイルが作られる。[2021-05-17]
したがって、ローカルディスクの容量が足りなくなると落ちる。
java.io.IOException: No space left on device at java.io.FileOutputStream.writeBytes(Native Method) at java.io.FileOutputStream.write(FileOutputStream.java:326) at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82) at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126) at org.apache.hadoop.fs.s3a.S3AOutputStream.write(S3AOutputStream.java:140) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:58) 〜
hadoop.tmp.dirで指定されたディレクトリーの下にs3aというディレクトリーを作るようなので、その指定を容量の多いディスクに変えればよい。
<property> <name>hadoop.tmp.dir</name> <value>/home/hoge/tmp</value> </property>