Hadoop2.10.1(2021/5/13時点でHadoop2系の最新)のインストール方法について。
$ tar xf hadoop-2.10.1.tar.gz
$ export HADOOP_HOME=解凍した場所/hadoop-2.10.1 $ export PATH=$PATH:$HADOOP_HOME/bin
デフォルトでは、hdfsコマンドでAmazon S3のスキーマs3aを見ようとするとエラーになる。
(スキーマs3やs3nは既に非推奨なので、s3aを使う)
$ hdfs dfs -ls s3a://zzz
-ls: Fatal internal error
java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2425)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3213)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3245)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:121)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3296)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3264)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:475)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:356)
at org.apache.hadoop.fs.shell.PathData.expandAsGlob(PathData.java:325)
at org.apache.hadoop.fs.shell.Command.expandArgument(Command.java:245)
at org.apache.hadoop.fs.shell.Command.expandArguments(Command.java:228)
at org.apache.hadoop.fs.shell.FsCommand.processRawArguments(FsCommand.java:103)
at org.apache.hadoop.fs.shell.Command.run(Command.java:175)
at org.apache.hadoop.fs.FsShell.run(FsShell.java:317)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
at org.apache.hadoop.fs.FsShell.main(FsShell.java:380)
Caused by: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2329)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2423)
... 16 more
解凍したHadoopディレクトリー内にアクセス用のライブラリーはある(クラスパスは通っていない)ので、クラスパスに加えてやれば実行できるようになる。
$ export HADOOP_CLASSPATH="$HADOOP_HOME/share/hadoop/tools/lib/*"
なお、EC2のAmazon LinuxでIAMロールによって(シークレットキーやアクセスキーを指定せずに)S3へアクセスできるようになっている場合、(シークレットキーやアクセスキーを指定しなくても)HadoopのAWS用ライブラリーが上手く扱ってくれる。
デフォルトでは、S3への同時アクセス数が制限されている。[2021-05-14]
それを超えるアクセスを行うと、以下のような例外が発生する。(HTTPリクエストが出来ないだの、タイムアウトしただの)
java.io.InterruptedIOException: Reopen at position 4 on s3a://〜/hoge.parquet: com.amazonaws.SdkClientException: Unable to execute HTTP request: Timeout waiting for connection from pool
at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:141)
at org.apache.hadoop.fs.s3a.S3AInputStream.reopen(S3AInputStream.java:167)
at org.apache.hadoop.fs.s3a.S3AInputStream.lazySeek(S3AInputStream.java:295)
at org.apache.hadoop.fs.s3a.S3AInputStream.read(S3AInputStream.java:378)
at java.io.DataInputStream.readFully(DataInputStream.java:195)
at java.io.DataInputStream.readFully(DataInputStream.java:169)
at parquet.hadoop.ParquetFileReader$ConsecutiveChunkList.readAll(ParquetFileReader.java:756)
at parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:494)
〜
Caused by: com.amazonaws.SdkClientException: Unable to execute HTTP request: Timeout waiting for connection from pool
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleRetryableException(AmazonHttpClient.java:1114)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1064)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:743)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:717)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:699)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:667)
at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:649)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:513)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4325)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4272)
at com.amazonaws.services.s3.AmazonS3Client.getObject(AmazonS3Client.java:1409)
at org.apache.hadoop.fs.s3a.S3AInputStream.reopen(S3AInputStream.java:160)
... 16 more
Caused by: com.amazonaws.thirdparty.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting for connection from pool
at com.amazonaws.thirdparty.apache.http.impl.conn.PoolingHttpClientConnectionManager.leaseConnection(PoolingHttpClientConnectionManager.java:286)
at com.amazonaws.thirdparty.apache.http.impl.conn.PoolingHttpClientConnectionManager$1.get(PoolingHttpClientConnectionManager.java:263)
at sun.reflect.GeneratedMethodAccessor12.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.amazonaws.http.conn.ClientConnectionRequestFactory$Handler.invoke(ClientConnectionRequestFactory.java:70)
at com.amazonaws.http.conn.$Proxy9.get(Unknown Source)
at com.amazonaws.thirdparty.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:190)
at com.amazonaws.thirdparty.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:184)
at com.amazonaws.thirdparty.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184)
at com.amazonaws.thirdparty.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
at com.amazonaws.thirdparty.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55)
at com.amazonaws.http.apache.client.impl.SdkHttpClient.execute(SdkHttpClient.java:72)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1236)
at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1056)
... 26 more
Hadoopの設定で同時接続数を変更できる。
<configuration> <property> <name>fs.s3a.connection.maximum</name> <value>128</value> <description>Controls the maximum number of simultaneous connections to S3.</description> </property> <property> <name>fs.s3a.threads.max</name> <value>128</value> <description> Maximum number of concurrent active (part)uploads, which each use a thread from the threadpool.</description> </property> <property> <name>fs.s3a.max.total.tasks</name> <value>128</value> <description>Number of (part)uploads allowed to the queue before blocking additional uploads.</description> </property> </configuration>
参考: Hadoop-AWS module: Integration with Amazon Web Services
s3aでファイルをS3上に作成する場合、一旦ローカルディスクに一時ファイルが作られる。[2021-05-17]
したがって、ローカルディスクの容量が足りなくなると落ちる。
java.io.IOException: No space left on device
at java.io.FileOutputStream.writeBytes(Native Method)
at java.io.FileOutputStream.write(FileOutputStream.java:326)
at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126)
at org.apache.hadoop.fs.s3a.S3AOutputStream.write(S3AOutputStream.java:140)
at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:58)
〜
hadoop.tmp.dirで指定されたディレクトリーの下にs3aというディレクトリーを作るようなので、その指定を容量の多いディスクに変えればよい。
<property> <name>hadoop.tmp.dir</name> <value>/home/hoge/tmp</value> </property>