WSL2上のUbuntu 22.04.3 LTSでSpark and Icebergを試してみます。
まずはDocker/Docker Composeをインストールします。以下のサイト通りに実施すれば使えるようになります。
Install Docker Engine on Ubuntu | Docker Docs
# Add Docker's official GPG key: sudo apt-get update sudo apt-get install ca-certificates curl gnupg sudo install -m 0755 -d /etc/apt/keyrings curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg sudo chmod a+r /etc/apt/keyrings/docker.gpg # Add the repository to Apt sources: echo \ "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu \ $(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \ sudo tee /etc/apt/sources.list.d/docker.list > /dev/null sudo apt-get update
続いて以下のサイトにある通りにdocker-compose.ymlを準備し、Spark and Icebergを試す環境を作ります。
docker-compose.yml
version: "3" services: spark-iceberg: image: tabulario/spark-iceberg container_name: spark-iceberg build: spark/ networks: iceberg_net: depends_on: - rest - minio volumes: - ./warehouse:/home/iceberg/warehouse - ./notebooks:/home/iceberg/notebooks/notebooks environment: - AWS_ACCESS_KEY_ID=admin - AWS_SECRET_ACCESS_KEY=password - AWS_REGION=us-east-1 ports: - 8888:8888 - 8080:8080 - 10000:10000 - 10001:10001 rest: image: tabulario/iceberg-rest container_name: iceberg-rest networks: iceberg_net: ports: - 8181:8181 environment: - AWS_ACCESS_KEY_ID=admin - AWS_SECRET_ACCESS_KEY=password - AWS_REGION=us-east-1 - CATALOG_WAREHOUSE=s3://warehouse/ - CATALOG_IO__IMPL=org.apache.iceberg.aws.s3.S3FileIO - CATALOG_S3_ENDPOINT=http://minio:9000 minio: image: minio/minio container_name: minio environment: - MINIO_ROOT_USER=admin - MINIO_ROOT_PASSWORD=password - MINIO_DOMAIN=minio networks: iceberg_net: aliases: - warehouse.minio ports: - 9001:9001 - 9000:9000 command: ["server", "/data", "--console-address", ":9001"] mc: depends_on: - minio image: minio/mc container_name: mc networks: iceberg_net: environment: - AWS_ACCESS_KEY_ID=admin - AWS_SECRET_ACCESS_KEY=password - AWS_REGION=us-east-1 entrypoint: > /bin/sh -c " until (/usr/bin/mc config host add minio http://minio:9000 admin password) do echo '...waiting...' && sleep 1; done; /usr/bin/mc rm -r --force minio/warehouse; /usr/bin/mc mb minio/warehouse; /usr/bin/mc policy set public minio/warehouse; tail -f /dev/null " networks: iceberg_net:
サイトにはdocker-compose upで起動するように書かれていますが、先ほどインストールしたDocker ComposeはVersion 2なので以下のコマンドで起動します。
> docker compose up
spark-icebergが起動できたらSparkSQLを使ってテーブルを作り、INSERTとSELECTを試してみます。
もう1つコンソールを開いてspark-icebergに接続し、spark-sqlを立ち上げます。
> docker exec -it spark-iceberg spark-sql 以下実行結果 Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 24/01/01 09:39:41 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 24/01/01 09:39:43 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041. Spark Web UI available at http://2157a499f607:4041 Spark master: local[*], Application Id: local-1704101983121 spark-sql ()>
続いてCREATE TABLE/INSERT/SELECTの実行。
以下CREATE TABLE実行結果 spark-sql ()> CREATE TABLE demo.nyc.taxis > ( > vendor_id bigint, > trip_id bigint, > trip_distance float, > fare_amount double, > store_and_fwd_flag string > ) > PARTITIONED BY (vendor_id); Time taken: 2.788 seconds 以下INSERT実行結果 spark-sql ()> INSERT INTO demo.nyc.taxis > VALUES (1, 1000371, 1.8, 15.32, 'N'), (2, 1000372, 2.5, 22.15, 'N'), (2, 1000373, 0.9, 9.01, 'N'), (1, 1000374, 8.4, 42.13, 'Y'); Time taken: 4.687 seconds 以下SELECT実行結果 spark-sql ()> SELECT * FROM demo.nyc.taxis; 1 1000371 1.8 15.32 N 1 1000374 8.4 42.13 Y 2 1000372 2.5 22.15 N 2 1000373 0.9 9.01 N Time taken: 0.896 seconds, Fetched 4 row(s)