Skip to content

EO-DataHub/eodhp-local-dev-stack

Repository files navigation

EODHP Local Development Stack

A Docker Compose stack for running enough of the EODH backend infrastructure locally to test data download and processing workflows without needing access to a Kubernetes cluster.

The stack currently provides:

  • RustFS as an S3-compatible object store
  • Apache Pulsar in standalone mode
  • An S3 init container that creates buckets and uploads seed data
  • An optional harvest-transformer container wired to the local S3 and Pulsar services

This is intended for local development and test harnesses. It is not intended to be a production deployment.

Services

Service Purpose Host URL / port
s3 RustFS S3-compatible object store S3 API: http://localhost:9000, console: http://localhost:9001
s3-init One-shot AWS CLI container that creates and seeds buckets No exposed ports
pulsar Apache Pulsar standalone broker Binary protocol: pulsar://localhost:6650, admin API: http://localhost:8080
harvest-transformer Local harvest-transformer runner configured against this stack No exposed ports

Inside Docker Compose, containers should use service names rather than localhost:

S3 endpoint:       http://s3:9000
Pulsar service:   pulsar://pulsar:6650
Pulsar admin API: http://pulsar:8080

From the host machine, use localhost:

S3 endpoint:       http://localhost:9000
Pulsar service:   pulsar://localhost:6650
Pulsar admin API: http://localhost:8080

Prerequisites

  • Docker
  • Docker Compose
  • AWS CLI, if you want to inspect the local S3 service from the host

Configuration

Copy the example environment file:

cp .env.example .env

The main settings are:

S3_DATA_DIR=./data/s3
S3_LOG_DIR=./data/logs
S3_ACCESS_KEY=dev_access_key
S3_SECRET_KEY=dev_secret_key_123
S3_BUCKET=transformed
OUTPUT_ROOT=http://localhost/

If S3_DATA_DIR or S3_LOG_DIR are not set, Compose defaults to:

./data/s3
./data/logs

Create those directories before starting the stack:

mkdir -p ${S3_DATA_DIR:-./data/s3} ${S3_LOG_DIR:-./data/logs}

RustFS must be able to write to these directories. The RustFS container commonly runs as UID 10001, so on Linux you may need:

sudo chown -R 10001:10001 ${S3_DATA_DIR:-./data/s3} ${S3_LOG_DIR:-./data/logs}

RUSTFS_UNSAFE_BYPASS_DISK_CHECK

Set to true to bypass RustFS disk topology checks. Useful on systems where RustFS refuses to start due to disk layout constraints. Defaults to false.

Start the stack

docker compose up -d

Check service status:

docker compose ps

View logs:

docker compose logs -f s3
docker compose logs -f pulsar
docker compose logs -f s3-init

Stop the stack:

docker compose down

To remove Docker-managed Pulsar data as well:

docker compose down -v

RustFS / S3 usage

RustFS exposes:

  • S3 API: http://localhost:9000
  • Console: http://localhost:9001

Log in to the console using:

Username: <S3_ACCESS_KEY>
Password: <S3_SECRET_KEY>

For the default .env.example values:

Username: dev_access_key
Password: dev_secret_key_123

AWS CLI profile

Create a local AWS profile.

In ~/.aws/config:

[profile local_s3]
region = eu-west-2
output = json
s3 =
    addressing_style = path

In ~/.aws/credentials:

[local_s3]
aws_access_key_id = dev_access_key
aws_secret_access_key = dev_secret_key_123

List buckets:

aws --profile local_s3 --endpoint-url http://localhost:9000 s3 ls

Upload a file:

aws --profile local_s3 --endpoint-url http://localhost:9000 s3 cp ./example.txt s3://transformed/example.txt

Download a file:

aws --profile local_s3 --endpoint-url http://localhost:9000 s3 cp s3://transformed/example.txt ./example.txt

Seeding S3 buckets

The s3-init service runs scripts/init.sh after RustFS becomes healthy.

The init script:

  1. Waits until the S3 endpoint responds
  2. Creates a minimal spdx-public-eodhp bucket used by harvest-transformer
  3. Treats each first-level directory in seed/ as an S3 bucket name
  4. Uploads files from each seed directory into the matching bucket
  5. Ignores .gitkeep files

Example:

seed/
└── transformed/
    └── example.json

becomes:

s3://transformed/example.json

Nested directories are preserved as S3 key prefixes:

seed/transformed/foo/bar/example.json

becomes:

s3://transformed/foo/bar/example.json

To create an empty bucket that is committed to Git, create a directory under seed/ and add a .gitkeep file:

seed/my-empty-bucket/.gitkeep

The bucket will be created, but .gitkeep will not be uploaded.

You can re-run the S3 init step with:

docker compose up s3-init

Pulsar usage

Pulsar runs in standalone mode and exposes:

  • Binary protocol: pulsar://localhost:6650
  • Admin REST API: http://localhost:8080

Check that Pulsar is healthy:

curl http://localhost:8080/admin/v2/clusters

Expected response:

["standalone"]

List topics in the default namespace:

curl http://localhost:8080/admin/v2/persistent/public/default

Create a topic manually, if needed:

docker exec -it local-pulsar \
  bin/pulsar-admin topics create persistent://public/default/transformed

Pulsar can also auto-create topics in this local standalone setup when a producer connects. In application code, use the Pulsar protocol URL, not HTTP:

import pulsar

client = pulsar.Client("pulsar://localhost:6650")
producer = client.create_producer("persistent://public/default/transformed")

Using http://localhost:6650 is incorrect because port 6650 is the binary protocol port.

Harvest transformer

The harvest-transformer service is built from:

https://github.com/EO-DataHub/harvest-transformer.git#main

It is configured to use the local services:

PULSAR_URL=pulsar://pulsar:6650
AWS_ENDPOINT_URL_S3=http://s3:9000
S3_BUCKET=${S3_BUCKET}
S3_SPDX_BUCKET=spdx-public-eodhp
OUTPUT_ROOT=${OUTPUT_ROOT}

It starts only after:

  • s3-init has completed successfully
  • pulsar is healthy

Local application environment

local.env.example contains useful values for applications running on the host:

AWS_PROFILE=local_s3
S3_ENDPOINT=http://localhost:9000
S3_FORCE_PATH_STYLE=true
PULSAR_SERVICE_URL=pulsar://localhost:6650

Applications running inside Compose should use Docker service names instead:

S3_ENDPOINT=http://s3:9000
PULSAR_SERVICE_URL=pulsar://pulsar:6650

Common checks

List S3 buckets:

aws --profile local_s3 --endpoint-url http://localhost:9000 s3 ls

List Pulsar clusters:

curl http://localhost:8080/admin/v2/clusters

List Pulsar topics:

curl http://localhost:8080/admin/v2/persistent/public/default

Check containers:

docker compose ps

Troubleshooting

AWS CLI says InvalidAccessKeyId

This usually means the AWS CLI is talking to real AWS instead of RustFS. Always include the local endpoint:

aws --profile local_s3 --endpoint-url http://localhost:9000 s3 ls

Also check that credentials are in ~/.aws/credentials, not only in ~/.aws/config.

RustFS cannot write to the data directory

Make sure the host directories exist and are writable by the container UID:

mkdir -p ${S3_DATA_DIR:-./data/s3} ${S3_LOG_DIR:-./data/logs}
sudo chown -R 10001:10001 ${S3_DATA_DIR:-./data/s3} ${S3_LOG_DIR:-./data/logs}

Pulsar logs show TooLongFrameException

Check that clients are using:

pulsar://localhost:6650

not:

http://localhost:6650

Port 6650 is Pulsar's binary protocol port. HTTP is available on port 8080 for the admin API.

Topic transformed does not appear

Pulsar topics may not appear until a producer or consumer uses them. You can create the topic manually:

docker exec -it local-pulsar \
  bin/pulsar-admin topics create persistent://public/default/transformed

Notes

This repository is deliberately small. It aims to provide just enough local infrastructure to exercise EODH backend services that normally run against Kubernetes-managed S3-compatible storage and Pulsar.

About

A Docker Compose stack emulating S3 and Pulsar to allow for local development of EODH back end services

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages