Publishing the package
The installable Laravel library lives in packages/datahelm/crawler/ (datahelm/crawler on Packagist). The reference repo is the development sandbox (Laravel app, Docker, site-specific robots). The public package repo (GitHub → Packagist) is a separate directory that receives only the files required for composer require datahelm/crawler.
Repository layout
DataHelm is split across three repositories:
| Repository | Visibility | Contents |
|---|---|---|
| datahelm/crawler | Public (Packagist) | Laravel package — composer require datahelm/crawler |
| datahelm/environment | Public | Full Docker stack: nginx, PHP, PostgreSQL, Redis, browserless, FlareSolverr, Supervisor, … |
| DataHelmCrawler | Private / demo | Development sandbox: Laravel app, site-specific robots, examples |
datahelm/crawler Packagist / GitHub
├── README.md transport table, env vars, quick start
├── docker/
│ └── compose.services.yml optional: browserless + FlareSolverr only
├── composer.json
├── config/
└── src/
datahelm/environment separate Git repo
└── docker-compose.yml full stack to run DataHelm locally
DataHelmCrawler (this repo) private demo / dev sandbox
└── docker-compose.yml same idea as datahelm/environment
What gets synced
scripts/sync-package.include is an rsync manifest. It copies only:
| Path | Purpose |
|---|---|
composer.json | Package metadata and PSR-4 autoload |
README.md | Install guide, transport table, env vars |
docker/compose.services.yml | Optional browserless + FlareSolverr only |
config/ | Published config (crawler.php) |
src/ | DataHelm\Crawler\ library code |
Files that stay only in the publishable repo (not overwritten by sync): .git, LICENSE, phpunit.xml, .github/, etc.
Running the sync
From the root of the monorepo:
cd /path/to/DataHelmCrawler
# Default destination: /home/murilo/Docker/DataHelm.dev
./scripts/sync-package.sh
Other options:
# Custom destination
./scripts/sync-package.sh /path/to/DataHelm.dev
# Include unit tests (optional)
./scripts/sync-package.sh --with-tests
# Environment variable for destination
PACKAGE_DST=/home/murilo/Docker/DataHelm.dev ./scripts/sync-package.sh
# Help
./scripts/sync-package.sh --help
If you get Permission denied:
chmod +x scripts/sync-package.sh
./scripts/sync-package.sh
First-time setup (publishable repo)
mkdir -p /home/murilo/Docker/DataHelm.dev
cd /home/murilo/Docker/DataHelm.dev
git init
# Add once: LICENSE, .gitignore, phpunit.xml, .github/workflows/…
# (README.md and docker/compose.services.yml are synced automatically)
Then run ./scripts/sync-package.sh from the monorepo whenever you want to push an update.
After sync — commit and release
cd /home/murilo/Docker/DataHelm.dev
git status
git add -A
git commit -m "Sync from monorepo"
git tag v1.0.0
git push && git push --tags
Packagist picks up new versions from Git tags. The sync script only copies files; it does not commit or push for you.
What package users need
composer require datahelm/crawler works without any Docker extras. The default transport is guzzle (plain HTTP). Optional infrastructure is only required for JS-heavy sites or bot protection:
| Need | Solution |
|---|---|
| Plain HTML / public APIs | Nothing extra — CRAWLER_TRANSPORT=guzzle |
| JS / SPA rendering | browser transport → browserless |
| Cloudflare challenges | flaresolverr transport → FlareSolverr |
| Hardest WAFs (Akamai, PerimeterX) | scraping_api + paid API key |
| Hands-off escalation | CRAWLER_TRANSPORT=auto |

