[FileX] — Domain Structure Reconstructor

GPL-3.0 · PHP + Vanilla JS · Deploy-ready for InfinityFree

FileX is a powerful web tool that reconstructs the internal file and directory structure of any domain or subdomain by aggregating data from multiple passive and active sources — robots.txt, sitemaps, the Wayback Machine CDX API, DNS, certificate transparency logs, and a multi-threaded wordlist crawler. Results are rendered as an interactive, filterable, exportable file tree with technology fingerprinting, subdomain discovery, and security-relevant path flagging.

Features
Architecture
How It Works
Directory Layout
API Endpoints
- probe.php
- robots.php
- sitemap.php
- wayback.php
- dns.php
- certs.php
- crawl.php
Deployment
CI/CD Pipelines
- lint.yml
- deploy.yml
Configuration & Options
Exporting Results
Security & Ethics
Why PHP for the Backend?
Known Limitations
Contributing
License

Features

Feature	Detail
robots.txt parser	Full directive parsing — Disallow, Allow, Sitemap, Crawl-delay, Host — per user-agent with comment extraction
Sitemap crawler	Recursive sitemap index traversal across unlimited nesting levels, up to 5,000 URLs
Wayback Machine CDX	Up to 5,000 collapsed URL snapshots with MIME stats and path deduplication
DNS recon	A/AAAA/CNAME/MX/NS/TXT/SOA/SRV via Google DoH, with 30 common subdomain probes
Certificate transparency	crt.sh wildcard query → all known subdomains from historical TLS certificates
Wordlist crawl	350+ path wordlist probed in parallel batches via PHP curl_multi, configurable batch size
Link probing	Optionally probe every `<a href>` and `<script src>` discovered on the root page
Technology fingerprinting	CMS (WordPress, Drupal, Joomla, Shopify…), server (nginx, Apache), framework (React, Vue, Next.js, Laravel…), CDN (Cloudflare), mail (GSuite, M365)
HTML comment extraction	Surfaces developer comments that may leak paths or credentials
Interactive file tree	Collapsible, searchable, filterable tree with status badges, content types, sizes, redirect targets
⚡ Interesting filter	One-click to show only security-relevant paths (open dirs, 401/403, .env, config, backup…)
Export	JSON (full scan data), CSV (path/status/type/size/source), plaintext tree
Stat dashboard	Live chips: unique paths, live 2xx, redirects, 401/403, open directories, Wayback captures, subdomains
Abort support	Cancel mid-scan at any time with the STOP button
Source attribution	Every path tagged with its discovery source (robots, sitemap, wayback, crawl, probe, dns)

Architecture

Browser (HTML + CSS + JS ES Modules)
    │
    │  fetch() calls — same origin, no CORS
    ▼
PHP API endpoints (public/api/*.php)
    │
    ├── curl / curl_multi  ──► Target domain
    ├── curl               ──► web.archive.org CDX API
    ├── curl               ──► dns.google/resolve (DoH)
    └── curl               ──► crt.sh

The frontend is a single-page app built with ES Modules — no build step, no bundler, no framework. The PHP layer exists entirely to proxy outbound HTTP requests that would be blocked by CORS in a browser context.

How It Works

Stage 1 — robots.txt

robots.php fetches https://{domain}/robots.txt (falling back to HTTP) and parses every directive. All Disallow and Allow paths across every user-agent block are extracted and surfaced as known paths. Sitemap URLs referenced in robots.txt seed Stage 2.

Stage 2 — Sitemap Discovery

sitemap.php tries 9 common sitemap locations (/sitemap.xml, /wp-sitemap.xml, etc.) and recursively follows <sitemapindex> references up to 4 levels deep, collecting <loc> entries up to a 5,000-URL cap.

Stage 3 — Wayback Machine CDX

wayback.php queries the Internet Archive's CDX Search API with collapse=urlkey to get one representative snapshot per unique URL. It extracts paths, MIME types, status codes, and timestamps — surfacing URLs that existed historically but may have been deleted.

Stage 4 — DNS + Subdomain Enumeration

dns.php queries Google's DoH API (dns.google/resolve) for A, AAAA, CNAME, MX, NS, TXT, SOA, and SRV records. It then probes 30 common subdomain prefixes (www, mail, api, dev, staging, admin, etc.) using PHP's dns_get_record(). Technology inference from nameservers, MX hosts, and TXT records surfaces email providers, DNS providers, and domain verification tokens.

Stage 5 — Certificate Transparency

certs.php queries crt.sh with a wildcard search for the base domain (%.example.com). This reveals all subdomains that have ever had a TLS certificate issued — including internal, staging, and legacy ones that DNS no longer resolves.

Stage 6 — Wordlist Crawl

crawl.php contains a built-in wordlist of 350+ paths covering every common web application pattern — CMS paths, API endpoints, configuration files, backup archives, admin panels, framework artifacts, development leftovers, log files, and security-sensitive filenames. Paths are probed in parallel using PHP's curl_multi_* API. The frontend paginates through the wordlist in configurable batches (25/50/100), streaming results in real time. Only paths with a non-zero, non-404 status are highlighted as live.

Stage 7 — Link Probing (optional)

When enabled, probe.php is called on the root URL to extract all <a href> and <script src> values. Each discovered URL is then individually probed to retrieve its status, content type, size, and response headers.

Stage 8 — Tree Building

The frontend deduplicates all collected path entries across all sources, builds a nested directory tree structure, and renders it as an interactive DOM tree. Each node carries its probe result (if available), source attribution badge, and semantic badges (DIR LIST, ⚡ interesting, redirect target).

Directory Layout

filex/
├── .github/
│   └── workflows/
│       ├── lint.yml          # PHP syntax, ESLint, HTML validation
│       └── deploy.yml        # FTP deploy to InfinityFree on push to main
├── public/                   # Entire contents deployed to /htdocs/
│   ├── .htaccess             # Security headers, compression, cache
│   ├── index.html            # Single-page app shell
│   ├── api/
│   │   ├── .htaccess         # API-specific headers, PHP config
│   │   ├── probe.php         # General URL probe + fingerprinting
│   │   ├── robots.php        # robots.txt fetch + parser
│   │   ├── sitemap.php       # Recursive sitemap crawler
│   │   ├── wayback.php       # Wayback Machine CDX proxy
│   │   ├── dns.php           # DNS recon via Google DoH
│   │   ├── certs.php         # Certificate transparency via crt.sh
│   │   └── crawl.php         # Parallel wordlist path prober
│   └── assets/
│       ├── css/
│       │   └── filex.css     # Full stylesheet (CSS variables, dark theme)
│       └── js/
│           ├── filex.js      # Main orchestrator (ES module)
│           └── tree.js       # FileTree class — build, render, export
├── docs/
│   └── api.md                # API endpoint reference
├── .gitignore
├── .gitattributes
├── LICENSE                   # GPL-3.0
└── README.md

API Endpoints

All endpoints accept GET requests and return application/json. They are designed for same-origin use only (the .htaccess sets broad CORS headers for development convenience, which you may want to tighten for production).

probe.php

Query params: url (full URL including scheme)

Fetches the target URL and returns a rich metadata object.

{
  "url": "https://example.com/",
  "finalUrl": "https://example.com/",
  "status": 200,
  "elapsed": 312,
  "contentType": "text/html; charset=UTF-8",
  "contentLength": 14832,
  "server": "nginx/1.24.0",
  "poweredBy": "PHP/8.2.0",
  "redirectCount": 0,
  "headers": { "…": "…" },
  "hasDirectoryList": false,
  "cms": ["WordPress"],
  "tech": ["nginx 1.24.0", "PHP 8.2.0", "jQuery 3.6.4"],
  "forms": ["/wp-login.php"],
  "links": ["https://example.com/about/", "…"],
  "scripts": ["https://example.com/wp-includes/js/jquery/jquery.min.js"],
  "metaTags": { "generator": "WordPress 6.4" },
  "comments": ["Build: 2024-01-15"],
  "bodySnippet": "<!DOCTYPE html>…"
}

robots.php

Query params: domain

{
  "url": "https://example.com/robots.txt",
  "status": 200,
  "raw": "User-agent: *\nDisallow: /admin/\n…",
  "parsed": {
    "agents": {
      "*": {
        "disallow": [{ "path": "/admin/", "comment": "" }],
        "allow": []
      }
    },
    "sitemaps": ["https://example.com/sitemap.xml"],
    "allPaths": ["/admin/", "/private/", "…"]
  }
}

sitemap.php

Query params: domain

{
  "domain": "example.com",
  "total": 142,
  "sources": [
    { "url": "https://example.com/sitemap.xml", "status": 200 }
  ],
  "urls": ["https://example.com/about/", "…"]
}

wayback.php

Query params: domain, limit (max 5000), from (timestamp), to (timestamp)

{
  "domain": "example.com",
  "total": 2841,
  "paths": ["/", "/old-page/", "/deleted-admin/", "…"],
  "mimeStats": { "text/html": 2100, "application/javascript": 400, "…": "…" },
  "urls": [
    { "url": "https://example.com/old-page/", "path": "/old-page/", "status": "200", "mime": "text/html", "ts": "20190401120000", "size": 8192 }
  ]
}

dns.php

Query params: domain

{
  "domain": "example.com",
  "ipv4": ["93.184.216.34"],
  "ipv6": ["2606:2800:220:1:248:1893:25c8:1946"],
  "nameservers": ["a.iana-servers.net.", "b.iana-servers.net."],
  "mx": [{ "priority": 10, "host": "mail.example.com." }],
  "txt": ["v=spf1 -all"],
  "cname": null,
  "tech": ["SPF", "DMARC"],
  "subdomains": ["www.example.com", "mail.example.com"]
}

certs.php

Query params: domain

{
  "domain": "example.com",
  "baseDomain": "example.com",
  "total": 7,
  "subdomains": ["api.example.com", "dev.example.com", "staging.example.com", "…"],
  "certCount": 34
}

crawl.php

Query params: domain, scheme (http/https), batch, offset, paths (newline-separated custom paths)

{
  "domain": "example.com",
  "total": 350,
  "offset": 0,
  "batch": 50,
  "results": [
    {
      "path": "/.env",
      "status": 403,
      "contentType": "text/html",
      "contentLength": 0,
      "server": "nginx",
      "redirect": null,
      "hasDirectoryList": false,
      "interesting": true
    }
  ]
}

Deployment

Prerequisites

A free InfinityFree account
A GitHub repository (fork or clone of this one)
PHP 8.0+ with curl and libxml enabled (InfinityFree provides both)

InfinityFree Setup

Create a hosting account and note your FTP hostname, FTP username, and FTP password from the control panel.
The deploy target is /htdocs/ — the public/ directory contents go directly here.
Ensure no existing index.html conflicts.
InfinityFree's PHP has curl and allow_url_fopen enabled by default.

GitHub Secrets Required

Navigate to your repo → Settings → Secrets and variables → Actions → New repository secret and add:

Secret name	Value
`FTP_HOST`	e.g. `ftpupload.net`
`FTP_USERNAME`	e.g. `epiz_12345678`
`FTP_PASSWORD`	Your FTP password

Manual FTP Deploy

If you prefer a one-time manual deploy with lftp:

lftp -e "mirror -R ./public/ /htdocs/ --delete; bye" \
     -u "$FTP_USER,$FTP_PASS" ftp://$FTP_HOST

Or use FileZilla to drag public/ contents into /htdocs/.

CI/CD Pipelines

lint.yml

Triggers on every push to main/develop and every pull request to main.

Job	What it does
`lint-php`	`php -l` syntax check on every `.php` file + PHP-CS-Fixer dry run against PSR-12
`lint-js`	ESLint 9 with ES2022 module rules across `public/assets/js/`
`lint-html`	`html-validate` on `index.html`

deploy.yml

Triggers on push to main and manual workflow_dispatch.

Runs a pre-deploy PHP syntax check as a gate
Uses SamKirkland/FTP-Deploy-Action to mirror public/ → /htdocs/
Excludes .git*, node_modules, *.bak, *.log
Reports success or failure in the workflow log

Configuration & Options

All options are toggled in the UI before scanning. They persist per-scan but not across page reloads.

Option	Default	Effect
robots.txt	✅	Fetch and parse `/robots.txt`
Sitemap	✅	Recursively crawl all sitemaps
Wayback Machine	✅	Query CDX API for historical paths
DNS + Subdomains	✅	Full DNS recon + 30 subdomain probes
Cert Transparency	✅	Query crt.sh for subdomain history
Wordlist Crawl	✅	Probe 350+ paths in parallel batches
Probe Discovered Links	❌	Individually probe `<a>` + `<script>` URLs from root page
Scheme	HTTPS	Use HTTP or HTTPS for crawl requests
Batch size	50	Paths per crawl API call (25/50/100)

Exporting Results

Format	Contents
JSON	Full scan data object — all entries, probe results, robots, sitemap, wayback, dns, certs, source counts
Tree (TXT)	ASCII art directory tree identical to `tree(1)` output, with `[status]` and `(source)` annotations
CSV	One row per unique path — columns: `path, status, contentType, size, source, interesting, redirect`

Security & Ethics

FileX is a passive and semi-passive reconnaissance tool. It:

Does not send exploit payloads of any kind
Does not brute-force authentication
Does not attempt to read file contents beyond HTTP response metadata
Uses only publicly available data sources (robots.txt, sitemaps, Wayback Machine, DNS, crt.sh)
The wordlist crawl sends standard HTTP GET requests — identical to what a browser or search engine crawler would do

You are solely responsible for ensuring you have authorization to scan any domain you target. Scanning domains you do not own or do not have explicit written permission to test may violate computer fraud laws in your jurisdiction. The authors and contributors of FileX accept no liability for misuse.

The tool deliberately omits:

Authenticated brute-force (HTTP 401 credential testing)
Vulnerability scanning or exploit probing
Automated rate-exhaustion or DoS-capable patterns

Why PHP for the Backend?

The frontend makes HTTP requests to arbitrary external domains, which browsers block via CORS unless the target server explicitly allows it — which most do not. A thin PHP proxy running on the same origin as the frontend sidesteps this entirely. InfinityFree provides free PHP hosting with cURL, making it a zero-cost deployment target. The PHP layer is intentionally minimal: it validates input, proxies requests with a custom UA, and returns JSON.

Known Limitations

InfinityFree max_execution_time: InfinityFree caps PHP execution at approximately 30–60 seconds. Large Wayback CDX responses or slow crawl batches may hit this limit. Reduce batch size or limit source counts if you observe timeouts.
InfinityFree cURL restrictions: Some shared hosts block outbound cURL to certain ranges. If Wayback or crt.sh calls fail, this may be the cause.
No JavaScript execution: probe.php fetches raw HTML only — it cannot render JavaScript-heavy SPAs. Paths loaded via client-side routing will not be discovered unless they appear in other sources.
Wayback CDX rate limiting: The Internet Archive imposes soft rate limits on CDX queries. Repeated rapid scans of the same domain may return throttled or empty responses.
crt.sh availability: crt.sh is a free community service and occasionally experiences downtime.
robots.txt scope: robots.txt only discloses paths the site operator chose to list. It is not a complete inventory.

Contributing

Contributions are welcome. Please:

Fork the repository
Create a feature branch (git checkout -b feature/your-feature)
Ensure lint.yml passes (php -l, ESLint, html-validate)
Submit a pull request against main with a clear description

Ideas for contribution:

Additional CMS/tech fingerprints in probe.php
Expanded wordlist entries in crawl.php
JavaScript <link rel="preload"> / <link rel="stylesheet"> path extraction
.well-known/ endpoint enumeration
Wappalyzer-style comprehensive tech detection
Dark/light theme toggle
Scan history stored in localStorage

License

FileX is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License version 3 as published by the Free Software Foundation.

See LICENSE for the full license text, or visit https://www.gnu.org/licenses/gpl-3.0.en.html.

Copyright (C) FileX Contributors
This program comes with ABSOLUTELY NO WARRANTY.
This is free software, and you are welcome to redistribute it
under certain conditions; see LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.github/workflows		.github/workflows
docs		docs
public		public
.gitattributes		.gitattributes
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

[FileX] — Domain Structure Reconstructor

Table of Contents

Features

Architecture

How It Works

Stage 1 — robots.txt

Stage 2 — Sitemap Discovery

Stage 3 — Wayback Machine CDX

Stage 4 — DNS + Subdomain Enumeration

Stage 5 — Certificate Transparency

Stage 6 — Wordlist Crawl

Stage 7 — Link Probing (optional)

Stage 8 — Tree Building

Directory Layout

API Endpoints

probe.php

robots.php

sitemap.php

wayback.php

dns.php

certs.php

crawl.php

Deployment

Prerequisites

InfinityFree Setup

GitHub Secrets Required

Manual FTP Deploy

CI/CD Pipelines

lint.yml

deploy.yml

Configuration & Options

Exporting Results

Security & Ethics

Why PHP for the Backend?

Known Limitations

Contributing

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages