GPL-3.0 · PHP + Vanilla JS · Deploy-ready for InfinityFree
FileX is a powerful web tool that reconstructs the internal file and directory structure of any domain or subdomain by aggregating data from multiple passive and active sources — robots.txt, sitemaps, the Wayback Machine CDX API, DNS, certificate transparency logs, and a multi-threaded wordlist crawler. Results are rendered as an interactive, filterable, exportable file tree with technology fingerprinting, subdomain discovery, and security-relevant path flagging.
- Features
- Architecture
- How It Works
- Directory Layout
- API Endpoints
- Deployment
- CI/CD Pipelines
- Configuration & Options
- Exporting Results
- Security & Ethics
- Why PHP for the Backend?
- Known Limitations
- Contributing
- License
| Feature | Detail |
|---|---|
| robots.txt parser | Full directive parsing — Disallow, Allow, Sitemap, Crawl-delay, Host — per user-agent with comment extraction |
| Sitemap crawler | Recursive sitemap index traversal across unlimited nesting levels, up to 5,000 URLs |
| Wayback Machine CDX | Up to 5,000 collapsed URL snapshots with MIME stats and path deduplication |
| DNS recon | A/AAAA/CNAME/MX/NS/TXT/SOA/SRV via Google DoH, with 30 common subdomain probes |
| Certificate transparency | crt.sh wildcard query → all known subdomains from historical TLS certificates |
| Wordlist crawl | 350+ path wordlist probed in parallel batches via PHP curl_multi, configurable batch size |
| Link probing | Optionally probe every <a href> and <script src> discovered on the root page |
| Technology fingerprinting | CMS (WordPress, Drupal, Joomla, Shopify…), server (nginx, Apache), framework (React, Vue, Next.js, Laravel…), CDN (Cloudflare), mail (GSuite, M365) |
| HTML comment extraction | Surfaces developer comments that may leak paths or credentials |
| Interactive file tree | Collapsible, searchable, filterable tree with status badges, content types, sizes, redirect targets |
| ⚡ Interesting filter | One-click to show only security-relevant paths (open dirs, 401/403, .env, config, backup…) |
| Export | JSON (full scan data), CSV (path/status/type/size/source), plaintext tree |
| Stat dashboard | Live chips: unique paths, live 2xx, redirects, 401/403, open directories, Wayback captures, subdomains |
| Abort support | Cancel mid-scan at any time with the STOP button |
| Source attribution | Every path tagged with its discovery source (robots, sitemap, wayback, crawl, probe, dns) |
Browser (HTML + CSS + JS ES Modules)
│
│ fetch() calls — same origin, no CORS
▼
PHP API endpoints (public/api/*.php)
│
├── curl / curl_multi ──► Target domain
├── curl ──► web.archive.org CDX API
├── curl ──► dns.google/resolve (DoH)
└── curl ──► crt.sh
The frontend is a single-page app built with ES Modules — no build step, no bundler, no framework. The PHP layer exists entirely to proxy outbound HTTP requests that would be blocked by CORS in a browser context.
robots.php fetches https://{domain}/robots.txt (falling back to HTTP) and parses every directive. All Disallow and Allow paths across every user-agent block are extracted and surfaced as known paths. Sitemap URLs referenced in robots.txt seed Stage 2.
sitemap.php tries 9 common sitemap locations (/sitemap.xml, /wp-sitemap.xml, etc.) and recursively follows <sitemapindex> references up to 4 levels deep, collecting <loc> entries up to a 5,000-URL cap.
wayback.php queries the Internet Archive's CDX Search API with collapse=urlkey to get one representative snapshot per unique URL. It extracts paths, MIME types, status codes, and timestamps — surfacing URLs that existed historically but may have been deleted.
dns.php queries Google's DoH API (dns.google/resolve) for A, AAAA, CNAME, MX, NS, TXT, SOA, and SRV records. It then probes 30 common subdomain prefixes (www, mail, api, dev, staging, admin, etc.) using PHP's dns_get_record(). Technology inference from nameservers, MX hosts, and TXT records surfaces email providers, DNS providers, and domain verification tokens.
certs.php queries crt.sh with a wildcard search for the base domain (%.example.com). This reveals all subdomains that have ever had a TLS certificate issued — including internal, staging, and legacy ones that DNS no longer resolves.
crawl.php contains a built-in wordlist of 350+ paths covering every common web application pattern — CMS paths, API endpoints, configuration files, backup archives, admin panels, framework artifacts, development leftovers, log files, and security-sensitive filenames. Paths are probed in parallel using PHP's curl_multi_* API. The frontend paginates through the wordlist in configurable batches (25/50/100), streaming results in real time. Only paths with a non-zero, non-404 status are highlighted as live.
When enabled, probe.php is called on the root URL to extract all <a href> and <script src> values. Each discovered URL is then individually probed to retrieve its status, content type, size, and response headers.
The frontend deduplicates all collected path entries across all sources, builds a nested directory tree structure, and renders it as an interactive DOM tree. Each node carries its probe result (if available), source attribution badge, and semantic badges (DIR LIST, ⚡ interesting, redirect target).
filex/
├── .github/
│ └── workflows/
│ ├── lint.yml # PHP syntax, ESLint, HTML validation
│ └── deploy.yml # FTP deploy to InfinityFree on push to main
├── public/ # Entire contents deployed to /htdocs/
│ ├── .htaccess # Security headers, compression, cache
│ ├── index.html # Single-page app shell
│ ├── api/
│ │ ├── .htaccess # API-specific headers, PHP config
│ │ ├── probe.php # General URL probe + fingerprinting
│ │ ├── robots.php # robots.txt fetch + parser
│ │ ├── sitemap.php # Recursive sitemap crawler
│ │ ├── wayback.php # Wayback Machine CDX proxy
│ │ ├── dns.php # DNS recon via Google DoH
│ │ ├── certs.php # Certificate transparency via crt.sh
│ │ └── crawl.php # Parallel wordlist path prober
│ └── assets/
│ ├── css/
│ │ └── filex.css # Full stylesheet (CSS variables, dark theme)
│ └── js/
│ ├── filex.js # Main orchestrator (ES module)
│ └── tree.js # FileTree class — build, render, export
├── docs/
│ └── api.md # API endpoint reference
├── .gitignore
├── .gitattributes
├── LICENSE # GPL-3.0
└── README.md
All endpoints accept GET requests and return application/json. They are designed for same-origin use only (the .htaccess sets broad CORS headers for development convenience, which you may want to tighten for production).
Query params: url (full URL including scheme)
Fetches the target URL and returns a rich metadata object.
{
"url": "https://example.com/",
"finalUrl": "https://example.com/",
"status": 200,
"elapsed": 312,
"contentType": "text/html; charset=UTF-8",
"contentLength": 14832,
"server": "nginx/1.24.0",
"poweredBy": "PHP/8.2.0",
"redirectCount": 0,
"headers": { "…": "…" },
"hasDirectoryList": false,
"cms": ["WordPress"],
"tech": ["nginx 1.24.0", "PHP 8.2.0", "jQuery 3.6.4"],
"forms": ["/wp-login.php"],
"links": ["https://example.com/about/", "…"],
"scripts": ["https://example.com/wp-includes/js/jquery/jquery.min.js"],
"metaTags": { "generator": "WordPress 6.4" },
"comments": ["Build: 2024-01-15"],
"bodySnippet": "<!DOCTYPE html>…"
}Query params: domain
{
"url": "https://example.com/robots.txt",
"status": 200,
"raw": "User-agent: *\nDisallow: /admin/\n…",
"parsed": {
"agents": {
"*": {
"disallow": [{ "path": "/admin/", "comment": "" }],
"allow": []
}
},
"sitemaps": ["https://example.com/sitemap.xml"],
"allPaths": ["/admin/", "/private/", "…"]
}
}Query params: domain
{
"domain": "example.com",
"total": 142,
"sources": [
{ "url": "https://example.com/sitemap.xml", "status": 200 }
],
"urls": ["https://example.com/about/", "…"]
}Query params: domain, limit (max 5000), from (timestamp), to (timestamp)
{
"domain": "example.com",
"total": 2841,
"paths": ["/", "/old-page/", "/deleted-admin/", "…"],
"mimeStats": { "text/html": 2100, "application/javascript": 400, "…": "…" },
"urls": [
{ "url": "https://example.com/old-page/", "path": "/old-page/", "status": "200", "mime": "text/html", "ts": "20190401120000", "size": 8192 }
]
}Query params: domain
{
"domain": "example.com",
"ipv4": ["93.184.216.34"],
"ipv6": ["2606:2800:220:1:248:1893:25c8:1946"],
"nameservers": ["a.iana-servers.net.", "b.iana-servers.net."],
"mx": [{ "priority": 10, "host": "mail.example.com." }],
"txt": ["v=spf1 -all"],
"cname": null,
"tech": ["SPF", "DMARC"],
"subdomains": ["www.example.com", "mail.example.com"]
}Query params: domain
{
"domain": "example.com",
"baseDomain": "example.com",
"total": 7,
"subdomains": ["api.example.com", "dev.example.com", "staging.example.com", "…"],
"certCount": 34
}Query params: domain, scheme (http/https), batch, offset, paths (newline-separated custom paths)
{
"domain": "example.com",
"total": 350,
"offset": 0,
"batch": 50,
"results": [
{
"path": "/.env",
"status": 403,
"contentType": "text/html",
"contentLength": 0,
"server": "nginx",
"redirect": null,
"hasDirectoryList": false,
"interesting": true
}
]
}- A free InfinityFree account
- A GitHub repository (fork or clone of this one)
- PHP 8.0+ with
curlandlibxmlenabled (InfinityFree provides both)
- Create a hosting account and note your FTP hostname, FTP username, and FTP password from the control panel.
- The deploy target is
/htdocs/— thepublic/directory contents go directly here. - Ensure no existing
index.htmlconflicts. - InfinityFree's PHP has
curlandallow_url_fopenenabled by default.
Navigate to your repo → Settings → Secrets and variables → Actions → New repository secret and add:
| Secret name | Value |
|---|---|
FTP_HOST |
e.g. ftpupload.net |
FTP_USERNAME |
e.g. epiz_12345678 |
FTP_PASSWORD |
Your FTP password |
If you prefer a one-time manual deploy with lftp:
lftp -e "mirror -R ./public/ /htdocs/ --delete; bye" \
-u "$FTP_USER,$FTP_PASS" ftp://$FTP_HOSTOr use FileZilla to drag public/ contents into /htdocs/.
Triggers on every push to main/develop and every pull request to main.
| Job | What it does |
|---|---|
lint-php |
php -l syntax check on every .php file + PHP-CS-Fixer dry run against PSR-12 |
lint-js |
ESLint 9 with ES2022 module rules across public/assets/js/ |
lint-html |
html-validate on index.html |
Triggers on push to main and manual workflow_dispatch.
- Runs a pre-deploy PHP syntax check as a gate
- Uses
SamKirkland/FTP-Deploy-Actionto mirrorpublic/→/htdocs/ - Excludes
.git*,node_modules,*.bak,*.log - Reports success or failure in the workflow log
All options are toggled in the UI before scanning. They persist per-scan but not across page reloads.
| Option | Default | Effect |
|---|---|---|
| robots.txt | ✅ | Fetch and parse /robots.txt |
| Sitemap | ✅ | Recursively crawl all sitemaps |
| Wayback Machine | ✅ | Query CDX API for historical paths |
| DNS + Subdomains | ✅ | Full DNS recon + 30 subdomain probes |
| Cert Transparency | ✅ | Query crt.sh for subdomain history |
| Wordlist Crawl | ✅ | Probe 350+ paths in parallel batches |
| Probe Discovered Links | ❌ | Individually probe <a> + <script> URLs from root page |
| Scheme | HTTPS | Use HTTP or HTTPS for crawl requests |
| Batch size | 50 | Paths per crawl API call (25/50/100) |
| Format | Contents |
|---|---|
| JSON | Full scan data object — all entries, probe results, robots, sitemap, wayback, dns, certs, source counts |
| Tree (TXT) | ASCII art directory tree identical to tree(1) output, with [status] and (source) annotations |
| CSV | One row per unique path — columns: path, status, contentType, size, source, interesting, redirect |
FileX is a passive and semi-passive reconnaissance tool. It:
- Does not send exploit payloads of any kind
- Does not brute-force authentication
- Does not attempt to read file contents beyond HTTP response metadata
- Uses only publicly available data sources (robots.txt, sitemaps, Wayback Machine, DNS, crt.sh)
- The wordlist crawl sends standard HTTP GET requests — identical to what a browser or search engine crawler would do
You are solely responsible for ensuring you have authorization to scan any domain you target. Scanning domains you do not own or do not have explicit written permission to test may violate computer fraud laws in your jurisdiction. The authors and contributors of FileX accept no liability for misuse.
The tool deliberately omits:
- Authenticated brute-force (HTTP 401 credential testing)
- Vulnerability scanning or exploit probing
- Automated rate-exhaustion or DoS-capable patterns
The frontend makes HTTP requests to arbitrary external domains, which browsers block via CORS unless the target server explicitly allows it — which most do not. A thin PHP proxy running on the same origin as the frontend sidesteps this entirely. InfinityFree provides free PHP hosting with cURL, making it a zero-cost deployment target. The PHP layer is intentionally minimal: it validates input, proxies requests with a custom UA, and returns JSON.
- InfinityFree
max_execution_time: InfinityFree caps PHP execution at approximately 30–60 seconds. Large Wayback CDX responses or slow crawl batches may hit this limit. Reduce batch size or limit source counts if you observe timeouts. - InfinityFree cURL restrictions: Some shared hosts block outbound cURL to certain ranges. If Wayback or crt.sh calls fail, this may be the cause.
- No JavaScript execution:
probe.phpfetches raw HTML only — it cannot render JavaScript-heavy SPAs. Paths loaded via client-side routing will not be discovered unless they appear in other sources. - Wayback CDX rate limiting: The Internet Archive imposes soft rate limits on CDX queries. Repeated rapid scans of the same domain may return throttled or empty responses.
- crt.sh availability: crt.sh is a free community service and occasionally experiences downtime.
- robots.txt scope: robots.txt only discloses paths the site operator chose to list. It is not a complete inventory.
Contributions are welcome. Please:
- Fork the repository
- Create a feature branch (
git checkout -b feature/your-feature) - Ensure
lint.ymlpasses (php -l, ESLint, html-validate) - Submit a pull request against
mainwith a clear description
Ideas for contribution:
- Additional CMS/tech fingerprints in
probe.php - Expanded wordlist entries in
crawl.php - JavaScript
<link rel="preload">/<link rel="stylesheet">path extraction .well-known/endpoint enumeration- Wappalyzer-style comprehensive tech detection
- Dark/light theme toggle
- Scan history stored in
localStorage
FileX is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License version 3 as published by the Free Software Foundation.
See LICENSE for the full license text, or visit https://www.gnu.org/licenses/gpl-3.0.en.html.
Copyright (C) FileX Contributors
This program comes with ABSOLUTELY NO WARRANTY.
This is free software, and you are welcome to redistribute it
under certain conditions; see LICENSE for details.