Open
Conversation
Extract the inlined _PARKING_REDIRECT_HOSTS / _is_parking_redirect from
classifier.py into a standalone domain_classifier.parking module with
five categorised host sets (for-sale marketplaces, registrar parking,
monetisation networks, traffic arbitrage, content-farm affiliates) plus
a URL_PARKING_PATTERNS list for host+query-string rules.
Add a new domain_classifier.platforms module holding the hosting/SaaS
platform data that Domain Intelligence was maintaining separately
(PLATFORM_REDIRECT_HOSTS, PLATFORM_HOSTNAME_SUFFIXES, URL_PLATFORM_PATTERNS,
is_platform_url, is_platform_host). One source of truth; downstream
importers replace their local copies with `from domain_classifier.platforms`.
Convention: hosts that are primarily parking landers live in parking.py
only; hosts that are primarily platforms but may also see parking traffic
live in platforms.py only. Membership is mutually exclusive — parking
wins on overlap (homestead.com was in both; kept in parking only).
New URL_PARKING_PATTERNS entry:
("help.com.au", "?d=") — Servers Australia hosting lander; without
this rule the redirect masquerades as an
acquisition signal.
Restores the richer parking coverage (105 hosts) that existed before
the ml/content/fetch extraction refactor trimmed it to 26.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
classifier.pyinto a newdomain_classifier.parkingmodule with 5 categorised subsets (for-sale marketplaces, registrar parking, monetisation networks, traffic arbitrage, content-farm affiliates) plus aURL_PARKING_PATTERNSlist for host+query rulesdomain_classifier.platformsmodule holding the hosting/SaaS platform data that Domain Intelligence was duplicating inredirect_detector.py(now single source of truth)URL_PARKING_PATTERNSentry:("help.com.au", "?d=")— Servers Australia parking lander. Without this rule the redirect masqueraded as an acquisition signal in DIhomestead.comout of platforms (it was in both lists; parking wins on overlap)Why
Domain Intelligence and this repo had drifted: DI was maintaining a separate
PLATFORM_REDIRECT_HOSTS/URL_PLATFORM_PATTERNSset in itsredirect_detector.py, and some hosts (e.g.homestead.com) appeared in both lists under different categories. DI also had to invent aplatformclassification to cover parker hosts likehelp.com.authat the classifier should be catching asparked(grade C).This PR makes the classifier the one source of truth for all redirect-destination reference data; DI will follow up with a submodule bump and
from domain_classifier.platforms import .../ deletion of its local lists.Convention
Hosts that are primarily parking landers live in
parking.pyonly; hosts that are primarily platforms but may also see parking traffic live inplatforms.pyonly. Membership is mutually exclusive — parking wins on overlap.Test plan
tests/test_parking.py— 19 assertions across host match, URL pattern match, and non-match / dedupe guardstests/test_platforms.py— 22 assertions across URL pattern, host suffix, and dedupe guardspytest tests/— 172 passing (2 failing tests are unrelatedtest_rank.pytop-ranked-override WIP, pre-existing)classify_domainrun (CI / reviewer)🤖 Generated with Claude Code