Metaculus · probably-jaden · May 14, 2026 · May 12, 2026 · May 14, 2026 · May 14, 2026
diff --git a/README.ipynb b/README.ipynb
@@ -11,11 +11,11 @@
      "text": [
       "[NbConvertApp] Converting notebook README.ipynb to markdown\n",
       "[NbConvertApp] Writing 44327 bytes to README.md\n",
-      "┌──────────┬────────────┬───────────┐\n",
-      "│ \u001b[1mlast_day\u001b[0m │ \u001b[1mlast_month\u001b[0m │ \u001b[1mlast_week\u001b[0m │\n",
-      "├──────────┼────────────┼───────────┤\n",
-      "│    1,656 │     29,361 │     9,118 │\n",
-      "└──────────┴────────────┴───────────┘\n",
+      "\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u252c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n",
+      "\u2502 \u001b[1mlast_day\u001b[0m \u2502 \u001b[1mlast_month\u001b[0m \u2502 \u001b[1mlast_week\u001b[0m \u2502\n",
+      "\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u253c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n",
+      "\u2502    1,656 \u2502     29,361 \u2502     9,118 \u2502\n",
+      "\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2534\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n",
       "\n"
      ]
     }
@@ -58,13 +58,11 @@
     "This repository contains forecasting and research tools built with Python and Streamlit. The project aims to assist users in making predictions, conducting research, and analyzing data related to hard to answer questions (especially those from Metaculus).\n",
     "\n",
     "Here are the tools most likely to be useful to you:\n",
-    "- 🎯 **Forecasting Bot:** General forecaster that integrates with the Metaculus AI benchmarking competition and provides a number of utilities. You can forecast with a pre-existing bot or override the class to customize your own (without redoing all the aggregation/API code, etc)\n",
-    "- 🔌 **Metaculus API Wrapper:** for interacting with questions and tournaments\n",
-    "- 📊 **Benchmarking:** Randomly sample quality questions from Metaculus and run your bot against them so you can get an early sense of how your bot is doing by comparing to the community prediction and expected baseline scores.\n",
-    "- 🤖 **In-House Metaculus Bots**: You can see all the bots that Metaculus is running on their site in `run_bots.py`\n",
+    "- \ud83c\udfaf **Forecasting Bot:** General forecaster that integrates with the Metaculus AI benchmarking competition and provides a number of utilities. You can forecast with a pre-existing bot or override the class to customize your own (without redoing all the aggregation/API code, etc)\n",
+    "- \ud83d\udd0c **Metaculus API Wrapper:** for interacting with questions and tournaments\n",
+    "- \ud83e\udd16 **In-House Metaculus Bots**: You can see all the bots that Metaculus is running on their site in `run_bots.py`\n",
     "\n",
     "Here are some other features of the project (not all are documented yet):\n",
-    "- **Smart Searcher:** A custom AI-powered internet-informed llm powered by Exa.ai and GPT. It is more configurable than Perplexity AI, allowing you to use any AI model, instruct the AI to decide on filters, get citations linking to exact paragraphs, etc.\n",
     "- **Key Factor Analysis:** Key Factors Analysis for scoring, ranking, and prioritizing important variables in forecasting questions\n",
     "- **Base Rate Researcher:** for calculating event probabilities (still experimental)\n",
     "- **Niche List Researcher:** for analyzing very specific lists of past events or items (still experimental)\n",
@@ -406,139 +404,6 @@
     "# Important Utilities"
    ]
   },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Benchmarking\n",
-    "Below is an example of how to run the benchmarker"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [
-    {
-     "name": "stdout",
-     "output_type": "stream",
-     "text": [
-      "--------------------------------\n",
-      "Bot: TemplateBot\n",
-      "Score: 53.24105782939477\n",
-      "Num reports in benchmark: 2\n",
-      "Time: 0.23375582297643024min\n",
-      "Cost: $0.03020605\n",
-      "--------------------------------\n",
-      "Bot: CustomBot\n",
-      "Score: 53.24105782939476\n",
-      "Num reports in benchmark: 2\n",
-      "Time: 0.20734789768854778min\n",
-      "Cost: $0.019155650000000003\n"
-     ]
-    }
-   ],
-   "source": [
-    "from forecasting_tools import Benchmarker, TemplateBot, BenchmarkForBot\n",
-    "\n",
-    "class CustomBot(TemplateBot):\n",
-    "    ...\n",
-    "\n",
-    "# Run benchmark on multiple bots\n",
-    "bots = [TemplateBot(), CustomBot()]  # Add your custom bots here\n",
-    "benchmarker = Benchmarker(\n",
-    "    forecast_bots=bots,\n",
-    "    number_of_questions_to_use=2,  # Recommended 100+ for meaningful results\n",
-    "    file_path_to_save_reports=\"benchmarks/\",\n",
-    "        # It will create a file name for you if given a folder.\n",
-    "        # If a file name is given, and the file already exists, it will overwrite it.\n",
-    "    concurrent_question_batch_size=5,\n",
-    ")\n",
-    "benchmarks: list[BenchmarkForBot] = await benchmarker.run_benchmark()\n",
-    "\n",
-    "# View results\n",
-    "for benchmark in benchmarks[:2]:\n",
-    "    print(\"--------------------------------\")\n",
-    "    print(f\"Bot: {benchmark.name}\")\n",
-    "    print(f\"Score: {benchmark.average_expected_baseline_score}\") # Higher is better\n",
-    "    print(f\"Num reports in benchmark: {len(benchmark.forecast_reports)}\")\n",
-    "    print(f\"Time: {benchmark.time_taken_in_minutes}min\")\n",
-    "    print(f\"Cost: ${benchmark.total_cost}\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "The ideal number of questions to get a good sense of whether one bot is better than another can vary. 100+ should tell your something decent. See [this analysis](https://forum.effectivealtruism.org/posts/DzqSh7akX28JEHf9H/comparing-two-forecasters-in-an-ideal-world) for exploration of the numbers. With too few questions, the results could just be statistical noise, though how many questions you need depends highly on the difference in skill of your bot versions.\n",
-    "\n",
-    "If you use the average expected baseline score, higher score is better. The scoring measures the expected value of your score without needing an actual resolution by assuming that the community prediction is the 'true probability'. Under this assumption, expected baseline scores are a proper score (see analysis in `scripts/simulate_a_tournament.ipynb`)\n",
-    "\n",
-    "As of May 29, 2025 the benchmarker automatically selects a random set of questions from Metaculus that:\n",
-    "- Are binary questions (yes/no)\n",
-    "- Are currently open\n",
-    "- Opened within the last year\n",
-    "- Have at least 30 forecasters\n",
-    "- Have a community prediction\n",
-    "- Are not part of a group question\n",
-    "\n",
-    "Note that sometimes there are not many questions matching these filters (e.g. at the beginning of a new year when a majority of open questions were just resolved). As of last edit there are plans to expand this to numeric and multiple choice, but right now it just benchmarks binary questions.\n",
-    "\n",
-    "You can grab these questions without using the Benchmarker by running the below\n"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "from forecasting_tools import MetaculusApi\n",
-    "\n",
-    "questions = MetaculusApi.get_benchmark_questions(\n",
-    "    num_of_questions_to_return=100,\n",
-    ")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "You can also save/load benchmarks to/from json"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "from forecasting_tools import BenchmarkForBot\n",
-    "\n",
-    "# Load\n",
-    "file_path = \"benchmarks/benchmark.json\"\n",
-    "benchmarks: list[BenchmarkForBot] = BenchmarkForBot.load_json_from_file_path(file_path)\n",
-    "\n",
-    "# Save\n",
-    "new_benchmarks: list[BenchmarkForBot] = benchmarks\n",
-    "BenchmarkForBot.save_object_list_to_file_path(new_benchmarks, file_path) # Will overwrite the file if it already exists\n",
-    "\n",
-    "# To/From Json String\n",
-    "single_benchmark = benchmarks[0]\n",
-    "json_object: dict = single_benchmark.to_json()\n",
-    "new_benchmark: BenchmarkForBot = BenchmarkForBot.from_json(json_object)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Once you have benchmark files in your project directory you can run `streamlit run forecasting_tools/benchmarking/benchmark_displayer.py` to get a UI with the benchmarks. You can also put `forecasting-tools.run_benchmark_streamlit_page()` into a new file, and run this file with streamlit to achieve the same results. This will allow you to see metrics side by side, explore code of past bots, see the actual bot responses, etc. It will pull in any files in your directory that contain \"bench\" in the name and are json. Results may take a while to load for large benchmark files.\n",
-    "\n",
-    "![Benchmark Displayer Top](./docs/images/benchmark_top_screen.png)\n",
-    "![Benchmark Displayer Bottom](./docs/images/benchmark_bottom_screen.png)"
-   ]
-  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -631,99 +496,43 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# AI Research Tools/Agents"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Smart Searcher\n",
-    "The Smart Searcher acts like an LLM with internet access. It works a lot like Perplexity.ai API, except:\n",
-    "- It has clickable citations that highlights and links directly to the paragraph cited using text fragments\n",
-    "- You can ask the AI to use filters for domain, date, and keywords\n",
-    "- There are options for structured output (Pydantic objects, lists, dict, list\\[dict\\], etc.)\n",
-    "- Concurrent search execution for faster results\n",
-    "- Optional detailed works cited list"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
+    "### Group Questions\n",
     "\n",
-    "searcher = SmartSearcher(\n",
-    "    temperature=0,\n",
-    "    num_searches_to_run=2,\n",
-    "    num_sites_per_search=10,  # Results returned per search\n",
-    "    include_works_cited_list=False  # Add detailed citations at the end\n",
-    ")\n",
+    "Several of the methods above accept a `group_question_mode` parameter that controls how Metaculus group questions (e.g. \"How many people will die of coronavirus in [period]?\") are handled:\n",
+    "- `\"exclude\"` \u2014 drop group questions from the result.\n",
+    "- `\"unpack_subquestions\"` \u2014 turn each subquestion into a separate normal question.\n",
     "\n",
-    "response = await searcher.invoke(\n",
-    "    \"What is the recent news for Apple?\"\n",
-    ")\n",
+    "For backwards compatibility, the default is `\"exclude\"` for `get_question_by_post_id`, `get_question_by_url`, `ApiFilter` (used by `get_questions_matching_filter`), and `get_benchmark_questions` \u2014 so group questions don't get overweighted in benchmarks. The exception is `get_all_open_questions_from_tournament`, which defaults to `\"unpack_subquestions\"` so all subquestions are forecasted as normal questions.\n",
     "\n",
-    "print(response)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Example output:\n",
-    "> Recent news about Apple includes several significant developments:\n",
-    ">\n",
-    "> 1. **Expansion in India**: Apple is planning to open four more stores in India, with two in Delhi and Mumbai, and two in Bengaluru and Pune. This decision follows record revenues in India for the September 2024 quarter, driven by strong iPhone sales. Tim Cook, Apple's CEO, highlighted the enthusiasm and growth in the Indian market during the company's earnings call \\[[1](https://telecomtalk.info/tim-cook-makes-major-announcement-for-apple-in-india/984260/#:~:text=This%20is%20not%20a%20new,first%20time%20Apple%20confirmed%20it.)\\]\\[[4](https://telecomtalk.info/tim-cook-makes-major-announcement-for-apple-in-india/984260/#:~:text=This%20is%20not%20a%20new,set%20an%20all%2Dtime%20revenue%20record.)\\]\\[[5](https://telecomtalk.info/tim-cook-makes-major-announcement-for-apple-in-india/984260/#:~:text=Previously%2C%20Diedre%20O%27Brien%2C%20Apple%27s%20senior,East%2C%20India%20and%20South%20Asia.)\\]\\[[8](https://telecomtalk.info/tim-cook-makes-major-announcement-for-apple-in-india/984260/#:~:text=At%20the%20company%27s%20earnings%20call,four%20new%20stores%20in%20India.)\\].\n",
-    ">\n",
-    "> 2. **Product Launches**: Apple is set to launch new iMac, Mac mini, and MacBook Pro models with M4 series chips on November 8, 2024. Additionally, the Vision Pro headset will be available in South Korea and the United Arab Emirates starting November 15, 2024. The second season of the Apple TV+ sci-fi series \"Silo\" will also premiere on November 15, 2024 \\[[2](https://www.macrumors.com/2024/11/01/what-to-expect-from-apple-this-november/#:~:text=And%20the%20Vision%20Pro%20launches,the%20App%20Store%2C%20and%20more.)\\]\\[[12](https://www.macrumors.com/2024/11/01/what-to-expect-from-apple-this-november/#:~:text=As%20for%20hardware%2C%20the%20new,announcements%20in%20store%20this%20November.)\\].\n",
-    ">\n",
-    "> ... etc ...\n",
+    "```python\n",
+    "from forecasting_tools import MetaculusApi, ApiFilter\n",
     "\n",
-    "You can also use structured outputs by providing a Pydantic model (or any other simpler type hint) and using the schema formatting helper:"
+    "# Unpack a group question into its subquestions\n",
+    "result = MetaculusApi.get_question_by_post_id(\n",
+    "    post_id=...,  # a group-question post\n",
+    "    group_question_mode=\"unpack_subquestions\",\n",
+    ")  # returns list[MetaculusQuestion] for group posts\n",
+    "\n",
+    "# Same option on a filtered query\n",
+    "api_filter = ApiFilter(\n",
+    "    allowed_statuses=[\"open\"],\n",
+    "    group_question_mode=\"unpack_subquestions\",\n",
+    ")\n",
+    "questions = await MetaculusApi.get_questions_matching_filter(api_filter=api_filter)\n",
+    "```"
    ]
   },
   {
-   "cell_type": "code",
-   "execution_count": null,
+   "cell_type": "markdown",
    "metadata": {},
-   "outputs": [],
    "source": [
-    "from pydantic import BaseModel, Field\n",
-    "from forecasting_tools import SmartSearcher\n",
-    "\n",
-    "class Company(BaseModel):\n",
-    "    name: str = Field(description=\"Full company name\")\n",
-    "    market_cap: float = Field(description=\"Market capitalization in billions USD\")\n",
-    "    key_products: list[str] = Field(description=\"Main products or services\")\n",
-    "    relevance: str = Field(description=\"Why this company is relevant to the search\")\n",
-    "\n",
-    "searcher = SmartSearcher(temperature=0, num_searches_to_run=4, num_sites_per_search=10)\n",
-    "\n",
-    "schema_instructions = searcher.get_schema_format_instructions_for_pydantic_type(Company)\n",
-    "prompt = f\"\"\"Find companies that are leading the development of autonomous vehicles.\n",
-    "Return as a list of companies with their details. Remember to give me a list of the schema provided.\n",
-    "\n",
-    "{schema_instructions}\"\"\"\n",
-    "\n",
-    "companies = await searcher.invoke_and_return_verified_type(prompt, list[Company])\n",
-    "\n",
-    "for company in companies:\n",
-    "    print(f\"\\n{company.name} (${company.market_cap}B)\")\n",
-    "    print(f\"Relevance: {company.relevance}\")\n",
-    "    print(\"Key Products:\")\n",
-    "    for product in company.key_products:\n",
-    "        print(f\"- {product}\")"
+    "# AI Research Tools/Agents"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "The schema instructions will format the Pydantic model into clear instructions for the AI about the expected output format and field descriptions.\n",
-    "\n",
-    "\n",
     "## Key Factors Researcher\n",
     "The Key Factors Researcher helps identify and analyze key factors that should be considered for a forecasting question. As of last update, this is the most reliable of the tools, and gives something useful and accurate almost every time. It asks a lot of questions, turns search results into a long list of bullet points, rates each bullet point on ~8 criteria, and returns the top results."
    ]
@@ -934,7 +743,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "\n",
+    "prompt = \"What is the weather in Tokyo?\"\n",
     "result = await GeneralLlm(model=\"gpt-4o\").invoke(prompt)\n",
     "result = await GeneralLlm(model=\"claude-3-5-sonnet-20241022\").invoke(prompt)\n",
     "result = await GeneralLlm(model=\"metaculus/claude-3-5-sonnet-20241022\").invoke(prompt) # Adding 'metaculus' Calls the Metaculus proxy\n",
@@ -1083,15 +892,14 @@
    "source": [
     "from forecasting_tools import MonetaryCostManager\n",
     "from forecasting_tools import (\n",
-    "    ExaSearcher, SmartSearcher, GeneralLlm\n",
+    "    ExaSearcher, GeneralLlm\n",
     ")\n",
     "\n",
     "max_cost = 5.00\n",
     "\n",
     "with MonetaryCostManager(max_cost) as cost_manager:\n",
     "    prompt = \"What is the weather in Tokyo?\"\n",
     "    result = await GeneralLlm(model=\"gpt-4o\").invoke(prompt)\n",
-    "    result = await SmartSearcher(model=\"claude-3-5-sonnet-20241022\").invoke(prompt)\n",
     "    result = await ExaSearcher().invoke(prompt)\n",
     "    # ... etc ...\n",
     "\n",