Skip to content

DFR Browser 2¤

Overview¤

DFR Browser 2 is web-based topic modelling browser that provides interactive visualizations and analysis tools for exploring topic models generated by MALLET. DFR Browser 2 is based on Andrew Goldstone's original dfr-browser. It reproduces all the major functionality of the original, but with an entirely new architecture and additional features. For full documentation, see the DFR Browser 2 repository.

The dfr_browser2 module provides a small helper class Browser that automates the steps required to prepare and open a a DFR Browser 2 distribution. This helper is designed to be used programmatically from Python and can be used to produce a small static browser bundle. It performs the following functions:

  • Validates that all required MALLET output files exist
  • Auto-generates topic_coords.csv from topic-state.gz if the coordinates file is missing
  • Copies a template DFR Browser 2 folder into a working browser folder
  • Copies all MALLET output and metadata files into the browser's data/, along with optional files (diagnostics.xml) if present
  • Copies an optional file containing the documents used to generate the topic model
  • Manages configuration settings for the browser
  • Checks port availability before starting the server
  • Starts a simple HTTP server to serve the browser and opens it in a web browser

If you have not yet generated a MALLET topic model, it is recommended that you start with the MALLET tutorial.


Browser Class¤

To create a simple instance of a browser, you need to supply a path to the directory where you mallet files are located and a path to a directory where you want to save the browser:

from lexos.topic_modeling.dfr_browser2 import Browser

b = Browser(
    mallet_files_path="/path/to/mallet_files",
    browser_path="/tmp/dfr_browser_output",  # optional (temporary folder created if omitted)
)

b.serve()

Calling the serve() method will start a localhost server and open the browser in your system's default web browser. The server runs in a subprocess, so you can continue working in your Python session while the browser is running. You can also start the server from the commmand line by running the server.py script in your DFR Browser 2 folder.

DFR Browser screen shot

Note

DFR Browser 2 must be served from a server; otherwise it will not have full functionality. The serve() method checks if the specified port is available before starting the server. If the port is already in use, you'll receive a helpful error message with instructions on how to find and terminate the conflicting process, or you can specify a different port using the port parameter. The default port is 8000, which may conflict with Jupyter notebooks or other local servers.

Tip

If you're running this in a Jupyter notebook, the browser will open in a new tab. You can stop the server by calling b.stop_server() or by restarting the kernel.

The example below demonstrates a fuller set of options:

b = Browser(
    mallet_files_path="/path/to/mallet_files",
    browser_path="/tmp/dfr_browser_output",  # optional (temporary folder created if omitted)
    data_path="/path/to/docs.txt",  # optional
    template_path="/path/to/dfr_browser2/template",
    filename_map={"doc-topics.txt": "doc-topic.txt"},
    config={"application": {"name": "My Browser"}},
    port=5000
)

The data_path is the path to your original training data file. It should be a tab-separated file with 2 columns per line (ID, content) or 3 columns per line (ID, label, content) — lines are validated during initialization and copied to data/docs.txt.

Because DFR Browser 2 is a separate package, Lexos may not have the latest distribution. If this is the case, you can download the latest version from the DFR Browser 2 repository and set the template_path to this version.

The Browser class assumes canonical names like doc-topic.txt for your input files. You can rename your files to the canonical names or map the current names of your files to the canonical ones with filename_map. See below for further information on using this parameter.

Every instance of DFR Browser 2 has a config.json file containing the browser configuration. You can pass configuration values as a dictionary to this file using the Browser class with the config parameter. For discussion of the configuration options, see the DFR Browser 2 repository.

As mentioned earlier, you can also set the port on which the browser is served with the port parameter.

Typical Usage Flow¤

  • Construct the Browser object — initialization checks required files, auto-generates missing files (like topic_coords.csv), copies the template, copies files into data/, validates TSVs, and writes config.json.
  • Start serving using b.serve() — this checks port availability, runs a subprocess HTTP server, and opens the browser in your default web browser (or prints instructions if a web browser cannot be opened).
  • Continue working or stop the server using b.stop_server() when done.

Required MALLET Files and Alternate Filenames¤

The Browser checks a small set of required files in mallet_files_path and supports some alternate names. Canonical names are:

  • metadata.csv (required)
  • topic-keys.txt (required)
  • doc-topic.txt (or doc-topics.txt) (required) — synonyms for the documents-to-topic mapping
  • topic-state.gz (or state.gz) (required) — supports state.gz as an alternate name
  • topic_coords.csv (or topic-coords.csv) (optional) — coordinates for topic display; auto-generated from topic-state.gz if missing
  • diagnostics.xml (optional) — MALLET diagnostics file, copied if present

Behavior notes:

  • Auto-generation: If topic_coords.csv is missing, the Browser will automatically generate it from topic-state.gz using Jensen-Shannon divergence and multidimensional scaling (MDS). This ensures the browser has topic coordinates for visualization even if they weren't pre-computed.
  • If you supply a filename_map, keys are considered the original filenames found in mallet_files_path and values are destination names to use inside data/.
  • filename_map is flexible — if you reverse the mapping (i.e., use the destination as key and the source as value), Browser attempts to detect and correct the mapping.
  • For canonicalization, where both doc-topic.txt and doc-topics.txt are present, Browser will deduplicate and select the canonical destination doc-topic.txt.
  • Optional files like diagnostics.xml are automatically copied if they exist in the MALLET output directory.

Example: filename_map¤

  • Standard mapping when you want doc-topics.txt to be called doc-topic.txt in data/:
Browser(..., filename_map={"doc-topics.txt": "doc-topic.txt"})
  • You can also specify a partial map to only rename some files:
Browser(..., filename_map={"topic-keys.txt": "custom-topic-keys.txt"})
  • If your mapping was reversed (the key was the destination), Browser will attempt to handle that by checking if the key or value exists in the mallet_files_path and swapping behavior as necessary.

data_path TSV Validation¤

If data_path is provided, Browser will ensure it is a non-directory file and validate each row. Each non-empty row must contain exactly 2 or 3 tab-separated columns. If any row fails this validation a ValueError is raised.

Browser will copy the data file to data/docs.txt and update the config accordingly.


Merging config.json¤

Browser reads the template config.json (if it exists) and then merges the config provided by the caller. The merge rules are:

  1. User-supplied config keys are preserved and take precedence — Browser will not overwrite keys in self.config
  2. Browser sets *_file values (e.g. doc_topic_file, topic_keys_file, topic_state_file, etc.) to the files that were copied into data/ — but only when the user did not specify those keys in self.config
  3. Template values are used only where not overridden by the user or the copying process

The merged config.json is written to browser_path/config.json and the merged config is saved back to b.config so it is available in-memory. Assigning b.config later (either by b.config = {...} or using config_browser()) also writes the new config to disk.

To explicitly update the config programmatically after the Browser instance has been initialized, you can call:

b.config_browser({"application": {"name": "New Title"}})

This sets b.config, which triggers a write to the json.config file.


Version Behavior¤

Calling Browser.version will return the DFR Browser 2 version number. The Browser has a class-level BROWSER_VERSION used as a default. If the template or the user-specified config has an application.version, that value is returned by the property Browser.version. Otherwise, BROWSER_VERSION is returned.

The property is intentionally defensive — if the config.json file is malformed, it will simply fall back to the default BROWSER_VERSION.


Troubleshooting & Tips¤

  • If you get FileNotFoundError: Missing required mallet files, check mallet_files_path for expected files and confirm names or provide a filename_map to rename or canonicalize source files.
  • If you get RuntimeError: Server failed to start with a port conflict message, either use a different port by passing port=XXXX to serve(), or follow the instructions in the error message to terminate the conflicting process.
  • If topic_coords.csv is missing, it will be automatically generated from topic-state.gz. This may take a few seconds for large topic models.
  • If a config.json key is missing or incorrect, verify whether you set the config key in config (user-specified configs override template values) or whether Browser wrote a copied/data path into the merged config.json.
  • Use Browser.BROWSER_VERSION or b.version to inspect or assert the configured version.
  • To stop a running server, call b.stop_server() or interrupt the Python process.