This is the best practice / style guide for the BBOP group. Inspired by / cribbed from Knocean practice

We are a diverse group working on many different projects with different stakeholders and sets of collaborators. Nevertheless we strive to follow a set of core best practices so we can be most efficient and develop the highest quality code, ontologies, standards, schemas, and analyses.

Git and GitHub

  • use git
  • commit early, commit often
    • perfect later!
    • you should always be working on a branch, so don’t worry about breaking things
  • Make repos public by default
  • Use standard repo layouts
  • Include standard files:
    • README.md
    • LICENSE (BSD3 preferred for software)
    • CONTRIBUTING.md
    • CODE_OF_CONDUCT.md (see for example kgx CoC
    • Changes.md
    • .gitignore
    • Makefile or equivalent
  • use GitHub
    • Like GitLab in principle, but GitHub has network effect
    • prefer to work on the main repo, not forks, but defer to project-specific guidelines
  • use GitHub issues
    • in general you should always we working to a ticket assigned to you
    • try to assign every issue to somebody
    • try to have a single assignee / responsible person
    • tag people if necessary
      • note: if you tag me with @cmungall it’s likely I won’t see it. alert me to a ticket via slack if I am required
    • use GitHub’s default labels: bug, question, enhancement, good first issue, etc.
    • set up standard issue templates
  • use GitHub Pull Requests
    • mark as draft until ready for review, then assign reviewers
    • description should link to an issue “Resolves #1234”
      • otherwise you have to clean up issues manually
    • update description as needed
  • use GitHub Milestones to plan releases
  • use GitHub Releases to tag versions and attach binaries
  • use GitHub Pages for simple static content and documentation
    • prefer the docs/ directory option
  • use GitHub Projects (“project boards”) for coordinating issues and PRs
    • three columns:
      • To do: for manager to fill and prioritize
      • In progress: for developer to keep up-to-date
      • Ready for review: for manager to empty
    • order of preference for cards: PR link, issue link, text
  • set up GitHub actions to do CI
  • set up badges
  • read our GitHub Overview
  • tend your repos
    • remove cruft such as obsolete files (GitHub preserves history)
    • avoid random stuff at top level
    • keep README in sync
    • avoid using spaces in filenames
    • always use standard suffixes (e.g. .tsv, .txt, .md)
    • kabob-case-is-a-good-default.txt

Software-centric Repos

  • Use an existing repo from a group member as template for best practice, e.g.,
  • Include a README.md
    • provide sufficient context
    • don’t boil the ocean - put reference material in a separate reference guide
  • Create reference documentation using RTD/Sphinx
    • let inline docstrings in Python do most of the work for you
  • Include installation instructions
  • use an OSI approved LICENSE, BSD3 preferred
  • Use unit tests
    • consult others on framework
  • Use GitHub-integrated CI
    • formerly Travis
    • use GitHub actions
  • Release code to PyPI or appropriate repo
    • use GitHub releases
    • use GitHub actions to trigger releases to PyPI
  • Consider a Dockerfile
  • For ETL repos, follow standard templates for
    • kg-hub
    • koza
  • For ETL repos
    • Use Jenkins pipelines
  • Badges
    • CI
    • Code coverage
    • PyPI
    • TODO: ADD MORE

Schema/Standards-centric Repos

  • You will be using linkml
  • Create repo from LinkML template
  • Examples:
  • Register with w3id.org
  • Include comprehensive examples
  • Use LinkML mkdocs framework
  • Understand the difference between OWL-centric and KG-centric modeling
  • include mappings to biolink model
  • always include examples
    • integrate these with documentation
    • integrate these with unit tests
  • enable zenodo syncing

Ontology-centric Repos

  • Use ODK seed
  • Register ontology with OBO
    • include detailed metadata
    • include all products
    • include descriptive material in markdown
  • Use GitHub for .owl distribution unless ontology is large, then consider:
    • GitHub releases
    • S3
  • Follow group exemplars: Uberon, Mondo, GO, ENVO, CL, PATO
    • but be aware each has their quirks
  • distribute useful products
    • distribute SSSOM
    • always distribute an .obo
    • always distribute a obo .json
    • distribute a kgx file (NEW)
  • enable zenodo syncing

Analysis/Paper-centric Repos

  • One repo per paper
  • Entire analysis must be reproducible via Makefile
    • All steps:
      • download
      • clean/pre-process
      • transform
      • training
      • evaluation
    • check with Chris before using snakemake/CWL/alternatives
    • Chris still uses biomake
  • Use TSVs as default
  • ALL TSVs MUST have data dictionaries
    • use LinkML (see above)
  • check in small-mid size data files (<10m)
    • consider cogs if TSVs must be managed in google sheets
  • use JSON for complex data
  • use KGX for anything that should be modeled as a KG
  • use descriptive filenames
  • manage metadata in GitHub
  • sync repo with Zenodo
  • use S3 for larger files
    • release files to Zenodo
  • Dockerize
  • Use Jupyter notebooks
  • Consider Manubot
  • Other recommended best practices
  • enable zenodo syncing

Websites

  • GitHub pages favored over google sites over wikis
  • Manage and author content as markdown, managed in github, with PRs as for code
  • Google Analytics and similar (recommendations TODO)
  • avoid manually authoring anything that can be derived from metadata
    • examplars: obofoundry.github.io, this site
  • use a CC license, CC-0 or CC-BY

Documentation

  • all code, schemas, analyses, ontologies, MUST be documented
  • code documentation is a love-letter to your future self
  • understand this four-way distinction: tutorial, how-to, reference, explanation
  • have strategies to avoid staleness and documentation being out of sync
  • use inline documentation
    • publish via appropriate framework (RTD for code, mkdocs for schema, etc)
    • follow appropriate style guide
  • examples, examples, examples
    • fenced examples in markdown docs
    • example standalone scripts
    • example Jupyter notebooks
    • unit tests can serve as examples
  • use Markdown as default
    • RST acceptable for RTD
    • Google docs acceptable for initial brainstorming
    • Don’t use Wikis (mediawiki, GitHub wiki)
    • Manage markdown docs as version control
    • publish as static site (RTD, mkdocs, etc)

Coding/Python

  • Python is the default language; use others as appropriate
    • javascript/typescript for client-side
      • don’t implement domain/business logic in js. use python + APIs
    • Rust for speed
    • Scala for performance reasoners
    • Historically we used Java for anything requiring OWLAPI but being phased out
    • Chris still uses Prolog
  • Why Python?
    • ubiquitous, cross-platform
    • good for scripting, interactive development
    • strong ecosystem of libraries for almost anything
    • Easy for developers to pick up
    • Most bioinformaticians know it
    • use for anything more than about 10 lines of Bash/Perl
    • use Python 3.6+
  • Conform to the group style guide, or at least some style guide
  • use flask/fastAPI for web apps
    • don’t author OpenAPI directly; derive
  • avoid authoring complex data models
    • use LinkML and derived datamodel classes
  • use typing
  • use dataclasses or pydantic
  • use an IDE
    • PyCharm is most popular
  • ETL/ingest
  • use requests for URL calls
  • Always provide a CLI
    • use click
    • use de-facto standards
      • -i, --input
      • -o, --output
  • TODO: Best practice for
    • test framework (unittest vs pytest?)
    • environments: venv vs pipenv
    • config: requirements.txt vs toml vs Pipenv vs setup.cfg…
    • layout: src/name vs name

Database Engines

  • use whatever is appropriate for the job
    • blazegraph for ttl
    • neo4j for KGs
    • sqlite for lightweight tabular
    • avoid vendor lock-in
      • use generic sparql 1.1 API vs triplestore specific APIs
    • solr for searchable / denormalized / analytics
  • always have a schema no matter what the task
    • always derive from LinkML

Developer tools

  • GNU Make
  • cogs
  • linkml
  • odk
  • robot
  • bash; small scripts only
  • pandoc
  • Docker
  • editor of your choice

Programming Libraries

  • Data science
    • this is a fast changing field so recommendations here are general/loose
    • generally prefer Python » R » other languages for data sciences
    • we frequently use tensorflow, scikitlearn, keras
    • scikit-learn
    • catboost
    • pandas
      • TSV » CSV
      • parquet for large files
      • use # for header comments
    • seaborn within Jupyter
    • KGs
      • kgx
      • BMT
      • EnsmallenGraph, (Rust + Python bindings), fast graph ML
      • Embiggen graph ML (e.g. node2vec), and some other things like word2vec
      • NEAT is a Python wrapper for reproducible graph ML in a YAML-driven way
      • also exploring pykeen ampligraph
  • Ontologies
    • ontobio
    • OWLAPI (JVM) – only where necessary
    • obographviz (js)
    • beware of using rdflib and RDF-level libraries for working with OWL files, too low level
    • never, ever use XML parsers to parse RDF/XML
    • Ubergraph
  • NER/NLP
    • fast changing but some tools to consider:
      • runNER (which wraps OGER)
      • BERT for language models (experimental)
  • Data
  • Code
    • typing

File formats, languages, and standards

  • General
    • TSVs for columnar data
      • always have a data dictionary (use LinkML)
      • make it pandas-friendly
      • meaningful column names
      • SSSOM is an exemplar
    • hand-author YAML over JSON (+ follow schema)
    • Use JSON-LD / YAML-LD as appropriate
      • understand JSON-LD contexts
      • get context for free with LinkML
    • Turtle for some purposes
    • RDF/XML as default for OWL
  • Ontologies
    • OWL
    • OBO JSON
    • consider obo format deprecated. Exception: easier to maintain edit file as obo for git diff/PR purposes
    • COB as upper ontology, but also pay attention to biolink
    • Always use official PURLs for downloads
  • Mappings (ontology or otherwise)
    • SSSOM with skos predicates
  • KGs
    • biolink
    • kgx
    • RDF*
    • make available as:
      • RDF dump
      • Neo4J dump
      • sparql endpoint (consider putting into larger endpoint and segregating with NGs)
      • neo4j endpoint
      • KGX dump
      • KGX summary stats
  • Schemas
    • everything must have a schema, including:
      • all TSVs should have data dictionary
      • JSON/YAML
      • KGs
      • OWL ontologies and OWL instance graphs
    • Understand basic concepts:
      • normalized vs de-normalized
      • identifiers and URIs
      • closed-world vs open-world
      • schema vs ontology
    • Always author schemas in linkml
      • derive alternate representations (e.g. json-schema)
    • JSON-schema for JSON-centric projects (never author, always derive from LinkML)
    • ShEx for ontology-centric (try and derive from LinkML)
    • kwalify is deprecated for us
    • Always have a LinkML schema even when using
      • python dicts
      • open-ended JSON/YAML
      • RDF
      • Neo4J
      • ad-hoc TSVs
    • Include mappings:
      • map to biolink
  • Versioning
    • Semantic Versioning (semver) by default
    • ISO-8601 OBO style for ontologies
    • use GitHub releases for versioning as appropriate
    • release versions to appropriate repository/archive
  • Text
    • markdown by default
      • frontmatter metadata where appropriate
      • track in version control
  • APIs
    • RESTfulness
      • true REST may be too high a barrier
      • RPC-style (i.e. swagger/openAPI) may be fine
    • All web APIs should have OpenAPI exploration interface
    • derive OpenAPI from Python code
      • flask or fastapi
    • Must have Docker container
    • Use grlc to make APIs from sparql endpoints
  • CURIEs and IRIs
  • Genomics
    • GFF3
    • SO
  • Annotation
    • GAF
    • GPAD
    • Phenopackets
  • Dates
    • If you don’t use ISO-8601 you will go to hell

Portability

  • it should be easy for anyone to install from any of our repos
  • everything should run on macos or linux
  • provide a Docker image for anything complex
  • use standard installation idioms

Key specialized libraries and command line tools

  • ontobio, for ontologies and associations
  • kgx
  • ODK and ROBOT, for ontologies
  • runNER for NER

Building Ontologies

  • ontologies are for users, not ontologists
    • OWL and description logic is necessary for building robust ontologies, but needn’t be exposed
    • Minimize philosophy
    • avoid unnecessary abstractions
  • ontologies should have annotations
    • annotations, as in the sense used by curators
    • ontologies without annotations are generally of limited use, avoid working on them
  • learn tools and best practice for robust ontology engineering
    • Read my Onto-Tips
    • Use ODK
    • Use ROBOT
    • Do the GO OWL tutorial
    • For advanced OWL-centric tasks, use scowl
  • use the ontologies we work on as examplars
    • GO
    • Mondo
    • Phenotype Ontologies
    • ENVO
    • Uberon
    • RO
  • follow OBO best practice and principles
    • ontologies should be open
    • if OBO is underspecified, follow the examples of projects done in this group
      • oio over IAO
      • liberal axiom annotations
      • key annotation properties: synonyms, definitions, mappings
      • See documentation on uberon synonyms, this is an exemplar for us
      • dosdp over robot, but always use the more appropriate tool for the job
  • include comprehensive definitions clear to biologists
  • understand compositional patterns
  • avoid overmodeling
  • Document ontologies
  • understand limitations
  • use ontologies only where appropriate
    • vocabularies
    • descriptors
    • don’t use an ontology where a schema is more appropriate
    • don’t use an ontology where a KG is more appropriate. See KG vs ontology DPs
  • make best effort attempt to provide mappings
    • use SSSOM
    • use boomer

Collaboration

  • we are a collaborative group, reach out if you have issues
    • join relevant channels on bbop and other slacks
    • questions always welcome but make best effort to see if information available in group reference guides
  • make things easier for those who follow you
    • the same questions often come up repeatedly
    • if someone answers a question for you, update the relevant guide to make it clearer for others
  • follow codes of conduct
  • be constructive in any criticism
  • use your Berkeley Lab account for email, calendars
  • keep your calendar up to date, this facilitates scheduling meetings
  • slack
    • avoid @channel unless necessary
    • don’t be a channel anarchist
    • discussion about tickets OK but decisions and key points must be recorded in ticket
  • use GitHub for requests
  • Data mapping guide: selecting and requesting terms from ontologies, data models, and standards

Google docs/slides/sheets hygiene

  • Use google docs/slides over Microsoft/Apple
    • but sometimes markdown+git is more appropriate than either
    • for grants, papers, and other collaborative documents, move to Word at last possible minute (if at all)
    • pandocs can be used to make markdown
    • avoid latex/beamer unless it is really called for
  • Use tagging/comments/modes appropriately
    • If it’s not your doc, default to Suggesting mode
      • use your judgment; minor direct edits to correct typos usually OK
      • respect conventions of document owner
    • use comment feature to make comments, don’t write your comment in a different color
    • avoid use of text color as semantics
    • assign/tag people appropriately
    • avoid comment wars
  • Make the doc outline-mode-friendly
    • use H1/H2/etc
    • always have outline mode on (list-like icon near top left)
    • assume the reader has outline mode on
    • rarely need for a TOC
  • Use formatted templates where appropriate (grants, papers)
  • Use Paperpile for citations / reference management (you have access via the lab)
  • Give documents meaningful names (e.g., not just “meeting”)–assume that most people will find the doc via search rather than by going through the folder hierarchy
  • Use a rolling agenda/notes doc, rather than one doc per meeting
  • always have a google doc for every meeting you are in
  • include a link to the rolling doc in calendar invites
  • include the Zoom / videoconference link in the rolling notes doc
  • organize google docs in the relevant folder depending on what project is funding the work
  • understand how navigation works for google docs
  • make visible to all by default
  • include links to slides of general relevance from project repos
  • favour TSV+github over google sheets
    • workflows clearly favor sheets
    • when using sheets, use cogs
  • follow TSV guidelines for google sheets
  • don’t use color for semantics. Always use conditional formatting for colors etc
  • reuse slides from existing slide decks, but provide attribution

General Principles

  • DRY: Don’t Repeat Yourself
    • but avoid over-abstraction and frameworkitis
  • Always reuse
    • we probably have a Python library for it
    • reuse general design patterns
    • GitHub templates
    • follow exemplar repos
    • try especially hard not to reinvent what someone in the group or our collaborator has done
  • Avoid perfectionism
    • iterate on solutions
    • smaller batches of incremental progress » long delays on perfect solution (that may turn out to be flawed)
  • For many tasks, the 80/20 rule may suffice
    • Don’t boil the ocean
    • beware of rabbit holes
  • More to come…
Edit