This is the best practice / house style guide for the BBOP group. Inspired by / cribbed from Knocean practice and other sources.
We are a diverse group working on many different projects with
different stakeholders and sets of collaborators. Nevertheless we
strive to follow a set of core best practices so we can be most
efficient and develop the highest quality code, ontologies, standards,
schemas, and analyses.
Git and GitHub
- use git
- commit early, commit often
- perfect later!
- you should always be working on a branch, so don’t worry about breaking things
- Make repos public by default
- Use standard repo layouts
- Include standard files:
- README.md
- LICENSE (BSD3 preferred for software)
- CONTRIBUTING.md
- CODE_OF_CONDUCT.md (see for example kgx CoC
- Changes.md
- .gitignore
- Makefile or equivalent
- use GitHub
- Like GitLab in principle, but GitHub has network effect
- prefer to work on the main repo, not forks, but defer to project-specific guidelines
- use GitHub issues
- in general you should always we working to a ticket assigned to you
- try to assign every issue to somebody
- try to have a single assignee / responsible person
- tag people if necessary
- note: if you tag me with @cmungall it’s likely I won’t see it. alert me to a ticket via slack if I am required
- use GitHub’s default labels: bug, question, enhancement, good first issue, etc.
- set up standard issue templates (helps ensure tickets are auto-assigned)
- use GitHub Pull Requests
- mark as draft until ready for review, then assign reviewers
- description should link to an issue “Resolves #1234”
- otherwise you have to clean up issues manually
- update description as needed
- always look over your PRs
- are there unexpected changes? You should only see YOUR changes
- Is it adding files unexpectedly? Some git clients are eager to do this
- are some changes not recognizable as yours? Be careful not to clobber
- follow repo-standard practice for rebase etc
- AVOID:
- making PRs too large
- mixing orthogonal concerns in one PR. Generally 1PR = 1 issue
- mixing in formatting changes on sections of the code unrelated to the semantic changes you are making
- working on a PR for too long a time without feedback from others
- working on “invisible” branches. ALWAYS make a PR, ALWAYS push. You can mark as draft!
- use GitHub Milestones to plan releases
- use GitHub Releases to tag versions and attach binaries
- use semver
- use the auto-generate release notes feature (corollary: write informative PR titles and never commit on main)
- use GitHub Pages for simple static content and documentation
- prefer the
docs/
directory option
- use GitHub Projects (“project boards”) for coordinating issues and PRs
- three columns:
- To do: for manager to fill and prioritize
- In progress: for developer to keep up-to-date
- Ready for review: for manager to empty
- order of preference for cards: PR link, issue link, text
- set up GitHub actions to do CI
- travis no longer recommended
- use GitHub actions
- All changes should be on PRs thus validated
- main branch should never ever be failing
- set up GitHub teams
- default to public membership
- make sure it is clear who has permission to merge PRs
- set up badges
- always: CI
- pypi, downloads, codecov, zenodo, …
- Configure the “About” (see gear icon on right)
- Orgs
- define a standard topic (see above)
- include a .github
- examplar: github.com/linkml
- pin repos
- read our GitHub Overview
- make sure all relevant artefacts are checked in
- use
git status
and .gitignore
- in general avoid checking in derived products (but see below)
- avoid checking in .xslx files (use TSVs; or consider cogs instead)
- versioning
- do not check in files with version numbers e.g.
foo.v1.txt
into GitHub - git does versioning for you
- use the GitHub release mechanism
- use ISO-8601 or semver schemes (see guidelines on specific repo types below)
- tend your repos
- remove cruft such as obsolete files (GitHub preserves history)
- avoid random stuff at top level
- keep README in sync
- avoid using spaces in filenames
- always use standard suffixes (e.g. .tsv, .txt, .md)
- kabob-case-is-a-good-default.txt. See filenames in google developer guide
- use topics and “star” relevant repos
- https://github.com/topics/linkml
- https://github.com/topics/obofoundry
- https://github.com/topics/geneontology
Software-centric Repos
- Use an existing repo from a group member as template for best practice, e.g.,
- Include a README.md
- provide sufficient context
- don’t boil the ocean - put reference material in a separate reference guide
- include examples and use txm to use these as tests
- Create reference documentation using RTD/Sphinx
- let inline docstrings in Python do most of the work for you
- read writethedocs
- Include installation instructions
- use an OSI approved LICENSE, BSD3 preferred
- Use unit tests
- consult others on framework
- Use GitHub-integrated CI
- formerly Travis
- use GitHub actions
- Release code to PyPI or appropriate repo
- use GitHub releases
- use GitHub actions to trigger releases to PyPI
- make release notes automatically see github guide
- relies on using PRs with well-described titles
- Consider a Dockerfile
- For ETL repos, follow standard templates for
- For ETL repos
- Badges
- CI
- Code coverage
- PyPI
- TODO: ADD MORE
Schema/Standards-centric Repos
- You will be using linkml
- Create repo from LinkML template
- Examples:
- Register with w3id.org
- Include comprehensive examples
- Use LinkML mkdocs framework
- Understand the difference between OWL-centric and KG-centric modeling
- include mappings to biolink model
- always include examples
- integrate these with documentation
- integrate these with unit tests
- also include counter-examples
- data deliberately designed to fail validation
- check validation correctly identifiers these in github actions
- enable zenodo syncing
Ontology-centric Repos
- Use ODK seed
- Register ontology with OBO
- include detailed metadata
- include all products
- include descriptive material in markdown
- Use GitHub for .owl distribution unless ontology is large, then consider:
- Follow group exemplars: Uberon, Mondo, GO, ENVO, CL, PATO
- but be aware each has their quirks
- distribute useful products
- distribute SSSOM
- always distribute an .obo
- always distribute a obo .json
- distribute a kgx file (NEW)
- distribute a rdftab sqlite file (NEW)
- use a sensible source format (foo-edit.owl)
- .obo is best for diffs but less expressive and gotchas for CURIEs
- functional syntax is often preferred
- for template-based ontologies, much of the source may be TSVs
- enable zenodo syncing
- Understand issues relating to git conflicts with ontologies
- .obo as source mitigates some of these
- See this thread
- See this post
- many issues have since been resolved but unfortunately some remain
Analysis/Paper-centric Repos
- One repo per paper
- Entire analysis must be reproducible via Makefile
- All steps:
- download
- clean/pre-process
- transform
- training
- evaluation
- check with Chris before using snakemake/CWL/alternatives
- Chris still uses biomake
- Use TSVs as default
- make pandas-friendly
- use unix newline characters, not dos
- use human readable but computationally friendly column headers
- NO ALL CAPS
- alphanumeric characters preferred
- spaces or underscores as word separators OK, but underscores preferred for formal formats
- csvkit is your friend
- ALL TSVs MUST have data dictionaries
- check in small-mid size data files (<10m)
- consider cogs if TSVs must be managed in google sheets
- use JSON for complex data
- use KGX for anything that should be modeled as a KG
- use descriptive filenames
- manage metadata in GitHub
- sync repo with Zenodo
- use S3 for larger files
- Dockerize
- Use Jupyter notebooks
- Consider Manubot
- Other recommended best practices
- enable zenodo syncing
Websites
- GitHub pages favored over google sites over wikis
- Manage and author content as markdown, managed in github, with PRs as for code
- Google Analytics and similar (recommendations TODO)
- avoid manually authoring anything that can be derived from metadata
- examplars: obofoundry.github.io, this site
- use a CC license, CC-0 or CC-BY
Documentation
- See google guide on Writing inclusive documentation
- Avoid ableist language
- Avoid unnecessarily gendered language
- Avoid unnecessarily violent language
- all code, schemas, analyses, ontologies, MUST be documented
- documentation is a love-letter to your future self
- understand the Diataxis four-way distinction: tutorial, how-to, reference, explanation
- exemplar: obook
- exemplar: linkml docs
- google API documentation guide
- have strategies to avoid staleness and documentation being out of sync
- use inline documentation
- publish via appropriate framework (RTD for code, mkdocs for schema, etc)
- follow appropriate style guide
- examples, examples, examples
- fenced examples in markdown docs
- example standalone scripts
- example Jupyter notebooks
- double up: unit tests can serve as examples and vice versa
- use Markdown as default
- RST acceptable for Sphinx projects
- Google docs acceptable for initial brainstorming
- Don’t use Wikis (mediawiki, GitHub wiki)
- Manage markdown docs as version control
- publish as static site (RTD, mkdocs, etc)
Coding/Python
- Python is the default language; use others as appropriate
- javascript/typescript for client-side
- don’t implement domain/business logic in js. use python + APIs
- use typescript
- Rust for speed
- Scala for performance reasoners
- Historically we used Java for anything requiring OWLAPI but being phased out
- Chris still uses Prolog occasionally
- Why Python?
- ubiquitous, cross-platform
- good for scripting, interactive development
- strong ecosystem of libraries for almost anything
- Easy for developers to pick up
- Most bioinformaticians know it
- use for anything more than about 10 lines of Bash/Perl
- use Python 3.7+
- Conform to the group style guide, or at least some style guide
- pep-0008 for Python
- use type annotations PEP484
- google style guide
- See knocean/practices/python
- We are moving towards poetry for all repos
- See sssom-py as exemplar of our best practice
- Use black and flake8
- Use tox
- we use click not argparse
- undecided on pytest vs unittest
- Makefile defaults are good
- Note unlike Knocean, we make use of OO as appropriate
- document all public classes, methods, functions
- Always Use type annotations
- Always provide docstrings
- ReST » numpy-style docstrings or google style
- SOME standard is always better than none
- use flask/fastAPI for web apps
- NEVER author OpenAPI directly; ALWAYS derive
- avoid authoring complex data models
- use LinkML and derived datamodel classes
- use fstrings
- use typing
- makes code more understandable
- allows code completion in PyCharm etc
- helps find bugs
- use dataclasses or pydantic
- for DAOs, derive from linkml
- use an IDE
- PyCharm or VS is equally popular in the group
- ETL/ingest
- use
requests
for URL calls
- Always provide a CLI
- Learning resources
- TODO: Best practice for
- test framework (unittest vs pytest?)
- environments: venv vs pipenv vs poetry
- config: requirements.txt vs toml vs Pipenv vs setup.cfg…
- layout: src/name vs name
- linter: black?
Shell
Database Engines
- use whatever is appropriate for the job
- blazegraph for ttl
- neo4j for KGs
- Postgresql for SQL db server
- never use non-open SQL db solutions
- Some legacy apps may use MySQL but Pg is preferred
- sqlite for lightweight tabular
- avoid vendor lock-in
- use generic sparql 1.1 API vs triplestore specific APIs
- solr for searchable / denormalized / analytics
- always have a schema no matter what the task
- always derive from LinkML
- SQL vs other DB engines
- GNU Make – see Knocean guide
- cogs
- odk
- q – query TSVs via SQL
- csvkit
- jq/jq
- robot
- bash; small scripts only
- pandoc
- Docker
- editor of your choice
Programming Libraries
- Data science
- this is a fast changing field so recommendations here are general/loose
- generally prefer Python » R » other languages for data sciences
- we frequently use tensorflow, scikitlearn, keras
- scikit-learn
- catboost
- pandas
- TSV » CSV
- parquet for large files
- use
#
for header comments
- seaborn within Jupyter
- KGs
- kgx
- BMT
- EnsmallenGraph, (Rust + Python bindings), fast graph ML
- Embiggen graph ML (e.g. node2vec), and some other things like word2vec
- NEAT is a Python wrapper for reproducible graph ML in a YAML-driven way
- also exploring pykeen ampligraph
- Ontologies
- ontobio
- OWLAPI (JVM) – only where necessary
- obographviz (js)
- beware of using rdflib and RDF-level libraries for working with OWL files, too low level
- never, ever use XML parsers to parse RDF/XML
- Ubergraph
- semsql
- NER/NLP
- fast changing but some tools to consider:
- runNER (which wraps OGER)
- BERT for language models (experimental)
- Data
- Code
- General
- TSVs for columnar data
- always have a data dictionary (use LinkML)
- make it pandas-friendly
- meaningful column names
- SSSOM is an exemplar
- understand TidyData and Codd’s normal forms and when to use them
- hand-author YAML over JSON (+ follow schema)
- Use JSON-LD / YAML-LD as appropriate
- understand JSON-LD contexts
- get context for free with LinkML
- Turtle for some purposes
- RDF/XML as default for OWL
- Ontologies
- OWL
- OBO JSON
- consider obo format deprecated. Exception: easier to maintain edit file as obo for git diff/PR purposes
- COB as upper ontology, but also pay attention to biolink
- Always use official PURLs for downloads
- Mappings (ontology or otherwise)
- SSSOM with skos predicates
- KGs
- biolink
- kgx
- RDF*
- make available as:
- RDF dump
- Neo4J dump
- sparql endpoint (consider putting into larger endpoint and segregating with NGs)
- neo4j endpoint
- KGX dump
- KGX summary stats
- Schemas
- everything must have a schema, including:
- all TSVs should have data dictionary
- JSON/YAML
- KGs
- OWL ontologies and OWL instance graphs
- Understand basic concepts:
- normalized vs de-normalized
- identifiers and URIs
- closed-world vs open-world
- schema vs ontology
- Always author schemas in linkml
- derive alternate representations (e.g. json-schema)
- JSON-schema for JSON-centric projects (never author, always derive from LinkML)
- ShEx for ontology-centric (try and derive from LinkML)
- kwalify is deprecated for us
- Always have a LinkML schema even when using:
- python dicts
- open-ended JSON/YAML
- RDF
- Neo4J
- ad-hoc TSVs
- Include mappings:
- Versioning
- Semantic Versioning (semver) by default
- software MUST use semver
- schemas SHOULD use semver, but OBO-style may sometimes be appropriate
- ISO-8601 OBO style for OBO ontologies
- use GitHub releases for versioning as appropriate
- always use the autofill feature to make release notes and to name releases
- for software follow the group github-action best practice to auto-release to pypi
- release versions to appropriate repository/archive
- Compression
- use
.gz
instead of .zip
- if compressing multiple files in an archive, use
.tar.gz
, not .zip
- Rememeber compressed files are not diffable in git
- For very large files consider distributing gz files via S3 rather than in GitHub
- remember: if a repo has 10 x 50m files that change every release, the repo will be 10g in size in 20 releases
- Text
- markdown by default
- frontmatter metadata where appropriate
- track in version control
- use .rst for sphinx sites where autodoc features are needed
- APIs
- RESTfulness
- true REST may be too high a barrier
- RPC-style (i.e. swagger/openAPI) may be fine
- All web APIs should have OpenAPI exploration interface
- derive OpenAPI from Python code
- fastapi > flask »> others
- Must have Docker container
- Use grlc or sparqlfun to make APIs from sparql endpoints
- CURIEs and IRIs
- Read McMurry et al.
- always use CURIEs for IDs
- always use prefixes registered in bioregistry.io
- understand at a broad level the different registries:
- http://identifiers.org
- http://n2t.net – synced(?) with identifiers.org but broader context
- http://bioregistry.io/
- has a lot of advantages over id.org: more transparent, github metadata based, lightweight
- https://github.com/prefixcommons/biocontext
- we developed this as an “overlay” on existing registries
- have an explicit JSON-LD context or prefixes yaml file
- Use the prefixcommons curie util library
- Read the identifiers guides closely, even for projects you are not on
- Genomics
- Annotation
- Dates
- use ISO-8601
- use ISO-8601
- use ISO-8601
- use ISO-8601
- never, ever write a date in non-ISO-8601
Portability
- it should be easy for anyone to install from any of our repos
- everything should run on macos or linux
- provide a Docker image for anything complex
- use standard installation idioms
Building Ontologies
- ontologies are for users, not ontologists
- OWL and description logic is necessary for building robust ontologies, but needn’t be exposed
- Minimize philosophy
- avoid unnecessary abstractions
- ontologies should have annotations
- annotations, as in the sense used by curators
- ontologies without annotations are generally of limited use, avoid working on them
- learn tools and best practice for robust ontology engineering
- Read my Onto-Tips
- Use ODK
- Use ROBOT
- Do the GO OWL tutorial
- For advanced OWL-centric tasks, use scowl
- Take the OBO Academy training
- work on the components on your own
- attend the Monarch tutorials
- use the ontologies we work on as examplars
- GO
- Mondo
- Phenotype Ontologies
- ENVO
- Uberon
- RO
- follow OBO best practice and principles
- ontologies should be open
- if OBO is underspecified, follow the examples of projects done in this group
- oio over IAO
- liberal axiom annotations
- key annotation properties: synonyms, definitions, mappings
- See documentation on uberon synonyms, this is an exemplar for us
- Generally dosdp over robot template, but always use the more appropriate tool for the job
- include comprehensive definitions clear to biologists
- understand compositional patterns
- avoid overmodeling
- Document ontologies
- understand limitations
- use ontologies only where appropriate
- vocabularies
- descriptors
- don’t use an ontology where a schema is more appropriate
- don’t use an ontology where a KG is more appropriate. See KG vs ontology DPs
- make best effort attempt to provide mappings
Collaboration
- we are a collaborative group, reach out if you have issues
- join relevant channels on bbop and other slacks
- questions always welcome but make best effort to see if information available in group reference guides
- make things easier for those who follow you
- the same questions often come up repeatedly
- if someone answers a question for you, update the relevant guide to make it clearer for others
- follow codes of conduct
- be constructive in any criticism
- use your Berkeley Lab account for email, calendars
- keep your calendar up to date, this facilitates scheduling meetings
- slack
- avoid
@channel
unless necessary
- don’t be a channel anarchist
- discussion about tickets OK but decisions and key points must be recorded in ticket
- use GitHub for requests
- Use GitHub for requesting terms from ontologies etc
Google docs/slides/sheets hygiene
- Read Julie’s awesome guide
- Read [Data Organization in Spreadsheets for Ecologists](https://datacarpentry.org/spreadsheet-ecology-lesson/ from datacarpentry
- Read Data Organization in Spreadsheets by Bronan and Woo
- be consistent
- write dates like YYYY-MM-DD
- put just one thing in a cell
- no not merge cells
- organize the data as a single rectangle (with subjects as rows and variables as columns, and with a single header row)
- create a data dictionary
- do not include calculations in the raw data files
- do not use font color or highlighting as data
- choose good names for things
- make backups
- use data validation to avoid data entry errors
- save the data in plain text files
- Use google docs/slides over Microsoft/Apple/Desktop
- but sometimes markdown+git is more appropriate than either
- for grants, papers, and other collaborative documents, move to Word at last possible minute (if at all)
- pandocs can be used to make markdown
- avoid latex/beamer unless it is really called for
- Use tagging/comments/modes appropriately
- If it’s not your doc, default to Suggesting mode
- use your judgment; minor direct edits to correct typos usually OK
- respect conventions of document owner
- use comment feature to make comments, don’t write your comment in a different color
- avoid use of text color as semantics
- assign/tag people appropriately
- avoid comment wars
- Make the doc outline-mode-friendly
- use H1/H2/etc
- always have outline mode on (list-like icon near top left)
- assume the reader has outline mode on
- rarely need for a TOC
- For google sheets / excel
- never manually color code or use font/strikethrough. Always add an explicit field and use conditional formatting
- always have a schema, even if it is a flat data dictionary. linkml-model-enrichment will derived one
- Use formatted templates where appropriate (grants, papers)
- Use Paperpile for citations / reference management (you have access via the lab)
- Give documents meaningful names (e.g., not just “meeting”)–assume that most people will find the doc via search rather than by going through the folder hierarchy
- don’t use camelcase or underscores in google doc names, it hinders search
- Use a rolling agenda/notes doc, rather than one doc per meeting
- most recent first
- ISO-8601 » human readable dates » anything else
- The auto @today feature is useful
- always have a google doc for every meeting you are in
- include a link to the rolling doc in calendar invites
- include the Zoom / videoconference link in the rolling notes doc
- organize google docs in the relevant folder depending on what project is funding the work
- understand how navigation works for google docs
- make visible to all by default, unless sensitive
- include links to slides of general relevance from project repos
- favour TSV+github over google sheets
- workflows clearly favor sheets
- when using sheets, use cogs
- follow TSV guidelines for google sheets
- don’t use color for semantics. Always use conditional formatting for colors etc
- reuse slides from existing slide decks, but provide attribution
- Tips
DevOps
General Principles
- DRY: Don’t Repeat Yourself
- but avoid over-abstraction and frameworkitis
- various 10 simple guides:
- Always reuse
- we probably have a Python library for it
- reuse general design patterns
- GitHub templates
- follow exemplar repos
- try especially hard not to reinvent what someone in the group or our collaborator has done
- Avoid perfectionism
- iterate on solutions
- smaller batches of incremental progress » long delays on perfect solution (that may turn out to be flawed)
- For many tasks, the 80/20 rule may suffice
- Don’t boil the ocean
- beware of rabbit holes
- More to come…
Dates

Edit