The Jornada IM Manual
JRN Metabase (jrn_metabase
) is a PostgreSQL database that can be stored and accessed on either a local or remote host. Generally we configure it on a remote host for multi-user access. PostgreSQL uses a client/server model, which means that the database server application (postgres
) runs on the host system and manages the databases and all incoming connections from client applications. Users can choose from a number of client applications to connect to a server and database(s), either locally (from the server’s host machine), or remotely. The standard, commandline client interface to PostgreSQL is psql
, which can be, or already is, installed on most computers. We also use DBeaver as a graphical client for jrn_metabase
. Links to PostgreSQL and community documentation are on the Postgres Links page.
For the remote host, we use an Ubuntu server running an up-to-date PostgreSQL server. To access a remote host over a terminal connection use:
ssh <username>@<host name or IP>
Note that if the host you are accessing is a Jornada server you will need to use the Jornada VPN from outside Wooton Hall.
To install PostgreSQL, use the most current installation method for the host’s operating system. We installed the default packages available in the latest version of Ubuntu Server. In Linux systems (Ubuntu, Debian, macOS, etc), installation of PostgreSQL creates a system user and a database server role that are both named postgres
. The PostgreSQL administrative shell client, called `psql’, is also installed by default. Many more details on PostgreSQL server administration setup and administration can be found in the official PG Administrator Guide.
psql
clientWhen logged into the PostgreSQL host, any system user with sudoer privileges can switch to the postgres
user and enter the psql
shell with:
sudo -u postgres psql
The postgres=#
prompt will appear indicating you have entered the shell in the postgres
role, which is the default administrative role with superuser privileges. It has no password set on a new install(, so that will need to be set according to instructions below.)
Other PostgreSQL roles should be created for database users. Once these are created and remote access is configured on the host, psql
can be run from remote clients (if installed) so that users can login to databases on the host using commands like:
psql -U <rolename> -h <hostname or IP> -p <postgres port> <database name>
In general the postgres port is 5432.
There are several configurations to set to allow remote access to a PostgreSQL database cluster. Some of these changes involve editing config files and some involve using psql
. Most likely you’ll do this from your user account on the host machine. Check these instructions for updates.
Edit postgresql.conf
to allow remote connections. To do this, open the file (usually in /etc/postgresql/<version>/main
) and locate the listen_addresses='localhost'
line, uncomment if needed, and change it to:
listen_addresses=’*’
Now give the postgres
user a password. Enter the psql
as the default user (postgres)
sudo -u postgres psql # assuming logged on to host as a sudoer
Then change the postgres role’s password to something more secure
postgres=# ALTER USER postgres WITH ENCRYPTED PASSWORD '<password>';
Alter the PostgreSQL authentication config file to allow user postgres
to authenticate with md5 when making a remote (TCP/IP) connection.
sudo vim /etc/postgresql/12/main/pg_hba.conf
Add a line that looks like like this:
hostssl template1 postgres 0.0.0.0/0 scram-sha-256
You could also restrict by database, or ip.
For other users, you can add a similar line to pg_hba.conf
beneath the one above to allow remote connections - just change postgres
to the user name. This can allow users from any IP (0.0.0.0/0) to login using md5. You could also let all users in this way:
host all all 0.0.0.0/0 md5
But it isn’t that secure.
There are some recommended roles to add to a PostgreSQL cluster for LTER_core_metabase (defined here). This will most likely be done in psql while logged into the host.
Create a role for the database owner and set password
postgres=# CREATE ROLE <name> CREATEDB CREATEROLE LOGIN;
postgres=# ALTER USER <name> WITH ENCRYPTED PASSWORD '<password>';
Create other roles specified for LTER_core_metabase. If you create these before creating the LTER_core_metabase in the steps below, they will be granted the correct permissions to the schema and tables.
CREATE ROLE read_write_metabase;
CREATE ROLE read_only_metabase;
If for some reason permissions for these roles need correction, or a new role needs to be added, you might need to re-run the permission granting section in the database dump for LTER-core-metabase (onebigfile.sql
), potentially after substituting in the new role name. JRN created a separate user for one of its metabases (jrn_metabase_dev) using this method.
After making changes on server restart the postgres server.
sudo systemctl restart postgresql.service
There are some PostgreSQL tools available in Linux userspace, so while logged in to the host, you can create a testing database for a user role with:
createdb -O <username> <databasename>
Or you can log into psql
as a particular role and do:
username=# CREATE DATABASE <databasename>
Once this database is created you can log in to the database from the system shel (Note the uppercase -U flag to denote the user):
psql -U <username> <databasename>
Or connect from within psql:
username=# \c <databasename>
After logging in you can issue SQL commands and queries or use the psql
metacommands that are prepended by a backslash and described here.
At Jornada, we are basically using a “stock” version of LTER-core-metabase. It only takes a few minor modifications to install the source.
On the host machine, clone the lter_core_metabase git repository then cd
into the directory.
Edit the 2 sql files, sql/00_create_db.sql
and sql/onebigfile.sql
, to replace ‘%db_owner%’ with the name of the database owner role you created. This could be done with a standard text editor or sed
.
Create the database:
sudo -u postgres psql -f GitHub/LTER-core-metabase/sql/00_create_db.sql
If there is a locale error you may edit locales in 00_create_db.sql
to one present on your system (C.UTF-8
worked best for JRN), and/or create a new locale for your system (locale-gen...
).
Set up the schema with onebigfile.sql
(this is if logged on to host).
psql -U <db_owner username> -h localhost -d lter_core_metabase < GitHub/LTER-core-metabase/sql/onebigfile.sql
There are patches created for LTER_core_metabase periodically that may add new features or fix bugs between versions. These are in the migration branch of the LTER-core-metabase
GitHub repository. You should only apply the patches that are not already in onebigfile.sql
and have not been already installed to your metabase. Check which patches are installed by consulting the pkg_mgmt.version_tracker_metabase
table.
Some logical steps to install a patch are:
jrn_metabase
LTER-core-metabase
repository.Make a copy of the patches for local editing:
cp sql/44_add_provider_id_taxonomy_vw.sql sql_jrn/44_add_provider_id_taxonomy_vw.jrn.sql
Edit the patch to make it compatibile with your metabase configuration. In most cases (assuming the patch is well-tessted) this should easy, and the main task is usually to replace %db_owner%
with the appropriate Postgres role for the installed metabase.
vim sql_jrn/33_semantic_annotation.jrn.sql
Run the patch SQL by logging to the target database as postgres
(or another authorized user; sudo -u postgres psql jrn_metabase
) and issuing:
\i GitHub/LTER-core-metabase/sql_jrn/33_semantic_annotation.jrn.sql
using psql
from the system prompt may work, though there may be permission problems with this:
psql -U <username> -h <hostname> -d <databasename> < GitHub/LTER-core-metabase/sql_jrn/33_semantic_annotation.sql
The JRN Metabase is usually hosted on a server running Linux or a similar OS. In the case of Ubuntu/Debian systems, keep the OS and PostgreSQL up to date with apt
. Important tasks are listed below. For full documentation of PostgreSQL server administration see the official PG Administrator Guide.
Patches are periodically released and are available on the migration branch of the LTER GitHub repository. They are pretty easy to install (see setup document) but they may or may not be needed depending on how our database has evolved. Discuss with the patch creator before installing.
Is there a way to export patches if we change things?
Log in to the psql
shell either from a local terminal (sudo -u postgres psql
) or from a remote client.
psql -U <role name> -h <host name or IP> -p 5432 <database name>
Add a role for the user and assign a password:
postgres=# CREATE ROLE <name> <OTHER OPTIONS> LOGIN;
postgres=# ALTER USER <name> WITH ENCRYPTED PASSWORD '<password>';
Note that LOGIN roles are needed to make initial connections to a database, so normal users should have this. CREATE USER grants LOGIN automatically.
Email the user the new role/user name and password and ask them to change their password using the instructions below.
Grant or revoke the desired editing roles (list with \du
) to user roles:
postgres=# GRANT group_role TO role1, ... ;
postgres=# REVOKE group_role FROM role1, ... ;
In the case of LTER_core_metabase, the important roles are read_only_metabase
and read_write_metabase
.
To configure remote access (TCP/IP) for new users, they will need to be allowed in the pg_hba.conf
file in some form. See the basics of this in the setup page and PostgreSQL specifics for the pg_hba.conf
file.
The administrator (site IM for now), will email you a username and password that will allow you to login to the host and the PostgreSQL server. You will need to at least change the PostgreSQL server password for your user account.
Open a terminal on your computer and check to see if you have the PostgreSQL client (psql
) installed. The command below should return version info. If it doesn’t you need to install psql
.
psql --version
Issue the following command from your terminal:
psql -U <username> -h <host name or IP> -p 5432 -c "ALTER USER <username> WITH ENCRYPTED PASSWORD '<new_password>';" <database name>
where anything in angle brackets needs to be replaced with your username, server, and database information. Don’t forget to leave the single quotes around your new password, but you will leave them out when accessing the database.
If you don’t have psql
available and can connect to jrn_metabase
server with DBeaver, you may instead open an SQL console or SQL script window and type and execute this command:
ALTER USER <username> WITH ENCRYPTED PASSWORD '<password>';
In either case you will need to change the password in your locally-stored connection info (e.g. for DBeaver and R) for future logins.
User roles that no longer need access to a database or cluster should be dropped. The DROP ROLE SQL command will do this, as will the dropuser client application. Often a role will have ownership of database objects, so those ownerships need to be reassigned before dropping. See discussion here.
To copy a database, including schema and data, use:
postgres=# CREATE DATABASE <new_database_name> WITH TEMPLATE <template_database_name>;
More examples here.
To backup the database the basic steps are to dump the database to an SQL file using the pg_dump
utility:
pg_dump -F p the_db_name > the_backup.sql
Some options for pg_dump
are:
-F c|d|t|p
output file format (custom, directory, tar, plain text)-C
include commands to create database in dump-O
no owner-s
schema only-b
include large objects in the dump-v
verbose messages-f
specifies the backup file name-d
specifies the database to backup-U
specifies the user to use-h
specifies the hostSo, to create a backup to a client machine, for example, run:
pg_dump -h <hostname> -U postgres -F p <dbname> > ./path/dbname.bak.sql
Backups using pg_dump
can also be initiated from DBeaver or pgAdmin but not sure how yet.
A backup can be restored with some variation of:
psql the_db_name < the_backup.sql
Basic info here.
Regular database backups and backup rotation should be scheduled on the server - probably with cron
. There are some scripts in the lter_metabase_utils
repository that are based on these.
There are 2 versions of JRN metabase - a development copy called jrn_metabase_dev
and the production version (jrn_metabase
). Periodically, when the development database is well tested and stable, it should be copied to the production version. There are different ways to do this, the easiest of which is probably to delete jrn_metabase
, create a new, empty database with that name, and then restore it with a nightly backup of jrn_metabase_dev
.
On the host shell:
dropdb jrn_metabase
In psql:
CREATE DATABASE jrn_metabase;
On the host shell:
psql -U <username> -d jrn_metabase -f /home/backups/postgresql/2021-04-06-daily/jrn_metabase_dev.sql
The nightly backup will need to be unzipped first. It is also possible do it all in psql with something like:
DROP DATABASE jrn_metabase;
CREATE DATABASE jrn_metabase WITH TEMPLATE jrn_metabase_dev;
To copy the database to a new host the basic steps are to dump the database to an SQL file using the pg_dump
utility (see above), then copy this file to the new host and restore with:
psql the_new_dev_db < the_backup.sql
This can feasably all be done in one command:
pg_dump -C -h remotehost -U remoteuser dbname | psql -h localhost -U localuser dbname
Or the SQL file can be dumped to localhost like this:
ssh remoteuser@remotehost "pg_dump -U remoteuser dbname -h localhost -C --column-inserts" > ~/Desktop/dbname.bak.sql
See discussion here
Roles are also important - to export roles and restore them in a new cluster use:
ssh remoteuser@remotehost "pg_dumpall --roles-only -U remoteuser -h localhost" > ~/Desktop/dbname_roles.bak.sql
See here?
Restoring seems to be easiest with a command like:
pg_restore -C -h localhost -d jstream-data -U postgres /home/backups/postgresql/2024-09-17-daily/jstream-data.custom
TODO: Flesh this out - how we do backups, how to restore them, especially during a potential server upgrade or other change?
Including some issues.
There are 4 schemas available in LTER-core-metabase, but the lter_metabase
schema is the primary collection of tables for describing metadata and data packages. Within this schema there are 3 primary types of tables:
EML
-prefixed tables
List
-prefixed tables
DataSet
-prefixed tables
Storage type refers to the data type in which attribute values are stored in EML (See EMLStorageTypes table). We use integer, float, string, date, and boolean types for our data.
In some ways storageType corresponds to MeasurementScaleDomainIDs
Stuff on the internet about this:
There doesn’t seem to be an easy way to tie geographicdescription to attributes.
The LTER-core-metabase documentation on GitHub is the primary source for understanding this process. The steps to populate the lter-metabase schema are summarized below (and in the docs), with notes on how this process is being adapted for JRN Metabase. Note that before populating, it is worthwhile to learn a little about the LTER Metabase schema and how data are stored within it. See notes here about that.
In general we are using psql
, python, or DBeaver to populate and edit our databases. DBeaver has excellent documentation, but users will need to install it and set it up with their user/role and password to log into JRN Metabases. PgAdmin and other tools might be useful too.
Tables in JRN Metabase can be updated by importing CSV files containing metadata using psql
, DBeaver, or python. The relevant SQL command is COPY FROM
. Setting up incoming CSVs to match the table being copied to will help, and in LTER Metabases it will be best to start with parent tables (`DataSets’ in particular?) so that foreign key rules won’t be violated.
In server-side psql
use:
COPY persons(first_name, last_name, dob, email)
FROM '/home/username/sampledb/persons.csv'
DELIMITER ','
CSV HEADER; # if there is a header in the csv
In client side psql
use \copy
, and be aware that not all roles will be permitted to do client-site operations. All the columns in the csv need a destination column in the database table or else an error will result. This tutorial page helps.
COPY FROM
operations with CSV files can also be initiated from python using a psycopg
database connection. Some python scripts and modules for this are available in the jrn_metabase_tools repository.
In DBeaver, highlight the destination table and use the ‘Import’ tool, then select the CSV file to import. The tool allows you to match columns between CSV and database tables and create/ignore columns if needed. Setting up incoming CSVs to match the tables in the schema beforehand will help. Documentation here.
There is a tool being developed called EML2MB which might allow import of metadata using EML, but not ready for primetime yet.
DataSet
tablesThe DataSet
-prefixed tables need to be added to or updated to add/update JRN Datasets in the database. These tables are all linked by the DataSetID
columns, so new rows will need this key added, and updates to DataSets will need to take place with the correct DataSetID
s. These numbers correspond to the DataSet IDs we currently use for our JRN data packages (210001001, for example).
Before adding a new DataSet to the JRN Metabase, keep in mind that NOT ALL parts of the data package can be added directly to the database. The data entities (CCSVs or other files) and the abstract and methods documents (as .docx, markdown, etc) should be kept in a folder in our usual file system. You will add path to these files in JRN Metabase, but not the files themselves (for now). Once you have a folder to refer to, the order of operations to add a new dataset to JRN MEtabase is:
lter_metabase
schema, open the DataSets
table and add a new row.DataSetID
column.mb2eml_r
schema that identifies seldom changing metadata elements such as DataSetEntities
table and enter information for the data entities. If there are more than one you will enter multiple new rows and order them in the ``EntitySortOrder column. The
DataSetID and
EntitySortOrder` columns together identify DataSetID
and EntitySortOrder
of one will be for describing the attribues in the first dataEntity included in the dataset.DataSetAttributes
table.nominalEnum
measurement scale types are tricky. To make them work:
StorageType
to integer or string, depending what they areUnit
, NumberType
or other values.nominalText
measurement scale and a string StorageType
.
Unit
, NumberType
or other values.ordinalEnum
and ordinalText
measurement scales work yet. Can’t find much documentation about it, though the EML schema might define them.Jornada datasets consist of data files, metadata, and sometimes additional files. Not all of these can be stored directly in the JRN Metabase. So, each Jornada dataset should have a dedicated directory, usually on the Jornada shared drive. These directories are typically prefixed with the Jornada dataset ID number. As noted below, all data entities to be incuded with the published dataset, as well as some metadata to be attached, will be stored in this directory. It is also a good place to store scripts used to QA/QC datafiles and publish finished datasets.
DataSet
tablesMetadata specific to datasets are stored in DataSet
-prefixed tables in JRN Metabase. To create a new dataset record in JRN Metabase, new records (rows) must be added to these tables. To update an existing dataset in JRN Metabase, existing records in DataSet
tables must be altered, and new records (rows) may be added. The DataSet
tables are all linked by the DataSetID
columns, so new records added to a table will require that this key be added, and updates to any dataset will need to take place in records with the correct, corresponding DataSetID
. The DataSetID
numbers in JRN Metabase correspond to the jornada dataset IDs we currently use for all our data packages (210001001, for example).
Before adding a new dataset to JRN Metabase, keep in mind that NOT ALL metadata for the data package will be added directly to the database. Long-form text metadata, particularly the abstract and methods for the dataset, are typically kept in the dataset directory as .docx, markdown or other text files. The filenames for the abstract and methods will be added to JRN Metabase, but not the metadata themselves.
Additionally, the data entities (CSV tables, geoTIFFs, zip archives, PDFs, etc…) that will be included with your dataset must also be in the dataset directory. You will add filenames for these to JRN Metabase, but not the files themselves.
lter_metabase
schema, open the DataSets
table and add a new row.DataSetID
column.mb2eml_r
schema that identifies seldom changing metadata elements such as DataSetEntities
table and enter (or update) a record for each of the data entities. Each of these records refer to a file your dataset directory that will be published as part of the dataset.
EntitySortOrder
column.DataSetID
and EntitySortOrder
columns together will be used to identify data entities in other tables, such as in the DataSetAttributes
table (below).DataSetEntities.EntityType
=”dataTable”) open the DataSetAttributes
table to describe the columns in each tabular data entity.
EntitySortOrder
of “1” will describe the column attributes in the first dataEntity included in the dataset.DataSetID
and incremented EntitySortOrders
(2, 3, 4…) will describe column attributes in additional data entities for the dataset.DataSetAttributes
table defines any colums as categorical variables by having a MeasurementScaleDomain
of “nominalEnum”, you will need look at the categorical data column in the data table and add a record for each categorical variable code used to the DataSetAttributeEnumeration
table, using codes defined in the ListCodes
table.DataSetMethod
with information about the methods file in your dataset directory, usually named “methods.DataSetKeywords
with appropriate keywords chosen from the ListKeywords
table.DataSetPersonnel
with relevant personnel NameID
s chosen from the ListPeople
table.DataSetSites
with a site identifier chosen from the ListSites
table.DataSetTemporal
to define the time periods covered in the data.Note that many columns have constraints set - they will need to contain values, and these may be required to come from a parent table.
DataSetAttributes
table.nominalEnum
measurement scale types are tricky. To make them work:
StorageType
to integer or string, depending what they areUnit
, NumberType
or other values.nominalText
measurement scale and a string StorageType
.
Unit
, NumberType
or other values.ordinalEnum
and ordinalText
measurement scales work yet. Can’t find much documentation about it, though the EML schema might define them.initdb
, pg_dump
, createdb
and psql
.