Skip to content

Commit 33b3faf

Browse files
authored
Merge pull request #225 from ad-hardy/feature-custom-units-and-entities
custom units and entities
2 parents b7c5586 + 30f20c8 commit 33b3faf

File tree

14 files changed

+770
-162
lines changed

14 files changed

+770
-162
lines changed

.travis.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -51,7 +51,7 @@ script:
5151
- coverage run -a --source=quantulum3 setup.py test -s quantulum3.tests.test_classifier.ClassifierTest
5252
- coverage run -a --source=quantulum3 setup.py test -s quantulum3.tests.test_scripts.TrainScriptTest
5353
- coverage run -a --source=quantulum3 setup.py test -s quantulum3.tests.test_load.TestCached
54-
54+
- coverage run -a --source=quantulum3 setup.py test -s quantulum3.tests.test_load.TestLoaders
5555
after_success:
5656
- coverage report
5757
- coveralls

README.md

Lines changed: 91 additions & 74 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,11 @@
1-
quantulum3
2-
==========
1+
# quantulum3
2+
33
[![Travis master build state](https://app.travis-ci.com/nielstron/quantulum3.svg?branch=master "Travis master build state")](https://app.travis-ci.com/nielstron/quantulum3)
44
[![Coverage Status](https://coveralls.io/repos/github/nielstron/quantulum3/badge.svg?branch=master)](https://coveralls.io/github/nielstron/quantulum3?branch=master)
55
[![PyPI version](https://badge.fury.io/py/quantulum3.svg)](https://pypi.org/project/quantulum3/)
66
![PyPI - Python Version](https://img.shields.io/pypi/pyversions/quantulum3.svg)
77
[![PyPI - Status](https://img.shields.io/pypi/status/quantulum3.svg)](https://pypi.org/project/quantulum3/)
8-
8+
99
Python library for information extraction of quantities, measurements
1010
and their units from unstructured text. It is able to disambiguate between similar
1111
looking units based on their *k-nearest neighbours* in their [GloVe](https://nlp.stanford.edu/projects/glove/) vector representation
@@ -18,23 +18,23 @@ Lagi](https://github.com/marcolagi/quantulum).
1818
The compatibility with the newest version of sklearn is based on
1919
the fork of [sohrabtowfighi](https://github.com/sohrabtowfighi/quantulum).
2020

21-
Installation
22-
------------
21+
## User Guide
22+
23+
### Installation
2324

2425
```bash
25-
$ pip install quantulum3
26+
pip install quantulum3
2627
```
2728

2829
To install dependencies for using or training the disambiguation classifier, use
2930

3031
```bash
31-
$ pip install quantulum3[classifier]
32+
pip install quantulum3[classifier]
3233
```
3334

3435
The disambiguation classifier is used when the parser find two or more units that are a match for the text.
3536

36-
Usage
37-
-----
37+
### Usage
3838

3939
```pycon
4040
>>> from quantulum3 import parser
@@ -79,8 +79,7 @@ this library can also be used for simple number extraction.
7979
[Quantity(2, 'dimensionless')]
8080
```
8181

82-
Units and entities
83-
------------------
82+
### Units and entities
8483

8584
All units (e.g. *litre*) and the entities they are associated to (e.g.
8685
*volume*) are reconciled against WikiPedia:
@@ -117,8 +116,7 @@ dimensionality:
117116
Unit(name="kilometre per second", entity=Entity("speed"), uri=None)
118117
```
119118

120-
Disambiguation
121-
--------------
119+
### Disambiguation
122120

123121
If the parser detects an ambiguity, a classifier based on the WikiPedia
124122
pages of the ambiguous units or entities tries to guess the right one:
@@ -147,27 +145,47 @@ In addition to that, the classifier is trained on the most similar words to
147145
all of the units surfaces, according to their distance in [GloVe](https://nlp.stanford.edu/projects/glove/)
148146
vector representation.
149147

150-
Training the classifier
151-
-----------------------
152-
153-
If you want to train the classifier yourself, in addition to the packages above, you'll also need
154-
the packages `stemming` and `wikipedia`.
148+
## Spoken version
155149

156-
You can get the classifier dependencies by running
150+
Quantulum classes include methods to convert them to a speakable unit.
157151

158-
```bash
159-
$ pip install quantulum3[classifier]
152+
```pycon
153+
>>> parser.parse("Gimme 10e9 GW now!")[0].to_spoken()
154+
ten billion gigawatts
155+
>>> parser.inline_parse_and_expand("Gimme $1e10 now and also 1 TW and 0.5 J!")
156+
Gimme ten billion dollars now and also one terawatt and zero point five joules!
160157
```
161158

162-
You could also [download requirements_classifier.txt](https://raw.githubusercontent.com/nielstron/quantulum3/dev/requirements_classifier.txt)
163-
and run
159+
### Manipulation
164160

165-
```bash
166-
$ pip install -r requirements_classifier.txt
167-
```
161+
While quantities cannot be manipulated within this library, there are
162+
many great options out there:
163+
164+
- [pint](https://pint.readthedocs.org/en/latest/)
165+
- [natu](http://kdavies4.github.io/natu/)
166+
- [quantities](http://python-quantities.readthedocs.org/en/latest/)
167+
168+
## Extension
169+
170+
### Training the classifier
171+
172+
If you want to train the classifier yourself, you will need the dependencies for the classifier (see installation).
168173

169174
Use `quantulum3-training` on the command line, the script `quantulum3/scripts/train.py` or the method `train_classifier` in `quantulum3.classifier` to train the classifier.
170175

176+
``` bash
177+
quantulum3-training --lang <language> --data <path/to/training/file.json> --output <path/to/output/file.joblib>
178+
```
179+
180+
You can pass multiple training files in to the training command. The output is in joblib format.
181+
182+
To use your custom model, pass the path to the trained model file to the
183+
parser:
184+
185+
```pyton
186+
parser = Parser.parse(<text>, classifier_path="path/to/model.joblib")
187+
```
188+
171189
Example training files can be found in `quantulum3/_lang/<language>/train`.
172190

173191
If you want to create a new or different `similars.json`, install `pymagnitude`.
@@ -183,62 +201,62 @@ converted to a `.magnitude` file on-the-run. Check out
183201
[pre-formatted Magnitude formatted word-embeddings](https://github.com/plasticityai/magnitude#pre-converted-magnitude-formats-of-popular-embeddings-models)
184202
and [Magnitude](https://github.com/plasticityai/magnitude) for more information.
185203

204+
### Additional units
186205

187-
To use your custom model, pass the path to the trained model file to the
188-
parser:
206+
It is possible to add additional entities and units to be parsed by quantulum. These will be added to the default units and entities. See below code for an example invocation:
189207

190-
```pyton
191-
parser = Parser.parse(classifier_path="path/to/model")
208+
```pycon
209+
>>> from quantulum3.load import add_custom_unit, remove_custom_unit
210+
>>> add_custom_unit(name="schlurp", surfaces=["slp"], entity="dimensionless")
211+
>>> parser.parse("This extremely sharp tool is precise up to 0.5 slp")
212+
[Quantity(0.5, "Unit(name="schlurp", entity=Entity("dimensionless"), uri=None)")]
192213
```
193214

215+
The keyword arguments to the function `add_custom_unit` are directly translated
216+
to the properties of the unit to be created.
194217

195-
Manipulation
196-
------------
197-
198-
While quantities cannot be manipulated within this library, there are
199-
many great options out there:
200-
201-
- [pint](https://pint.readthedocs.org/en/latest/)
202-
- [natu](http://kdavies4.github.io/natu/)
203-
- [quantities](http://python-quantities.readthedocs.org/en/latest/)
218+
### Custom Units and Entities
204219

205-
Spoken version
206-
--------------
220+
It is possible to load a completely custom set of units and entities. This can be done by passing a list of file paths to the load_custom_units and load_custom_entities functions. Loading custom untis and entities will replace the default units and entities that are normally loaded.
207221

208-
Quantulum classes include methods to convert them to a speakable unit.
222+
The recomended way to load quantities is via a context manager:
209223

210224
```pycon
211-
>>> parser.parse("Gimme 10e9 GW now!")[0].to_spoken()
212-
ten billion gigawatts
213-
>>> parser.inline_parse_and_expand("Gimme $1e10 now and also 1 TW and 0.5 J!")
214-
Gimme ten billion dollars now and also one terawatt and zero point five joules!
215-
```
225+
>>> from quantulum3 import load, parser
226+
>>> with load.CustomQuantities(["path/to/units.json"], ["path/to/entities.json"]):
227+
>>> parser.parse("This extremely sharp tool is precise up to 0.5 slp")
216228

217-
Extension
218-
---------
229+
[Quantity(0.5, "Unit(name="schlurp", entity=Entity("dimensionless"), uri=None)")]
219230

220-
#### Custom units
231+
>>> # default units and entities are loaded again
232+
```
221233

222-
It is possible to add custom entities to be parsed by quantulum.
223-
See below code for an example invocation.
234+
But it is also possible to load custom units and entities manually:
224235

225236
```pycon
226-
>>> from quantulum3.load import add_custom_unit, remove_custom_unit
227-
>>> add_custom_unit(name="schlurp", surfaces=["slp"], entity="dimensionless")
237+
>>> from quantulum3 import load, parser
238+
239+
>>> load.load_custom_units(["path/to/units.json"])
240+
>>> load.load_custom_entities(["path/to/entities.json"])
228241
>>> parser.parse("This extremely sharp tool is precise up to 0.5 slp")
242+
229243
[Quantity(0.5, "Unit(name="schlurp", entity=Entity("dimensionless"), uri=None)")]
244+
245+
>>> # remove custom units and entities and load default units and entities
246+
>>> load.reset_quantities()
230247
```
231248

232-
The keyword arguments to the function `add_custom_unit` are directly translated
233-
to the properties of the unit to be created.
249+
See the Developer Guide below for more information about the format of units and entities files.
250+
251+
## Developer Guide
234252

235-
#### Extending the set of default units
253+
### Adding Units and Entities
236254

237255
See *units.json* for the complete list of units and *entities.json* for
238256
the complete list of entities. The criteria for adding units have been:
239257

240-
- the unit has (or is redirected to) a WikiPedia page
241-
- the unit is in common use (e.g. not the [premetric Swedish units of
258+
- the unit has (or is redirected to) a WikiPedia page
259+
- the unit is in common use (e.g. not the [premetric Swedish units of
242260
measurement](https://en.wikipedia.org/wiki/Swedish_units_of_measurement#Length)).
243261

244262
It\'s easy to extend these two files to the units/entities of interest.
@@ -251,9 +269,9 @@ Here is an example of an entry in *entities.json*:
251269
}
252270
```
253271

254-
- The *name* of an entity is its key. Names are required to be unique.
255-
- *URI* is the name of the wikipedia page of the entity. (i.e. `https://en.wikipedia.org/wiki/Speed` => `Speed`)
256-
- *dimensions* is the dimensionality, a list of dictionaries each
272+
- The *name* of an entity is its key. Names are required to be unique.
273+
- *URI* is the name of the wikipedia page of the entity. (i.e. `https://en.wikipedia.org/wiki/Speed` => `Speed`)
274+
- *dimensions* is the dimensionality, a list of dictionaries each
257275
having a *base* (the name of another entity) and a *power* (an
258276
integer, can be negative).
259277

@@ -277,24 +295,24 @@ Here is an example of an entry in *units.json*:
277295
}
278296
```
279297

280-
- The *name* of a unit is its key. Names are required to be unique.
281-
- *URI* follows the same scheme as in the *entities.json*
282-
- *surfaces* is a list of strings that refer to that unit. The library
298+
- The *name* of a unit is its key. Names are required to be unique.
299+
- *URI* follows the same scheme as in the *entities.json*
300+
- *surfaces* is a list of strings that refer to that unit. The library
283301
takes care of plurals, no need to specify them.
284-
- *entity* is the name of an entity in *entities.json*
285-
- *dimensions* follows the same schema as in *entities.json*, but the
302+
- *entity* is the name of an entity in *entities.json*
303+
- *dimensions* follows the same schema as in *entities.json*, but the
286304
*base* is the name of another unit, not of another entity.
287-
- *symbols* is a list of possible symbols and abbreviations for that
305+
- *symbols* is a list of possible symbols and abbreviations for that
288306
unit.
289-
- *prefixes* is an optional list. It can contain [Metric](https://en.wikipedia.org/wiki/Metric_prefix) and [Binary prefixes](https://en.wikipedia.org/wiki/Binary_prefix) and
307+
- *prefixes* is an optional list. It can contain [Metric](https://en.wikipedia.org/wiki/Metric_prefix) and [Binary prefixes](https://en.wikipedia.org/wiki/Binary_prefix) and
290308
automatically generates according units. If you want to
291309
add specifics (like different surfaces) you need to create an entry for that
292310
prefixes version on its own.
293311

294312
All fields are case sensitive.
295313

296-
Contributing
297-
------------
314+
### Contributing
315+
298316
`dev` build:
299317

300318
[![Travis dev build state](https://travis-ci.com/nielstron/quantulum3.svg?branch=dev "Travis dev build state")](https://travis-ci.com/nielstron/quantulum3)
@@ -311,8 +329,8 @@ If you'd like to contribute follow these steps:
311329
(Optional, will be done automatically after pushing)
312330
8. Create a Pull Request when having commited and pushed your changes
313331

314-
Language support
315-
----------------
332+
### Language support
333+
316334
[![Travis dev build state](https://travis-ci.com/nielstron/quantulum3.svg?branch=language_support "Travis dev build state")](https://travis-ci.com/nielstron/quantulum3)
317335
[![Coverage Status](https://coveralls.io/repos/github/nielstron/quantulum3/badge.svg?branch=language_support)](https://coveralls.io/github/nielstron/quantulum3?branch=dev)
318336

@@ -326,4 +344,3 @@ as in the automatic unittests.
326344

327345
No changes outside the own language submodule folder (i.e. `_lang.de_DE`) should
328346
be necessary. If there are problems implementing a new language, don't hesitate to open an issue.
329-

quantulum3/_lang/en_US/load.py

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,7 @@ def number_to_words(number):
3131
###############################################################################
3232
def build_common_words():
3333
# Read raw 4 letter file
34+
units_ = load.units(lang)
3435
path = os.path.join(TOPDIR, "common-units.txt")
3536
with open(path, "r", encoding="utf-8") as file:
3637
common_units = {line.strip() for line in file if not line.startswith("#")}
@@ -42,15 +43,15 @@ def build_common_words():
4243
continue
4344
line = line.rstrip()
4445
if (
45-
line not in load.units(lang).surfaces_lower
46-
and line not in load.units(lang).symbols
46+
line not in units_.surfaces_lower
47+
and line not in units_.symbols
4748
and line not in common_units
4849
):
4950
words[len(line)].append(line)
5051
plural = load.pluralize(line)
5152
if (
52-
plural not in load.units(lang).surfaces_all
53-
and plural not in load.units(lang).symbols
53+
plural not in units_.surfaces_all
54+
and plural not in units_.symbols
5455
and plural not in common_units
5556
):
5657
words[len(plural)].append(plural)

quantulum3/_lang/en_US/parser.py

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -242,6 +242,9 @@ def build_quantity(
242242
"""
243243
# TODO rerun if change occurred
244244
# Re parse unit if a change occurred
245+
246+
units_ = load.units(lang)
247+
245248
dimension_change = True
246249

247250
# Extract "absolute " ...
@@ -250,7 +253,7 @@ def build_quantity(
250253
unit.name == "dimensionless"
251254
and _absolute == orig_text[span[0] - len(_absolute) : span[0]]
252255
):
253-
unit = load.units(lang).names["kelvin"]
256+
unit = units_.names["kelvin"]
254257
unit.original_dimensions = unit.dimensions
255258
surface = _absolute + surface
256259
span = (span[0] - len(_absolute), span[1])
@@ -278,7 +281,7 @@ def build_quantity(
278281
# k/M etc is only applied if non-symbolic surfaces of other units
279282
# (because colloquial) or currency units
280283
symbolic = all(
281-
dim["surface"] in load.units(lang).names[dim["base"]].symbols
284+
dim["surface"] in units_.names[dim["base"]].symbols
282285
for dim in unit.original_dimensions[1:]
283286
)
284287
if not symbolic:
@@ -430,7 +433,7 @@ def build_quantity(
430433
unit.original_dimensions, orig_text, lang, classifier_path
431434
)
432435
else:
433-
unit = load.units(lang).names["dimensionless"]
436+
unit = units_.names["dimensionless"]
434437

435438
# Discard irrelevant txt2float extractions, cardinal numbers, codes etc.
436439
if (

0 commit comments

Comments
 (0)