gimie.parsers.license package

Module contents

class gimie.parsers.license.LicenseParser[source]

Bases: Parser

Parse LICENSE body into schema:license <spdx-url>. Uses tf-idf-based matching.

parse(data: bytes) Set[Tuple[URIRef, URIRef | Literal]][source]

Extracts an spdx URL from a license file and returns a set with a single tuple <schema:license> <spdx_url>. If no matching URL is found, an empty set is returned.

gimie.parsers.license.is_license_filename(filename: str) bool[source]

Given an input filename, returns a boolean indicating whether the filename path looks like a license.

Parameters:

filename – A filename to check.

Examples

>>> is_license_filename('LICENSE-APACHE')
True
>>> is_license_filename('README.md')
False
gimie.parsers.license.load_spdx_ids() List[str][source]

Load spdx licenses from disk.

gimie.parsers.license.load_tfidf_matrix() csr_matrix[source]

Load pre-computed tfidf matrix of spdx licenses from disk. Matrix has dimensions (n_licenses, n_features).

gimie.parsers.license.load_tfidf_vectorizer() TfidfVectorizer[source]

Load tfidf matrix and vectorizer from disk.

gimie.parsers.license.match_license(data: bytes, min_similarity: float = 0.9) str | None[source]

Given a license file, returns the url of the most similar spdx license. This is done using TF-IDF on the license text and getting the closest match in the SPDX license corpus based on cosine similarity.

Parameters:

data – The license body as bytes.

Examples

>>> match_license(open('LICENSE', 'rb').read())
'https://spdx.org/licenses/Apache-2.0.html'