gimie.parsers.license package¶
Module contents¶
- class gimie.parsers.license.LicenseParser(subject: str)[source]¶
Bases:
Parser
Parse LICENSE body into schema:license <spdx-url>. Uses tf-idf-based matching.
- gimie.parsers.license.is_license_filename(filename: str) bool [source]¶
Given an input filename, returns a boolean indicating whether the filename path looks like a license.
- Parameters:
filename – A filename to check.
Examples
>>> is_license_filename('LICENSE-APACHE') True >>> is_license_filename('README.md') False
- gimie.parsers.license.load_tfidf_matrix() csr_matrix [source]¶
Load pre-computed tfidf matrix of spdx licenses from disk. Matrix has dimensions (n_licenses, n_features).
- gimie.parsers.license.load_tfidf_vectorizer() TfidfVectorizer [source]¶
Load tfidf matrix and vectorizer from disk.
- gimie.parsers.license.match_license(data: bytes, min_similarity: float = 0.9) str | None [source]¶
Given a license file, returns the url of the most similar spdx license. This is done using TF-IDF on the license text and getting the closest match in the SPDX license corpus based on cosine similarity.
- Parameters:
data – The license body as bytes.
Examples
>>> match_license(open('LICENSE', 'rb').read()) 'https://spdx.org/licenses/Apache-2.0.html'